Global site lists: Possibility to display processing status/other improvements
As others I'm working now very often with the global site list and frequently add new url's to it. I've set my projects to use the global site list(s). But I really don't know (per project) how many unused url's are there. So it would be nice to display (on a per project base) how many URL's are there left to try to post to. This information would give a much better overview how often the "global site lists queue" has to be filled and how big the "buffer" of available sites is. Like "274523 are ready to process...".
Because global site lists are really important I think the "care" of it should have some priority. When I'm dealing with this function I saw the following points where I think an improvement would be nice:
Because global site lists are really important I think the "care" of it should have some priority. When I'm dealing with this function I saw the following points where I think an improvement would be nice:
- Using all 3 types of site lists produces a huge amount of duplicates. I clean up regularely my site lists (last time after importing/identifying a new scraped list about 2 days ago). Today it removed again 670k duplicates. That's a lot! Wouldn't it be possible to auto-clean them up from time to time (in a save way)?
- When I import via identify sites it would be usefull to have to only import sites which aren't duplicates. This would give a much better idea how many really were imported.
- An option to recheck the whole sitelist "database" would be great. These files are growing a lot and I think it would make sense to have the abbility to recheck them if they are still alive.
- Because mostly only one post per domain is made it would make sense that SER saves the PR of the page too and starts with the page with the highest PR. For identifying the indentifying of the pages (to prevent google bans because of the high amount of request) it would be useful to use a format like [URL]|[PR]. This could be done with Scrapebox before importing.
Would be interesting what others think about my points. Perhaps there are better ideas?
Tagged:
Comments
But agreed on your points, think 2. would take alot of CPU
3. would be good for a clear up
4. Good idea
2. I don't think this will consume more CPU than identifying and removing dupes as now. The difference would be that we don't have to do this manually and calculate the imported urls - removed duplicates.
3. I think these files are getting big... So this would help to keep only the more relevant urls.
3. Agreed blogspot / general blogs are huge around 40 MB and its hard to determine if they are all alive ...
3. At my installation (and with removing all duplicates) all identified site lists are together about 110 mb. I'm sure it doesn't make sense to keep them all, but the working are of course useful especially for new projects.
AA list: You can use the lists from submitted/verfied then you've got your AA list.
Using all 3 types of site lists produces a huge amount of duplicates. I clean up regularely my site lists (last time after importing/identifying a new scraped list about 2 days ago). Today it removed again 670k duplicates. That's a lot! Wouldn't it be possible to auto-clean them up from time to time (in a save way)?
Auto cleanup would be possible but I fear that this will result in possible memory issues as these lists can get really big over time. I would prefer if people take care of this on there own. Else you end up in a situation where the program crashes because of no memory left.
When I import via identify sites it would be usefull to have to only import sites which aren't duplicates. This would give a much better idea how many really were imported.
Same problem as above. But I guess this can be added somehow. Right now the program just writes new URLs at the end, no checking, no time waste, no memory issue.
An option to recheck the whole sitelist "database" would be great. These files are growing a lot and I think it would make sense to have the abbility to recheck them if they are still alive.
Same issue as above
Because mostly only one post per domain is made it would make sense that SER saves the PR of the page too and starts with the page with the highest PR. For identifying the indentifying of the pages (to prevent google bans because of the high amount of request) it would be useful to use a format like [URL]|[PR]. This could be done with Scrapebox before importing.
Saving the PR would require the program to know it. This is not always the case. And again I smell possible memory usage problems with to big lists.
2. Would be nice
What a lot people overread here:
A possible solution could be...
1) only once at first put filepointer on random position
2) read 1000 characters from that position
3) extract found url
4) calculate new file pointer position right after that url
5) save that position as start position for next request
6) also save request counter that can be shown in user interface ("253 of 200000 site list urls")
7) if EOF is reached, start from filepointer position 0.
8) if very first (random) position from point 1) has reached then save that project has done all.
mentioned data could be saved in extra file "site-list-projectinfo.dat":
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Project 1 ## FirstIndex=1111111 ## ActualIndex=2222222 ## AlreadyProcessed=111 ## url1
Project 2 ## FirstIndex=666666 ## ActualIndex=ALLDONE ## AlreadyProcessed=999 ## url2
<- means that ...
Project 1 has finished 111 urls from site list and filepointer to find next url is 2222222
Project 2 has finished all urls, because field "ActualIndex" had reached "FirstIndex" again.
9) if a new site list is being uploaded by user, then the file "site-list-projectinfo.dat" can be simple deleted.
because then the site list grows by itself and a large part of the uploaded list would sometimes never been used.
An easier approach might be to simply add two options for "global site lists" in project options:
"pick urls sequentially" & "pick randomly" <-- sequentially would start from file beginning after each list upload
if "sequentially" is selected the counter could be shown in the UI and more important, all globlal site list urls would be used by the project only once.