Global site lists: Possibility to display processing status/other improvements

Bytefaker · September 2012

As others I'm working now very often with the global site list and frequently add new url's to it. I've set my projects to use the global site list(s). But I really don't know (per project) how many unused url's are there. So it would be nice to display (on a per project base) how many URL's are there left to try to post to. This information would give a much better overview how often the "global site lists queue" has to be filled and how big the "buffer" of available sites is. Like "274523 are ready to process...".

Because global site lists are really important I think the "care" of it should have some priority. When I'm dealing with this function I saw the following points where I think an improvement would be nice:

Using all 3 types of site lists produces a huge amount of duplicates. I clean up regularely my site lists (last time after importing/identifying a new scraped list about 2 days ago). Today it removed again 670k duplicates. That's a lot! Wouldn't it be possible to auto-clean them up from time to time (in a save way)?
When I import via identify sites it would be usefull to have to only import sites which aren't duplicates. This would give a much better idea how many really were imported.
An option to recheck the whole sitelist "database" would be great. These files are growing a lot and I think it would make sense to have the abbility to recheck them if they are still alive.
Because mostly only one post per domain is made it would make sense that SER saves the PR of the page too and starts with the page with the highest PR. For identifying the indentifying of the pages (to prevent google bans because of the high amount of request) it would be useful to use a format like [URL]|[PR]. This could be done with Scrapebox before importing.

Would be interesting what others think about my points. Perhaps there are better ideas?

SiNeX · September 2012

Wrong forum? - Think this should be under SER not SEO indexer

But agreed on your points, think 2. would take alot of CPU

3. would be good for a clear up

4. Good idea

Bytefaker · September 2012

Oh, my fault... This is of course a feature request for SER and no SI. @mods/sven: can you please move the thread?

2. I don't think this will consume more CPU than identifying and removing dupes as now. The difference would be that we don't have to do this manually and calculate the imported urls - removed duplicates.

3. I think these files are getting big... So this would help to keep only the more relevant urls.

SiNeX · September 2012

2. It depends - if you scanned the file first before processing creating a temporary file for checking - how to handle multiple files might be an issue.

3. Agreed blogspot / general blogs are huge around 40 MB and its hard to determine if they are all alive ...

Bytefaker · September 2012

2. I don't talk about duplicates inside the file with the sites to identify. This file should of course de-duplicated before "importing" it. What I meant were the duplicates of sites which are already inside the idenfied site list. Atm we've to identify the list and click on de-duplicate urls afterwards. It's "impossible" to know which sites where there already (or too time consuming because they are splitted into multiple files).

3. At my installation (and with removing all duplicates) all identified site lists are together about 110 mb. I'm sure it doesn't make sense to keep them all, but the working are of course useful especially for new projects.

AlexR · September 2012

These ideas are brilliant! It definitely will add a huge amount of value.

AlexR · September 2012

Does site lists store data about obl and pr filters? How can we create a AA site list???

Bytefaker · September 2012

I don't think this information is stored (or at least not in the lists itself, BTW: OBL is a nice additional information too to get the right priority).

AA list: You can use the lists from submitted/verfied then you've got your AA list.

Marc · September 2012

IMO one problem is the

Remove Duplicate URLs

Remove Duplicate Domains

function.

The usage of this function depends on the CMS.

For a CMS like Pligg i can use "Remove Duplicate Domains"

But for Blog Comment, Trackback, Guestbook, Image Comment CMS i only want to use: Remove Duplicate URLs

Marc · September 2012

@Bytefaker

03: +1 from me

04: This is really a great idea. +1 too

Sven · September 2012

Using all 3 types of site lists produces a huge amount of duplicates. I clean up regularely my site lists (last time after importing/identifying a new scraped list about 2 days ago). Today it removed again 670k duplicates. That's a lot! Wouldn't it be possible to auto-clean them up from time to time (in a save way)?

Auto cleanup would be possible but I fear that this will result in possible memory issues as these lists can get really big over time. I would prefer if people take care of this on there own. Else you end up in a situation where the program crashes because of no memory left.

When I import via identify sites it would be usefull to have to only import sites which aren't duplicates. This would give a much better idea how many really were imported.

Same problem as above. But I guess this can be added somehow. Right now the program just writes new URLs at the end, no checking, no time waste, no memory issue.

An option to recheck the whole sitelist "database" would be great. These files are growing a lot and I think it would make sense to have the abbility to recheck them if they are still alive.

Same issue as above

Because mostly only one post per domain is made it would make sense that SER saves the PR of the page too and starts with the page with the highest PR. For identifying the indentifying of the pages (to prevent google bans because of the high amount of request) it would be useful to use a format like [URL]|[PR]. This could be done with Scrapebox before importing.

Saving the PR would require the program to know it. This is not always the case. And again I smell possible memory usage problems with to big lists.

Bytefaker · September 2012

@Sven: Thanks for taking time to answer.

Related to the memory problem: Do you think the 32 bit memory limit would be a problem? Or where do you see exactly the memory problem? ATM SER uses only less memory (only 350 MB, 800 threads)..

2. Would be nice

What a lot people overread here:

As others I'm working now very often with the global site list and frequently add new url's to it. I've set my projects to use the global site list(s). But I really don't know (per project) how many unused url's are there. So it would be nice to display (on a per project base) how many URL's are there left to try to post to. This information would give a much better overview how often the "global site lists queue" has to be filled and how big the "buffer" of available sites is. Like "274523 are ready to process...".

Would be very useful to know when SER needs to be feeded with scraped URL's again to constantly use global site lists.

AlexR · September 2012

+1 for a reporting improvement here.

Sven · September 2012

Well yes the 32bit problem...not more than 2.1gb memory allocation. I saw all kinds of people having problems with this as they use 1000 threads or many many projects and wonder why it crashes. With a feature like automatic cleanup you would end in this situation even more. I don't think this will get added.

AlexR · September 2012

What about between updates when all projects stop for a while it does a cleanup? Or an option to set cleanup schedule (i.e. 1 per week) and it pauses all projects, then continues when done? Maybe this will help with memory issue?

Bytefaker · September 2012

@Sven: As I said, really don't want get on your nerves with this or make you more support requests, but @GlobalGoogler took a very good point. I don't see anything against a feature (beside the work to implement it) to stop everything periodically (an option for this) and do cleanup stuff. This would imrpoove the speed and experience and keep at least this regular work at a minimum.

Sven · September 2012

If it was a speed issue I would have added it already. But as URLs are picked randomly from site lists it would not matter if they have duplicates in it or its size at all.

Bytefaker · September 2012

Ok thanks for your patience!

But is it somehow possible to see how many urls from the global lists aren't used and are available?

Sven · September 2012

Not really as the program takes random urls from it it doesn't know if it has used all.

peterb · May 2013

@Sven: As the program doesnt know how many global site list urls were used & because of seeking next url randomly every time ... the same urls will be unnecessarily used multiple times per project ..."

A possible solution could be...

1) only once at first put filepointer on random position
2) read 1000 characters from that position
3) extract found url
4) calculate new file pointer position right after that url
5) save that position as start position for next request
6) also save request counter that can be shown in user interface ("253 of 200000 site list urls")
7) if EOF is reached, start from filepointer position 0.
8) if very first (random) position from point 1) has reached then save that project has done all.

mentioned data could be saved in extra file "site-list-projectinfo.dat":
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Project 1 ## FirstIndex=1111111 ## ActualIndex=2222222 ## AlreadyProcessed=111 ## url1
Project 2 ## FirstIndex=666666 ## ActualIndex=ALLDONE ## AlreadyProcessed=999 ## url2

<- means that ...
Project 1 has finished 111 urls from site list and filepointer to find next url is 2222222
Project 2 has finished all urls, because field "ActualIndex" had reached "FirstIndex" again.

9) if a new site list is being uploaded by user, then the file "site-list-projectinfo.dat" can be simple deleted.

peterb · June 2013

THE ABOVE is only useful if projects dont add urls to the global site list while doing SE scraping.
because then the site list grows by itself and a large part of the uploaded list would sometimes never been used.

An easier approach might be to simply add two options for "global site lists" in project options:
"pick urls sequentially" & "pick randomly" <-- sequentially would start from file beginning after each list upload

if "sequentially" is selected the counter could be shown in the UI and more important, all globlal site list urls would be used by the project only once.

Sven · June 2013

Well you can as well import all site lists directly to a project to speed things up.

Sven · June 2013

Next version allows you to import site lists directly into a projects target urls.

Global site lists: Possibility to display processing status/other improvements

Comments