Cleanup Check and remove none working URLs FEATURE
Hey guy's
I am just wanting to know how the "cleanup check and remove none working URLs" works ??
I have quite a fair few verified URLs now but they have accumulated over time and i want to know if i should be hitting this button??? ( i de-dupe on a regular basis )
Does anyone have any input as to,
1) How often should i use this ?
2) How long does it take to do ( lets say i have 1 million URLs )
3) Do i stop GSA to do this ?
and anything else you may think is helpful ..
Thanks in advance
Comments
>1) How often should i use this ?
I only use this when I feel it's slowing down my submissions (maybe once in 3 months?). But I do remove dups everyday.
>2) How long does it take to do ( lets say i have 1 million URLs )
If you don't use the engine filter, it'll take forever. Just select only the engines you use, and it'll greatly speed up the process. Here's how to it,
https://forum.gsa-online.de/discussion/comment/88203/#Comment_88203
>3) Do i stop GSA to do this ?
I don't think you should stop the process. If you do, there will be nothing or very little in the Verified folder for the projects to use...
2) Yes, stop SER while Clean-Up...
But, do you really have 1 mil URLs in your verified folder? If you do, then I guess most of them are blog comments, pingbacks, etc. If this is the case, then the Clean-Up function may not be best way to clean up your list. It's too time consuming and those URLs are not worth keeping, unlike contextual links from Article, Social Network and Wiki...
May I suggest another method. It's still going to take days, but you can run SER normally while cleaning up your list. Briefly the method involves, deleting your Identified folder, move (copy delete) your Verified to Identified, run SER as usual (as it runs, it rebuilds your verified with clean and working URLs), and monitor your Verified everyday. If the number of URLs (after dedup) does not increase then you've squeezed all the working URLs from Identified.
Steps:
- backup Identified and Verified (Export site lists)
- empty Identified folder
- move files from Verified to Identified (Verified should now be empty)
- start your projects normally
- monitor Verified folder everyday, Remove dup domains, and URLs
- record the Total everyday until it does not increase any more
While you're doing this process, do not add new URLs into Identified...
But even then that still doesn't mean you can get links from all URLs on that fresh list.
If you use the method @Olve1954 described above it will give you a truly clean list where you'll get links from most of the URLs - But you'll lose a lot of your list. I would guess you'll lose 75% of it, if not more.
But the remaining URLs will be postable so all good in the end.
So I suggest you run it for weeks before deleting your Identified.
But I figured, if you can get 1 mil verified URLs, you must be a damn good scrapper. So building up your verified list won't be a problem...
Or is there a setting where SER will only use the URLs that it hasn't used in the past??
>Or is there a setting where SER will only use the URLs that it hasn't used in the past??
For each project, SER "remembers" if it has posted to it before. SER won't post again to that site, if you
- untick "Continuously try to post to a site even if failed before" or
- Delete Target URL History