Skip to content

Cleanup Check and remove none working URLs FEATURE

Hey guy's 
I am just wanting to know how the "cleanup check and remove none working URLs" works ??

I have quite a fair few verified URLs now but they have accumulated over time and i want to know if i should be hitting this button??? ( i de-dupe on a regular basis )

Does anyone have any input as to,

1) How often should i use this ?
2) How long does it take to do ( lets say i have 1 million URLs )
3) Do i stop GSA to do this ?

and anything else you may think is helpful ..

Thanks in advance  

Comments

  • Clean-Up works like the "identify and sort in" feature. It thats the URLs from your selected folder (Identified, Success, Verified, or Failed), removes duplicates, re-identify, and re-sort into the specific folder.

    >1) How often should i use this ?
    I only use this when I feel it's slowing down my submissions (maybe once in 3 months?). But I do remove dups everyday.

    >2) How long does it take to do ( lets say i have 1 million URLs )
    If you don't use the engine filter, it'll take forever. Just select only the engines you use, and it'll greatly speed up the process. Here's how to it,

    https://forum.gsa-online.de/discussion/comment/88203/#Comment_88203

    >3) Do i stop GSA to do this ?
    I don't think you should stop the process. If you do, there will be nothing or very little in the Verified folder for the projects to use...

  • Thanks @Olve1954

    1) So the "Clean up check and remove non working" and the "Identify and sort in" both check for dead URLs and remove them ?? Does GSA not do this as you continually use the same URLs for new projects pulling from the same folder..?
    Ex:
    So if i have used 1 million URLs to post to a weight loss website and since then 500 are now dead, then i use the same 1 million to post to my new make money online website, will GSA not delete the 500 dead ones as it try's to post to the make money online website?

    2) Do i stop GSA to do this ?  I meant should i stop posting to my projects while i use the "Clean up check and remove non working", i guess it would be faster as it' not using as much resources ??

    Thanks
  • 1) No, SER (GSA = company) does not delete any URLs from Identified, Success, Verified, and Failed folders. It only adds to them. Hence you need to remove dups once a while.

    2) Yes, stop SER while Clean-Up...

    But, do you really have 1 mil URLs in your verified folder? If you do, then I guess most of them are blog comments, pingbacks, etc. If this is the case, then the Clean-Up function may not be best way to clean up your list. It's too time consuming and those URLs are not worth keeping, unlike contextual links from Article, Social Network and Wiki...


  • Hey @Olve1954

    With the strategy that i adopt for ranking what i have in my list seems to work for me so i don't want to get rid of any unnecessarily.. i was looking for the most effective way to get rid of the dead URLs which doesn't take days or weeks...   

    Here are my category stats

    Category - Article............: 110521
    Category - Blog Comment.......: 648481
    Category - Directory..........: 20252
    Category - Document Sharing...: 365
    Category - Exploit............: 72115
    Category - Forum..............: 53631
    Category - Guestbook..........: 84179
    Category - Image Comment......: 31138
    Category - Indexer............: 10901
    Category - Microblog..........: 1157
    Category - Pingback...........: 49958
    Category - Referrer...........: 223
    Category - RSS................: 498
    Category - Social Bookmark....: 6795
    Category - Social Network.....: 155774
    Category - Trackback..........: 76223
    Category - Unknown............: 94363
    Category - URL Shortener......: 39831
    Category - Video..............: 6360
    Category - Web 2.0............: 677
    Category - Wiki...............: 36309
    -------------------------------
    Total.........................: 1499751

  • edited August 2014
    A very impressive verified list indeed, and it's not going to be an easy task cleaning it.

    May I suggest another method. It's still going to take days, but you can run SER normally while cleaning up your list. Briefly the method involves, deleting your Identified folder, move (copy delete) your Verified to Identified, run SER as usual (as it runs, it rebuilds your verified with clean and working URLs), and monitor your Verified everyday. If the number of URLs (after dedup) does not increase then you've squeezed all the working URLs from Identified.

    Steps:
    - backup Identified and Verified (Export site lists)
    - empty Identified folder
    - move files from Verified to Identified (Verified should now be empty)
    - start your projects normally
    - monitor Verified folder everyday, Remove dup domains, and URLs
    - record the Total everyday until it does not increase any more

    While you're doing this process, do not add new URLs into Identified...

  • goonergooner SERLists.com
    @StaceAce - Depending how old your list is you are going to lose a lot of URLs with sort and identify. I recently cleaned a 1.2 million verified list that was up to max 3 months old and it was trimmed to 700,000 after sort and identify.

    But even then that still doesn't mean you can get links from all URLs on that fresh list.

    If you use the method @Olve1954 described above it will give you a truly clean list where you'll get links from most of the URLs - But you'll lose a lot of your list. I would guess you'll lose 75% of it, if not more.

    But the remaining URLs will be postable so all good in the end.

  • Yes @gooner is right. The Verified method will definitely remove lots of URLs, due to bad captchas, internet connection problem, and website temporary down.

    So I suggest you run it for weeks before deleting your Identified.

    But I figured, if you can get 1 mil verified URLs, you must be a damn good scrapper. So building up your verified list won't be a problem...
  • Hey guy's 

    Thanks for your input, i think the method of using the identified folder to pull from sounds like a good path to go.

    I guess i could keep the identified backup, then once i think the cleaning has finished compare files with scrapebox that way i will get all the ones that never got verified again, then run this list again to see if there are any where the server didn't respond or internet connection dropped out, could get a few more verified this way?

    Owch to losing 75% of the list 
  • @Olve1954 and @Gooner  could i ask another quick question.....??

    Lets say i have 10 projects and i have posted to all them 10 projects with the 1 million URLs before, if i put my 1 million verified into the identified folder and write to the verified will this not mean that i am trying to post to the same URLs again or were you suggesting only do this with new projects i create ??

    Or is there a setting where SER will only use the URLs that it hasn't used in the past??

    Thanks 
  • If you're using the Identified to Verified method, I suggest you setup as many new projects as you can. The more projects, the faster the process.

    >Or is there a setting where SER will only use the URLs that it hasn't used in the past??

    For each project, SER "remembers" if it has posted to it before. SER won't post again to that site, if you

    - untick "Continuously try to post to a site even if failed before" or
    - Delete Target URL History

  • Thanks Olve1954,

    Will give that a try... Thanks for your info
Sign In or Register to comment.