Skip to content

How to verify quick a huge identified list with no duplicate url and no duplicate domains ?

I am done with scraping, identifying, Removing unknowns and de duping a huge list.  Here are the stats of the list that I have got at the end. Now I want to remove the dead,404 or any other error sites from them. Any quick way ?
P.S Sorry for being such a dumb here. I am new in this Thanks

-------------------------------
Category - Article............: 2653092
Category - Blog Comment.......: 2454838
Category - Directory..........: 46435
Category - Document Sharing...: 1437
Category - Exploit............: 14602
Category - Forum..............: 203295
Category - Guestbook..........: 113352
Category - Image Comment......: 27150
Category - Indexer............: 17
Category - Microblog..........: 32714
Category - Pingback...........: 1546006
Category - Referrer...........: 1264
Category - RSS................: 402
Category - Social Bookmark....: 57015
Category - Social Network.....: 94746
Category - Trackback..........: 262263
Category - URL Shortener......: 165374
Category - Video..............: 48539
Category - Web 2.0............: 19987
Category - Wiki...............: 89975
-------------------------------
Total.........................: 7832503

Comments

  • You could use scrapebox alive check addon
  • Thanks For Your reply
    @Seljo For that addon proxies are required or not ?  
  • IdontknowIdontknow Romania
    edited February 2015
    No proxies. But will take you a week to verify all links.
  • @Idontknow actually problem is after processing them with scrapebox I will loose the identification of urls. Is there any way that scrapebox generates an output that will be accepted by GSA SER ??
  • Tim89Tim89 www.expressindexer.solutions
    Simply create a SER project with your desired engines and then import the list directly to the project, uncheck use global site lists.
  • Just take file by file and rewrite.
  • @Tim89 Yeah thats a simple solution but it will take lots of time. I think there should be an option in GSA SER to clean up the list 
Sign In or Register to comment.