Skip to content

how to filter 20M link list for GSA in shorter time

Ok so i just got a 20M links list off a friend , went over to GSA and said Import links from text. A whole day gsa was processing to identify the list and made only 800k identified and 300k unidentified meaning it processed only 1M out of 20M during that day. Is there any way to filter links to what works for gsa within a short time?

Comments

  • goonergooner SERLists.com
    1M out of 20M untargeted URLs is actually not bad.

    You could run the list through "Sort and identify" and that will put the links SER thinks it can post to in your identified folder in sitelist format.

    Then set your projects to post from identified folder, they will continually try to post to all of the links so you will get as many as possible from them at the end.
  • If I'm not mistaken, he is already using the sort and identify tool and I believe this option is a waste of time when handling large lists.

    I'm also running large lists, but I'm always deduping my lists by domains or else I wouldn't have any chance to keep up with my scraping at all. I then import my list straight into a project with no filters at all and mostly all platforms ticked. I believe the sort in and identify tool is a waste of time because if you just import your list straight into a project, you'll be identifying and posting at the same time.
  • @sven said that a third party is developing a tool with the sole purpose of being a platform identifier and sorter. I'm assuming it's going to be much faster than the inbuilt SER identify and sort in.
  • goonergooner SERLists.com
    edited May 2014
    @fakenickahl - I think you're right he did use sort and identify, i didn't read it properly.

    We have tested this quite extensively and you get more links using sort and identify, as much as 10% more with the same list. You could get the same results if you imported the list more than 5 times, but it still takes a very long time to process all of those dup URLs.

    So, it's a choice between more links or more speed IMO.
  • @gooner, that's very interesting you saw such an increase. Another reason I have for avoiding the sort and identify tool is I saw a member write once that it missed a good amount of potential targets. He wrote that he came to this conclusion by saving the unknown targets and tried rerunning this list of unknown targets. I did the same and came to the same conclusion that a good amount of potential targets were categorized as unknown during the first run. Just thought you might be interested in this as I take it you're currently using the sort in and identify tool, and are possibly missing out on a good amount of links still.
  • goonergooner SERLists.com
    @fakenichahl - It could be true that some links are missed that way. I use it sometimes but i should clarify that i use it on a bulletproof server with no proxies and 3 retries. So maybe that method yields more links.

    The main reason i use it is so that i don't have to import URLs across numerous servers everyday.
    So if i am too busy to take care of it, i use identified lists.

    If i had a way to auto import when projects where out of targets, i probably wouldn't bother sort and identify.
  • You guys are right i have used sort in and identify. I have let it running for a whole day and it processed through 1 million and 200k lists but that is out of 20 Million link list so i need basiclly around 15 days to let gsa process the whole list.. i could live with only 1 million link out of the 20M but i was asking if there was a way to process the 20 million links all within a day or something?
  • Based on my experience then there's no way at all you could go through 20 million urls in a day with just one SER running. I was merely suggesting ways for you get through it quicker.
Sign In or Register to comment.