Best Practices for Managing Scraped (Unverified) Target URL's

bpm4gsa · July 2014

I have six projects. I have 60k verified targets that I have run campaigns to for all six projects.

Now I have scraped, say, a million new unverified targets.

What do I do with those in GSA SER?

I need to verify them, and I need to run all six projects using the verified URL's from this list.

What is the step-by-step way to accomplish this?

gooner · July 2014

You have two choices:

1) Import them directly into projects and see what gets verified. Ideally you want to put a dummy link in your project as you may not want 1000s of links sent to a money site.

2) Use sort and identify on the list first and then post from your new identified list. This is slow but it means you get a much better percentage of good links when you finally run the list in SER.

I prefer option 1.

Is that clear enough or do you need more info?

bpm4gsa · July 2014

Thanks for the response.

1. If I import into a dummy project to see what gets verified, of course they are now included in my global verified list, but how do I then run my real projects with these verified links?

2. Makes sense...so I would run through the identified until all projects have completed, then clear the identified list (since the good urls are now in the verified list)?

gooner · July 2014

No probs:

1) You can choose projects to run from verified, you could also choose real projects not to save in verified in project options and then save and delete the existing verified list. That way the new verified list would be only from what you scraped. The simplest solution is to have another VPS for processing scrapes.

2) Yes but remember you should run a list multiple times to get all the links. Proxies fail, connections timeout, captchas are not 100% solvable - So if you run the list once then discard it you are potentially throwing away good links.

bpm4gsa · July 2014

Sorry for slow responses, on family vacation this week. Thank you again for your help.

1) Ah, another vps...that would be great. Maybe in a month or so I can swing that.

2) Good point about running more than once. This just compounds the problem about having only one vps at the moment. It took as long to process a list once through a dummy project as it did to scrape the list in the first place.

So I guess there isn't just one simple, straightforward answer to my question. I'm going to go ahead and accept your last response, but any further suggestions are welcome.

the_other_dude · July 2014

Theres really no other way to process a scraped list than the methods @gooner discussed. I do it exactly as he does.

1 Scrape millions of targets every day using custom, and precise footprints + a large keyword list
2. Dedup all raw url files by domain
3. import raw url list directly into 4-5 dummy projects for posting to the urls.
4. dedup site lists by domain every day to keep them "clean"

thats my exact procedure. It does take some work to master this seemingly simple task. Now that I have been doing it every day for the last 3 months its like shifting gears in a manual trasmission vehicle. You dont even pay attention to the gauges, you just do it.

bpm4gsa · July 2014

Dedup by domain, not by url?

Using 4-5 dummy projects, that's how you address the fact that they don't always work the first time, captcha fail, etc....do you run them all at once? And once you have the verifieds from this process, how do you apply them to all of your existing projects that have already posted to your global verified list?

gooner · July 2014

If you dedup by domain the list will be processed faster.
If you dedeup by URL you will get more links, not totally sure why that is. Could be that it gives you more chance to get a link per domain, could be that SER doesn't recognise 1 URL from a particular domain but recognises another.

So it's a choice between max speed and and max links.

strovolo · August 2014

I go a step further. Apart from 4 dummy projects including all platforms (running the same list from all 4) I have a dummy project for each platform (2 sets, it takes sometime to setup).
I found this way to give me much more results by avoiding proxy, captcha and SER footprints identification issues by making sure each url is parsed at least 6 times.
I feed the results to a share folder (\\xxx.xxx.xxx.xxx\verified) and then my other copies of SER can pick the lists from there,You can use dropbox or google drive for the same purpose.

kiosh · September 2014

@strovolo "I feed the results to a share folder (\\xxx.xxx.xxx.xxx\verified) and then my other copies of SER can pick the lists from there,You can use dropbox or google drive for the same purpose."

This means that in each copy of SER, your verification folder points to a dropbox type folder ?

Best Practices for Managing Scraped (Unverified) Target URL's

Comments