Best Practices for Managing Scraped (Unverified) Target URL's
I have six projects. I have 60k verified targets that I have run campaigns to for all six projects.
Now I have scraped, say, a million new unverified targets.
What do I do with those in GSA SER?
I need to verify them, and I need to run all six projects using the verified URL's from this list.
What is the step-by-step way to accomplish this?
Comments
1) Import them directly into projects and see what gets verified. Ideally you want to put a dummy link in your project as you may not want 1000s of links sent to a money site.
2) Use sort and identify on the list first and then post from your new identified list. This is slow but it means you get a much better percentage of good links when you finally run the list in SER.
I prefer option 1.
Is that clear enough or do you need more info?
1) You can choose projects to run from verified, you could also choose real projects not to save in verified in project options and then save and delete the existing verified list. That way the new verified list would be only from what you scraped. The simplest solution is to have another VPS for processing scrapes.
2) Yes but remember you should run a list multiple times to get all the links. Proxies fail, connections timeout, captchas are not 100% solvable - So if you run the list once then discard it you are potentially throwing away good links.
1 Scrape millions of targets every day using custom, and precise footprints + a large keyword list
2. Dedup all raw url files by domain
3. import raw url list directly into 4-5 dummy projects for posting to the urls.
4. dedup site lists by domain every day to keep them "clean"
thats my exact procedure. It does take some work to master this seemingly simple task. Now that I have been doing it every day for the last 3 months its like shifting gears in a manual trasmission vehicle. You dont even pay attention to the gauges, you just do it.
If you dedeup by URL you will get more links, not totally sure why that is. Could be that it gives you more chance to get a link per domain, could be that SER doesn't recognise 1 URL from a particular domain but recognises another.
So it's a choice between max speed and and max links.
I found this way to give me much more results by avoiding proxy, captcha and SER footprints identification issues by making sure each url is parsed at least 6 times.
I feed the results to a share folder (\\xxx.xxx.xxx.xxx\verified) and then my other copies of SER can pick the lists from there,You can use dropbox or google drive for the same purpose.