Identify and sort in, or run raw url list?
Hey
I'm scraping massive amounts of urls. Just wanted to see what the scrapers choice is. feed the raw url list in to multiple projects, or identify and sort in, then import from identified list?
Thanks
I'm scraping massive amounts of urls. Just wanted to see what the scrapers choice is. feed the raw url list in to multiple projects, or identify and sort in, then import from identified list?
Thanks
Comments
However even after the dedup the scrape file it still takes forever to sort in 200000+ domains and is killing my cpu (MAXED OUT with 4 cores)
I also then dedupe the identified list too because I do this every day, and its pretty time consuming really.
!! I think I forgot to disable proxies this time. Even when I do that it still kills my CPU though. I do check the sort by engine but I dont uncheck any of them. I'll try that next time, thanks
I run multiple instances of scrapebox, each one only scraping for one type of engine, e.g. Article, Social Network, or Wiki. Then I remove duplicates (domain names for my case), and "identify platform and sort in", selecting only the engines I scraped for. It's much much faster this way.
Here's how to save the engine selection. Open a project, select all the engines you want to submit to (Article in this case), right click (in the [Where to Submit] box) -> "Load/Save Engine Selection" -> "Save Current".
You can then use the saved engine selection in "identify and sort in",
I usually concentrate my resources on scraping engines/platforms that can get me the most verified links.
So, I head over to serlist.com and check-out their Red List #9 stats (thanks ron and gang...). For example - Article sites, I would chose the following engines,
Article-BuddyPress…………: 602
Article-Drupal – Blog………: 762
Article-Free DNS…………..: 392
Article-Wordpress Article…..: 535
Article-XpressEngine……….: 1211
Then I head over to SER to get its predefined footprints.
Tools -> Search Online for URLs -> Add predefined Footprints, for Article - BuddyPress, Drupal–Blog, Free DNS, Wordpress Article, and XpressEngine.
Copy the footprints to a file, and here's what I do to further concentrate my scraping efforts. I remove all special operators like inurl, intitle etc. This makes my proxies last much longer. Then I remove footprints that gives me very little search results, like less than 1000. It's a waste of time scraping these footprints with keywords. I use a scrapebox add-on which makes this task very easy.
Load these footprints+keywords into your scraper and scrape away...
"Remove duplicate Domains" for engines like Article, Social Networking, Wiki, etc.
"Remove duplicate URLs" for Blog Comment, Pingback, Trackback, etc.
Load these new found URLs into SER and "identify and sort in", and remember to use "Engine Filters", and no need to use proxies.
Happy scraping and hope this helps...
But sorry I've to disagree with you, "identify and sort in" and using Global list does have its benefits over importing new found urls into your projects directly, and "identify and sort in" is not a useless feature...
If you are worried about losing links by not identifying first, import the raw list 3 - 5 times and you will get most of the links.
So, my best solution is to identify and sort them into the identified list and use the list over and over again. As long as it doesn't hurt my LPM (max at 400), I don't delete nor clean up my list, but I do remove duplicates everyday.
The OCR guys may improve their recaptcha accuracy, or Sven improves some engines, or I subscribe to a new text captcha service, then I can get more verified links from my identified list. If I didn't keep them, I would lose these "potential" links.
Yes, SER randomly picks URLs from the identified list, and this can slow down the building up of your verified list. But, I can speed it up by importing the identified list into a few of my projects...
But, if I'm doing Blog Comments, I probably won't bother with "identify and sort in". Get a few really good footprints, scrape, remove dup urls, and import directly into blog comment projects.
So yes, @gooner is right, it all depends on what your goals are...
I'm not using atm... The speed sucks and solve rate sucks as you mentioned.
Hoping it will be improved soon.
I would like to thanks you for all the tips you shared here.
I don't understand your premise, in so much as this:
You import and sort in, which sorts them into the identified global list, and then you let that run over and over on your projects.
How is that different then just importing a list as target urls for a project. Won't it go thru and identify them and sort them into the identified global list as it goes along? and then you can let it still pull from the global identified list over and over etc...
Im missing your logic somewhere and Im not sure what I am missing.
Lately I have been scraping massive lists with gscraper and public proxies, sometimes up to 50k urls a minute. just always remember to not only dedup the list you are importing to be sorted, but also dedup your identified list and your projects as well.
I recently got a dedicated server to process lists in SER and its going very smooth this way.
@loopline,
It depends on how you setup SER's Global list. I've only enabled Verified,
So, even if SER identifies a new site/url, it cannot write or save it in the identified list.
Reason for not enabling Identified? I've multiply lists, and I've imported the "blue pill" into my Submitted folder. If I enable Identified, SER will start writing and mixing Verified and Submitted into my Identified folder, and I don't want that to happen.
So yes @loopline, if you've enabled Identified and don't mind mixing up the list, then there is no difference between the methods, except for speed as @the_other_dude has pointed out...
With GSA SER, do i need to run low number of threads (100-150) to get maximum verified out of these or i can run larger number of threads (500-600) and still get same verified rate?l
What is the best way to get maximum number of links verified from Raw list?
I use dedicated server and 60 private proxies.
Hence, that's the reason I use "identified and sort in". If SER can identify the site, and saves it in the identified list, there's a good chance it'll get a verified link. If not today (due to bad internet connection, or bad captchas, or website down), maybe the next day, or the day after...
Ok that makes perfect sense, I also don't let SER write to the global Identified list and use it for maintaining the integrity of my high quality targets, such as contextual. However I am a "how stuff works" person, because once I figure how it works I can do anything I want.
So in this case I was just wanting to understand your logic as I have also learned that many other people have different approaches then I do and there is much to be gleaned from many minds thinking together.
Thanks for explaining.
To get the most verified links out of a scraped list you should be running multiple projects that are trying to post to the same targets. Increasing your threads to 600 with one project running is not going to get you more verified links. It's going to make you go through the list faster, depending on your proxies and hardware.