Identify and sort in, or run raw url list?

the_other_dude · July 2014

Hey

I'm scraping massive amounts of urls. Just wanted to see what the scrapers choice is. feed the raw url list in to multiple projects, or identify and sort in, then import from identified list?

Thanks

Olve1954 · July 2014

I prefer "identify and sort in". SER will go thru the list over and over again, increasing the chance of getting more verified links. Sometimes a submission or registration may fail due to internet connectivity, wrong captcha (especially recaptchas), site down, etc. Periodically (daily or weekly), I'll remove duplicate domains, and note down the number of verified sites. If the number doesn't increase anymore, hence no more new sites found, I can then delete the identified list/folder...

the_other_dude · July 2014

Typically thats what I do too, but i have at least 5 projects running the identified list to maximize verified links from what I scrape.

However even after the dedup the scrape file it still takes forever to sort in 200000+ domains and is killing my cpu (MAXED OUT with 4 cores)

I also then dedupe the identified list too because I do this every day, and its pretty time consuming really.

Olve1954 · July 2014

My trick is to "Use engine filter", i.e. identify only the engines I use. This greatly increases the "identify and sort in" process. Also I "Disable proxies" and 0 retries...

the_other_dude · July 2014

I'm actually sorting in a big de duped list right now, and its going to take a while it looks like. at least an hour with no projects running because its killing the CPU so bad

the_other_dude · July 2014

@Olve1954

!! I think I forgot to disable proxies this time. Even when I do that it still kills my CPU though. I do check the sort by engine but I dont uncheck any of them. I'll try that next time, thanks

Olve1954 · July 2014

"identify and sort in" is very CPU intensive especially if you don't uncheck any engines. For every url SER downloads, it has to scan thru hundreds of engines, looking for a particular footprint. I only use about 15 engines (mainly contextuals), and I only check these engines. SER zips thru the list very quickly...

the_other_dude · July 2014

nice. its worth aborting and starting over then :P

bpm4gsa · July 2014

Sort and identify....I have approximately 29.9 million scraped URL's after deduping. I am doing my first import/identify, and it took about 30 minutes to process 28k of those targets. At this rate, it will take 22 days to sort and identify all of my scraped targets.

I guess I need to abort and reduce the number of engines that are checked.....that means that when I do the import, I need to know which footprint(s) I was scraping for, for that file. Right?

Olve1954 · July 2014

@bpm4gsa

I run multiple instances of scrapebox, each one only scraping for one type of engine, e.g. Article, Social Network, or Wiki. Then I remove duplicates (domain names for my case), and "identify platform and sort in", selecting only the engines I scraped for. It's much much faster this way.

Here's how to save the engine selection. Open a project, select all the engines you want to submit to (Article in this case), right click (in the [Where to Submit] box) -> "Load/Save Engine Selection" -> "Save Current".

You can then use the saved engine selection in "identify and sort in",

bpm4gsa · July 2014

Thanks @Olve1954. I am scraping with GScraper using GSA predefined footprints. What would be the easiest way to get a list of engines based on GSA's pre-defined footprints? For example if I scrape by using "Add All from Article"...how would I get the correct engine list for that? And a list for blog comments, a list for pingbacks, etc....

botman · July 2014

I really see no point from this option - identify and sort in.

I tried to use it in very beginning when started to use SER but now I never use any global site list and I'm importing all my link lists directly and it works fine...

identify and sort in in my opinion is time waste and also CPU resources waste.

Olve1954 · July 2014

@bpm4gsa,

I usually concentrate my resources on scraping engines/platforms that can get me the most verified links.

So, I head over to serlist.com and check-out their Red List #9 stats (thanks ron and gang...). For example - Article sites, I would chose the following engines,

Article-BuddyPress…………: 602
Article-Drupal – Blog………: 762
Article-Free DNS…………..: 392
Article-Wordpress Article…..: 535
Article-XpressEngine……….: 1211

Then I head over to SER to get its predefined footprints.

Tools -> Search Online for URLs -> Add predefined Footprints, for Article - BuddyPress, Drupal–Blog, Free DNS, Wordpress Article, and XpressEngine.

Copy the footprints to a file, and here's what I do to further concentrate my scraping efforts. I remove all special operators like inurl, intitle etc. This makes my proxies last much longer. Then I remove footprints that gives me very little search results, like less than 1000. It's a waste of time scraping these footprints with keywords. I use a scrapebox add-on which makes this task very easy.

Load these footprints+keywords into your scraper and scrape away...

"Remove duplicate Domains" for engines like Article, Social Networking, Wiki, etc.

"Remove duplicate URLs" for Blog Comment, Pingback, Trackback, etc.

Load these new found URLs into SER and "identify and sort in", and remember to use "Engine Filters", and no need to use proxies.

Happy scraping and hope this helps...

Olve1954 · July 2014

Yes @botman, when I first started using SER and "identify and sort in" feature, it was almost useless. It took a day to complete the task, and at the meantime I've to stop the submission. But since v8.04, Sven added this "Engine Filter" and makes the identify process much faster. I just have to think and plan ahead before I scrape.

But sorry I've to disagree with you, "identify and sort in" and using Global list does have its benefits over importing new found urls into your projects directly, and "identify and sort in" is not a useless feature...

gooner · July 2014

I think it depends on your goals, for specific engines identify and sort is much better since the engine filter came into force, but if your aim is to build up your verified list as quickly as possible then i think you are better to import directly into projects.

If you are worried about losing links by not identifying first, import the raw list 3 - 5 times and you will get most of the links.

Olve1954 · July 2014

Yes @gooner, it really depends on your goals. I'm mainly interested in contextual links which are really hard to find. And I can't afford to lose potential links by importing them directly into projects. With todays OCR accuracy (I'm guessing less than 10% for normal recaptchas, even less for blobs), you do need to import 10 - 15 times into your projects before you get a verified link. Submissions or registrations may also fail due to change in the submission/registration process, bad internet connection, or website temporary down.

So, my best solution is to identify and sort them into the identified list and use the list over and over again. As long as it doesn't hurt my LPM (max at 400), I don't delete nor clean up my list, but I do remove duplicates everyday.

The OCR guys may improve their recaptcha accuracy, or Sven improves some engines, or I subscribe to a new text captcha service, then I can get more verified links from my identified list. If I didn't keep them, I would lose these "potential" links.

Yes, SER randomly picks URLs from the identified list, and this can slow down the building up of your verified list. But, I can speed it up by importing the identified list into a few of my projects...

But, if I'm doing Blog Comments, I probably won't bother with "identify and sort in". Get a few really good footprints, scrape, remove dup urls, and import directly into blog comment projects.

So yes, @gooner is right, it all depends on what your goals are...

gooner · July 2014

@olve1954 - Ah yea if you are using OCR then that defo makes a lot of sense.
I'm not using atm... The speed sucks and solve rate sucks as you mentioned.
Hoping it will be improved soon.

itsme · July 2014

Olve1954, I recently started scraping contextual links myself and was loading raw urls directly to my projects but now i will be trying identify and sort in feature. I hope i will get more verified links this way.

I would like to thanks you for all the tips you shared here.

Olve1954 · July 2014

You're welcome @itsme, glad to share. When I first started, @Sven and a few gurus here helped me a lot. Now it's my turn...

luckskywalker · July 2014

@Olve1954 From where are you using proxies? The paid from gscraper?

loopline · July 2014

@Olve1954
I don't understand your premise, in so much as this:

You import and sort in, which sorts them into the identified global list, and then you let that run over and over on your projects.

How is that different then just importing a list as target urls for a project. Won't it go thru and identify them and sort them into the identified global list as it goes along? and then you can let it still pull from the global identified list over and over etc...

Im missing your logic somewhere and Im not sure what I am missing.

the_other_dude · July 2014

i've been doing both -- importing raw url list into projects, and identifying and sorting in. Seems like SER actually runs trough the list much faster after idenfiying and sorting in than it does just importing a raw url list. I prefer to identify and sort in, but it is very resource heavy.

Lately I have been scraping massive lists with gscraper and public proxies, sometimes up to 50k urls a minute. just always remember to not only dedup the list you are importing to be sorted, but also dedup your identified list and your projects as well.

I recently got a dedicated server to process lists in SER and its going very smooth this way.

Olve1954 · July 2014

@luckskywalker, I'm using Scrapebox...

@loopline,

It depends on how you setup SER's Global list. I've only enabled Verified,

So, even if SER identifies a new site/url, it cannot write or save it in the identified list.

Reason for not enabling Identified? I've multiply lists, and I've imported the "blue pill" into my Submitted folder. If I enable Identified, SER will start writing and mixing Verified and Submitted into my Identified folder, and I don't want that to happen.

So yes @loopline, if you've enabled Identified and don't mind mixing up the list, then there is no difference between the methods, except for speed as @the_other_dude has pointed out...

the_other_dude · July 2014

i prefer nginx over apache because of speed.. therefore i also prefer the faster method of list processing :P

itsme · July 2014

Guys, i have question regarding processing of these raw urls ( i only scrape contextual targets).

With GSA SER, do i need to run low number of threads (100-150) to get maximum verified out of these or i can run larger number of threads (500-600) and still get same verified rate?l

What is the best way to get maximum number of links verified from Raw list?

I use dedicated server and 60 private proxies.

Olve1954 · July 2014

@itsme, I guess the higher the number of threads the higher the chances of bad internet connections (timeout etc), so the less verified rate.

Hence, that's the reason I use "identified and sort in". If SER can identify the site, and saves it in the identified list, there's a good chance it'll get a verified link. If not today (due to bad internet connection, or bad captchas, or website down), maybe the next day, or the day after...

luckskywalker · July 2014

I have a little question.

If I Right Click > Import Target URLs, there is a way to see the list imported after that? Like we can show URLs from Identified and etc.

gooner · July 2014

@luckskywalker - You can right click the project and then choose "Show URLS" > "Show Remaining Target URLs".

luckskywalker · July 2014

@gooner - Thanks

loopline · July 2014

@Olve1954
Ok that makes perfect sense, I also don't let SER write to the global Identified list and use it for maintaining the integrity of my high quality targets, such as contextual. However I am a "how stuff works" person, because once I figure how it works I can do anything I want.

So in this case I was just wanting to understand your logic as I have also learned that many other people have different approaches then I do and there is much to be gleaned from many minds thinking together.

Thanks for explaining.

the_other_dude · July 2014

@itsme I've never run SER over 250-300 threads, simply because I use semi-dedicated proxies for posting and I don't want a bunch of proxy errors in the case the other users decide to use them heavily.

To get the most verified links out of a scraped list you should be running multiple projects that are trying to post to the same targets. Increasing your threads to 600 with one project running is not going to get you more verified links. It's going to make you go through the list faster, depending on your proxies and hardware.