Someone please enlighten me....

magically · November 2014

Hi,

I really hope someone are able to enlighten me about this mystery?

Here is the problem:

I ran a good scrape with multiple footprints for .edu and .gov and 100K keywords merged together.

Result was more than 250.000 URLS - found in about 1-2 hours with 70 threads using private proxies.

After trimming those for redundancy and pointless extensions like .pdf, png etc... I ran them through GSA Ser.

And here is the funny part...More than half od those URLS---> "Allready parsed" and "No Engine"

Result is really bad - nothing to say it frankly.

I also tried to extract the footprints internally in GSA SER and then combine those with 100K keywords...

Same result: NOTHING!

How is that possible?

Why are there no positives i.e verified links?

Why does it say already parsed when I'm 100% sure both footprints and keywords are unique?

Damn, i'm really frustrated about those endless scrapings with zero results - regardless of settings, footprints, methods etc...

How on earth do you guys get success?

Anyone here up for running my scraped list through to see why I get nothing out of it?

robby1 · November 2014

Thank you for making music out of the noise I was getting deafened with.

I'm suffering exactly from similar issues.

s4nt0s · November 2014

@magically - If you need me to test the list on my end, I'll run it through.

magically · November 2014

@s4ntOs

Appreciated:)

I will drop you a pm now with a link to download the list with more than 224.000 URL's

Looking forward to see if you get the same results, because I have made many lists and they always end with the same - Zero.

@robby1

Let's wait and see what comes out of the re-run of the mentioned list. Surely something must be wrong if more people have same issues;)

satyr85 · November 2014

Why are there no positives i.e verified links?

250k urls harvested in 1-2 hours is veeery low harvesting speed. You wont get many verified from so few harvested links. Here is screenshot from one of my harvesting servers:

As you can see ~500k links harvested per minute and thats amount of links that need to be harvested to give tons of possible targets.

Did you remove duplicates from this 250k harvest (duplicate domains/urls)? If not that can be reason of "already parsed"

Why does it say already parsed when I'm 100% sure both footprints and keywords are unique?

Your keywords and footprints can be 1000% unique but google wont give you unique results for every footprint + keyword search. Thats how google works. For example now im doing harvest for buddypress sites - 60k keywords, 1 footprint, So far i harvested ~13 milion urls, after removing duplicates i get 18k unique domains.

Feel free to PM me your list so i can take closer look.

magically · November 2014

@satyr85

Yep I do see the "bigger" picture you are pointing out here.

Originally I had approx. 450K URL's before removing duplicate urls with another tool.

I guess I'm just really bad at doing the scraping part then:D

However I also only have 40 private proxies to do the job, which naturally reduces the amount of threads that can run and increases the time for doing it on top.

I just thought that 250K urls would give at least 20-30 verified - but that is not the case for sure.

Anyway, I will forward than list to you also in a PM.

Perhaps I can learn something and improve it over time, it's actually quite difficult to do a decent job.

satyr85 · November 2014

Just got your list via PM and your list is not properly deduped.

You removed duplicate urls - thats fine only in case you are looking for blog comments, image comments, etc but from what i see you dont target this engines, so you have to remove duplicate domains, than import deduped list to GSA.

After removing duplicate domains your list contains 26k unique domains. "Already parsed" messages in log were because your list was not properly deduped.

Btw what engines you were targeting with this scrape?

magically · November 2014

Ohhh I see, so with deduping you mean trim to root?

- Remove duplicate domains also

Engines targeted here (the list) were various forums primarily.

Well I will try to re-run it and see if that improves the performance in terms of verified links:D

I guess the 'deduping factor' is rather important - so I need to catch up what that exactly means and how to do it properly.

satyr85 · November 2014

Dedupe = remove duplicate domains in this case. Triming to root is different thing. You can remove duplicate domains from txt file in GSA using:
Options-> Advenced-> Tools -> Remove duplicate domains from file.

magically · November 2014

Ahhh I See - Thanks for that Tip:)

I was about to do that in Scrapebox - This is new to me that GSA Ser can do the same task.

You are right, that brings it down to 20K

Someone please enlighten me....

Comments