DRASTIC Efficiency Improvements - Part 2 - SE Logic Module

AlexR · December 2012

Following on from my previous post I have been thinking about keyword and SE overlap. The number of times, I use Scrapebox, and after a massive scrape it removes duplicates and it says 80% to 90% duplicates removed, has got me thinking. That's with 1 SE...how bad must it be when you are using SE's powered on the same engine! What a waste to go and have to parse all these URL's.

That's like asking 100 people, going door to door, to the answer to the question "What is 1 + 1?" Well, you could ask 1 person and get 2, but you'd maybe wonder if it was correct. So you could ask some more people. Maybe 10 people. When 9 people all say 2 and one person says something other, it's time to stop. You're 90% certain the answer is 2. No need to continue...but that's exactly what we do at the moment in GSA...we keep knocking on doors, asking the answer to the question. Yes, as soon as the person says 2 we quickly move on (like an URL parsed), but we still are checking!

It's something I touched on in the past and I see a new thread about it now where there is discussion on what SE's to select.

There are a LOT of SE's to choose from and we need to be able to sort them per project (different projects target different countries, keywords, niches, etc).

Let's say you have 10 KW's and these generate 100 SE's results on 1 SE. (like Google). Therefore 1 000 SE results to have a go at and to identify the platform, etc.

Now, you expand this list with synonyms and add a few extra words. Let's say that 80% of the results are the same since many of these keywords are so related you would expect at least an 80% similarity. So now you only get 20% new results. (again these figures are very very rough, as it depends on many factors, but I'd guess they're closer to 90%) But we're happy with sacrificing the time to get this extra 20% of target sites.

BUT SHOULD WE BE SACRIFICING THIS TIME AND RESOURCES HERE FOR ONLY 20%???

This is using 1 SE.

Now when you increase this to 10 or 50 and the SE's are powered by similar engines or related, the numbers are NOT in your favour. It's like doing the same step 1, again, and again, and again, and again, and again...depending on how similar they are and how similar your KW's are. ;-)

HERE's a Feature I'd Like to See (Maybe an added to SE's per project, similar to how the proxy module works, but this needs to be per project, to help you choose per project which SE's to select)

I think it should be available to all in GSA rather than a few testers (the reason is that different SE's for different countries, keywords, will generate different results, and I want each project to be super effective! Also this way the user can use the best results for their needs as well as do the research). It would add a HUGE amount of efficiency to GSA and all should benefit from this!!!

1) You input 1 keyword. (or it takes a random anchor text, since you are targetting this, and normally the keywords you use are related to this and the data is already inputted into GSA)

2) It selects ALL SE's available in GSA. (OR if you want you can remove by country mask, for more fine control)

3) It basically runs the keyword/random anchor through all SE's selected and allocates the URL's that each SE generates to that SE.So it parses ONCE per project ALL the SE's and stores these results.

4) You identify which URL's are COMMON to the most SE's. I.e. these are the results that most SE's generate for that keyword. You set it to the number of results. I.e. 10, 20, 50, 100, 200, etc. Thus it identifies, what the most common XX URL's across all SE's. I.e. top 100 URL's for that anchor.

5) Set "% Common URL Match". For each SE you identify how many/percent of the the URL's in step 4 it returned. A good SE will cover 80 to 90% of the URL's in step 4...so you only need to select 1 good SE and it cover most of the results you need.

6) Set SE "% UNIQUE URL's" .You have the option to Use/Select All SE's with a unique threshold above XX percent. Let user decide percent. I.e. here you are selecting SE's that are generating different results. This could be some great target URL's or just some very bad results by poor SE's, depending on how you set it. You've got to be asking yourself why are these SE's generating different results, and how much should we tolerate. Hence the user selecting the unique threshold.

So what this practically means and why it's a DRASTIC efficiency improvement.

Let's say you have 1 keyword. You set step 4 (% Common URL Match) to "50 SE Common URL's". You identify the 50 most common URL's for that keyword. In order to get ranked for this keyword in the MOST SE's, you would want links on these 50 common URL's, since these are the URL's that the MOST SE's deem important and authority sites in the niche.

This would be a massively neat feature if that was all. You've basically got the hottest list for sites you need to get links on right there!

But it gets better.

Step 5 basically, says which SE's generate the same results should be ignored. So instead of parsing the results of SE's that just generate the SAME results everytime, it says IGNORE ALL SE's that generates % Common URL Match above 80%. Thus, use 1 SE and ignore all other SE's that generate the same 80% of results. THIS STEP IS CRITICAL AS IT WILL REMOVE THE SE'S THAT JUST GENERATE THE SAME RESULTS!

Step 6 - you can set the threshold of SE's that generate the most value for you to run your queries on. (i.e. what percent of unique results does the SE have to generate for it to be worth using) Some niches, have very few results, so ANY unique new URL's are great, while others you can afford to "miss" that one or two url's, but instead you have saved so much on resources and time that you've got a 1000 other URL's to go at instead. Some keywords/niches, just don't have this luxury. This way you can really make sure your SE's selection are maximised for your keyword/niche.

Practically, let's talk numbers.

1 000 keywords.

50 SE results on average per SE.

That's 50 000 unique results.

Now that's assuming you're using 1 SE that generates unique results.

Let's say you use 50 SE's (maybe you missed some really good ones that you didn't think to use!). These SE's have 80% similar results...or maybe you go lucky and just picked the right combination of SE's...but you'd still have picked many that just generate the same results...there's no way you know unless you are doing serious testing.

That means you are running

50 SE's x 50 000 = 2 500 000 URL's to parse.

80% similar = 2 000 000 URL's that didn't need to get parsed.

Now if you are running 10 projects per VPS that 20 000 000 URL's that didn't need to get parsed. That = ONE DRASTIC EFFICIENCY IMPROVEMENT.

I hope it's not a big feature, but it would add serious value.

The key is:

1) Storing the SE results for each SE.

2) Computation to find the most common number of XX target URL's.

3) Applying the filter/thersholds.

Please add some discussion! (This post took forever to explain clearly...it's clear in my head at least...I hope I have done it justice!)

AlexR · December 2012

I know this post is a little long but it took me ages to write and construct! I'd love to get some feedback from others on it.

Bytefaker · December 2012

I can agree with this also. Again thanks for taking time and writing these toughts and suggestions down. I'm hoping that Sven is open for it (and you get some additional feedback).

AlexR · December 2012

@bytefaker - have just been surprised at the complete lack of feedback on this. Maybe I should rather have entitled it "I'm getting low links" - that always seems to get some good feedback. ;-)

Heisenberg · December 2012

Perhaps it's because not everyone is so stupid to use keywords like this for searching / scraping...

what are blue widgets
samsung blu ray widgets
blue widgets seo
blue one armed widgets
how we make blue widgets
blue jays widgets
what is blue widgets
how we make blue widgets
buy blue widgets
beautiful widgets binary blue
what are blue widgets
blue one armed widgets
apple widgets blue friends

I for one don't!

doubleup · December 2012

People use GSA SER in the same way, but differently. The issues you encounter, i don't experience, as i don't use GSA SER to find sites to place backlinks on.

AlexR · December 2012

@doubleup - so what you're saying is GSA is a weak tool to find sites to place backlinks on? Why else would you not use it for that?

AlexR · December 2012

@Heisenberg - you'll see MUCH advice on generating keywords, and they all generate similar results. So yes, most people will have an overlap to some degree...what I'm proposing is a way to help resolve that...

Anyways...even WITH unique keywords, there is a SE overlap, that I am proposing we solve, so while your comment applies to one aspect that I touched on (too similar keywords), it doesn't fully cover what I was proposing (which is SE speed, and SE overlap).

doubleup · December 2012

@GlobalGooglerPacquiao lost to Marquez last week, but i wouldn't say Pacquiao was weak

In a day, using GSA SER to search for places to backlink, you'd get a certain number. Using other programs such as SB, you'd be able to find a FAR greater number in the same period of time. It's all about efficient use of time, and using GSA SER in that way isn't very efficient in my opinion.

AlexR · December 2012

@Sven - would this be possible?

Sven · December 2012

this sounds logical and good but right now I don't see that this can be added any time soon. Though I keep this bookmarked to maybe add one or the other thing when I can.

AlexR · December 2012

@Sven - thanks for considering it!

Also - maybe you can advise me here....

I'd like to take a keyword and run it through ALL the SE's in GSA. I'd like it to get the parsed results for each SE and store it in csv. Then I can do some analysis on this.

I'm not a programmer is there an easy way to do this (get all the parsed results for each SE), without doing it one by one?

Sven · December 2012

sorry, there is no such way as duplicate results get filtered out already before shoing in log.

AlexR · December 2012

@sven - would it be possible for it to log how many duplicates it removed? Thus, if we inputted only 1 keyword, and we selected all SE's, it would log which ones had the most duplicates. Then we could see which SE's are generating the most duplicates, and as such overlapping the most. Would help greatly in us choosing the correct SE's to use for our projects. I know this is a temporary solution, but it seems it would help greatly.

Sven · December 2012

Yes thats shown with a messgae after SE parsing on "010/020 ..." saying 10 where duplicate out of 20 returned results.

AlexR · December 2012

@sven - This is good news. So just to check if I can do this:

I go to options, tools advanced, I take 1 keyword and select all SE's. I want to run the search without footprints, so this would be the best way. But I don't think it offers a log for this. Is there a way to extract this log too?

OR

Is there a platform that I can select that has the smallest footprint, i.e. that is closest to just a "keyword" straight search with nothing added?

So what I'd do is select a single platform, a single keyword, all SE's. Then let it run, and log it to file.

Then extract and clean up data so that for each SE (about 1000) I could get the statistic:

1) Number of URL's Parsed

2) Number of Duplicates

Sven · December 2012

Yes you can define a engine with a footprint like "a" and it would basically only use that one + the keyword.

AlexR · December 2012

I'm just starting to learn about the scripting.

So I create a .ini file, with a footprint in it. What folder do I place it in? Will GSA then add it automatically to the GUI?

Sven · December 2012

in the engine folder where the executabe is. The program will find it automatically.

AlexR · January 2013

Can someone help me with this? I am not a programmer and took a look at the .ini files and wasn't able to do this. So many variables in the file and I'm not sure if I must remove any. I can see where to edit the footprint, but not sure about all the others!

What I am looking for is a "Test.ini" file that I can run that only has a footprint like "a".

There are just too many variables in the .ini's and I'm not sure what to edit. If anyone can assist me with my testing by supplying a test.ini that only has a footprint with "a" I'd be much grateful!

Also - in the footprint, I don't want it to actually submit something. I am only interested in getting the SE's to parse the results.

AlexR · January 2013

What I have done is just copied a trackback .ini and replace 2 lines with:

url must have1=**

search term=a

I then ran a test with keyword "golf".

These are some of my logged results:

[12:15:47] test.com: [ ] 000/000 [Page END] results on google DE for Search Engine Test with query a golf http://www.google.de/search?q=a+golf&as_qdr=all&filter=0&num=100&start=300&cr=countryDE

[12:15:47] test.com: [ ] 000/001 [Page END] results on Lycos DE for Search Engine Test with query a golf http://search.lycos.com/web?q=a+golf&pn=1&region=de

[12:15:51] test.com: [ ] 000/001 [Page END] results on Lycos DE for Search Engine Test with query a golf http://search.lycos.com/web?q=a+golf&pn=1&region=de

[12:15:54] test.com: [ ] 093/142 [Page 001] results on google DE for Search Engine Test with query golf a http://www.google.de/search?q=golf+a&as_qdr=all&filter=0&num=100&start=0&cr=countryDE

[12:15:55] test.com: [ ] 000/001 [Page END] results on Lycos DE for Search Engine Test with query a golf http://search.lycos.com/web?q=a+golf&pn=1&region=de

[12:15:56] test.com: [ ] 001/011 [Page 002] results on MSN DE for Search Engine Test with query golf a http://www.bing.com/search?q=golf+a+loc:DE&filt=all&first=11&FORM=PERE

[12:15:57] test.com: [ ] 042/097 [Page 002] results on google DE for Search Engine Test with query golf a http://www.google.de/search?q=golf+a&as_qdr=all&filter=0&num=100&start=100&cr=countryDE

@sven - so what I would do is: (looking at Google DE)

1) Add up the total results for all pages Google DE.. e.g. 97+142+0. This gives me total parsed results.

2) Add up the total duplicates for SE. e.g. 42+93+0. This gives me SE duplicates/already parsed results.

3) Measure this ratio of duplicates to parsed results.

Questions:

1) Is the above footprint edit correct?

2) When it is measuring it as parsed, which SE results is it comparing it to? (i.e. which SE parsed it first? Is this a random one?)

3) If it's using a random one to measure original parse, what I could do is first run it with Google.com, then use this as a base. Then I could measure the value each extra SE adds. Would this be correct?

Sven · January 2013

1. Yes foot print seems to be OK

2. It's all random

3. Yes thats how I would do it as well.

AlexR · January 2013

@Sven - to be certain that I get all the results/pages from the SE, can I confirm that it always shows the [PAGE END] once it has parsed all the results? So this is the marker to show the SE has exhausted the results.

Sven · January 2013

yes the END comes up if no more results are to be expected from the search engine (last page).

AlexR · January 2013

Starting some SE testing this week. Looking forward to seeing some of the results. :-)

AlexR · January 2013

Thanks for your patience! Just setting everything up for the test and need a few pointers.

Just having a minor issue:

1) I selected Google US.

2) See log file here: http://pastebin.com/RN1PZSCW

3) I cleared target URL cache, and target url history on the project.

I'm trying to run initial project with Google.com only as a reference point.

All I see is "000/000 [Page END] results on google US " and one or two entries with "000/140 [Page 001] results on google US"

1) Surely Google.com has 1000 results?

2) Does this indicate that there are only 140 results for the search term "a golf"?

3) Or is it only using Google US and it's only showing 140 results?

4) Where do I find the main Google.com SE in the SE options (which country is it listed under and what's it called)? I want to use this as a reference SE?

Is this correct?

AlexR · January 2013

Sorry to bump this up, but wanting to run the tests today/tomorrow and just want to confirm the above 4 points before I get it running. Macros created, all good to go. Just need to check the above before I begin. :-)

Sven · January 2013

"000/140 [Page 001]" means 0 new results from 140 results on page 1. The program does not know/extract how many more pages/results there are.

AlexR · January 2013

Where do I find the main Google.com SE in the SE options (which country is it listed under and what's it called)? I want to use this as a reference SE?

Ozz · January 2013

google --> International

AlexR · January 2013

Thanks!

DRASTIC Efficiency Improvements - Part 2 - SE Logic Module

Comments