Following on from my previous post I have been thinking about keyword and SE overlap. The number of times, I use Scrapebox, and after a massive scrape it removes duplicates and it says 80% to 90% duplicates removed, has got me thinking. That's with 1 SE...how bad must it be when you are using SE's powered on the same engine! What a waste to go and have to parse all these URL's.
That's like asking 100 people, going door to door, to the answer to the question "What is 1 + 1?" Well, you could ask 1 person and get 2, but you'd maybe wonder if it was correct. So you could ask some more people. Maybe 10 people. When 9 people all say 2 and one person says something other, it's time to stop. You're 90% certain the answer is 2. No need to continue...but that's exactly what we do at the moment in GSA...we keep knocking on doors, asking the answer to the question. Yes, as soon as the person says 2 we quickly move on (like an URL parsed), but we still are checking!
It's something I touched on in the past and I see a new thread about it now where there is discussion on what SE's to select.
There are a LOT of SE's to choose from and we need to be able to sort them per project (different projects target different countries, keywords, niches, etc).
Let's say you have 10 KW's and these generate 100 SE's results on 1 SE. (like Google). Therefore 1 000 SE results to have a go at and to identify the platform, etc.
Now, you expand this list with synonyms and add a few extra words. Let's say that 80% of the results are the same since many of these keywords are so related you would expect at least an 80% similarity. So now you only get 20% new results. (again these figures are very very rough, as it depends on many factors, but I'd guess they're closer to 90%) But we're happy with sacrificing the time to get this extra 20% of target sites.
BUT SHOULD WE BE SACRIFICING THIS TIME AND RESOURCES HERE FOR ONLY 20%???
This is using 1 SE.
Now when you increase this to 10 or 50 and the SE's are powered by similar engines or related, the numbers are NOT in your favour. It's like doing the same step 1, again, and again, and again, and again, and again...depending on how similar they are and how similar your KW's are. ;-)
HERE's a Feature I'd Like to See (Maybe an added to SE's per project, similar to how the proxy module works, but this needs to be per project, to help you choose per project which SE's to select)
I think it should be available to all in GSA rather than a few testers (the reason is that different SE's for different countries, keywords, will generate different results, and I want each project to be super effective! Also this way the user can use the best results for their needs as well as do the research). It would add a HUGE amount of efficiency to GSA and all should benefit from this!!!
1) You input 1 keyword. (or it takes a random anchor text, since you are targetting this, and normally the keywords you use are related to this and the data is already inputted into GSA)
2) It selects ALL SE's available in GSA. (OR if you want you can remove by country mask, for more fine control)
3) It basically runs the keyword/random anchor through all SE's selected and allocates the URL's that each SE generates to that SE.So it parses ONCE per project ALL the SE's and stores these results.
4) You identify which URL's are COMMON to the most SE's. I.e. these are the results that most SE's generate for that keyword. You set it to the number of results. I.e. 10, 20, 50, 100, 200, etc. Thus it identifies, what the most common XX URL's across all SE's. I.e. top 100 URL's for that anchor.
5) Set "% Common URL Match". For each SE you identify how many/percent of the the URL's in step 4 it returned. A good SE will cover 80 to 90% of the URL's in step 4...so you only need to select 1 good SE and it cover most of the results you need.
6) Set SE "% UNIQUE URL's" .You have the option to Use/Select All SE's with a unique threshold above XX percent. Let user decide percent. I.e. here you are selecting SE's that are generating different results. This could be some great target URL's or just some very bad results by poor SE's, depending on how you set it. You've got to be asking yourself why are these SE's generating different results, and how much should we tolerate. Hence the user selecting the unique threshold.
So what this practically means and why it's a DRASTIC efficiency improvement.
Let's say you have 1 keyword. You set step 4 (% Common URL Match) to "50 SE Common URL's". You identify the 50 most common URL's for that keyword. In order to get ranked for this keyword in the MOST SE's, you would want links on these 50 common URL's, since these are the URL's that the MOST SE's deem important and authority sites in the niche.
This would be a massively neat feature if that was all. You've basically got the hottest list for sites you need to get links on right there!
But it gets better.
Step 5 basically, says which SE's generate the same results should be ignored. So instead of parsing the results of SE's that just generate the SAME results everytime, it says IGNORE ALL SE's that generates % Common URL Match above 80%. Thus, use 1 SE and ignore all other SE's that generate the same 80% of results. THIS STEP IS CRITICAL AS IT WILL REMOVE THE SE'S THAT JUST GENERATE THE SAME RESULTS!
Step 6 - you can set the threshold of SE's that generate the most value for you to run your queries on. (i.e. what percent of unique results does the SE have to generate for it to be worth using) Some niches, have very few results, so ANY unique new URL's are great, while others you can afford to "miss" that one or two url's, but instead you have saved so much on resources and time that you've got a 1000 other URL's to go at instead. Some keywords/niches, just don't have this luxury. This way you can really make sure your SE's selection are maximised for your keyword/niche.
Practically, let's talk numbers.
1 000 keywords.
50 SE results on average per SE.
That's 50 000 unique results.
Now that's assuming you're using 1 SE that generates unique results.
Let's say you use 50 SE's (maybe you missed some really good ones that you didn't think to use!). These SE's have 80% similar results...or maybe you go lucky and just picked the right combination of SE's...but you'd still have picked many that just generate the same results...there's no way you know unless you are doing serious testing.
That means you are running
50 SE's x 50 000 = 2 500 000 URL's to parse.
80% similar = 2 000 000 URL's that didn't need to get parsed.
Now if you are running 10 projects per VPS that 20 000 000 URL's that didn't need to get parsed. That = ONE DRASTIC EFFICIENCY IMPROVEMENT.
I hope it's not a big feature, but it would add serious value.
The key is:
1) Storing the SE results for each SE.
2) Computation to find the most common number of XX target URL's.
3) Applying the filter/thersholds.
Please add some discussion! (This post took forever to explain clearly...it's clear in my head at least...I hope I have done it justice!)