DRASTIC Efficiency Improvements - Part 2 - SE Logic Module

AlexRAlexR Cape Town
edited December 2012 in Feature Requests
Following on from my previous post I have been thinking about keyword and SE overlap. The number of times, I use Scrapebox, and after a massive scrape it removes duplicates and it says 80% to 90% duplicates removed, has got me thinking. That's with 1 SE...how bad must it be when you are using SE's powered on the same engine! What a waste to go and have to parse all these URL's. 

That's like asking 100 people, going door to door, to the answer to the question "What is 1 + 1?" Well, you could ask 1 person and get 2, but you'd maybe wonder if it was correct. So you could ask some more people. Maybe 10 people. When 9 people all say 2 and one person says something other, it's time to stop. You're 90% certain the answer is 2. No need to continue...but that's exactly what we do at the moment in GSA...we keep knocking on doors, asking the answer to the question. Yes, as soon as the person says 2 we quickly move on (like an URL parsed), but we still are checking! 

It's something I touched on in the past and I see a new thread about it now where there is discussion on what SE's to select.

There are a LOT of SE's to choose from and we need to be able to sort them per project (different projects target different countries, keywords, niches, etc). 

Let's say you have 10 KW's and these generate 100 SE's results on 1 SE. (like Google). Therefore 1 000 SE results to have a go at and to identify the platform, etc.

Now, you expand this list with synonyms and add a few extra words. Let's say that 80% of the results are the same since many of these keywords are so related you would expect at least an 80% similarity. So now you only get 20% new results.  (again these figures are very very rough, as it depends on many factors, but I'd guess they're closer to 90%) But we're happy with sacrificing the time to get this extra 20% of target sites. 

BUT SHOULD WE BE SACRIFICING THIS TIME AND RESOURCES HERE FOR ONLY 20%???

This is using 1 SE. 

Now when you increase this to 10 or 50 and the SE's are powered by similar engines or related, the numbers are NOT in your favour. It's like doing the same step 1, again, and again, and again, and again, and again...depending on how similar they are and how similar your KW's are. ;-)

HERE's a Feature I'd Like to See (Maybe an added to SE's per project, similar to how the proxy module works, but this needs to be per project, to help you choose per project which SE's to select)
I think it should be available to all in GSA rather than a few testers (the reason is that different SE's for different countries, keywords, will generate different results, and I want each project to be super effective! Also this way the user can use the best results for their needs as well as do the research). It would add a HUGE amount of efficiency to GSA and all should benefit from this!!!

1) You input 1 keyword. (or it takes a random anchor text, since you are targetting this, and normally the keywords you use are related to this and the data is already inputted into GSA)
2) It selects ALL SE's available in GSA. (OR if you want you can remove by country mask, for more fine control)
3) It basically runs the keyword/random anchor through all SE's selected and allocates the URL's that each SE generates to that SE.So it parses ONCE per project ALL the SE's and stores these results. 
4) You identify which URL's are COMMON to the most SE's. I.e. these are the results that most SE's generate for that keyword. You set it to the number of results. I.e. 10, 20, 50, 100, 200, etc. Thus it identifies, what the most common XX URL's across all SE's. I.e. top 100 URL's for that anchor. 
5) Set "% Common URL Match". For each SE you identify how many/percent of the the URL's in step 4 it returned. A good SE will cover 80 to 90% of the URL's in step 4...so you only need to select 1 good SE and it cover most of the results you need. 
6) Set SE "% UNIQUE URL's" .You have the option to Use/Select All SE's with a unique threshold above XX percent. Let user decide percent. I.e. here you are selecting SE's that are generating different results. This could be some great target URL's or just some very bad results by poor SE's, depending on how you set it. You've got to be asking yourself why are these SE's generating different results, and how much should we tolerate. Hence the user selecting the unique threshold. 

So what this practically means and why it's a DRASTIC efficiency improvement.

Let's say you have 1 keyword. You set step 4 (% Common URL Match) to "50 SE Common URL's". You identify the 50 most common URL's for that keyword. In order to get ranked for this keyword in the MOST SE's, you would want links on these 50 common URL's, since these are the URL's that the MOST SE's deem important and authority sites in the niche. 

This would be a massively neat feature if that was all. You've basically got the hottest list for sites you need to get links on right there!

But it gets better.

Step 5 basically, says which SE's generate the same results should be ignored. So instead of parsing the results of SE's that just generate the SAME results everytime, it says IGNORE ALL SE's that generates % Common URL Match above 80%. Thus, use 1 SE and ignore all other SE's that generate the same 80% of results. THIS STEP IS CRITICAL AS IT WILL REMOVE THE SE'S THAT JUST GENERATE THE SAME RESULTS! 

Step 6 - you can set the threshold of SE's that generate the most value for you to run your queries on. (i.e. what percent of unique results does the SE have to generate for it to be worth using) Some niches, have very few results, so ANY unique new URL's are great, while others you can afford to "miss" that one or two url's, but instead you have saved so much on resources and time that you've got a 1000 other URL's to go at instead. Some keywords/niches, just don't have this luxury. This way you can really make sure your SE's selection are maximised for your keyword/niche.

Practically, let's talk numbers. 
1 000 keywords.
50 SE results on average per SE. 
That's 50 000 unique results.

Now that's assuming you're using 1 SE that generates unique results.

Let's say you use 50 SE's (maybe you missed some really good ones that you didn't think to use!). These SE's have 80% similar results...or maybe you go lucky and just picked the right combination of SE's...but you'd still have picked many that just generate the same results...there's no way you know unless you are doing serious testing. 

That means you are running 
50 SE's x 50 000 = 2 500 000 URL's to parse.
80% similar = 2 000 000 URL's that didn't need to get parsed.

Now if you are running 10 projects per VPS  that 20 000 000 URL's that didn't need to get parsed. That = ONE DRASTIC EFFICIENCY IMPROVEMENT.

I hope it's not a big feature, but it would add serious value. 

The key is:
1) Storing the SE results for each SE.
2) Computation to find the most common number of XX target URL's.
3) Applying the filter/thersholds.

Please add some discussion! (This post took forever to explain clearly...it's clear in my head at least...I hope I have done it justice!)
Tagged:
«13

Comments

  • AlexRAlexR Cape Town
    I know this post is a little long but it took me ages to write and construct! I'd love to get some feedback from others on it.
  • I can agree with this also. Again thanks for taking time and writing these toughts and suggestions down. I'm hoping that Sven is open for it (and you get some additional feedback).
  • AlexRAlexR Cape Town
    @bytefaker -  have just been surprised at the complete lack of feedback on this. Maybe I should rather have entitled it "I'm getting low links" - that always seems to get some good feedback. ;-)
  • Perhaps it's because not everyone is so stupid to use keywords like this for searching / scraping...

    what are blue widgets
    samsung blu ray widgets
    blue widgets seo
    blue one armed widgets
    how we make blue widgets
    blue jays widgets
    what is blue widgets
    how we make blue widgets
    buy blue widgets
    beautiful widgets binary blue
    what are blue widgets
    blue one armed widgets
    apple widgets blue friends

    I for one don't! :)
  • People use GSA SER in the same way, but differently. The issues you encounter, i don't experience, as i don't use GSA SER to find sites to place backlinks on.
  • AlexRAlexR Cape Town
    @doubleup - so what you're saying is GSA is a weak tool to find sites to place backlinks on? Why else would you not use it for that?
  • AlexRAlexR Cape Town
    @Heisenberg - you'll see MUCH advice on generating keywords, and they all generate similar results. So yes, most people will have an overlap to some degree...what I'm proposing is a way to help resolve that...

    Anyways...even WITH unique keywords, there is a SE overlap, that I am proposing we solve, so while your comment applies to one aspect that I touched on (too similar keywords), it doesn't fully cover what I was proposing (which is SE speed, and SE overlap). 
  • @GlobalGoogler ;Pacquiao lost to Marquez last week, but i wouldn't say Pacquiao was weak :)

    In a day, using GSA SER to search for places to backlink, you'd get a certain number. Using other programs such as SB, you'd be able to find a FAR greater number in the same period of time. It's all about efficient use of time, and using GSA SER in that way isn't very efficient in my opinion.
  • AlexRAlexR Cape Town
    @Sven - would this be possible?
  • SvenSven www.GSA-Online.de
    this sounds logical and good but right now I don't see that this can be added any time soon. Though I keep this bookmarked to maybe add one or the other thing when I can.
  • AlexRAlexR Cape Town
    @Sven - thanks for considering it! 

    Also - maybe you can advise me here....

    I'd like to take a keyword and run it through ALL the SE's in GSA. I'd like it to get the parsed results for each SE and store it in csv. Then I can do some analysis on this. 

    I'm not a programmer is there an easy way to do this (get all the parsed results for each SE), without doing it one by one? 


  • SvenSven www.GSA-Online.de
    sorry, there is no such way as duplicate results get filtered out already before shoing in log.
  • AlexRAlexR Cape Town
    @sven - would it be possible for it to log how many duplicates it removed? Thus, if we inputted only 1 keyword, and we selected all SE's, it would log which ones had the most duplicates. Then we could see which SE's are generating the most duplicates, and as such overlapping the most. Would help greatly in us choosing the correct SE's to use for our projects. I know this is a temporary solution, but it seems it would help greatly. 


  • SvenSven www.GSA-Online.de
    Yes thats shown with a messgae after SE parsing on "010/020 ..." saying 10 where duplicate out of 20 returned results.
  • AlexRAlexR Cape Town
    edited December 2012
    @sven - This is good news. So just to check if I can do this:

    I go to options, tools advanced, I take 1 keyword and select all SE's. I want to run the search without footprints, so this would be the best way. But I don't think it offers a log for this. Is there a way to extract this log too?

    OR

    Is there a platform that I can select that has the smallest footprint, i.e. that is closest to just a "keyword" straight search with nothing added?

    So what I'd do is select a single platform, a single keyword, all SE's. Then let it run, and log it to file. 

    Then extract and clean up data so that for each SE (about 1000) I could get the statistic:
    1) Number of URL's Parsed
    2) Number of Duplicates



  • SvenSven www.GSA-Online.de
    Yes you can define a engine with a footprint like "a" and it would basically only use that one + the keyword.
  • AlexRAlexR Cape Town
    I'm just starting to learn about the scripting. 

    So I create a .ini file, with a footprint in it. What folder do I place it in? Will GSA then add it automatically to the GUI?
  • SvenSven www.GSA-Online.de
    in the engine folder where the executabe is. The program will find it automatically.
  • AlexRAlexR Cape Town
    edited January 2013
    Can someone help me with this? I am not a programmer and took a look at the .ini files and wasn't able to do this. So many variables in the file and I'm not sure if I must remove any. I can see where to edit the footprint, but not sure about all the others!

    What I am looking for is a "Test.ini" file that I can run that only has a footprint like "a". 

    There are just too many variables in the .ini's and I'm not sure what to edit. If anyone can assist me with my testing by supplying a test.ini that only has a footprint with "a" I'd be much grateful!

    Also - in the footprint, I don't want it to actually submit something. I am only interested in getting the SE's to parse the results. 
  • AlexRAlexR Cape Town
    What I have done is just copied a trackback .ini and replace 2 lines with:
    url must have1=**
    search term=a
    I then ran a test with keyword "golf".

    These are some of my logged results:
    [12:15:47] test.com: [ ] 000/000 [Page END] results on google DE for Search Engine Test with query a golf http://www.google.de/search?q=a+golf&as_qdr=all&filter=0&num=100&start=300&cr=countryDE
    [12:15:47] test.com: [ ] 000/001 [Page END] results on Lycos DE for Search Engine Test with query a golf http://search.lycos.com/web?q=a+golf&pn=1&region=de
    [12:15:51] test.com: [ ] 000/001 [Page END] results on Lycos DE for Search Engine Test with query a golf http://search.lycos.com/web?q=a+golf&pn=1&region=de
    [12:15:54] test.com: [ ] 093/142 [Page 001] results on google DE for Search Engine Test with query golf a http://www.google.de/search?q=golf+a&as_qdr=all&filter=0&num=100&start=0&cr=countryDE
    [12:15:55] test.com: [ ] 000/001 [Page END] results on Lycos DE for Search Engine Test with query a golf http://search.lycos.com/web?q=a+golf&pn=1&region=de
    [12:15:56] test.com: [ ] 001/011 [Page 002] results on MSN DE for Search Engine Test with query golf a http://www.bing.com/search?q=golf+a+loc:DE&filt=all&first=11&FORM=PERE
    [12:15:57] test.com: [ ] 042/097 [Page 002] results on google DE for Search Engine Test with query golf a http://www.google.de/search?q=golf+a&as_qdr=all&filter=0&num=100&start=100&cr=countryDE


    @sven - so what I would do is: (looking at Google DE)
    1) Add up the total results for all pages Google DE.. e.g. 97+142+0. This gives me total parsed results.
    2) Add up the total duplicates for SE. e.g. 42+93+0. This gives me SE duplicates/already parsed results. 
    3) Measure this ratio of duplicates to parsed results.

    Questions:
    1) Is the above footprint edit correct?
    2) When it is measuring it as parsed, which SE results is it comparing it to? (i.e. which SE parsed it first? Is this a random one?) 
    3) If it's using a random one to measure original parse, what I could do is first run it with Google.com, then use this as a base. Then I could measure the value each extra SE adds. Would this be correct?



  • SvenSven www.GSA-Online.de

    1. Yes foot print seems to be OK

    2. It's all random

    3. Yes thats how I would do it as well.

  • AlexRAlexR Cape Town
    @Sven - to be certain that I get all the results/pages from the SE, can I confirm that it always shows the [PAGE END] once it has parsed all the results? So this is the marker to show the SE has exhausted the results. 
  • SvenSven www.GSA-Online.de
    yes the END comes up if no more results are to be expected from the search engine (last page).
  • AlexRAlexR Cape Town
    Starting some SE testing this week. Looking forward to seeing some of the results. :-)
  • AlexRAlexR Cape Town
    edited January 2013
    Thanks for your patience! Just setting everything up for the test and need a few pointers. 

    Just having a minor issue:

    1) I selected Google US.
    2) See log file here: http://pastebin.com/RN1PZSCW
    3) I cleared target URL cache, and target url history on the project.

    I'm trying to run initial project with Google.com only as a reference point.

    All I see is "000/000 [Page END] results on google US " and one or two entries with "000/140 [Page 001] results on google US"

    1) Surely Google.com has 1000 results? 
    2) Does this indicate that there are only 140 results for the search term "a golf"?
    3) Or is it only using Google US and it's only showing 140 results?
    4) Where do I find the main Google.com SE in the SE options (which country is it listed under and what's it called)? I want to use this as a reference SE? 

    Is this correct? 
  • AlexRAlexR Cape Town
    Sorry to bump this up, but wanting to run the tests today/tomorrow and just want to confirm the above 4 points before I get it running. Macros created, all good to go. Just need to check the above before I begin. :-)
  • SvenSven www.GSA-Online.de
    "000/140 [Page 001]" means 0 new results from 140 results on page 1. The program does not know/extract how many more pages/results there are.
  • AlexRAlexR Cape Town
    Where do I find the main Google.com SE in the SE options (which country is it listed under and what's it called)? I want to use this as a reference SE? 
  • google --> International
  • AlexRAlexR Cape Town
    Thanks!
Sign In or Register to comment.