[Request] Search Engine Duplication Results List

grafx77 · January 2013

Does anyone have a resource or a list of SEs that utilize their own results and DO NOT pull from other SE results like Google, Bing, etc?

I'm going to need to deselect some of the SEs I have selected recently from SER, as I have a feeling I'm pulling the same results over and over again from different SEs.

AlexR · January 2013

@grafx77 - totally hear you out!

I bet you that many of your SE's are just reparsing the same results....

Check out:

https://forum.gsa-online.de/discussion/1225/drastic-efficiency-improvements-part-2-se-logic-module#Item_39

I'm in the process of setting up a few tests to check this out...

grafx77 · January 2013

Yeah....I read that thread earlier. It's a bit too much "fiddling" for me personally, but keep me updated on what you find. This should provide some valuable insight into this overlap problem.

Ozz · January 2013

There aren't that many.

- Google

- Bing

- Baidu (cn)

- Yandex (ru)

+ some other SE of Russia (like Rumblr) and China that I'm not fully aware of if they are using their own technoligy

+ SEs of Japan and Korea where Google don't have a big market share

+ some niche SEs which are limited in search functionality or don't give many results at all

---> I'm not sure about ask.com for instance which give you some results

+ some SE like duckduckgo or yandex.com which not give you good results because SER can't read all of the results to the end

grafx77 · January 2013

@Ozz - Right, but for example aren't Google DE, Google UK, and Google US going to pull different results according to their geographic targeting? I would assume the term "golf bags" would pull in many different results in UK for more localized UK golf bag websites when compared to US, where more US golf bag websites are shown.

I also realize that SER is going to filter out any duplicate results anyways......so I'm in a bit of a quandry here.

Do I utilize more select SEs for more results, but sacrifice system resources on the overlapping results (even though filtered)?
OR
Do I select a few SEs that I know will deliver unique results, but possibly come up with a fraction of the results for my links to be placed?

I had been using only a select few SEs for my projects in the past, but then I read a few threads from you and Sven about expanding our results with more SEs that no one else really utilizes. Which ones do you use with the best results with min overlap?

Ozz · January 2013

I did some research here with the help of Scrapebox. As search term I used "golf bag" to get results with google.com, google.co.au and google.de.

All results are deduped. They gave about the same number of results unfiltered (~985).

google.com: 353

google.co.au: 323

google.de: 328

all together: 436

Next I merged results of 2 of them so we can compare the differences in numbers.

google.com + google.au: 432 -> +79 for google.com and +109 for google.co.au

google.com + google.de: 430 -> +77 for google.com and +102 for google.de

goolgle.co.au + google.de: 334 -> +11 for google.co.au and +6 for google.de

I wondered what might be the differences if we use the german word for "golf bag" which is "golftasche" and compare the results of google.de.

golf bag: 328

golftasche: 233

==============

merged: 553 -> difference is +225 for "golf bag" and +320 for "golftasche"

No wonder that both terms have the most differences when combined as they are spelled totally differently. But that make me wonder if a keyword translation option might be usefull that translates keyword accordingly to the search engine of the country??

LeeG · January 2013

This will bugger those up that follow my tricks on getting good daily submissions

If your using google as a search engine, plus shared proxies. Its worth running three or four googles and go for weird countries on some choices. Plenty of choice to use in gsa. You could even go for a different selection of googles on each project and tier to spread the search load

The average person running any kind of search scraper with proxies, hits the UK, US type engines.

So your proxies might be banned from those engines giving zero results on a term that might return thousands

Your running a google fall back

If you dont use a big bunch blog engines, again you can find your ips are banned

If you only use google, without the blog search versions, you need to alter your engines files to suit

Each google search can return 100 results per page

ron · January 2013

I spent almost a full day going over this, so let me share what I found, and what I eliminated:

Bing=Yahoo => eliminate either one, I kept Bing

Startpage=Google =>eliminate Startpage

Lycos=Bing => eliminate Lycos

Ecosia=Bing => eliminate Ecosia

Keep DuckDuckGo - very unique results.

I dumped Sky because I couldn't access it on multiple ocassions.

Ask is powered by Google, but they layer an algorithm on top of Google, so the results are a liitle bit unique, so keep it.

Eliminate Google on English speaking islands like Samoa, Antigua, Bahamas, Barbados, etc. Too similar to Google.com.

Most compilers that say "Powered by Google, Bing Yandex" are all owned by a company called "infospace" (they own 100 search engines which all do the same), so keep Excite, but get rid of other compilers like metacrawler, dogpile, etc.

Keep Baidu, Yandex.

Use international search engines which are choices about halfway down the list.

All in all, I end up with about 112 search engines.

Without feeding lists (which skews the results positively), and using all platforms across about 30 projects with only about 4 of those projects bottom tiers with no submission limits, and just using the SE's to find targets, on 100 threads and 30 semi-private proxies, I am able to get about 30,000 submissions per day, with roughly 15% verification.

For the record, I compared each search engine manually, side by side, with two browser tabs. I probably went about 4 pages deep with each search engine to see if those results were the same.

AlexR · January 2013

"No wonder that both terms have the most differences when combined as they are spelled totally differently. But that make me wonder if a keyword translation option might be usefull that translates keyword accordingly to the search engine of the country??"

This is a fantastic idea for those who've run out of SERPS to parse!

Ozz · January 2013

@ron: "Keep DuckDuckGo - very unique results."

did you test the engine with SER, because I only get results of the first page which eliminates the value it could have if SER were able to scrape all results?

The problem with this SE is that results are not seperated in pages and you have to click "show more" to get further results. SER is not capable to click "show more", so you only get 10 results at maximum AFAIK

ron · January 2013

Ozz, you are absolutely correct - I did check that when I did it, and it only goes one page. The only reason I kept it is because they had completely different results than Bing and Google.

LeeG · January 2013

20 x random google. Bin the rest

The only captcha software that can keep up with my link building is GSA Captcha Breaker. And its still only in beta. Which gives some idea of the speed it can work at and not lock up

And I have had to drop my threads by thirty to slow SER down

I found using the likes of Bing, Yahoo etc etc, the results returned were of small amounts

Bang out ten results then waste time waiting for another search spot. It became too monotonous and tedious. Load several hundred search results at a time, bang through them and hope the decaptcha system can keep up. Which it does now thanks to Sven

ron · January 2013

Which goes back to Ozz's point, and I agree with you. I haven't done anything since then, but if we are talking about efficiency, you want the search engines that can deliver the most results in the quickest amount of time. Which is why I should let go of DuckDuckGo and some others as well.

What kills me is that these search engines are supposed to have completely different algorithms, yet they tend to deliver the same results (English speaking SE's) - just maybe in a different order.

I think the only way to game it over the long term so you have massive diversity with different websites is to scrape each platform with something like SB for a zillion search terms, weed out the dups, and feed this beast.

grafx77 · January 2013

WOW....amazing feedback guys!! Thank you very much for your insights!!

@Ozz - that is a great point! Maybe we should consider translating our keywords, once they have been run through English based SEs and then re-run them according to the new SEs we want results from (German, Polish, Chinese, Russian, etc). If this was automated as a separate option, on down the line, EVEN BETTER!

@Ron - thank you very much for your testing feedback. It has been very useful and will be well utilized. I believe in squeezing out as much efficiency as you possibly can with SER.

So here is the conclusive list we have so far:
Bing
Ask
Google
Yandex
Excite
Baidu
Yandex
International SEs
Google and Bing with translated keywords (Google DE, Google RUS, etc)

AlexR · January 2013

Glad this discussion is happening. Finally. Read over here as this is what I have been going on about here:

https://forum.gsa-online.de/discussion/1225/drastic-efficiency-improvements-part-2-se-logic-module#Item_41

There are two issues that GSA still needs to solve in my opinion:

1) SE overlap. Measuring how much value a specific SE adds. I.e. What percent of unique results for a keyword does an SE add, and then allowing us to select based on that percent. I am running tests to see how this works, but having to do some workarounds.

2) Keyword Overlap. Within an SE, how unique are the results for each keyword.

I.e. when you put in 100 keywords in SB, and scrape, at end it says "85% duplicates removed". So 85% of the results were duplicates, and we didn't need to parse them. No biggie on 100 keywords. BUT with 1000 keywords the workload is 10x. Now with 100 projects in GSA, the extra workload is 10x100=1000 more at a minimum. So let's reduce workload by like at least 1000 times with a unique tool.

So what I am looking for is a tool that measures the unique results per keyword. E.g. "blue widgets today" and "lovely blue widgets" and "unqiue green widgets". In G, there will be an overlap/duplicates in SE results. The tool takes an SE (like G), and measures how many unique results compared to the main keyword "blue widgets" it generated unique results. Basically, a little test BEFORE we run all our keywords in GSA. So "blue widgets" generates 100% unique. "lovely blue widgets" generates 15% unique, and "blue widgets today" generates 5% unique. We can then remove all the low unique % keywords, as is all that parsing worth the 1% or 2% of extra results?

[Request] Search Engine Duplication Results List

Comments