Cleaning up the Search Engines

I decided to take some time to try and gauge the effectiveness of the non Google, Yahoo and Bing search engines, as these engines can be good sources of urls, particularly if your proxies are prone to bans. So, to do this I created a simple engine file with one footprint - "powered by wordpress". Surely this has to be one of the most widely spread footprints on the web, and any worthy search engine should be able to give us some results for it.

Next, I unchecked by mask all search engines with google, yahoo or msn in the name. This left me with around 575 search engines. I then started a scrape at 50 threads, 60 seconds between searches, and using a plain residential ISP IP.

After the scrape was finished, I took note of the search engines that gave no results. These were:
Acoon
Ananzi
AOL
Atlas
Centrum
DuckDuckGo
Ecosia
EuroSeek
Expopage
FindLink
Gigablast
inout-search-ultimate
iZito
Jayde
krstarica
kvasir
List.ru
Mamma
Meta RU
Metabot.ru
Metager 2
mojeek
Nigma
pathfinder
Portelo
SearchHippo
SearchingQ
Thunderstone
vinden.nl
Volunia
webcrawler
Wirtualna Polska
zapmeta
Zoznam
cn.bing
yahoo.cn

When I have time, I'm going to try and go through each engine individually, to see if it needs fixing or just needs deleting. But so far, I see these updates need making:

Euroseek * - this now seems to be just a directory, with no search. I'd remove it.
Expopage - does not appear to be a search engine.
FindLink - appears to be dead.
Gigablast - seems to give an ascii-based captcha that I don't think is solvable via automated tools. Remove?

And some quick fixes to some engines:

[Acoon]
country=germany
url=https://www.acoon.de/cgi/search.exe?begriff=%search%&startwith=%page%
links_on_page=10
start_page=1
inc_page=10
enabled=0
ignore=acoon|overture.com|www.w3.org

[Atlas]
country=Czech Republic
url=http://searchatlas.centrum.cz/?q=%search%&l-choose=&l=cs&kibitz=0&kibitz-db=&trigger=button&offset=%page% ignore=atlas.cz|centrum.cz|najisto.cz|heureka.cz|aktualne.cz|zena.cz|bearshare.com|centrumholdings.|clicktale.net|www.isb.cz|www.w3.org|.atdmt.com|i0.cz|ippi.cz|google-analytics.com
inc_page=10
start_page=1
links_on_page=10
enabled=0
site=site:
inurl=inurl:
intitle=intitle:

[Jayde]
country=international
url=http://www.jayde.com/sch.html?q=%search%&ds=%page%
links_on_page=5
start_page=0
inc_page=5
enabled=0
ignore=jayde.com|/jayde/|600z.com|www.w3.org|ientry.com|twellow.com|ientrymail.com|www.website.com|www.yoursite.com|ientry.net|quantserve.com

Will add to this thread when I have time.

Comments

Sign In or Register to comment.