Common Crawl: The Open Search Engine

 
 

Our mission is to democratize access to web information by producing and maintaining an open repository of web crawl data that is universally accessible. We store the crawl data on Amazon’s S3 service, allowing it to be bulk downloaded as well as directly accessed for map-reduce processing in EC2. Somewhere in the IP range between 38.107.191.66 and 38.107.191.119 lurks a fresh new bot, crawling and weaving an open web free from the shroud of mystery and secret algorithms of Google and Bing. Ladies and gentlemen meet Common Crawl’s ccBot identified as: User-Agent string:CCBot/1.0 (+http://www.commoncrawl.org/bot.html).

common-crawlCommon Crawl host their data with Amazon’s S3 service and their crawl data can be accessed directly for map-reduce processing in EC2 or downloaded as a bulk. They obey common search engine commands such as nofollow tags (for link metrics not accessing the content), meta data, robots.txt and there is support for gzip encoding format.

This could be a wonderful new start for development of search engines in the climate of Google’s dominance.

Visit: http://www.commoncrawl.org/

Dan Petrovic is a well-known Australian SEO and a managing director of Dejan SEO. He has published numerous research articles in the field of search engine optimisation and online marketing. Dan's work is highly regarded by the world-wide SEO community and featured on some of the most reputable websites in the industry.

More Posts - Website

Copy the code below to your web site.
x 
 

National Sales Number: 1300 123 736 - International Callers: +61 7 3188 9200

© 2014 DEJAN SEO PTY LTD - ABN:77151340420 - ACN:151340420 - Privacy Policy

Privacy Policy by TRUSTe Google Partner