Common Crawl: The Open Search Engine
- Wednesday November 9, 2011
Our mission is to democratize access to web information by producing and maintaining an open repository of web crawl data that is universally accessible. We store the crawl data on Amazon’s S3 service, allowing it to be bulk downloaded as well as directly accessed for map-reduce processing in EC2. Somewhere in the IP range between 188.8.131.52 and 184.108.40.206 lurks a fresh new bot, crawling and weaving an open web free from the shroud of mystery and secret algorithms of Google and Bing. Ladies and gentlemen meet Common Crawl’s ccBot identified as: User-Agent string:CCBot/1.0 (+http://www.commoncrawl.org/bot.html).
Common Crawl host their data with Amazon’s S3 service and their crawl data can be accessed directly for map-reduce processing in EC2 or downloaded as a bulk. They obey common search engine commands such as nofollow tags (for link metrics not accessing the content), meta data, robots.txt and there is support for gzip encoding format.
This could be a wonderful new start for development of search engines in the climate of Google’s dominance.