Anchor Text Harvesting
This article explains how to crawl a single domain and generate a complete list of all external anchor text used.
- Log into 80legs and create a new job. You can name it whatever you like.
- Add the URL of the domain you wish to crawl in the “Simple” mode of “Seed List of URLs” field.
- Set the parameters of “Outgoing Links to Crawl” settings (typically the third option)
- Paying customers can change the maximum number of crawls, otherwise leave it at 1000.
- Don’t touch anything else. Hit “Create Crawl” and wait for the crawler to finish the job.
The time to completion depends on how busy the server farm is, but in most cases it takes about 24 hours. Once completed you will be able to download your results by using the Java app. You will need to activate it using your personal security token which is visible on any job results page.
- Load the Java downloader app and select Document Data “XML to CSV” in the results drop down.
- Select which local directory you wish to download to.
- Start download by clicking on the “ANALYZED_URL” row in the available download files.
Note that some post-download processing may take place so be patient until the whole process has been completed. You will notice all temporary files will disappear from the download directory.
Now you have a whole lot of data you can do wonderful things with including external link anchor text analysis, which is what we’ll focus on in this article.
Ancore is a simple utility we built which converts the 80legs output file into a usable CSV formatted to show:
- Linking URL
- Linked URL
- Anchor Text
Processing your file in Ancore is simple. Enter the crawled domain URL, choose the 80Legs export .zip and hit “Check now!” button.
Download will start shortly after that and the output file will contain something similar to this:
We’re going to give you access to raw anchor text data of Google’s very own blog. Download it and let us know what you can tell about their linking style. Share your observations. How do they link, why do they link?
My initial observation is that Google uses rel=”nofollow” only on one article, and that article talks about nofollow usage.