Correlation Between Latent Dirichlet Allocation (LDA) and Google Rankings

A recently published article on SEOmoz’s Daily SEO Blog discusses an exciting new find in the field of SEO – the discovery of high correlation between Latent Dirichlet Allocation (LDA) and Google’s search results. Basically, SEO researchers have found one method that Google uses to rank search results which we previously didn’t know about, and we may be able to use this to optimize pages better. The article, written by Randfish details a talk given by Ben Hendrickson at SEOmoz’s recent annual “Mozinar”.

LDA is a complex topic which we won’t go into fully. What you need to know to understand this post is that LDA is a type of “Topic Modelling” – a method of relating words to each other. The blog gives a great example of this in a diagram using the words “cat” and “dog” as an example.


Simplistic Term Vector Mode

Simplistic Term Vector Mode by SEO Moz

Here, all words are related to either cat or dog. A neutral word, like “bigfoot,” would be 45° away from both words. Words like “feline,” or “canine” would be much closer to cat and dog, respectively, because of their increased association with each of those. This model is drastically oversimplified – in actual LDA, there would be millions of different words and phrases which all exist in separate dimensions.

However, for the sake of this post, under sting this simple example is enough. Now, how does this apply to SEO? Researchers at SEOmoz investigated using an LDA modelling to replicate Google search results. We know that Google uses over 200 different factors to determine search results, and even though we know what some of these factors are, we do not know how they are all individually weighted. In the past, SEO researchers have used several different models to rank search results, and compared them to Google’s results. Using simple models like keyword frequency, or TF*IDF (term frequency x inverse document frequency), they have found some correlation with Google’s search rankings.

Correlation does not, of course, imply causation, but it does give strong suggestions about how much weight Google places on various factors. The big break through with this LDA model was its high correlation – a little over .33.

This may sound underwhelming to the average person, but this is significantly above the correlation shown by any of the other models. Even though this isn’t conclusive, it suggests that term modelling and LDA may be a major factoring used by Google to determine search relevance. One other pieces of evidence helps to back this up. Every two years, SEOmoz conducts an opinion survey of SEO professionals to try to determine the weight that Google places on various factors.

The most recent survey showed a major increase in on-page keyword usage. This also suggests that Google is using some kind of term modelling or LDA system to determine relevance. All this is great, but if you are an SEO professional, then you are probably still wondering how it impacts you or how you can use it to your advantage. To understand this, let’s go back to what term modelling is. Term modelling is based on figuring out how similar some words are to others.

Instead of thinking of this as keywords, think of it as keyword synonyms, or keyword keywords. In addition to using keywords on your page, you may also want to use words related to that word. For instance, a page which is being optimized for the keyword “computer” should also contain words like monitor, CPU, and hard drive, but using the word CD might make the search engine think the page is about music. This new research suggests that using these additional keywords will help improve a page’s ranking for the original word.

Because Google uses a very complex search algorithm, and all of the 200 different factors that it uses are weighted differently, we will never know exactly what its search algorithm is. This research does not suggest that LDA is more than a small part of the algorithm, but it does show that it is significantly more important than was thought before. For more on some of the terms used in this article, check out these links:

Latent Dirichlet Allocation
Topic Modelling
Vector Space Models
Correlation Between Latent Dirichlet Allocation (LDA) and Google Rankings
SEOmoz’s original article

Related: International SEO Research (French)

Dan Petrovic, the managing director of DEJAN, is Australia’s best-known name in the field of search engine optimisation. Dan is a web author, innovator and a highly regarded search industry event speaker.

More Posts - Website