Obstacles in Experimental Testing and Reverse Engineering of Google Algorithm
One tricky thing about SEO as a marketing discipline is that it’s not an exact science and we’re dealing with proprietary algorithms which grow more complex each year. Despite the strong community in the industry and contributions to common knowledge, SEO professionals struggle to see the whole picture and are often puzzled by sometimes strange behaviour of the biggest search engines in the world.
In this article we will outline and comment on results from an experiment Dejan SEO team performed in 2008. Before we get into data let’s consider the following factors which (among others) drive advancement of search engines today:
- User technology adoption
- Diversification of online channels
- Growing amount of online content
Each major wave of search engine incarnation is followed by a tsunami of spam forcing search engines to stay agile and combat manipulation. Spam has been and will be one of the major driving factors in advancement of search engines.
User Technology Adoption
Search engines have to keep up with growth and diversification of user types and platforms forcing them to adapt to different languages, geographic locations, browsing platforms and temporal factors.
Diversification of Online Channels
Blogs, forums, local directories, job websites, social networks… search engines are having so much fun feeding on this data and so much headache in trying to make out what’s what, validate, cross reference and use as a signal. This will never change and technology will continuously advance – be prepared for continuous adoption and adaptation.
Growing amount of online content
Google is the first search engine that put true emphasis on the sorting of results and that was a good thing as this became useful with a growing number of online resources which needed to be prioritised, firstly in order of relevance to users’ search query and secondly by page’s importance (e.g. PageRank, also referred to as ‘PR’).
What is the problem for search engines face?
The main problem search engines face today is rapid influx of new data and growing number of online resources. ‘Prosumerism’ is taking momentum and everyone has something to say – each networked individual is a micro content publisher. Computational resources and bandwidth are still limited so there is always the matter of prioritisation and optimisation of resource usage. Algorithmic changes are essential in order to set crawling, indexation and result serving priorities right. The above mentioned factors are keeping search engines busy and add to complexity of our job – which is to understand search engines and enable better exposure for our or our clients’ websites.
In order to reveal a small piece of the puzzle we decided to test Google’s crawling behaviour by monitoring the flow of PR through a complex iterative navigational structure. A new domain was registered for the purposes of the experiment and fed with a single PR7 link pointed directly to the index page. No other links were added in order to preserve the integrity of the experiment. The estimated flow of PageRank is illustrated below:
This estimate is based on PR flow observation of a sample of hundred random domains. Typically toolbar PageRank (TBPR) value reduces by around 1 point for each level deeper from the highest source of PR and where the highest source of PR is the home page. Actual PR value is estimated to be in average 25% higher or lower than the visible value. Actual flow of PageRank seems surprisingly different. Pages marked in grey are those where TBPR value was as expected, green pages have higher value and red lower value than expected.
The surprise first comes in the third layer where seven pages get PR5 instead of PR4, perhaps due to the fact that the initial source of PR had value closer towards 7.5 and the fact that the site has only four second level pages to share internal PageRank with. Observe that links closer to “Home” in the navigation (or left side of the horisontal menu) seem to be getting more PR passed to them. Due to the fact that TBPR values are rounded to a single digit this is only evident by observing its distribution in the third level of cascading site architecture. Observe the last node of the last page of the top level navigation (the only red item in the third layer). At this point is where Google stopped assigning PageRank to pages, except for the last page in the fifth layer which for some strange reason scored a PR3, hinting at a possible external link affecting the experiment. To discover why Google randomly stopped assigning PageRank to pages our team went to observe the caching behaviour of the site throughout the period of 3 months.
In the first week since PR7 link is placed on the site only index page is accepted in Google cache, hinting at potential balancing of resource usage on Google’s behalf for this newly discovered resource. Given that a high PR source linked to this site (which led to its discovery) we expected more rapid caching of all pages and this cautious behaviour came as a surprise.
By the end of the second week new pages are visible in Google cache. Not surprisingly the entire first level is in cache. What is interesting that only two pages of the third level (2.1 and 3.1) enter cache, the rest remain un-cached. At this point we’re starting to make the connection with subtle PageRank distribution variations and its effect on the caching rate.
In week four is when the strange caching behaviour becomes clear as the rate of caching now clearly explains the higher than expected PageRank value for the page 22.214.171.124. What remains unexplained at this stage is why this page got this PageRank in the first place and we’re looking for potential external links that may have affected the experiment – so far nothing found.
It’s finally in the eight week of the experiment that we’re seeing full site in Google cache to its deepest architectural level. At this stage PageRank has updated and revealed it’s unusual allocation as see in the Figure 2.
This experiment demonstrates how Google approaches new sites and allocates it’s resources according to the level of trust for newly discovered resources. In this simplified model we’re not observing many other factors which can influence search engine behaviour and focusing on cascading behaviour of Google PageRank and its effect of resource allocation and the rate of caching of discovered content. The apparently random element in content indexation and assigning of PageRank which in addition to variable value of PageRank causes indexation to suddenly stop and resumes at later stage (perhaps when crawlers internal limit has been reached) appears to be a complex set of variety of internal rules rather than a purely random behaviour. Some degree of random behaviour could potentially be beneficial to a search engine as it would provide a platform for organic growth of the algorithm much like sporadic mutations in gene replication enable living species to evolve. Secondary benefit could be a layer of protection against deliberate reverse-engineering of its algorithm as each attempt to probe the system would result in subtly distorted version that would not match previous attempts. To prove this more experimentation is needed, perhaps the same experiment repeated more than one time to observe differences in behaviour under same circumstances.
Petrovic, D. (2010). Correlation Between Latent Dirichlet Allocation (LDA) and Google Rankings. Retrieved January 28, 2011, from Dejan SEO: http://dejanseo.com.au/correlation-between-latent-dirichlet-allocation-lda-and-google-rankings/
Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry (1999) The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Copy the code below to your web site.