How I Hijacked Rand Fishkin’s Blog

Search Result Hijacking

Search result hijacking is a surprisingly straightforward process. This post will go over theory, test cases done by Dejan SEO team and offer ways for webmasters to defend against search result theft.

I wish to thank Jim Munro, Rob Maas and Rand Fishkin for allowing me to run my experiment on their pages.

Brief Introduction

Before I go any further I’d like to make it clear that this is not a bug, hack or an exploit – it’s a feature. Google’s algorithm prevents duplicate content displaying in search results and everything is fine until you find yourself on the wrong end of the duplication scale. From time to time a larger, more authoritative site will overtake smaller websites’ position in the rankings for their own content. Read on to find out how exactly this happens.

Search Theory

When there are two identical documents on the web, Google will pick the one with higher PageRank and use it in results. It will also forward any links from any perceived ‘duplicate’ towards the selected ‘main’ document. This idea first came to my mind while reading a paper called “Large-scale Incremental Processing Using Distributed Transactions and Notifications” by Daniel Peng and Frank Dabek from Google.

PageRank Copy

Here is the key part:

“Consider the task of building an index of the web that can be used to answer search queries. The indexing system starts by crawling every page on the web and processing them while maintaining a set of invariants on the index. For example, if the same content is crawled under multiple URLs, only the URL with the highest PageRank [28] appears in the index. Each link is also inverted so that the anchor text from each outgoing link is attached to the page the link points to. Link inversion must work across duplicates: links to a duplicate of a page should be forwarded to the highest PageRank duplicate if necessary.”

Case Studies

I decided to test the above theory on real pages from Google’s index. The following pages were our selected ‘victims’.

  1. MarketBizz
  2. Dumb SEO Questions
  3. ShopSafe
  4. Rand Fishkin’s Blog

Case Study #1: MarketBizz

marketbizz

26 October 2012: Rob Maas kindly volunteered for the first stage test and offered one of his English language pages for our first ‘hijack’ attempt. We set up a subdomain called rob.dejanseo.com.au and created a single page http://rob.dejanseo.com.au/ReferentieEN.htm by copying the original HTML and images. The newly created page was +’ed and linked to from our blog. At this stage it was uncertain how similar (or identical) the two documents had to be for our test to work.

30 October 2012: Search result successfully hijacked. Not only did our new subdomain replace Rob’s page in results but the info: command was now showing the new page even for the original page and it’s original PageRank 1 was replaced by PageRank “0” of the new page. Note: Do not confuse the toolbar PageRank of zero with real-time PageRank which was calculated to be 4.

 Hijacked SERP

Notice how the info: search for the URL returns our test domain instead?

So all it took was higher PageRank stream to the new page and a few days to allow for indexing of the new page.

Search for text from the original page also returned the new document:

Hijacked Result

One interesting fact is that site:www.marketbizz.nl still returns the original page “www.marketbizz.nl/en/ReferentieEN.htm” and does not omit it from site search results. Interestingly that URL does not return any results for cache, just like the copy we created. Google’s merge seems pretty thorough and complete in this case.

Case Study #2: dumbseoquestions.com

dsq

30 October 2012: Jim Munro volunteers his website dumbseoquestions.com in order to test whether authorship helps against result hijacking attempts. We copied his content and replicated it on http://dsq.dejanseo.com.au/ without copying any media across.

1 November 2012: The next day Jim’s page was replaced with our subdomain, rendering Jim’s original as a duplicate in Google’s index. This suggests that authorship did very little or nothing to stop this from happening.

Dumb SEO Questions Hijack

The original website was replaced for both info: command and search queries.

Interesting Discovery

Search for the exact match brand “Dumb SEO Questions” brings the correct result and not the newly created subdomain. This potentially reveals domain/query match layer of Google’s algorithm in action.

Exact Brand Match

Whether Jim’s authorship helped in this instance is uncertain, but we did discover two conflicting search queries:

  1. Today we were fortunate to be joined by Richard Hearne from Red Cardinal Ltd. (returns the original site)
  2. Dumb+SEO+questions+answered+by+some+of+the+world’s+leading+SEO+practitioners (returns a copy)
One returned the original site while the other showed its copy. At this stage we have not yet tested the impact of rel=”canonical” in potential prevention of result hijacking and for that reason we created a separate experiment.

Case Study #3: Shop Safe

shopsafe

The following subdomain was created http://shopsafe.dejanseo.com.au/ replicating a page which contained rel=”canonical”. Naturally the tag was stripped off on the duplicate page for the purposes of the experiment.

This page managed to overtake the original in search, but never replaced it when tested using the info: command. All +1’s were purposely removed after the hijack to see if the page would be restored. Several days later the original page overtook the copy, however it is unclear if +’s had any impact on this.

Possible defense mechanisms:

  1. Presence of rel=”canonical” on the original page
  2. Authorship markup / link from Google+ profile
  3. +1’s

Case Study #4: Rand Fishkin’s Blog

Rand's Blog

Our next test was related to domain authority so we picked a hard one. Rand Fishkin agreed to a hijack attempt so we set up a page in a similar way to previous experiments with a few minor edits (rel/prev, authorship, canonical). Given that a considerable amount of code was changed I did not expected this particular experiment to succeed to full extent.

We did manage to hijack Rand’s search result for both his name and one of his articles, but only for Australian searches:

Rand Fishkin

Notice that the top result is our test domain, only a few days old. Same goes for the test blog post which now replaces the original site in Australian search results:

Rand's Article

This “geo-locking” could be happening at least two reasons:

  1. .au domain hosts the copy
  2. .au domain links pointing towards the copied page

Not a Full Hijack

What we failed to achieve was to completely replace his URL in Google’s index (where info: shows our subdomain) which is what happened with Rob’s page. This could be partly due to the fact that the code was slightly different than the original and possibly due to Rand’s authorship link which we left intact for a while (now removed for further testing). Naturally Rand’s blog also has more social signals and inbound links than our previous test pages.

Interesting Observation

When a duplicate page is created and merged into a main “canonical” document version it will display it’s PageRank, cache, links, info but in Rand’s case also +1’s. Yes, even +1’s. For example if you +1 a designated duplicate, the selected main version will receive the +1’s. Similarly if you +1 the selected main URL the change in +1’s will immediately reflect on any recognised copies.

Example: http://rand.dejanseo.com.au/ – URL shows 18 +1’s which really belong to Rand’s main blog.

When a copy receives higher PageRank however, and the switch takes place, all links and social signals will be re-assigned to the “winning” version. So far we have two variants of this. In case of a full hijack, we’re seeing no +’s for the removed version and all +’s for the winning document, borderline cases seems to show +’s for both documents. Note that this could also be due to code/authorship markup on the page itself.

We’re currently investigating the cause for this behavior.

leakPreventative Measures

Further testing is needed to confirm the most efficient way for webmasters to defend against the result/document hijacking by stronger, more authoritative pages.

Canonicalisation

Most websites will simply mirror your content or scrape a substaintial amount of it from your site. This is typically done on the code level (particularly if automated). This means that presence of properly set rel=”canonical” (full URL) ensures that Google knows which document is the canonical version. Google takes rel=”canonical” as a hint and not an absolute directive so it could still happen that the URL replacement happens in search results, even if you canonicalise your pages.

There is a way to protect your documents too (e.g. PDF) through use of http header canonicalisation:

GET /white-paper.pdf HTTP/1.1
Host: www.example.com
(…rest of HTTP request headers…)
 
HTTP/1.1 200 OK
Content-Type: application/pdf
Link: <http://www.example.com/white-paper.html>; rel=”canonical”
Content-Length: 785710
(… rest of HTTP response headers…)

Authorship

I am not entirely convinced that authorship will do much to prevent search result swap from a more juiced URL, however it could be a contributing factor or a signal and it doesn’t hurt to have it implemented regardless.

Internal Links

Using full URLs to reference to your home page and other pages on your site means that if somebody scrapes your content they will automatically link to your page passing PageRank to it. This of course doesn’t help if they edit the page to set the URL path to their own domain.

Content Monitoring

By using services such as CopyScape or Google Alerts webmasters can listen to references of their brand and content segments online, and as they happen. Acting quickly and requesting either removal or a link back /citation back to your site is an option if you notice a high authority domain is replicating your pages.

NOTE: I contacted John Mueller, Daniel Peng and Frank Dabek for comments and advice regarding this article and still waiting to hear from them. Also this was meant to be a draft version (accidentally published) and is missing information about how page hijacking reflects in Google Webmaster Tools.

PART II:

Article titled “Mind-Blowing Hack for Competitive Link Research” explains how the above mentioned allows webmasters to see somebody else’s links in their Google Webmaster Tools.

Dan Petrovic is a well-known Australian SEO and a managing director of Dejan SEO. He has published numerous research articles in the field of search engine optimisation and online marketing. Dan's work is highly regarded by the world-wide SEO community and featured on some of the most reputable websites in the industry.

More Posts - Website

97 thoughts on “How I Hijacked Rand Fishkin’s Blog

  1. This is a very interesting post, also observing 302 redirect isn’t needed to be successful in hijacking.
    I’ve also recently experienced issues with different domains (eg. it, .com) having the same HTML but translated (IT – EN) content. Info, cache, link operators are still showing the wrong page in some cases. Hreflang alternate were in place but seemed not to be helpful; I’m trying canonicalization too but pages haven’t been re-crawled yet…
    Thanks for the experiment, anyway, useful as usual.

  2. Thanks for sharing this Dan, love real-life SEO testing.

    I have a question regarding the PageRank of the new, duplicate pages that you published on DejanSEO.

    “Note: Do not confuse the toolbar PageRank of zero with real-time PageRank which was calculated to be 4.”

    How did you pass enough PageRank to these to make them rank higher than the originals? Was it purely internal linking from your very strong domain?

    Thanks again for sharing!

    Paddy

  3. Darn fine job!

    Goes to show that G has little?no automated interested in showing the Originator, only the Popular (yet again Popularity is the influencer :sigh: )

    As for prevention of Hijacking – it should be the same as the Anti-Scrapping methods.

    1) Create page
    2) Live the page
    3) Include full name in content (top and bottom)
    4) Include Date/Time stamp
    5) Include full URL as Text
    6) Include full URL as link
    7) Include SiteName in content
    8) Include SiteName in Title
    9) Use the Canonical Link Element
    10) Use Authorship markup to your G+ Profile URL

    11) Add page URL to Sitemap
    12) Ping GWMT Sitemap tool
    13) Use GWMT Fetch as GoogleBot tool for new URL
    14) Link from your G+ Profile to your Page URL
    15) Use services such as Ping.FM and PubSubHub
    16) Social-Bookmark/Social-Share the new page/URL

    Unfortuantely -we have no idea just how influential any of that is – but is “should” help.
    Just keep in mind that G is interested in “the best”, which they view as “the most popular”.

  4. Very interesting test, however I believe some of the results are down to localisation of domains and local Google. For instance you are using a .com.au domain in the Australian Google. I highly doubt that the .com.au would show up in the States or UK above the hijacked sites. However that’s still up for testing :)

    The first example also shows this as it is a .nl result. I believe there is some layer in Google’s algorithm that determines whether a foreign ccTLD is more relevant than a local ccTLD (which is why we hardly see any .com.au’s in the UK), so by using .com.au in Google Australia you may not be fully testing the hijacking issue. Very interesting study though. :)

  5. Hi Guys

    Good Stuff. Its interesting to see this kind of analysis and advance SEO research. I am kinda surprised to see +1 and social signals are being transferred to the popular page.

    Have a few questions , if you don’t mind.

    1) When you said ” Note: Do not confuse the toolbar PageRank of zero with real-time PageRank which was calculated to be 4 “. What exactly do you mean by real-time PageRank ?

    2) You guys have a strong domain and the subdomains on here would naturally rank well. I am interested to see if this sort of duplication will have any impact on a relatively new domain, with a heavy social push on the newly created duplicate pages.

    Looking fwd to the next segment.

    Regards
    Saijo George

  6. I see you are logged in when you are taking the screenshots. Does this make any change in terms of what/which is the sites show up? Normally your personal results will be “skewed” compared to the “non-logged” in users or even other logged in users as they have a different search pattern and so on. Just curious as to whether that make a big difference in the results or not.

  7. This is fascinating. Dan, what effects on the domain of a site with duplicated content would you expect to see? I mean, if its links are being passed to another site, would the entire domain suffer a loss of PR/authority if duplication was extensive?

  8. There could exist two pages both PR4, but one got its pagerank after the public TBPR update and doesn’t show it. Similarly both pages should show PR4 and one could have lost it in the meantime and not showing the reduction until the next public update.

    I would imagine that hijack attempts on a weak domain would not work well.

  9. I can confirm that the .com.au is currently showing up in the UK SERPs in the place of Rand’s blog. Pretty scary experiment – I can already think of some pretty black-hat things that could be done with this.

  10. It appears that you were logged into google in the “rand fiskin” serp screenshot. If that’s the case, googles personalized results could account for all your test results. Can you replicate your findings by using an independent rank tracking tool?

  11. This blows my mind…while simultaneously scaring the crap out of me. I wouldn’t have believed it had I not checked the SERPs myself. What especially perplexed me was the authorship experiment. I cannot, for the life of me, figure out how and why they would switch his URL out for yours when his was verified. Very strange indeed (but great work!).

  12. Awesome results and article, as usual.

    It made me think a lot, unusual.

    Would all of this have worked if you use a subdirectory or folder instead of a subdomain?

  13. Incredible test with some great takeaways. Thanks for taking the time to do this. Just reiterates the importance that rel=canonical can have on a site.

  14. this experiment feels like little kids playing hide & seek game as well as a high adrenaline thriller at the same time! all hats off! Google’s engineers got their work cut out to come up with better stability.

  15. I have an appreciation for the time devoted to this test but I can’t determine a useful purpose for the efforts. We recently had an issue with a client’s competitor using scraped content which resulted in their site being banned from Google search results. Under the DMCA (Digital Millenium Copyright Act) and a handy Google tool, sites with scraped content can be reported.
    I agree that it is a good idea to document ownership however I do not believe it is necessary to go overboard. Periodic checks in Copyscape will find any scrapers, and a couple of emails can get the offending site wiped out. Just my thoughts . . . .

  16. Oh right :)

    Theoretically if its only looking at the domian authority ( which I would assume its not ) any one could do negative SEO by chucking up duplicate content on wordpress.com which is scary.

    I would also be interested to see if the ranking naturally fall back to normal after a while ( assuming the freshness is lost on those posts ) . From what I can see from your study .. where social signals are being transferred to the duplicate page I dont think that will happen.

  17. I’d be interested to see how this worked if it was a subpage instead of a subdomain. Did you do any tests like that?

    I wonder what would happen if a stronger competitor copied your entire site on a subdomain of theirs …

    This makes absolute URLs and self referential canonical tags that much more important (although it still seems competitors can outrank you regardless).

  18. Hi Dan,
    “Note: Do not confuse the toolbar PageRank of zero with real-time PageRank which was calculated to be 4.”

    How did you calculate the real time pagerank?

  19. We’ve definitely seen this for a while. Until G+ matures and they can sift through real social profiles vs artificial there’s not going to be much social influence. I suspect they’re able to establish other Google account activity to solidify and expedite a person’s interests etc. You always hope it never happens to you but ya it happens every day.

  20. I’m not sure Incognito mode is that reliable – try adding &pws=0 to the search string. My results when searching on Rand’s name does not have your test page. This blog post is listed though :-)

  21. Amazing experiment. When I search for “Rand Fishkin” in US (incognito), I see the rand.dejanseo.com.au result in the 3rd position. moz.com is on the 2nd page.

    Really scary results. My question is why does G choose your page over moz.com’s page? You mentioned that the page with the higher pagerank is going to replace the other in serps but I don’t think your newly created subdomain has a higher pagerank. What is the factor that makes your page superior?

    Looking forward to more!
    -Oleg

  22. Thx!

    I just ran a couple tests too, in Canada dejanseo.com.au is ranking #1 – in US, ranking #3. This is absolutely shocking to see first hand. Well done sir, well done.

  23. Great article.. I would like to see this test with normal domains and not subdomains.. I think that you will get very different results.. Great article…

  24. That’s a fantastic series of tests with interesting results. Thanks for publishing it. I would’ve thought the oldest (first indexed) would win. Isn’t that what Google’s been saying? Strange that it doesn’t. Great stuff!

  25. So you are saying that the one document with higher PR will take over in results, but how did you build higher PR for those new subdomains that you just created to gain so quickly in search. I would think a new subdomain would take more then a day or two to build PR to overtake the original, but did you do anything aggressive to gain PR so quickly and thus resulting in the SERPs you saw, or was this completely natural and just over took the original page just based on the domain authority?

  26. once these sites are detected and reported the BIG G takes manual action… they have helped plenty read comments above for more details… use copyscape to setects copies of your site then take appropriate action through Google web masters tools…..

  27. I’m relieved that I’m using the rel=canonical and have implemented “authorship” with a WordPress plugin, have set my Google+ pages up according to there procedures, referenced back to the sites I write on and have moved my sites to CloudFlare to monitor bot threats, set threat levels as further protection against scraping and have blocked whole countries who show a propensity to crawl my sites and are noted as “high level threats.”

    It’s not only about implementing these things but being very consistent about posting back to your social media (particularly Google+) to retain some authorship protection. Google doesn’t commit to saying “authorship” will happen but in the long-term I believe it can’t hurt publishers to follow that practice.

  28. Great post Dan! I quite like what you do in the SEO industry. This was a great experiment and revealed a lot of things I didn’t know. Look forward to you sharing more of these experiment results in the future.

  29. Dan. Congratulations on an extremely informative and useful blog for SEOs. I’ve done a lot of work on geo targeting and duplication of content, so I fully respect and appreciate the effort you’ve gone to with this blog.

    On the ‘interesting observation’ part, I too noticed that Tweets and likes were be assigned to the duplicate copy that should be attributed to the original copy. Again, great insight.

  30. Interesting and more than little scary results. I can see how this would arm blackhat spammers.

    You mentioned that the links to duplicate content get transferred to the original webpage. This raises an obvious Penguin-related question. What prevents a malicious person from scraping your site on a low PR domain and then spamming the duplicate domain with truckloads of bad links.

    I would hope that Google has some fail-safes to prevent abuse like that, but the algo updates in the past year have really shaken my confidence in the big G.

    What do you think about this?

  31. So, I followed up with a test using a couple of Press Releases through PRweb.

    When the content is placed on the site first (using pubsubhubbub) then the original site gets credit for the content over PRWeb (and other channels). It is interesting though that when searching for different exact strings within the article that sometimes it returns PRWeb and sometimes it returns the article on our site.

    On another site that has pubsubhubbub and Google authorship the all the exact strings return our site’s article over PRWeb and their channels.

  32. Dan,
    I am very impressed at the effort you put into this experiment. How did you calculate the real-time pagerank for the subdomains? At the time of the experiment did you or the owners of the sites know that you could view their link profile in GMT?

  33. This is interesting. The possibility of Hijacking other’s reputation with a higher PR links flow to fake page. Means the small businesses which have smaller PR (PR 3 – 4) are easier to become target of irresponsible people. There should be a way to protect these smaller sites.

  34. Hey Dan,
    These are interesting theories and I find your observations very plausible. The mechanics of the game have gotten more complex and ambiguous with the participation of social signals in the picture. I’m looking forward to learn about the results of your investigation. Thanks.