Link Spam Detection
In this article we describe quantitative metric-based link spam detection in a collection of analysed domains. Accuracy and co-relation ranges from 60 to 90 percent with adjustment of different quality parameters. Our technique is still considered to be ‘work in progress’ and may not be fit for accurate automated action. Instead we offer a rudimentary, top-level analysis method, suitable for first round flagging and creation of spam alerts within large collections of domains and massive backlink profiles.
In the above image we show the case of strong co-relation with the metrics on the right and the manually flagged spam domains showing in red on the left.
Metrics & Calculation
Although not all metrics proved to be useful in our analysis we will list all included in our spreadsheet (from left to right starting with column A and ending with W):
- Backlinks (BL)
- Maximum Backlink PageRank (MaxPR)
- PageRank Sum (tPR)
- Unique PageRank Sum (uPR)
- Unique Domain Backlink Sum (udBL)
- D-Factor (D=uPR/udPR)
- Average D-Factor Difference (avgD=D-(D1+Dn…/n)
- Unique Government Domain Links (uGOV)
- Unique Educational Domain Links (uEDU)
- Trusted D-Factor (tD=uGOV/uEDU)
- Average Trusted D-Factor (avgtD=tD-(tD1+tDn…/n)
- Advanced Trust (aT=avgtD/avgD)
- Manual Trust Value (mT=Manually entered for benchmarking purposes)
- Domain Length (L)
- Formula Variants:
- Booster (Arbitrary value of 1000, used for smoothing the detection highlighting gradient and can be adjusted at free will)
The most accurate results were achieved through following formulas:
Simplified formula (Sx) is considered to be a borderline case, however it does offer a massively simplified method of link spam detection which proves to be useful even with considerably degraded result accuracy. The logic behind the simplified formula is as follows: (uPR*udBL)*D.
Likewise aT metric gives reasonable accuracy measure on its own. In the table below the first two rows are accurate guesses (aT values of -0.70 and -0.74) with the other three being false positives. We therefore predict that the aT threshold value in this particular collection is above -0.60, which in our collection of observed domains accurately separated the spam from genuine websites.
As a reminder aT is a sum of Average D-Factor and Average Trusted D-Factor, this means that introduction of edu and gov domains does indeed hint at quality, though we give .gov more weight after observing more than 100 flagged results. Even though .gov domains seem to be better moderated and harder to infiltrate by spammers they are not entirely immune to manipulation (e.g. spam of the public log files and statistics). This is where qualitative analysis comes in, however this is outside of the scope of our study.
In the table above, last four columns illustrate our more complex formulas which tend to place a wider gap between the domains with organic and inorganic links.
Spreadsheet Access & Comments
To request access to our spreadsheet or make a comment or suggestion please visit the Google+ post for this article.
References and research which inspired our work include:
- Quality-Biased Ranking of Web Documents
- Web Spam Detection: link-based and content-based techniques
- Spam Behavior Analysis and Detection in User-Generated Content on Social Networks