A few months ago, I joined ACM's Special Interest Group on Information Retrieval (SIGIR) - an organization that focuses on search from a scientific standpoint. It's THE place where researchers and search engineers from Microsoft, Google, Yahoo, etc all want to have their latest papers and theories published.
Accordingly, it's THE place I go to find out what they are up to ;)
Since I happened to join after the conference in August, I missed all the juicy stuff that happened and, quite frankly, got busy and forgot to go look it up afterward. Fortunately for me, they mail members a publication that gives an overview on the papers presented, etc. I just got mine in the mail and it was very interesting.
As a result of WEBSPAM-UK2006, a reference collection for web spam was created based on a cross section of 6,552 pages - each was then hand checked and labeled by humans as either normal, spam or borderline.
This reference set was then made available for search engines and researchers to check their anti-spam tactics against. It's about 55GB in size (compressed), and contains the full HTTP response of each page, along with the human annotations.
This is the first publicly available webspam reference collection, and as such is quite an important group of files. I sure hope nothing I've ever done is in there ;)
The process of creating this file actually was very interesting and there were a few facts and observations that were made as a result.
The researchers labeled the results "71% of the hosts as normal, 25% as spam and the remainder 4% as undecided".
That's a lot of spam!
One of the most interesting parts of this was when the reviewers disagreed over what as spam or not.
"The average agreement is never more than about 80% (slightly more than 83% on the 14 reviewers with overlap of at least 65) and never below 75%. Also, and to some extent surprisingly, the average agreement does not seem to grow with the overlap... This result...seems to indicate that a non-negligible degree of "disagreement" is maybe not the result of statistical noise. Rather, it seems inherent to human rating of Web spamming and seems to indicate, to some extent, the lack of a general consensus on what exactly is spam and what not."
Source: Castillo, Carlos, et al "A Reference Collection for Web Spam" - ACM SIGIR Frum Vol 40 No. 2 December 2006 Page 20.
In short, sometimes spam really is in the eye of the beholder, and even trained and experienced humans can disagree on whether something is spam or not.
They also listed a (preliminary and non-exclusive) list of common elements in pages defined as spam (along with the prevalence score on the spam in the collection):
- Keywords in URL (84%)
- Keywords in Anchor Text (80%)
- Multiple Sponsored Links (52%)
- Multiple external ad units (ie AdSense, etc) (39%)
- Text obtained from Web search (except internal search) (26%)
- Synthetic text (10%)
- Parked domains (4%)
They of course noted that the presence of keywords in the URL and anchor text were not necessarily spam, but when taken together with other aspects became suspicious. Synthetic text was considered almost always spam, and having more than one type of external ad unit was, as well (this should serve as a warning sign to affiliates with multiple ads from multiple sources on a single page).