Home Research Review: Robust Methodologies for Modeling Web Click Distributions
|
Review: Robust Methodologies for Modeling Web Click Distributions |
|
Written by Kevin Chai
|
|
Wednesday, 05 March 2008 20:19 |
Authors: Ali, K. & Scarr, M. Year: 2007 Published in: Proceedings of the 16th international conference on World Wide Web Link: http://portal.acm.org/citation.cfm?id=1242642 Importance to my research: Medium
Abstract Metrics such as click counts are vital to online businesses but their measurement has been problematic due to inclusion of high variance robot traffic. We posit that by applying statistical methods more rigorous than have been employed to date that we can build a robust model of thedistribution of clicks following which we can set probabilistically sound thresholds to address outliers and robots. Prior research in this domain has used inappropriate statistical methodology to model distributions and current industrial practice eschews this research for conservative ad-hoc click-level thresholds. Prevailing belief is that such distributions are scale-free power law distributions but using more rigorous statistical methods we find the best description of the data is instead provided by a scale-sensitive Zipf-Mandelbrot mixture distribution. Our results are based on ten data sets from various verticals in the Yahoo domain. Since mixture models can overfit the data we take care to use the BIC log-likelihood method which penalizes overly complex models. Using a mixture model in the web activity domain makes sense because there are likely multiple classes of users. In particular, we have noticed that there is a significantly large set of "users" that visit the Yahoo portal exactly once a day. We surmise these may be robots testing internet connectivity by pinging the Yahoo main website. Backing up our quantitative analysis is graphical analysis in which empirical distributions are plotted against heoretical distributions in log-log space using robust cumulative distribution plots. This methodology has two advantages: plotting in log-log space allows one to visually differentiate the various exponential distributions and secondly, cumulative plots are much more robust to outliers. We plan to use the results of this work for applications for robot removal from web metrics business intelligence systems.
Review This paper presents a methodology for modelling webpage click distributions (WCD). The authors argue that existing WCD models fail to correctly remove robot traffic (traffic generated by crawlers and software agents) which can lead to the generation of misleading web statistics. A mixture model was therefore proposed for providing a more robust way to model WCD. The zero click class (robot traffic) is modelled by a single-valued degenerate distribution model and positive clicks (human traffic) is modelled by a discrete positive-valued distribution model. Such models are referred to as zero-altered or zero-inflated models. These models were tested against Yahoo web-search clicks per session and website clicks per user (data from 10 websites within Yahoo was evaluated - e.g. travel.yahoo). It was discovered that the Zipf-Mandelbrot model provided the best fit against the click distribution data when compared with other models. The use of this mixture model might be useful in my research if I decide to incorporate non-textual features (e.g. webpage views and clicks) in assessing user contribution and/or content quality. The model would serve as an anti-fraud mechanism in ensuring that robot traffic is not credited towards user scores. Important New Terms - Robot traffic
- Maximum likelihood method (MLE)
- Bayesian Information Criterion (BIC)
- Inverse Gaussian (IG) & log-normal continuous distributions
- Power-laws, Zipf & Pareto distributions
- Scale-free and scale-sensitive distributions
|
|
" This problem, too, will look simple after it is solved. "
Charles Kettering
|