Home Research Review: Finding High-Quality Content in Social Media
|
Review: Finding High-Quality Content in Social Media |
|
Written by Kevin Chai
|
|
Thursday, 28 February 2008 19:55 |
Authors: Agichtein, E., Castillo, C., Donato, D., Gionis, A. & Mishne, G. Year: 2008 Published in: Proceedings of the international conference on Web search and web data mining (WSDM'08) Link: http://download.tailrank.com/wsdm2008/p183.pdf Importance to my research: Very high
Abstract The quality of user-generated content varies drastically from excellent to abuse and spam. As the availability of such content increases, the task of identifying high-quality content sites based on user contributions --social media sites -- becomes increasingly important. Social media in general exhibit a rich variety of information sources: in addition to the content itself, there is a wide array of non-content information available, such as links between items and explicit quality ratings from members of the community. In this paper we investigate methods for exploiting such community feedback to automatically identify high quality content. As a test case, we focus on Yahoo! Answers, a large community question/answering portal that is particularly rich in the amount and types of content and social interactions available in it. We introduce a general classification framework for combining the evidence from different sources of information, that can be tuned automatically for a given social media type and quality definition. In particular, for the community question/answering domain, we show that our system is able to separate high-quality items from the rest with an accuracy close to that of humans.
Review This paper provides a great overview of current research in the field of quality assessment in social media / software. Linked from this overview, the authors present a framework that identifies high-quality user generated content (UGC) within a popular the question/answering portal known as Yahoo! Answers. The framework performs its quality assessment by combining several types of quality evidence sources for the aim of developing a more rigorous and robust framework. The framework employs the following techniques in assessing the quality of UGC: - Link analysis to evaluate relationships between users and objects
- Text analysis for content quality analysis with references to the field of Automated Essay Grading (AES). This analysis is used to assess punctation, typos, syntatic and semantic complexity and grammar. Text classification algorithms are also employed.
- Implicit feedback for ranking through usage statistics such as the number of clicks an item receives and the dwell time of certain question/answer pages. These measures also take into consideration the popularity of different domains. For example, the results from these statistics may have large variations when we compare it against different topics with differing popularity levels - e.g. celebrity fashion vs. science questions/answers. Usage statistics are therefore used to identify high-quality questions / answers within a specific domain.
- Explicit feedback gathered from users that rate the quality of specific questions and answers
I believe the researchers have developed a very sound framework. I plan to evaluate their work further to identify what ideas from their framework can be applied to a generic user contribution measurement model for social media (i.e. not just for question/answering portals). The authors believe that their results and insights are applicable to other social media settings and domains centered around UGC. I certainly hope this is the case! Important New Terms - Question/answering portals
- ExpertiseRank - expert finding
- Trust metrics- distrust can not be considered transitive
- Automated Essay Grading (AES)
- Gunning-Fog index, Flesch-Kincaid formula & SMOG grading
- Usage statistics
- Sentiment classification
- Part-of-speech (POS) tags and tag n-grams
- KL-divergence
- Log-linear classifiers & stochastic gradient boosted trees
- Overfitting
- Temporal statistics
- Area under the ROC curve
- Variant of co-training or co-boosting & maximum entropy classifer
|
|
" What are the most important problems in your field? What are you working on? Why aren’t they the same? "
Richard Hamming
|