Schedule:
|
Every Wednesday, 4:00p.m. to 5:30p.m.
|
Venue:
|
Meeting Room 4-4, School of Information Systems, Singapore Management University | Downloads of papers and slides in this page is ONLY available for authorized group members.
|
posted Nov 19, 2009 12:12 AM by jianshu weng
Authors: David Carmel, Haggai Roitman, Naama Zwerdling
Appeared in SIGIR '09
Abstract:
This work investigates cluster labeling enhancement by utilizing Wikipedia, the free on-line encyclopedia. We describe a general framework for cluster labeling that extracts candidate labels from Wikipedia in addition to important terms that are extracted directly from the text. The“labeling quality” of each candidate is then evaluated by several independent judges and the top evaluated candidates are recommended for labeling.
Our experimental results reveal that the Wikipedia labels agree with manual labels associated by humans to a cluster, much more than with significant terms that are extracted directly from the text. We show that in most cases even when human’s associated label appears in the text, pure statistical methods have difficulty in identifying them as good descriptors. Furthermore, our experiments show that for more than 85% of the clusters in our test collection, the manual label (or an inflection, or a synonym of it) appears in the top five labels recommended by our system. |
posted Nov 4, 2009 5:30 PM by hanbo dai
Title: Know your Neighbors: Web Spam Detection using the Web Topology.(paper)(slides)
Authors:Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdoc, Fabrizio Silvestri.
Appeared in SIGIR07
Abstract: Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data. |
posted Oct 30, 2009 1:02 AM by Freddy Chua
[
updated Oct 30, 2009 1:11 AM
]
Title: Detecting spammers and content promoters in online video social networks (paper)(slides)
Authors: | Fabrício Benevenuto | Federal University of Minas Gerais, Belo Horizonte, Brazil | | Tiago Rodrigues | Federal University of Minas Gerais, Belo Horizonte, Brazil | | Virgílio Almeida | Federal University of Minas Gerais, Belo Horizonte, Brazil | | Jussara Almeida | Federal University of Minas Gerais, Belo Horizonte, Brazil | | Marcos Gonçalves | Federal University of Minas Gerais, Belo Horizonte, Brazi |
Appeared in SIGIR 2009 Abstract: A number of online video social networks, out of which YouTube is the most popular, provides features that allow users to post a video as a response to a discussion topic. These features open opportunities for users to introduce polluted content, or simply pollution, into the system. For instance, spammers may post an unrelated video as response to a popular one aiming at increasing the likelihood of the response being viewed by a larger number of users. Moreover, opportunistic users--promoters--may try to gain visibility to a specific video by posting a large number of (potentially unrelated) responses to boost the rank of the responded video, making it appear in the top lists maintained by the system. Content pollution may jeopardize the trust of users on the system, thus compromising its success in promoting social interactions. In spite of that, the available literature is very limited in providing a deep understanding of this problem.In this paper, we go a step further by addressing the issue of detecting video spammers and promoters. Towards that end, we manually build a test collection of real YouTube users, classifying them as spammers, promoters, and legitimates. Using our test collection, we provide a characterization of social and content attributes that may help distinguish each user class. We also investigate the feasibility of using a state-of-the-art supervised classification algorithm to detect spammers and promoters, and assess its effectiveness in our test collection. We found that our approach is able to correctly identify the majority of the promoters, misclassifying only a small percentage of legitimate users. In contrast, although we are able to detect a significant fraction of spammers, they showed to be much harder to distinguish from legitimate users. |
|
posted Oct 26, 2009 3:03 AM by Viet An Nguyen
[
updated Oct 26, 2009 10:14 PM
]
Title: RTG: A Recursive Realistic Graph Generator using Random Typing ( paper) ( slides) Authors: Leman Akoglu, Christos Faloutsos (Carnegie Mellon University) Appeared in ECML/PKDD 2009 Abstract: We propose a new, recursive model to generate realistic graphs, evolving over time. Our model has the following properties: it is (a) flexible, capable of generating the cross product of weighted/unweighted, directed/undirected, uni/bipartite graphs; b) realistic, giving graphs that obey eleven static and dynamic laws that real graphs follow (we formally prove that for several of the (power) laws and we estimate their exponents as a function of the model parameters); (c) parsimonious, requiring only four parameters. (d) fast, being linear on the number of edges; (e) simple, intuitively leading to the generation of macroscopic patterns. We empirically show that our model mimics two real-world graphs very well: Blognet (unipartite, undirected, unweighted) with 27K nodes and 125K edges; and Committee-to-Candidate campaign donations (bipartite, directed, weighted) with 23K nodes and 880K edges. We also show how to handle time so that edge/weight additions are bursty and self-similar. |
posted Oct 7, 2009 3:47 AM by Cane Leung
[
updated Oct 7, 2009 9:01 AM
]
Patterns of Influence in a Recommendation Network (paper) (slides) J. Leskovec, A. Singh, J. Kleinberg In Proceedings of PAKDD 2006AbstractInformation cascades are phenomena in which individuals adopt a new action or idea due to influence by others. As such a process spreads through an underlying social network, it can result in widespread adoption overall. We consider information cascades in the context of recommendations, and in particular study the patterns of cascading recommendations that arise in large social networks. We investigate a large person-to-person recommendation network, consisting of four million people who made sixteen million recommendations on half a million products. Such a dataset allows us to pose a number of fundamental questions: What kinds of cascades arise frequently in real life? What features distinguish them? We enumerate and count cascade subgraphs on large directed graphs; as one component of this, we develop a novel efficient heuristic based on graph isomorphism testing that scales to large datasets. We discover novel patterns: the distribution of cascade sizes is approximately heavy-tailed; cascades tend to be shallow, but occasional large bursts of propagation can occur. The relative abundance of different cascade subgraphs suggests subtle properties of the underlying social network and recommendation process. |
posted Sep 30, 2009 2:50 AM by Meiqun HU
Personalized Tag Recommendation using Graph-based Ranking on Multi-type Interrelated Objects (paper) (slides) by Ziyu Guan, Jiajun Bu, Qiaozhu Mei, Chun Chen and Can Wang Appeared in Proceedings of SIGIR '09 AbstractSocial tagging is becoming increasingly popular in many Web 2.0 applications where users can annotate resources (e.g. Web pages) with arbitrary keywords (i.e. tags). A tag recommendation module can assist users in tagging process by suggesting relevant tags to them. It can also be directly used to expand the set of tags annotating a resource. The benefits are twofold: improving user experience and enriching the index of resources. However, the former one is not emphasized in previous studies, though a lot of work has reported that different users may describe the same concept in different ways. We address the problem of personalized tag recommendation for text documents. In particular, we model personalized tag recommendation as a “query and ranking” problem and propose a novel graph-based ranking algorithm for interrelated multi-type objects. When a user issues a tagging request, both the document and the user are treated as a part of the query. Tags are then ranked by our graph-based ranking algorithm which takes into consideration both relevance to the document and preference of the user. Finally, the top ranked tags are presented to the user as suggestions. Experiments on a large-scale tagging data set collected from Del.icio.us have demonstrated that our proposed algorithm significantly outperforms algorithms which fail to consider the diversity of different users’ interests.
|
posted Sep 24, 2009 12:54 AM by jianshu weng
[
updated Sep 24, 2009 12:59 AM
]
Title: Discovering Users’ Specific Geo Intention in Web Search (paper)(slides)
Authors: Xing Yi, Hema Raghavan, and Chris Leggetter
Appeared in WWW 2009
Abstract:
Discovering users’ specific and implicit geographic intention in web search can greatly help satisfy users’ information needs. We build a geo intent analysis system that uses minimal supervision to learn a model from large amounts of web-search logs for this discovery. We build a city language model, which is a probabilistic representation of the language surrounding the mention of a city in web queries. We use several features derived from these language models to: (1) identify users’ implicit geo intent and pinpoint the city corresponding to this intent, (2) determine whether the geo-intent is localized around the users’ current geographic location, (3) predict cities for queries that have a mention of an entity that is located in a specific place. Experimental results demonstrate the effectiveness of using features derived from the city language model. We find that (1) the system has over 90% precision and more than 74% accuracy for the task of detecting users’ implicit city level geo intent (2) the system achieves more than 96% accuracy in determining whether implicit geo queries are local geo queries, neighbor region geo queries or none-of these (3) the city language model can effectively retrieve cities in location-specific queries with high precision (88%) and recall (74%); human evaluation shows that the language model predicts city labels for location-specific queries with high accuracy (84.5%). |
posted Sep 11, 2009 5:47 AM by jianshu weng
[
updated Sep 13, 2009 6:50 PM
]
Title: Efficient Identification of Starters and Followers in Social Media (paper)
Authors: Michael Mathioudakis, Nick Koudas Appeared in EDBT 2009
Abstract:
Activity and user engagement in social media such as web logs, wikis, online forums or social networks has been increasing at unprecedented rates. In relation to social behavior in various human activities, user activity in social media indicates the existence of individuals that consistently drive or stimulate `discussions' in the online world. Such individuals are considered as `starters' of online discussions in contrast with `followers' that primarily engage in discussions and follow them. In this paper, we formalize notions of `starters' and `followers' in social media. Motivated by the challenging size of the available information related to online social behavior, we focus on the development of random sampling approaches allowing us to achieve signi.cant efficiency while identifying starters and followers. In our experimental section we utilize BlogScope, our social media warehousing platform under development at the University of Toronto. We demonstrate the scalability and accuracy of our sampling approaches using real data establishing the practical utility of our techniques in a real social media warehousing environment. |
posted Sep 5, 2009 10:28 PM by Byung-Won On
[
updated Sep 24, 2009 1:57 AM
]
Title: Collaborative Filtering for Orkut Communities: Discovery of User Latent Behavior ( paper)( slides) Authors: W. Chen, J. Chu, J. Luan, H. Bai, Y. Wang, and E. Chang Appeared in WWW 2009 Abstract: Users of social networking services can connect with each other by forming communities for online interaction. Yet as the number of communities hosted by such websites grows over time, users have even greater need for effective community recommendations in order to meet more users. In this paper, we investigate two algorithms from very different domains and evaluate their effectiveness for personalized community recommendation. First is association rule mining (ARM), which discovers associations between sets of commuities that are shared across many users. Second is latent Dirichlet allocation (LDA), which models user-community co-occurrences using latent aspects. In comparing LDA with ARM, we are interested in discovering whether modeling low-rank latent structure is more effective for recommendations than directly mining rules from the observed data. We experiment on an Orkut data set consisting of 492,104 users and 118,002 communities. Our empirical comparisons using the top-k recommendation metric sho that LDA performs consitently better than ARM for the community recommendation task when recommending a list of 4 or more communities. However, for recommendation lists of up to 3 communities, ARM is still a bit better. We analyze examples of the latent information learned by LDA to explain this finding. To effectively handle the large-scale data set, we parallelize LDA on distributed computers and demonstrate our parallel implementation's scalability with varying numbers of machines. | |
posted Sep 2, 2009 2:42 AM by jianshu weng
[
updated Sep 13, 2009 6:51 PM
]
Title: Query Dependent Ranking Using K-Nearest Neighbor (paper) (slides)
Authors: Xiubo Geng, Tie-Yan Liu, Tao Qin, Andrew Arnold, Hang Li, and Heung-Yeung Shum Appeared in SIGIR 2008
Abstract:
Many ranking models have been proposed in information retrieval, and recently machine learning techniques have also been applied to ranking model construction. Most of the existing methods do not take into consideration the fact that significant differences exist between queries, and only resort to a single function in ranking of documents. In this paper, we argue that it is necessary to employ diff.erent ranking models for di.erent queries and conduct what we call query-dependent ranking. As the first such attempt, we propose a K-Nearest Neighbor (KNN) method for query-dependent ranking. We first consider an online method which creates a ranking model for a given query by using the labeled neighbors of the query in the query feature space and then rank the documents with respect to the query using the created model. Next, we give two offline approximations of the method, which create the ranking models in advance to enhance the efficiency of ranking. And we prove a theory which indicates that the approximations are accurate in terms of di.erence in loss of prediction, if the learning algorithm used is stable with respect to minor changes in training examples. Our experimental results show that the proposed online and offline methods both outperform the baseline method of using a single ranking function. |
|