Projects

Recommendation System in Online Collaborative Platforms (2018-Now)

With the proliferation of online social collaborative platforms (such as GitHub, StackOverflow, etc.), users find themselves increasingly engaged in various activities on different items. To improve user experiences, it is desirable to have a recommender system that can suggest not only items to a user but also the activities to be performed on the suggested items. To this end, we propose a novel approach dubbed Keen2Act, which decomposes the recommendation problem into two stages: the Keen and Act steps. In the Keen step, for a given user, we identify a (sub)set of items in which he/she is likely to be interested. In the Act step, we then recommend activities for the user to perform on the identified set of items. We evaluate our proposed approach on two real-world, and the experiments show promising results where Keen2Act outperforms several baseline models.

Team: Roy Ka-Wei Lee, Richard J. Oentaryo, and Thong Hoang

Deep Learning for Stable Patch Identification (2018-2020)

Linux kernel stable versions serve the needs of users who value stability of the kernel over new features. The quality of such stable versions depends on the initiative of kernel developers and maintainers to propagate bug fixing patches to the stable versions. Thus, it is desirable to consider to what extent this process can be automated. A previous approach relies on words from commit messages and a small set of manually constructed code features. This approach, however, shows only moderate accuracy. In this work, we investigate whether deep learning can provide a more accurate solution. We propose PatchNet, a hierarchical deep learning-based approach capable of automatically extracting features from commit messages and commit code and using them to identify stable patches. PatchNet contains a deep hierarchical structure that mirrors the hierarchical and sequential structure of commit code, making it distinctive from the existing deep learning models on source code. Experiments on 82,403 recent Linux patches confirm the superiority of PatchNet against various state-of-the-art baselines, including the one recently-adopted by Linux kernel maintainers.

Team: Thong Hoang, Julia Lawall, Yuan Tian, Richard J. Oentaryo, and David Lo

Talent Flow Analytics (2015-2016)

Analyzing job hopping behavior is important for the understanding of job preference and career progression of working individuals. When analyzed at the workforce population level, job hop analysis helps to gain insights of talent flow and organization competition. Traditionally, surveys are conducted on job seekers and employers to study job behavior. While surveys are good at getting direct user input to specially designed questions, they are often not scalable and timely enough to cope with fast-changing job landscape. In this work, we present a data science approach to analyze job hops performed by about 490,000 working professionals located in a city using their publicly shared profiles. We develop several metrics to measure how much work experience is needed to take up a job and how recent/established the job is, and then examine how these metrics correlate with the propensity of hopping. We also study how job hop behavior is related to job promotion/demotion. Finally, we perform network analyses at the job and organization levels in order to derive insights on talent flow as well as job and organizational competitiveness.

Team: Richard J. Oentaryo, Ee-Peng Lim, Philips K. Prasetyo, Anh Thu Vu, Vivian Lai, Dinusha Wijedasa, Kyong Jin Shim, and David Lo

Software Bug Localization (2014-2018)

Debugging often takes much efforts and resources. To help developers debug, numerous information retrieval (IR)-based and spectrum-based bug localization techniques have been proposed. IR-based techniques process textual information in bug reports, while spectrum-based techniques process program spectra. Both eventually generate a ranked list of program elements that are likely to contain the bug. However, these techniques only consider one source of information, either bug reports or program spectra, which is not optimal. To deal with the limitation of existing techniques, in this work, we propose a new multi-modal technique that considers both bug reports and program spectra to localize bugs. Our approach {\em adaptively} creates a bug-specific model to map a particular bug to its possible location, and introduces a novel idea of {\em suspicious words} that are highly associated to a bug. We evaluate our approach on 157 real bugs from four software systems, and compare it with a state-of-the-art IR-based bug localization method, a state-of-the-art spectrum-based bug localization method, and three state-of-the-art multi-modal feature location methods that are adapted for bug localization.

Team: Richard J. Oentaryo, Bu Tien Duy Le, Thong Hoang, Yuan Tian, and David Lo

Traffic Flow Prediction via Local Gaussian Processes (2015-2016)

Traffic speed is a key indicator of the efficiency of an urban transportation system. This project addresses the problem of efficient and fine-grained speed prediction using big traffic data obtained from traffic sensors. Gaussian processes (GPs) have been used to model various traffic phenomena; however, GPs do not scale with big data due to their cubic time complexity. We address such efficiency issues by proposing local GPs to learn from and make predictions for correlated subsets of data. The main idea is to quickly group speed variables in both spatial and temporal dimensions into a finite number of clusters, so that future and unobserved traffic speed queries can be heuristically mapped to one of such clusters. A local GP corresponding to that cluster can then be trained on the fly to make predictions in real-time. We call this localization, which is done using non-negative matrix factorization. We additionally leverage on the expressiveness of GP kernel functions to model the road network topology and incorporate side information. Extensive experiments using real-world traffic data show that our proposed method significantly improve both the runtime and prediction performances.

Team: Truc Viet Le, Richard J. Oentaryo, Siyuan Liu, and Hoong Chuin Lau

Business Location Analytics (2015-2016)

If you were an owner of a local cafe, would you not want to know the best location to set up your business? Which are the factors among nearest competitors, hotspots, and human flow data should be considered? In this work, we seek to answer the above questions by investigating the use of Facebook's check-ins to evaluate or estimate the success of businesses. Using a dataset of more than twenty thousand food businesses on Facebook pages, we conduct analysis of several success-related factors including business categories, locations and neighboring businesses. From these factors, we extract a set of relevant features and develop an efficient way to predict business success. Our experiments have shown that the success of neighboring business contributes the key features to perform accurate prediction. We finally illustrate the application of such a prediction method using a user-friendly food business recommender system.

Team: Jovian Lin, Richard J. Oentaryo, Ee-Peng Lim, Anh Thu Vu, Adrian Vu, and Philips K. Prasetyo

Profiling Latent User Attributes in Social Media (2014-2015)

Social media have become an important platform for users to connect and share contents. With the massive amount of user-generated data, it is now possible to develop methods for inferring latent user attributes, which are useful for personalization and advertising. However, the traditional, purely supervised methods for predicting user attributes have limited success due to the scarce availability of labeled data. Moreover, they do not yet account for different types of social relationships encoded as multiple social graphs. We thus develop a multi-relational semi-supervised learning framework that utilizes a large pool of unlabeled data and multiple social graphs in a synergistic manner. Our approach is built upon sound probabilistic basis, and features an efficient learning procedure to estimate the model parameters. Its predictive performance has been exemplified through extensive empirical studies using Singapore Twitter data.

Team: Richard J. Oentaryo, Freddy C. T. Chua, Ee-Peng Lim, Jia-Wei Low, Philips K. Prasetyo, and David Lo

Algorithm Selection via Ranking (2014-2015)

The abundance of algorithms developed to solve different problems has given rise to an important research question: How do we choose the best algorithm for a given problem? Known as algorithm selection, this issue has been prevailing in many domains, as no single algorithm can perform best on all problem instances. Traditional algorithm selection and portfolio construction methods typically treat the problem as a classification or regression task. In this work, we present a new approach that provides a more natural treatment of algorithm selection and portfolio construction as a ranking task. We develop a Ranking-Based Algorithm Selection method, which employs a simple polynomial model to capture the ranking of different solvers for different problem instances. Experiments on the SAT 2012 competition dataset show that our approach yields competitive performance to that of more sophisticated algorithm selection methods.

Team: Richard J. Oentaryo, Stephanus Daniel Handoko, and Hoong Chuin Lau

Recommender System for Software Engineering (2014-2016)

APIs provide a plethora of functionalities for developers to reuse without reinventing the wheel. Thousands of APIs are available, and identifying the appropriate APIs given a requirement can help the success of a project. However, due to a large number of APIs, it would be challenging for a developer to find the right APIs. In this project, we propose a new approach called APIRec that takes as input a description of a project and outputs a ranked set of APIs that are potentially relevant to the project. At its heart, APIRec employs a personalized ranking model that ranks APIs specific to a project. Based on the historical data of API usages, APIRec learns a model that minimizes the incorrect ordering of APIs, which occurs when a used API is ranked lower than an unused (or a not-yet-used) API. Our evaluation using a dataset taken from ProgrammableWeb shows that APIRec is substantially better than the recommendation provided by ProgrammableWeb’s native search functionality. It also outperforms popularity, vector space model, and collaborative filtering methods.

Team: Ferdian Thung, Richard J. Oentaryo, and David Lo

Inferring Temporal Social Correlation via Topic Model (2013-2014)

The abundance of online user data has led to a surge of interests in understanding the dynamics of social relationship using computational techniques. To this end, we propose to utilize user’s item adoption data for measuring the information flow between users over time, termed as Temporal Social Correlation (TSC). We develop a Linear Dynamical Topic Model (LDTM) to address several issues, such as the difficulty of representing users adoption behavior in latent space for sparse adoption data, and estimating the rate of decay in users’ preferences over time. Using the time series constructed from the topic distributions found by LDTM, we then conduct Granger causality tests to measure TSC. Experiments on bibliographic datasets show that the ordering of (co)authors’ name plays a key role on the flow of information between authors.

Team: Freddy C. T. Chua, Richard J. Oentaryo, and Ee-Peng Lim

Predicting Response in Mobile Advertising (2012-2013)

Mobile advertising has seen dramatic growth, fueled by the global proliferation of mobile devices. Predicting ad response is thus crucial for maximizing business revenue. However, ad response data change dynamically over time, and are subject to cold-start situations in which limited history hinders reliable prediction. There is also a need for a robust regression estimation for high prediction accuracy, and good ranking to distinguish the impacts of different ads. To this end, we develop a generic latent factor model that incorporates importance weights and hierarchical learning. Empirical studies on real-world mobile advertising data show that our model outperforms the contemporary temporal models. The results also demonstrate the efficacy of the proposed importance-aware and hierarchical learning in improving the overall prediction and prediction in cold-start scenarios, respectively.

Team: Richard J. Oentaryo, Ee-Peng Lim, Jia-Wei Low, David Lo, and Michael Finegold

Detecting Click Fraud in Mobile Advertising (2012)

Click fraud–the deliberate clicking on advertisements with no real interest on the product or service offered–is one of the most daunting problems in online advertising. Building an effective fraud detection method is thus pivotal for online advertising businesses. Our goal is to identify fraudulent publishers who generate illegitimate clicks, and distinguish them from normal publishers. We have developed and experimented with a wide variety of machine learning algorithms, which include state-of-the-art single classifier and ensemble model approaches. Our principal findings are that features derived from fine-grained time- series analysis are crucial for accurate fraud detection, and that ensemble models offer promising solutions to highly-imbalanced nonlinear classification tasks with mixed variable types and noisy/missing patterns.

Team: Richard J. Oentaryo, Ee-Peng Lim, David Lo, Feida Zhu, and Michael Finegold

Collective Churn Prediction in Social Network (2011-2012)

In service-based industries, churn poses a significant threat to the integrity of the user communities and profitability of the service providers. Research on churn prediction methods has thus been actively pursued, involving either intrinsic, user profile factors or extrinsic, social factors. However, existing approaches often address each type of factors separately. We propose a new churn prediction approach based on collective classification (CC), which accounts for both the intrinsic and extrinsic factors by utilizing the local features of, and dependencies among, individuals during prediction steps. We evaluate our CC approach using real data provided by an established mobile social networking site, with particular focus on the user chat activities. Our results show that using CC and social features derived from interaction records and network structure yields improved prediction in comparison to using conventional classification and user profile features only.

Team: Richard J. Oentaryo, Ee-Peng Lim, David Lo, Feida Zhu, and Philips K. Prasetyo

Intelligent System for Complex Manufacturing (2010-2011)

Modeling machining processes plays a crucial role in manufacturing operations, in view of its substantial impacts on the overall cost effectiveness and productivity. To this end, computational intelligence approaches, such as neural networks, fuzzy systems, and hybrid fuzzy neural networks, are increasingly being employed in the recent years. However, the existing approaches are largely based on batched learning procedure, in which all machining data are assumed to be available and can be accessed repeatedly. Such approach is impractical in the face of large data stream, and is not suitable for dynamic, time-varying tasks. In this light, we develop a novel fuzzy neural network, which features a fully online learning scheme established upon solid statistical (probabilistic) foundation. Empirical studies on tool wear prognosis and chaotic time series prediction tasks have verified the efficacy of the proposed system as an online modeling tool.

Team: Richard J. Oentaryo, Meng Joo Er, Linn San, Lianyin Zhai, and Xiang Li

Integrated Neuro-Cognitive Architecture (2004-2010)

Developing a general machine intelligence that can provide truly natural interaction and human-like cognition has been a major challenge in artificial intelligence research. To this end, cognitive architectures are increasingly investigated as generic blueprints for intelligent agents that can operate across different task domains. A variety of cognitive architectures have been formulated over the years, but there remains a need to further develop salient aspects of general intelligence, such as knowledge consolidation, system scalability, and metacognitive functions. To realize the three salient aspects, we develop an Integrated Neuro-Cognitive Architecture (INCA) that models the putative functional aspects of the major brain systems and their interactions. To systematically coordinate the interaction among the modules, we develop two novel procedures namely consolidation and inference cycles. The two cycles distinguish INCA from other contemporary cognitive architectures, and are realized using a slow-learning and a fast-learning memory models which we develop based on a novel neuro-fuzzy system approach.

Team: Richard J. Oentaryo, Michel Pasquier, and Chai Quek