Research Projects

Non-expert Annotation for Expert Content

Annotation is usually seen as a task undertaken by domain experts (or at least those who have had significant training in a given annotation schema and understand the content being annotated). Annotation guidelines can easily number in the hundreds of pages, and often require days of training for gaining expertise. Expert annotators also often work on annotations as a full-time job. However, what if we could get non-experts to annotate content, thereby parallelizing effort, as well as dramatically reducing the amount of training and annotation time required? We explore the feasibility of leveraging non-expert crowd workers to annotate text (e.g., Ubuntu IRC logs, advising data) with complex labels (e.g., entities) that requires expertise they may not possess. In this project, we introduce novel interfaces and explore methods that provide domain-specific knowledge in the form of distributed context clues to help crowd workers more accurately annotate expert content.

Image Annotation for Autonomous Vehicle Safety

Safety is one of the most important concerns in the study of autonomous vehicles. Researchers have spent a long time to collect data to analyze when and why the autonomous vehicles will trigger accidents. But setting autonomous vehicles in the real life to collect data in unsafe scenarios is not a feasible method. Alternatively, we collect and analyze the unsafe scenarios in the real life and figure out the reasons for existing accidents.

Crowd workers have been proved to work well on annotating objects. Beyond that, they can do other awesome jobs for annotated objects. Previous research has demonstrated crowd workers’ ability to find a variety of external information for annotated objects, such as help blind people locate microwaves in real-time camera and find the corresponding manuals of microwaves online to help people press buttons on them. In the project, we recruit crowd workers to annotate objects in vehicle crashing images and find external information, such as the make, model, year, length, width and height of vehicles. Workers can annotate the details of objects in the scene and bind the measurement to the annotation. We have designed a collaborative annotation system for workers to efficiently annotate objects and to find external information. We also try to do 3D reconstructions for the objects in the scenes and offer the reconstructed scenes as important data in the autonomous vehicle safety research.

Crowd Machine Learning

Machine learning has achieved a great success in recent years, which highly rely on the large scale of data used in the model training. However, a large amount of data is not always available or necessary to collect. Researchers can select and build small datasets for model training and reach good performance as well. Many hybrid intelligence systems have been built to leverage both human intelligence and machine intelligence to obtain good results with low costs. In some of the hybrid system, experts (or oracles) or non-experts (such as crowd workers) select and annotate data points that are hard for machine learning algorithms to process. They feed the “challenging” data points to the training of machine learning models. Inspired by Panos’ idea of “Beat the machine”, we go beyond just labeling challenging data points in our CrowdML project. Crowd workers are recruited to label “challenging” patterns in the classification tasks. For example, to build a reliable classifier for dogs and cats, we build a rough classifier first and post the preliminary classification results to workers. We then ask workers to focus on the misclassified images and try to find specific patterns in these images (e.g. they find white fluffy dogs are easy to be misclassified). We designed a workflow to iteratively collect these patterns based on multiple workers’ agreement and fed the collected patterns and their corresponding images to machine learning algorithms to reach a good performance.

CodeMend: Assisting Interactive Programming with Bimodal Embedding

Software APIs often contain too many methods and parameters for developers to memorize or navigate effectively. Instead, developers resort to finding answers through online search engines and systems such as Stack Overflow. However, the process of finding and integrating a working solution is often very time-consuming. Though code search engines have increased in quality, there remain significant language- and workflow-gaps in meeting end-user needs. Novice and intermediate programmers often lack the "language" to query, and the expertise in transferring found code to their task. To address this problem, we present CodeMend, a system to support finding and integration of code. CodeMend leverages a neural embedding model to jointly model natural language and code as mined from large Web and code datasets. We also demonstrate a novel, mixed-initiative, interface to support query and integration steps. Through CodeMend, end-users describe their goal in natural language. The system makes salient the relevant API functions, the lines in the end-user's program that should be changed, as well as proposing the actual change. We demonstrate the utility and accuracy of CodeMend through lab and simulation studies.

Machine Learning for Sketched Animation

Apparition is a real-time crowdsourcing system to generate animations according to user's requests. In this system, crowd workers are recruited and asked to create elements on the canvas and draw animations easily using toolkits provided by the system. Different workers are assigned to different animation creation tasks. Combining their outcomes will result in the high-quality sketch of animations, which can be used as the start points for professional animation creations.

Based on users’ feedback in using Apparition, many behaviors are repeated frequently in the process of animation creation. For example, click on one button (trigger of a gun) and release an object (a bullet) is common behavior in many animations. How to find these common behaviors and help workers generate the animations automatically is an interesting question. In this project, we plan to build some machine learning models, such as LSTM, to capture these common behaviors in animation creation and build smart tools to automatically create parts of animations for crowd workers. We expect workers to become more efficient using our animation automation toolkits.

Data Quality Control in eBird

Ebird is a citizen science project collecting bird observation data over the world. Millions of bird observation records (checklists) are submitted to the eBird in the past decade. One of the major concern in eBird is data quality issue. Because a lot of observers in eBird are amateurs who are in lack of avian knowledge, we design some models to automatically judge the expertise levels for observers when they submit checklists to eBird. The models help eBird to successfully filter out many checklists to be reviewed by volunteers in different states. We are developing new models to judge the expertise level considering the temporal and spatial factors. And we focus on the design of metrics to measure users' experience over submission years.

Measuring Breadth of Research

In scientific communities, metrics to measure scientific impact are widely used, such as impact factor and H-index. But there are no widely accepted metrics to measure breadth of research for scholars. In this projects, I design a new metric based on previous Generalized Stirling metrics considering different aspects of research areas. And I developed a new evaluation method based on axiom system to evaluate whether the metric is suitable for measuring breadth of research. I also test the relationship between breadth of research and scientific impact to answer the question whether choosing interdisplinary research projects is good investment to improve the scientific impact over different scientific communities.

Dynamic Scientific Collaboration

Researchers in scientific communities tend to take part in the research of popular research topics over years. So researchers behaviors (publication, collaboration) could help us to understand the variation of hot topics over years. In the project, we make use of unsupervised learning techniques to find the patterns of variation of researchers' behavior in cluster science. The close collaboration between scholars in cluster physics and cluster chemistry illustrate the general development of cluster science.