Bloomberg: Roll Call Vote Prediction using Universal Schema Knowledge Base
Arpit Singh, Kriti Myer, Pallavi Patil, Ronak Zala
Abstract: This project seeks to explore the benefits of using an embedded structured knowledge base to improve a prediction task, namely that of predicting roll-call votes using bill text. The structured knowledge base (KB) is created using universal schema, and populated using relevant NYTimes articles and Freebase relations. The relations inferred by the KB will be used to augment a baseline model from Kraft et al (2016). We aim to show that the presence of this KB enhances predictions, especially in cases where a politician may be present infrequently in training data. We also aim to explore the benefits of structuring the information, as opposed to using raw text features. The difference in performance upon creating the knowledge base using text from different parts of the article will also be evaluated.
Microsoft Research Montreal: Resolving Large Action Spaces in Text-based Games
Bryon Kucharski, Rakesh Radhakrishnan Menon, Satya Surya Venkata Sasi Kiran Yelamarthi, Clayton Thorrez
Abstract: Reinforcement Learning has been able to achieve a lot of successes in tasks ranging from video-game play to robotics by trying to maximize cumulative reward. However, a large number of challenges still limit the claim for the deployment of these models in real-world scenarios. One such challenge is the question of ``how to deal with very large action spaces?''. As humans, we are faced with multiple decisions at every instant during our daily lives and hence this is an important challenge. However, current state-of-the-art algorithms face an issue with exploring such large action spaces and hence require an exorbitant number of samples for training models. In this paper, we look to tackle the problem of dealing with a large natural language action space, in a text-based game setting, through a combination of action elimination and action generation techniques. We show the efficacy of our approach using the TextWorld environment on a set of cooking tasks in a home world.
Amazon.com: Fast Intent Classification for Smart Assistants
Akshit Tyagi, Nan Zhuang, Varun Sharma, Zihang Wang, Lynn Samson
Abstract: Smart voice assistants have started to become an integral part of our daily interactions, and one of the reasons is the near human like manner of maintaining dialogue with the end-user. For this, the assistant needs to be quick in inferring the correct response with a certain degree of confidence. However, these assistants generally lack the flexibility to reason deeper on specific requests. Distinguishing which requests need more inference and which ones can be handled by shallow thinking makes the responses of the assistant seem richer and more human-like. Recently, early exiting strategies have been used successfully for deep convolutional networks. We propose a strategy that combines these early exiting strategies with conventional NLP networks, making the agent learn when it can predict from a shallower branch and when it needs to go through more layers of the same network to respond more confidently.
Chan Zuckerberg Initiative (CZI) 1: Universal biomedical sentence embeddings
Ao Liu, Kushagra Pundeer, Rohan Gangaraju, Surya Teja Devarakonda
Abstract: The motivation to develop pre-trained sentence embeddings has increased considerably, especially in biomedical domain, due to their significance in many bio-NLP tasks such as sentence similarity, text classification, etc. Word representations trained on a large dataset are very effective on a large number of NLP tasks. However, using only the word embeddings, models still have to learn the relationship between words in context within a sentence. Pre-trained sentence embeddings avoid this by allowing models to leverage existing contextual knowledge and improve the performances even in a low resource setting. We aim to develop a universal biomedical sentence embedding model in order to facilitate various bio-NLP tasks and accelerate the research in the biomedical domain. Due to the effectiveness of multi-task learning models in developing sentence embeddings for open-domain NLP as demonstrated in several works, such as GenSen (Subramanian et al., 2018) and Universal Sentence Encoder (Cer et al., 2018), we investigate the effectiveness of multi-task learning models for biomedical domain.
Chan Zuckerberg Initiative (CZI) 2: Clustering biomedical citation graphs
Abhishek Mandal, Ajay Venkitaraman, Shyla Gangwar, Sreeparna Mukherjee
Abstract: Chan Zuckerberg Initiative is invested in accelerating biomedical research and holds that the publication of pre-prints will facilitate research. Motivated by this, we utilize the CZI biomedical database to find communities of journal papers and preprints and investigate the interactions between them. These communities are of two types, homogenous - (it has either preprint or paper) and heterogeneous - (has both papers and preprints). Thus, we can conclude that there are interactions between papers and preprints if a cluster is heterogeneous. Following this, we also examine how these communities interact with each other and analyze the concepts that are shared among the communities. We observe that the majority of our communities are heterogeneous, and share concepts and methodologies across these communities. We also examine the citation behavior of both preprints and papers and report how such patterns reflect on the impact. Using graph density, clustering coefficient, maximum betweenness, and network constraint, we find whether a citer is idiosyncratic, within-community or a broker between multiple communities, and compare this behavior among preprints and journal papers.
Google: Probabilistic Embeddings on Taxonomies
Abhishek Singh, Ninad Khargonkar, Yicheng Shao, Anubha Thandlay
Abstract: Natural language processing models typically employ word embedding methods for efficient representation of entities under consideration. The objective of an embedding model is to organize symbolic objects in such a way that their similarity or distance in the embedding space reflects their semantic similarity. This mapping from semantics to geometric representations can express ideas of hierarchy, entailment and transitive relations. Common embedding methods like Word2vec assign a point vector which is not suitable to capture the structured nature of the data. This is the main motivation for learning more complex geometric representations. In our project, we are interested in models that learn embeddings for complex relations either through a structured geometric representation or through learning in a non-euclidean space. We will explore probabilistic Box Lattice model with soft edges and Poincare Embeddings. We add implicit and explicit taxonomies to the training data and investigate whether such taxonomies over genres help in learning a better embedding.
IBM Research 1: Fine-tuning BERT for AMR Parsing
Abhirup Mukherjee, David Ter-Ovanesyan, Thomas Lam, Jaskaran Singh
Abstract: Abstract Meaning Representation (AMR) attempts to represent the meaning of a sentence with a single rooted directed graph. For this project, we propose to use the state-of-the-art BERT (Devlin et al., 2018) model to improve current results on AMR parsing. We would be replacing the word embeddings used in (Lyu and Titov, 2018) with contextual BERT representations which, we hypothesize, would improve the current benchmark SMatch score of 74.4, set by (Lyu and Titov, 2018). We will also be performing experiments with various layers/configurations of BERT representations to use for this task.
IBM Research 2: Analysis of Methods and Types for Complex Question Answering
Hang Liu, Keshav Seth, Wei Xie, Yi-Pei Chen
Abstract: AI2 Reasoning Challenge (ARC) is a multiple-choice question answering dataset which requires more reasoning than simple string match to select the correct answer. The classic approach is selecting supporting texts from a large corpus using an information retrieval system and getting help from a reading comprehension model to read through all supporting texts to choose the correct answer. (Sun et al., 2018; Ni et al., 2018; Musa et al., 2018) Previous work suggests that information retrieval may be a bottleneck by showing 42% improvement with human-selected supporting texts to those retrieved from IR system. Therefore, we focus on query reformulation as a way to improve the quality of retrieved supporting texts. We propose three ways to tackle the query reformulation problem: symbolic methods, an embedding approach, and reinforcement learning. At this time we have not seen an improvement from these approaches, however the embedding model looks promising. Future work will concentrate on tuning and adapting the embedding approach to improve these results.
Scripps Research: Improving disease and phenotype NER by improving crowd-sourced annotations
Aditya Ashwinikumar Sathe, Akshita Bhagia, Rohan Paul, Salar Satti
Abstract: Named Entity Recognition (NER) in the biomedical domain suffers from a lack of expert-annotated gold standard datasets. One way to improve NER is to crowd-source annotations from citizen scientists, but different annotators follow different guidelines to label the data, resulting in weaker NER. This project aims to improve systems for detecting diseases and phenotypes in biomedical text through improving the data itself, by aligning the crowd-sourced annotations with expert guidelines using probabilistic graphical models.
IBM Research 3: MultiHop Question Answering Systems
Abhishek Singhal, Ameya Godbole, Dilip Kavarthapu, Zhiyu Gong
Abstract: Multihop reasoning, as the name suggests, requires finding answers by accumulating evidence over multiple documents. HotPotQA is a recent dataset created specifically with multihop reasoning in mind. The dataset comes with benchmarks on the full-wiki setting which requires reasoning over the entire corpus of Wikipedia abstracts. Making this tractable requires efficient document retrieval. We propose a retriever that searches the corpus and uses the inter-document linking to efficiently collect relevant documents. Our system works better than a direct text-based search engine when evaluated using the recall of the retrieved documents.
Lexalytics: Abstractive Summarization of Scientific Documents
Carlos Daniel Mondragon Chapa, Rajvi Kapadia, Soumya Saha, Shikhar Sharma
Abstract: Neural abstractive text summarization is one of the active fields of research in Natural Language Processing. Our work focuses on the summarization of longer form documents, for which we have found scientific research papers to be the best fit as they contain a discourse structure along with a representation of the summary, i.e., the abstract. Moreover, certain sections capture the main theme of the paper better than others. In our model we have used section-level attention mechanism to leverage these sectional structures and sentence level embeddings to input more amount of data into our model. We also aim to do a comparative analysis among different ways of encoding and feeding input to the prediction network, the results of which can help us understand what input-encoding approach would be best-suited for summarizing long-form documents.
Microsoft Azure: Using Microsoft Azure as a Machine Learning Service
Deeksha Razdan, Katie House, Shanu Vashishtha, Cheng-Yin Eng
Abstract: Azure Machine Learning service provides a cloud-based environment data scientists can use to pre- process data, train, test, and deploy models, and also track various runs of machine learning model experiments. Such service allows more efficient workflow, consistent and scalable computing environment, and also reduces the financial cost of purchasing and maintaining expensive hardware that might not be used often. Since the service currently lacks comprehensive documentation for user tasks, our group's goal is to: (1) be acquainted with cloud computing concepts; and (2) produce user- friendly, end-to-end data science notebook tutorials that are accessible to general data science audience, utilizing the advantages of cloud computing features. We will discuss our experience of using Azure to complete a suite of common data science tasks and present Azure's performance in terms of speed, accuracy and ease of open-source code compatibility.
Oracle Labs: Learning Graph Embeddings for Companies in Financial Data
Chen Pan, Jiayi Liu, Manisha Kumari Barnwal, Nikhil Adhe
Abstract: SEC filings are public documents, required by the US government for all publicly-traded companies, that contain both rich text information about corporate actors and their relations to one another. Extracting knowledge from public companies is an interesting task and have real-world applications. Analyzing industry structure and company relations helps to make better management decisions. In this project, we work on building network graph from SEC filings and implementing methods to learn text embedding and network embedding for the companies which integrate both unstructured natural language and structured graph data contained in the filings (as well as other textual data sources). We evaluate the utility of these representations using link prediction.
Quantiphi Inc: Modeling Disease Progression Using Intensive Care Unit (ICU) Data
Monica Munnangi, Sai Saran Chintha, Subhajit Naskar, Trang Le
Abstract: With the widespread digitization of clinical data, predictive modeling in the medical field has received increased emphasis. Advances in deep learning have pushed the state of the art in clinical outcome prediction tasks. Clinical data is rife with inconsistencies and missing data due to a variety of medical reasons. Prediction of rare events such as the occurrence of a condition or the worsening of a fatal disease poses a challenge due to the rarity of the condition among a population. In this project, we work on sepsis, a cardiovascular condition that can quickly turn for the worse and be fatal. We analyze ICU data, rife with irregular missing values, of 40,000 patients to predict an early onset of sepsis, at least 6 hours earlier. We use statistical and deep learning methods to model multivariate time series to predict the occurrence of sepsis.
American Institutes for Research (AIR): Unsupervised Outlier and Missing Data Detection
Aswin Kolli, Gourav Saha, Robin Wu, Shruti Jalan
Abstract: Data collected from surveys are prone to human errors that may take form as outliers or missing data. In this paper, we propose techniques to identify outliers for the Civil Rights Dataset Collection (CRDC). We deal with missing data through imputation techniques. We also propose to evaluate these models using synthetic outliers and a distance metric.