Knowledge Extraction and Inference from Text

(AM Tutorial #1 at CIKM 2017)


Soumen Chakrabarti, IIT Bombay
Partha Talukdar, IISc Bangalore


Systems for structured knowledge extraction and inference have made giant strides in the last decade.  Starting from shallow linguistic tagging and coarse-grained recognition of named entities at the resolution of people, places, organizations, and times, modern systems link billions of pages of unstructured text with knowledge graphs having hundreds of millions of entities belonging to tens of thousands of types, and related by tens of thousands of relations.  Via deep learning, systems build continuous representations of words, entities, types, and relations, and use these to continually discover new facts to add to the knowledge graph, and support search systems that go far beyond page-level ``ten blue links''. We will present a comprehensive catalog of the best practices in traditional and deep knowledge extraction and inference, trace their development, interrelationships, and point out various loose ends.

Target audience

We will target fresh academic researchers and industrial practitioners. Attendees are expected to have some basic familiarity with text indexing and corpus statistics (tokenization, typical heavy-tailed vocabularies, TFIDF).  They are expected to be largely familiar with undergrad statistics (probability, distributions, divergence). Some elementary machine learning (clustering, regression and classification; logistic regression, support vector machine basics) will also help.


Before 2007 or so, interest in knowledge representation was limited to researchers of symbolic AI, a section of semantic Web enthusiasts, and builders of question answering (QA) systems.  Important machine learning techniques were being invented for front-end NLP such as POS tagging, chunking, and named entity recognition (NER).  But it was only after Wikipedia became among the largest and most reputed repositories of semi-structured knowledge, and Google's purchase of the Freebase knowledge graph (KG), that large-scale analysis of the implicit and explicit links between knowledge graphs and text corpora began to draw intense attention.  With the more recent triumph of deep learning, the pace of new developments has made it very tricky to keep track of best practices without a birds-eye view of the field.  Our goal is to provide a 10-year historical perspective that can guide a researcher or practitioner's choice of methods for a variety of extraction, inference and search tasks.