A Data-Centric View for Composable Natural Language Processing


Empirical natural language processing (NLP) systems in application domains such as healthcare, finance, and education involve frequent manipulation of data and interoperation among multiple components, ranging from data ingestion, text retrieval, analysis, generation, and even human interactions like visualization and annotation. The diverse nature of the components in such complex systems imposes challenges to create standardized, robust and reusable components.

In this talk, we present a data centric view of NLP operation and tooling, which bridges different style of software libraries, different user personas, and over additional infrastructures such as those for visualization and distributed training. We propose a highly universal data representation called DataPack, which builds on a flexible type-ontology that is morphable and extendable to subsume any commonly used data formats in all known (and hopefully, future) NLP tasks, yet remains invariant as a software data structure that can be passed across any NLP building blocks. Based on this abstraction, we develop Forte, a Data-Centric Framework for Composable NLP Workflows, with rich in-house processors, standardized 3rd-party API wrappers, and operation logics implemented at the right level of abstraction to facilitate rapid composition of sophisticated NLP solutions with heterogeneous components.

By defining and leveraging appropriate abstractions of NLP data, Forte aims bridge silos and divergent efforts in NLP tool development, bring good software engineering practices into NLP development, with the goal to help NLP practitioners to build robust NLP systems more efficiently.