Programming for Corpus Linguistics with Python and Dataframes

Daniel Keller - Western Kentucky University

Elements in Corpus Linguistics - Cambridge University Press

There is increasing recognition that programming is a valuable skill for corpus linguists. Despite this, few programming books exist that are designed specifically around the unique challenges of corpus linguistics (CL). This volume offers an introduction to programming specifically for CL in the Python language using dataframes. Dataframes provide a fast, efficient, intuitive set of data structures and functions for working with large, complex datasets such as corpora. This book demonstrates principles of dataframe programming applied to CL analyses, as well as complete algorithms for producing concordances, lists of collocates, keywords, and lexical bundles, and key feature analysis. Additional algorithms for creating dataframe corpora are presented including methods for tokenizing, part-of-speech tagging, and lemmatizing. This book provides a set of core skills that can be applied to a wide range of CL research questions, as well as to original analyses not possible with existing corpus software.

Download CORE

Download the dataframe version of the Corpus of Online Registers of English (CORE).

Additional Algorithms

Go beyond the algorithms in the book.

Questions?

Contact Daniel.Keller@wku.edu with questions or feedback

Page updated

Google Sites

Report abuse