Lesson 6

Synopsis

Data rarely comes in a readily usable form. From the moment it is sourced to the end point when it is published or archived for long-term storage, it typically undergoes many stages of transformation. Throughout the life cycle, it is up to the data practitioners–creators, end-users, and anyone who processes data in any manner–to ensure its integrity by tightly controlling the four main outcomes of a transformative operation: transliteration, loss, augmentation, and corruption of information, all the while maintaining reversibility through judicious employment of version control. This chapter presents an overview of considerations going into planning and managing data transformation processes as well as recommended tools and best practices. Of the two main forms of linguistic data–textual and audiovisual, the chapter will largely focus on the former.

Core concepts & keywords

Transliteration: Converting data to its isomorphic counterparts with no information gained or lost throughout the process.

Augmentation: Addition to a dataset from a secondary source.

Corruption: The unintended loss of meaningful information within a dataset.

Loss: The intentional trimming or simplification of immaterial information in a dataset to increase consistency, density, and/or redaction of sensitive information.

Data Form: The presentational aspect of data that's separate from its information content, such as character encoding (UTF-8, ASCII, etc.) and format (CSV, XML, etc.)

Data Information Content: Meaningful distinctions within the data, not strictly its literal symbols, tokens, or values.

ANSI: The Window's non-Unicode encoding system for English and Western language versions.

UTF-8: A subtype of Unicode and a common standard, which uses 8-bit minimum character width.

Locale: The set of system-wide parameters which defines various localization settings such as language, country, regions, etc.

Activities

Exercises - Practice what you've learned

  • Find an existing open access dataset online and examine it based on the information in this chapter. Which file type is the data accessible in? How is the data encoded? Are the version history and transformation information available alongside the data?

    • Download "Project Gutenberg Selections" from the NLTK Corpora page (http://www.nltk.org/nltk_data/). Unzip the zip file, and examine the included text files ('austen-emma.txt', 'shakespeare-caesar.txt', ...).

        • What encoding scheme do the files have? Is every file UTF-8? Find out using the 'file' command ('file *' will display results for all files in one go) or through a text editor program.

        • Four of the text files have the Windows style "CRLF" line ending. Which are they?

        • The file command reports 'milton-paradise.txt' as a 'data' file, not a plain text file. Is this correct?

        • Let's bring some consistency to this corpus: every file should have UTF-8 encoding with the Unix-style LF line ending. Apply conversion to appropriate files using either: (1) command-line tools, (2) text editor programs such as Notepad++ and Atom.

Implement these practices in your career

  • Practice converting an existing text file to one of the plain-text file formats that work for data exchange (see Table 6.1). You can manually export to a plain-text format through an application such as Word or Excel, or use a separate application for document conversion such as Pandoc (see section 6.1).

  • Try to convert an existing text file to UTF-8 with a text editor such as Notepad++ (see Figure 6.1). Then try converting a folder of text files to a different character encoding in the command line. See Figure 6.2 for examples of Unix command-line tools and commands, and see section 5.1 on command line tools).

  • Explore new command-line tools, end-to-end processing tools/programming languages and version control systems to see which ones you'd like to implement in your new workflow. Try using a new tool recommended in this chapter.

Quiz - Test yourself!

Related readings

Carroll, Stephanie Russo, Desi Rodriguez-Lonebear, and Andrew Martinez. 2019. Indigenous data governance: Strategies from United States Native Nations. Data Science Journal 18(1): 31. DOI: https://doi.org/10.5334/dsj-2019-031

Leonard, Wesley Y. 2018. Reflections on (de)colonialism in language documentation. In Bradley McDonnell, Andrea L. Berez-Kroeker, and Gary Holton (eds.), Reflections on language documentation 20 years after Himmelmann 1998, 55–65. Honolulu: University of Hawai‘i Press. http://hdl.handle.net/10125/24808

Linn, Mary S. 2014. Living archives: A community-based language archive model. Language Documentation and Description 12: 53–67. http://www.elpublishing.org/itempage/137

International Arctic Science Committee. 2013. IASC Data Statement. https://iasc.info/data-observations/iasc-data-statement.

Share your thoughts on this article or topic

Use #LingData #DataTransformation #LingDataManagement on your favorite social media platform!

About the author:

Na-Rae Han

Na-Rae Han is a Senior Lecturer at the Department of Linguistics, University of Pittsburgh, where she teaches computational linguistics and data science methods. She participated in multiple linguistic data and annotation projects throughout her career, many of which were published by the Linguistic Data Consortium (LDC).

Picture of Na-Rae Han

Citations

Cite this chapter:

Han, Na-Rae. 2022. Transforming data. In The Open Handbook of Linguistic Data Management, edited by Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller, and Lauren B. Collister, 73-88. doi.org/10.7551/mitpress/12200.003.0010. Cambridge, MA: MIT Press Open.

Cite this online lesson:

Gabber, Shirley, Danielle Yarbrough, Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller, Lauren B. Collister, and Na-Rae Han. 2022. "Lesson 6." Linguistic Data Management: Online companion course to The Open Handbook of Linguistic Data Management. Website: https://sites.google.com/hawaii.edu/linguisticdatamanagement/course-lessons/06-transforming-data [Date accessed].