Public papers of the Presidents
An exploration of language use in Presidential administrations, now with charts!
The corpus that initially sparked my interest in this textual analysis project was "The Public Papers of the Presidents", an initiative of the National Archive:
The Public Papers of the Presidents, which is compiled and published by the Office of the Federal Register, National Archives and Records Administration, began in 1957 in response to a recommendation of the National Historical Publications Commission. Noting the lack of uniform compilations of messages and papers of the Presidents before this time, the Commission recommended the establishment of an official series in which Presidential writings, addresses, and remarks of a public nature could be made available.
www.archives.gov/federal-register/publications/presidential-papers.html
Although the collection available online goes back to the Hoover administration - https://www.govinfo.gov/app/collection/ppp/ - I decided for reasons of both manageability and personal interest to work with the material from 1980 to the present, the latest available volume being 2014. This proved to be a practical decision, as the corpus is sizeable and the amount of both processing and preprocessing was enough to be somewhat of a burden even with only 68 volumes to work with.
ABOUT THE CORPUS : Each volume, with a couple exceptions, collects about half of a year's worth of written communication and transcriptions of both formal speeches and less formal interactions with the public and press by the sitting President. The main section of each, "Public Papers", is roughly 700-1000 pages long. Each volume also appendices of more formal documents like official press releases and nominations of federal officials; these are much shorter, a hundred or so pages at most.
Table of Contents of the second volume of 2013
To get a general sense of the material in each volume, I picked 3 books at random and skimmed the main section, making note of the general character of each segment. Each individual "chapter" was no longer than 8 pages, and often only a few paragraphs. I ended up categorizing them into three different groups, although I didn't mark or record this information anywhere:
Speeches : transcriptions of public speeches or statement made by the POTUS, in a variety of settings and on a variety of topics. These comprised about 35% of the "chapters", but they also tended to be the longest, so I estimate that this group composed about 50% of the text.
Statements : these were official communications that were originally delivered as text, rather than being transcriptions, although they use semiformal language and many were probably read out loud by a press secretary at some point. Most of them are policy statements or announcements of an action taken by a branch of the administration. These comprise about 50% of the "chapters"
Letters : these are written communications addressed to other branches of government, generally Congress, rather then the public. They cover similar topics to the "Statements" but tend to use more formal language. About 15% of the "chapters".
PREPROCESSING THE TEXT: To prepare for analysis, I did some normalizing of the data. Although most years had two volumes, divided up sometime in June or July, the year 1981 only had one volume, and the year 2000 had three. I split 1981 into two volumes and combined part 2 and 3 of 2000 into one; the end results were about 50% smaller and 70% larger, respectively, than the average volume in the corpus, but I decided that this would not result in any significant distortions in my results. There was a fairly large variance in the size of each volume already, randomly distributed.
Since I was dealing with such a large quantity of text, I decided that having raw text files rather than the PDFs would save processing time, so I converted all the volumes to text files. I was disappointed to find that the available tools for this process were limited and not very powerful. The format of most of these PDFs is a two-column per page layout, and the PDF conversion software I was able to find has no ability to reconstruct words that are divided by hyphens because of a column break, so the output text has a high number of malformed partial words. Several software products I used, including Adobe Acrobat, could not handle the conversion process at all; I suspect the files may have been encoded with an older version of the PDF standard. After some trial and error, I ended up using an open source PDF reader called Okular, opening each document and using "Save As Plain Text". Since the text malformations caused by the two-column format were evenly distributed throughout the documents, it seemed like a reasonable assumption that it would not affect the aggregate results of analysis.