Principal Component Analysis Introduction
Principal Component Analysis (PCA) is a multivariate statistical analysis method used to examine and explore the variance structure of the data. Specifically, the directions in which the data varies the most are defined as the "first principal component", and components organize sequentially after that for a given dataset. These directions can be written as linear combinations of the original variables. See the right graph for a simple illustration. The original variables are x1 and x2. The direction in which the data have the largest variance is identified as principal component one (PC1), represented by the blue line and defined to be 0.78x1+0.62x2. The coefficients of each variable in a component are referred to in the aggregate as the "loadings" for the corresponding variables.
The number of principal components derivable equals the number of variables in the data. However, PCA is often used as a dimension reduction method. Only the first several principal components are used to derive a lower-dimensional structure, and the data can be re-represented correspondingly using these principal components. Furthermore, to further understand and interpret the principal components, it is common to consider only variables with high absolute values for loadings.
Step 1. Run DocuScope. Tag the corpora of speeches with Docuscope tags
Step 2. Run PCA. We first perform the principal component analysis (PCA), on the tagged texts. The result of this step was to reduce DocuScope’s 28 categories (variables) to a fewer number of internally-correlated super-variables (called principal components) that offer an easier-to-interpret lens for the texts under study. We thought of each principal component extracted as a “first draft” of a rhetorical profile, extracted but not yet refined (tested and interpretively filtered) by a human reader.
Step 3. Use Close Reading to Test the Capacity of the Principal Component to Guide the Interpretative Process. The third step was to closely read the speeches where the principal component “scored highest.” This involved reading up to 20 speeches or more per component where the language variables most defining of the principal component appeared in the highest concentrations in the corpus. The outcome of this step was to make filtering decisions about which subsets of variables (e.g., language categories) from the reduced set of variables available from PCA actually guided our interpretation and which variables seemed peripheral (sometimes guiding but not always) or even extraneous (never guiding). As a result of our close reading process and our evaluation of a variable’s utility, we assigned all the variables of the principal component to one of three categories: (1) Core, (2) Non-Core, and (3) Extraneous. Core variables were, with few exceptions, integral to the profile regardless of speaker and most central to our interpreting and naming of the rhetorical profile. Non-Core variables appear in a given profile contingently and their contribution to the overall strategy is conditioned by context. Non-core variables thus help define some strategies underlying some speeches used by some speakers in some situations but they are not thought to be robustly explanatory of the profile across contexts. As long as the core variables remained constant, we used the same rhetorical profiles across contexts. Our judgments of constancy held even if there were changes in the non-core variables accompanying the profile’s core. Because of this constancy judgment, we can and will in later chapters see the same profile recurring across principal components, though with changes mostly in its non-core variables. Our analysis in each chapter illustrates how the non-core variables of a rhetorical profile help account for its plasticity of rhetorical function across contexts.
Extraneous variables are variables in the principal component we never relied on in our close reading and we eliminated them (e.g., elaborative language commonplace to many political speeches). The rhetorical profile emerged as the result of this selective filtering on the principal component.
Step 4. Calculate the Consequences of Eliminating the Extraneous Variables from the Principal Component. The possibility always exists that we have distorted the principal component by dropping the “extraneous” variables. Eliminating variables leaves us vulnerable to the concern that our notion of a rhetorical profile discards the original principal component more than refines it. To address this vulnerability, we devised a statistical method to check the consequences of dropping variables from the principal component when forming rhetorical profiles. The method compares how well the rhetorical profile can generate rankings of the highest scoring texts that match the rankings generated by the original principal component. The closer the match, the stronger the evidence that our rhetorical profiles are simply reader-filtered refinements of the original principal component and nothing more distortive.