What is a variable? To summarize up front, a variable is a concept or a measure of something that varies. This video discusses variables in two ways: what their function in research is (cause and effect as variables) and what kind of information they describe (a categorical vs a continuous data).
We think about variables in a variety of ways in political science: it might be a letter in an equation, a column in a dataset, or more generally a measure of a concept. We’ll use variables in all these ways, although our focus for this first month is on the concept itself. In this video, I will talk about the different functions of variables (cause and effect, for example), but also about the type of data we find in variables.
Let’s start with what’s most important: variables as concepts. Looking at ideas in a scientific way means being clear about what we mean by that idea and how it relates to other things. Our primary focus is going to be on causal relationships, so we can use the word variable to describe a cause (independent variable is the technical term, but there are lots of other words for it) and an effect (the dependent variable because it depends on change in the cause). But as we think about cause and effect, we often talk about other variables as well.
For example, let’s look at the theory of the democratic peace. This is the rare empirical truth in political science: no two democracies have ever gone to war with each other. The cause, two democracies, and the effect, peace are clear, but no one really understands how democracy leads to peace. One theory argues that liberal ideas shape political thought and political institutions in a democracy, making the idea of attacking another democratic state problematic and creating constraints on leaders from doing so. In this case, the cause is liberal ideas, but we need to define the intermediate steps, or intervening variables (like institutions and constraints) in order to make the relationship with the effect (peace) clear.
Another way we think about variables is in establishing causality. There are many different reasons why a country may become more democratic. Modernization theory holds that economic development is the cause of transitions to democracy. But to show that is true, you have to rule out all other possible explanations of democratic transitions. Here I have written one alternate explanation, regional influence for or against democracy, which we call a control variable in this context. Showing that development causes democracy means showing that other explanations do not.
This is particularly important because in real life, it is often hard to tell what is a cause and what is an effect in politics. For example, in their book Why Nations Fail, Acemoglu and Robinson try to explain why some countries end up in virtuous cycles – they are rich, stable, and democratic – while other countries are mired in poverty, high corruption, and political instability. In these cycles, it is hard to determine if a country is poor because it has had a civil war or if it had a civil war because it is poor. And we need to identify what the cause is if we want to change it and improve the outcome.
What’s most essential to remember is the most obvious part: variables vary. You can’t have a variable that doesn’t change, because if a cause doesn’t change, we can’t see if it made an effect change too. Sometimes this isn’t very obvious. For example, many scholars are interested in studying the impact of ethnic diversity on economic growth. The challenge is that the ethnic makeup of most places is pretty consistent over time. It will change over the long term, but the difference from one year to another in the ethnic makeup of the United States is much smaller than the difference between a diverse country like the United States (which changes faster than most countries) and a group with a single dominant ethnicity like South Korea. So if you want to use diversity as a variable, you either need data over a long time frame for a single location or you need data for many locations (in the U.S. or internationally).
When you hear the word variable, though, you probably think of equations and numerical data, and don’t worry, we will be using those later. In this context, I’d like you to focus on two things. First is the distinction between variables, observations, and values. Let’s start with observations. These are the people or places we are studying – an observation can be an individual response on a survey, it can be a state within the U.S., it can be a city, it can be a country. But it is the unit we are analyzing. Two common phrases you will hear are the “N” of a study, which just means the number of observations, and “time series data” which means that we are looking at at least one unit over time. Time-series cross-sectional data varies over both time and unit of analysis is also called panel data. Variables, then, are the topics we are studying about those people or places. And values are the data itself.
Let me use this example of what is called “Nominate” data. This is score that some scholars came up with that describes the ideology of a member of Congress based on all of their roll call votes when they were in office. Data are by session of Congress – this 92nd session was from 1971-1972 – that’s the time frame, and by member of Congress – that is the individual being studied. So an observation here is for a member of Congress for the 92nd Congress. The variables include the state that person is from, their political party, and the ideology score itself. I’ve highlight state as a variable. Note that the observation here is about the individual, not about the state (the place) in this case. What you are actually study varies depending on your research. And the value is what the variable is for that observation. So the variable varies here – it could be any state. But for this particular person, the value is Virginia. In a spreadsheet or dataset like this, observations are the rows, variables are the columns, and values are the data itself.
The last thing I wanted to discuss is the way that numerical variables can vary. There are two main types of numerical variables: categorical and continuous. Categorical variables have categories. A variable with two categories (like a yes/no answer to a question) is called a binary (or sometimes a dummy) variable. Note that I just called a yes/no variable a numerical variable. That’s because we typically will write down a 1 for a yes and a 0 for a no answer if we want to use it in analysis. We’ll give numbers to all the categories in a variable to make them easier to use. Anyway, the other main type of categorical variable to remember is an ordinal or ordered categorical variable. Some variables (like a person’s religion) have categories but you can’t rank them into any logical order. Ordinal variables can be ranked. For example, if the variable is a person’s level of education, the lowest would be elementary, then high school, then college graduate, and so on.
By contrast, continuous variables are a range of numbers. For example, a person’s individual income or the temperature outside both from 0 (or below) to higher numbers.
This a partly a vocabulary issue – just being able to understand what people are talking about when they use these terms -- but it does also matter for analysis. You can only do certain types of statistics with certain types of variables. You can only make a cross-tabulation with a categorical variable, but can only do OLS regression (the kind we will learn) with a continuous variable.
In class, we are going to practice learning about types of variable (and how well they define concepts) by looking at variables describing democracy around the world.