Voting Survey Study
In this document, you will find the process I used to transform a .csv file of survey results into a presentation containing visualizations and interpretations.
ABOUT THIS DOCUMENT
This document was created for the purpose of providing you an insight on how I tackle a data challenge.
It contains details on my methods for organizing the data that would not be included if I were presenting it to a panel of people looking for the results of the voting survey.
WHAT YOU CAN FIND IN THIS DOCUMENT
A process for labeling the columns and converting values into explorable text
The creation of a table that allows for easy query
An example of how AI can assist in learning the data
A report on the data with visualizations
ABOUT THE DATA
The source for the data used in this study can be found here:
https://github.com/fivethirtyeight/data/blob/master/non-voters/nonvoters_data.csv
Date of Data: Oct 26, 2020
This is the data behind the story Why Many Americans Don't Vote.
Data presented here comes from polling done by Ipsos for FiveThirtyEight, using Ipsos’s KnowledgePanel, a probability-based online panel that is recruited to be representative of the U.S. population. The poll was conducted from Sept. 15 to Sept. 25 among a sample of U.S. citizens that oversampled young, Black and Hispanic respondents, with 8,327 respondents, and was weighted according to general population benchmarks for U.S. citizens from the U.S. Census Bureau’s Current Population Survey March 2019 Supplement.
The source data used in this study is a publicly downloadable .csv that has 5837 rows and around 115 columns and looks like this…
A nice codebook was provided:
https://github.com/fivethirtyeight/data/blob/master/non-voters/nonvoters_codebook.pdf
The survey, that is the source of the data, used numerous question types which made the deciphering and organization of the data challenging:
Yes/No
Yes to all that apply
Choose only 1
If yes, then…
Rate the significance
A note on the web page said this…
SUMMARY OF MY PROCESS
Step 1 - Import the .csv into an Azure SQL database
I have a personal Azure account, so I created a SQL database for this project and used the Azure Data Studio’s import wizard to import the nonvoters_data.csv.
*NOTE - I am using an Acer Chromebook, so needed to install Linux in order to install Azure Data Studio. It works well, but can be slow and clunky, so I mainly use the Azure Data Studio for any tools not in the online Azure.
Step 2 - Transform the data!
2.1 - Now that the raw data is imported in to a table, I could trudge through as it stands, developing SQL to pull the data and using the codebook as a guide to translate and name things in the SQL or translate everything in Excel as I visualize. That sounds brutal. I want a data table that is as simple as possible. This will enable me to easily play with the data and get to know it quickly.
I need to create a method to translate the raw data into something visually interpretable.
It doesn’t have to be perfect - just better than a table of Qs, 1s, and 2s.
2.1 - I make a copy of my raw table (nonvoter_data) into CLEAN_VOTER_DATA. This is just to ensure I don’t corrupt nonvoter_data.
2.2 - I rename all of the columns in CLEAN_VOTER_DATA into something that I can interpret at a glance. All columns must have a unique name. Plus, these are going to be a primary key in your other translation tables.
Here is the script to change column names: EXEC sp_rename 'dbo.CLEAN_VOTER_TABLE.Q1', 'Citizen', 'COLUMN';
2.3 - This was a little tedious but by no means hard nor time-consuming. I took the codebook that was provided (which looks like the survey but with question and answer designators) and formatted it into 2 ‘$’ delimited CSVs. One file for questions and one file for answers. They looked like this:
2.4 - I also needed to create a column translator so I could be able to find the questions and answers associated with my new columns in CLEAN_VOTER_DATA table. To save time and ensure accuracy, I created a schema for CLEAN_VOTER_DATA. The schema script looked like this:
SELECT
t.TABLE_NAME,
c.COLUMN_NAME,
c.DATA_TYPE,
c.CHARACTER_MAXIMUM_LENGTH,
c.NUMERIC_PRECISION,
c.NUMERIC_SCALE,
c.IS_NULLABLE
FROM
INFORMATION_SCHEMA.TABLES t
JOIN
INFORMATION_SCHEMA.COLUMNS c
ON t.TABLE_NAME = c.TABLE_NAME
WHERE
t.TABLE_TYPE = 'BASE TABLE' and
t.TABLE_NAME = 'CLEAN_VOTER_TABLE'
ORDER BY
t.TABLE_NAME,
c.ORDINAL_POSITION;
2.5 - I used the schema and created a csv that looked like this >>>>>>>
2.6 - I imported my 3 CSVs into my database that houses my CLEAN_VOTER_TABLE.
Now, I have all of the ingredients to create a new table that I will use to query all of the data from now on.
Step 3 - The creation of VOTING_PIVOT_MONSTER
3.1 - I used a straightforward SELECT INTO method for creating VOTING_PIVOT_MONSTER. I unioned all 112 necessary SELECTs and used my translation tables to replace all of the numeric values (answers) with real answers.
There were a few errors with data types. For example, in a column where a majority of answers were numeric the data type was assumed to be int but occasionally an NA (not applicable) would be the answer and fail the insert.
The following script is an example of what was used to alter data types until everything passed:
ALTER TABLE [dbo].[VOTING_PIVOT_MONSTER]
ALTER COLUMN ANSWER_DESC nvarchar(255);
3.1 - Here is the top few lines of the SQL script that creates VOTING_PIVOT_MONSTER:
INSERT INTO VOTING_PIVOT_MONSTER (
QUESTION,
RESPID,
ANSWER_DESC,
GENDER,
PPAGE,
EDUC,
RACE,
INCOME_CAT,
VOTER_CATEGORY
)
select 'voting' AS 'question', a.respid,
b.answer_desc, a.gender, 'age', a.educ, a.race,
a.income_cat, a.voter_category
from [dbo].[CLEAN_VOTER_TABLE] a, [dbo].[ANSWERS] b,
[dbo].[COLUMN_TRANS] d--
where d.column_name = 'voting'
and d.question_id = b.question_id
and a.voting = b.answer_id
and a.voting NOT IN ('-1 ', 'NA')
UNION
select 'politics' AS 'question', a.respid, b.answer_desc, a.gender, 'age', a.educ, a.race, a.income_cat, a.voter_category
from [dbo].[CLEAN_VOTER_TABLE] a, [dbo].[ANSWERS] b, [dbo].[COLUMN_TRANS] d--
where d.column_name = 'politics'
and d.question_id = b.question_id
and a.politics = b.answer_id
and a.politics NOT IN ('-1 ', 'NA')
UNION
-- The script goes on for another 800 lines of “Unions” pulling all the questions and answers.
Instead of having 5,837 rows and 115 columns, I now have 494,860 rows and 9 columns. The table looks like this:
EASE OF USE - FOR PASSING TO COLLEAGUES OR PASSING TO AI
Structuring the data into such a simple table, using intuitive labeling, enables this project to be shared with others. It also allows it to be passed to an LLM easily. Only a few hours after receiving the raw CSV, the data can be explored easily by colleagues or submitted to an LLM to reveal further exploration.
LLM AIs, like ChatGPT 4o, can use well-named column names (designed with AI in mind):
Here is the straightforward SQL I used to create an AI-friendly CSV:
SELECT a.respid as ID,
a.ppage as 'Age',
a.gender as 'Gender',
a.educ as 'Education Level',
a.race as 'Race',
a.income_cat as 'Income Category',
a.voter_category as 'Voting Frequency',
a.answer_desc as 'Political Party',
--
(select z.answer_desc
FROM [dbo].[VOTING_PIVOT_MONSTER] z
WHERE a.respid = z.respid
and z.question IN ('racism')) as 'Racism is a Problem' ,
--
(select y.answer_desc
FROM [dbo].[VOTING_PIVOT_MONSTER] y
WHERE a.respid = y.respid
and y.question IN ('whichpres')) as 'My Choice as President'
--
FROM [dbo].[VOTING_PIVOT_MONSTER] a
WHERE a.question IN ('yourparty')
GROUP BY a.respid, a.ppage, a.gender, a.educ, a.race, a.income_cat, a.voter_category, a.answer_desc
ORDER BY ID;
Steps for using AI
- Download the results as a csv (there is an export/download limit in Azure of 55K rows)
- Open a new chat in ChatGpt 4o
- Upload the csv into the chat. The LLM will analyze the file and ask you what you need to know.
- I entered the following prompt:
To see the resulting analysis, please click the following link:
REVIEW WHAT WE HAVE:
A simple, 9-column table with all of the data in English terms (VOTER_PIVOT_MONSTER)
Easy to pass to colleagues without needing complex instructions
A simple file-building SQL that anyone could use as a template
Only uses one table
Produces very Excel friendly files
Produces very AI friendly files
After only about 8 hours of preparation, you can have a team meeting to do a very quick and creative analysis of the data.
ANALYSIS OF THE DATA
There are many (possibly hundreds) of ways to analyze this dataset. It is truly a wealth of voter survey data. I chose to concentrate of barriers to voting. What prevents people from voting? There are conditions which make it a burden. There are also beliefs and attitudes that lessen one's motivation to vote. My analysis tries to find what these barriers are and imagine what could minimize them.