D. Rush - Voting Survey Study

Voting Survey Study

In this document, you will find the process I used to transform a .csv file of survey results into a presentation containing visualizations and interpretations.

ABOUT THIS DOCUMENT

This document was created for the purpose of providing you an insight on how I tackle a data challenge.

It contains details on my methods for organizing the data that would not be included if I were presenting it to a panel of people looking for the results of the voting survey.

WHAT YOU CAN FIND IN THIS DOCUMENT

A process for labeling the columns and converting values into explorable text
The creation of a table that allows for easy query
An example of how AI can assist in learning the data
A report on the data with visualizations

ABOUT THE DATA

The source for the data used in this study can be found here:
https://github.com/fivethirtyeight/data/blob/master/non-voters/nonvoters_data.csv

Date of Data: Oct 26, 2020

This is the data behind the story Why Many Americans Don't Vote.

Data presented here comes from polling done by Ipsos for FiveThirtyEight, using Ipsos’s KnowledgePanel, a probability-based online panel that is recruited to be representative of the U.S. population. The poll was conducted from Sept. 15 to Sept. 25 among a sample of U.S. citizens that oversampled young, Black and Hispanic respondents, with 8,327 respondents, and was weighted according to general population benchmarks for U.S. citizens from the U.S. Census Bureau’s Current Population Survey March 2019 Supplement.

The source data used in this study is a publicly downloadable .csv that has 5837 rows and around 115 columns and looks like this…

A nice codebook was provided:

https://github.com/fivethirtyeight/data/blob/master/non-voters/nonvoters_codebook.pdf

The survey, that is the source of the data, used numerous question types which made the deciphering and organization of the data challenging:

Yes/No
Yes to all that apply
Choose only 1
If yes, then…
Rate the significance

A note on the web page said this…

SUMMARY OF MY PROCESS

Step 1 - Import the .csv into an Azure SQL database

I have a personal Azure account, so I created a SQL database for this project and used the Azure Data Studio’s import wizard to import the nonvoters_data.csv.

*NOTE - I am using an Acer Chromebook, so needed to install Linux in order to install Azure Data Studio. It works well, but can be slow and clunky, so I mainly use the Azure Data Studio for any tools not in the online Azure.

Step 2 - Transform the data!

2.1 - Now that the raw data is imported in to a table, I could trudge through as it stands, developing SQL to pull the data and using the codebook as a guide to translate and name things in the SQL or translate everything in Excel as I visualize. That sounds brutal. I want a data table that is as simple as possible. This will enable me to easily play with the data and get to know it quickly.

I need to create a method to translate the raw data into something visually interpretable.
It doesn’t have to be perfect - just better than a table of Qs, 1s, and 2s.

2.1 - I make a copy of my raw table (nonvoter_data) into CLEAN_VOTER_DATA. This is just to ensure I don’t corrupt nonvoter_data.

2.2 - I rename all of the columns in CLEAN_VOTER_DATA into something that I can interpret at a glance. All columns must have a unique name. Plus, these are going to be a primary key in your other translation tables.

Here is the script to change column names: EXEC sp_rename 'dbo.CLEAN_VOTER_TABLE.Q1', 'Citizen', 'COLUMN';

2.3 - This was a little tedious but by no means hard nor time-consuming. I took the codebook that was provided (which looks like the survey but with question and answer designators) and formatted it into 2 ‘$’ delimited CSVs. One file for questions and one file for answers. They looked like this:

2.4 - I also needed to create a column translator so I could be able to find the questions and answers associated with my new columns in CLEAN_VOTER_DATA table. To save time and ensure accuracy, I created a schema for CLEAN_VOTER_DATA. The schema script looked like this:

SELECT

t.TABLE_NAME,

c.COLUMN_NAME,

c.DATA_TYPE,

c.CHARACTER_MAXIMUM_LENGTH,

c.NUMERIC_PRECISION,

c.NUMERIC_SCALE,

c.IS_NULLABLE

FROM

INFORMATION_SCHEMA.TABLES t

JOIN

INFORMATION_SCHEMA.COLUMNS c

ON t.TABLE_NAME = c.TABLE_NAME

WHERE

t.TABLE_TYPE = 'BASE TABLE' and

t.TABLE_NAME = 'CLEAN_VOTER_TABLE'

ORDER BY

t.TABLE_NAME,

c.ORDINAL_POSITION;

2.5 - I used the schema and created a csv that looked like this >>>>>>>

2.6 - I imported my 3 CSVs into my database that houses my CLEAN_VOTER_TABLE.

Now, I have all of the ingredients to create a new table that I will use to query all of the data from now on.

Step 3 - The creation of VOTING_PIVOT_MONSTER

3.1 - I used a straightforward SELECT INTO method for creating VOTING_PIVOT_MONSTER. I unioned all 112 necessary SELECTs and used my translation tables to replace all of the numeric values (answers) with real answers.

There were a few errors with data types. For example, in a column where a majority of answers were numeric the data type was assumed to be int but occasionally an NA (not applicable) would be the answer and fail the insert.

The following script is an example of what was used to alter data types until everything passed:

ALTER TABLE [dbo].[VOTING_PIVOT_MONSTER]

ALTER COLUMN ANSWER_DESC nvarchar(255);

3.1 - Here is the top few lines of the SQL script that creates VOTING_PIVOT_MONSTER:

INSERT INTO VOTING_PIVOT_MONSTER (

QUESTION,

RESPID,

ANSWER_DESC,

GENDER,

PPAGE,

EDUC,

RACE,

INCOME_CAT,

VOTER_CATEGORY

)

select 'voting' AS 'question', a.respid,
b.answer_desc, a.gender, 'age', a.educ, a.race,
a.income_cat, a.voter_category

from [dbo].[CLEAN_VOTER_TABLE] a, [dbo].[ANSWERS] b,
[dbo].[COLUMN_TRANS] d--

where d.column_name = 'voting'

and d.question_id = b.question_id

and a.voting = b.answer_id

and a.voting NOT IN ('-1 ', 'NA')

UNION

select 'politics' AS 'question', a.respid, b.answer_desc, a.gender, 'age', a.educ, a.race, a.income_cat, a.voter_category

from [dbo].[CLEAN_VOTER_TABLE] a, [dbo].[ANSWERS] b, [dbo].[COLUMN_TRANS] d--

where d.column_name = 'politics'

and d.question_id = b.question_id

and a.politics = b.answer_id

and a.politics NOT IN ('-1 ', 'NA')

UNION

-- The script goes on for another 800 lines of “Unions” pulling all the questions and answers.

Instead of having 5,837 rows and 115 columns, I now have 494,860 rows and 9 columns. The table looks like this:

EASE OF USE - FOR PASSING TO COLLEAGUES OR PASSING TO AI

Structuring the data into such a simple table, using intuitive labeling, enables this project to be shared with others. It also allows it to be passed to an LLM easily. Only a few hours after receiving the raw CSV, the data can be explored easily by colleagues or submitted to an LLM to reveal further exploration.

LLM AIs, like ChatGPT 4o, can use well-named column names (designed with AI in mind):

Here is the straightforward SQL I used to create an AI-friendly CSV:

SELECT a.respid as ID,

a.ppage as 'Age',

a.gender as 'Gender',

a.educ as 'Education Level',

a.race as 'Race',

a.income_cat as 'Income Category',

a.voter_category as 'Voting Frequency',

a.answer_desc as 'Political Party',

(select z.answer_desc

FROM [dbo].[VOTING_PIVOT_MONSTER] z

WHERE a.respid = z.respid
and z.question IN ('racism')) as 'Racism is a Problem' ,

(select y.answer_desc

FROM [dbo].[VOTING_PIVOT_MONSTER] y

WHERE a.respid = y.respid

and y.question IN ('whichpres')) as 'My Choice as President'

FROM [dbo].[VOTING_PIVOT_MONSTER] a

WHERE a.question IN ('yourparty')

GROUP BY a.respid, a.ppage, a.gender, a.educ, a.race, a.income_cat, a.voter_category, a.answer_desc

ORDER BY ID;

Steps for using AI

- Download the results as a csv (there is an export/download limit in Azure of 55K rows)
- Open a new chat in ChatGpt 4o
- Upload the csv into the chat. The LLM will analyze the file and ask you what you need to know.
- I entered the following prompt:

To see the resulting analysis, please click the following link:

ChatGPT 4o Voting Data Analysis

REVIEW WHAT WE HAVE:

A simple, 9-column table with all of the data in English terms (VOTER_PIVOT_MONSTER)

- Easy to pass to colleagues without needing complex instructions

A simple file-building SQL that anyone could use as a template

- Only uses one table
- Produces very Excel friendly files
- Produces very AI friendly files

After only about 8 hours of preparation, you can have a team meeting to do a very quick and creative analysis of the data.

ANALYSIS OF THE DATA

There are many (possibly hundreds) of ways to analyze this dataset. It is truly a wealth of voter survey data. I chose to concentrate of barriers to voting. What prevents people from voting? There are conditions which make it a burden. There are also beliefs and attitudes that lessen one's motivation to vote. My analysis tries to find what these barriers are and imagine what could minimize them.

Click on the chart, above, to go to the full set of visualizations.

Google Sites

Report abuse