"Learn How to Get Baby's First Number" - United States Social Security Administration
For our analysis of baby names, we are combining two datasets. We have a dataset from the Social Security Administration that contains the number of babies given each name in each year for the country as a whole and for each state. We also scraped a dataset from behindthename that has crowdsourced sentiment values for 14 attributes for each name. We are analyzing the data for trends over time and geographic trends within the country.
When we were browsing behindthename, we noticed that there seemed to be relationships between different attribute ratings. We computed the correlations between attributes across the dataset. Two of the attributes (Strong/Delicate, and Rough/Refined) seem to be moderately gendered, and several of the attributes (Classic, Mature, Formal, Upper Class, Serious, Nerdy) were also moderately correlated. Below is the correlation plot of all of the name sentiment attributes.
One of our hypotheses was that the data we scraped from behindthename.com is very biased based on when names were popular. We hypothesized this because the ratings are crowdsourced and were all obtained recently, so names that were popular many years back will have a bias for attributes such as Mature vs. Youthful because the people who rate the name will be biased by the fact that most people they know with that name are older. We tested this by taking the attributes for the 100 most popular names for each year and running a linear regression to test the correlation between year and attribute. Attached below are graphs of the attributes over time along with the R-squared coefficient, mean squared error, and p-value of the linear regressions. Below are the graphs of the two properties with the highest R squared values over time.
One of the other trends we studied was the number of names that originate in each state. We studied this in order to investigate how impactful each state was on which names would become popular nationwide. We determined this by taking the 30 most popular names from each year (1938-2014) in the social security database and finding which state that name first occurred in. We ignored the names that appeared in the majority of states at the beginning of the data in order to avoid counting names that existed before the data was being collected. We plotted the U.S. where each state was colored based on how many popular names originated in that state and obtained the graph below.
Another hypothesis we wanted to test was that names that are perceived as more gender-neutral have become more common over time. Our first approach to testing this hypothesis was to find the top 100 names for each year since 1937 and look at each name’s rating on the Masculine-Feminine scale. We plotted the number of names that fall into each range on the scale using a stack plot:
Next, we plan on exploring the similarities in trend lines that we observed for the most popular male names. We have started reshaping the data so that we can cluster it, to see if we learn any new tidbits about our data distribution.