The goal of our project was to analyze and explore trends and sentiments of baby names in the U.S.
Along the way, we developed hypotheses about various aspects of the data we found interesting. Our successes and struggles with testing those are detailed below. We used machine learning/statistics and data visualization techniques in each section.
We used two datasets for all of our analysis. One of our data sets came from the Social Security Administration, and we scraped the other data set from BehindTheName.com, a website that crowd-sources name attribute ratings.
One of our biggest technical challenges that we have faced so far was scraping the data off of behindthename.com. We initially extended our scraping assignments to make requests the the website and then scrape the attributes from the webpages, but we ran into scaling issues because there were 94,000 names we had to scrape data for, and each http GET request took almost three quarters of a second to complete. This would require us to run the scraping program for almost 24 hours before getting the data, and since python does not do well with long-running processes, we decided to reconstruct the scraping program to make the requests concurrently to save time. This, however, required switching to another programming language since python does not have good concurrency support because of the Global Interpreter Lock, so we rewrote the program in Node.js, which was able to make requests in batches of 200 every 5 seconds, and scrape all of the data in under an hour. We also had to account for the different ways that behindthename.com dealt with names that have multiple spellings or multiple origins (behindthename also gives the meaning and history of names), because they linked to different ratings pages.
We were able to find what we wanted in the data that we gathered, as both datasets were pretty rich in what they offered for analysis.
After seeing the behindthename.com data, we felt that it seemed biased. (This is possibly because all the names are being rated now so we only have current perceptions of names, but we can't test for the cause). We performed linear regression over time for the average attribute values for each year and found high correlation values of R-squared = 0.9501 and R-squared = 0.9736 for Mature vs. Youthful and Classic vs. Modern, respectively. These plots are shown below.
We also felt that some of the attributes had similar meanings, so we created a correlation plot (left) and found that some of the attributes had high correlation values. For example, Classic and Mature had a correlation of 0.789. This indicated that the data was actually lower in dimensionality than the 14 attributes would suggest.
We also implemented K Nearest Neighbors and used a name's values for each of the above attributes as its set of features. When we tested various names, we found that their closest neighbors were intuitively similar, which suggested that there was some consistency in the sentiment data. For example, in the ten closest neighbors of Katherine, the names Catherine and Catharine show up, as well as Elizabeth and Elisabeth and some other British royalty names.
We also found Hades' neighbors to be interesting, as they included Zeus, another Greek god, and Baltasar and Kain, the Greek spellings of biblical figures.
In addition, when looking at the nearest neighbors of Donald, the eighth most similar name is Melania, and the second most similar name to Melania is Ivanka. Looking at the ratings of these names, it seems that these are biased, as Donald has been a popular name throughout history but it is rated as 74% a bad name. Also, although historically the name Donald has been given to significantly more male babies than female, it is rated as 54% feminine. These ratings seem biased, but it is interesting that they have similar ratings.
After plotting where new names originate in the country, we hypothesized that it was purely based off of population, so we decided to control for population and then run a chi^2 test of Goodness of Fit where we used uniform as our null hypothesis. Our criteria for a new name is the first time any name appeared in the data where that name made it into the top 30 most popular for the country. We ignored the names that were present at the start of the data.
To control for population, we weighted each new name by the ratio of babies born in that state that year to the number of babies born that year, i.e. for state S, where b_y is number of babies born in year y and b_y^S is number of babies born in year y in state S and used this to calculate the score of a state as:
Below are our plots of where names originated, with and without controlling for population
We ran a chi^2 test for GOF on the distribution of where baby names originated after normalization with H_0 as the uniform distribution, and had a significant p-value of p = 2.2e-16, so we can conclude that the distribution of where baby names originate is significantly different from uniform, after controlling for population of states.
A new theory we have after seeing the trends in the names is that the states along the mexico border after more "influential" because of immigration up from mexico, and the new names that appeared in New Mexico were "Sofia" and "Jose" which is evidence to support this theory, but we don't have a way to test it without looking at migration patterns.
As you can see in the graphs above, some of the trends for the most popular male names were very similar. This inspired us to analyze the similarity in popularity over time between names and look for significant groupings. We computed the cross-correlation of the normalized popularity of each name with each other name to create a similarity matrix. We found the two most similar female names to be Florence and Mildred, and the two most similar male names to be George and Arthur.
We then used t-distributed Stochastic Neighbor Embedding to reduce the similarity matrix to two dimensions so that we could visualize it. Take a look at our scatter plot! We found that the names seem to form two clusters. One cluster contained almost all of the names, and was ordered by the year the name was most popular. These names also only had one large peak in popularity between 1937-2014. The other cluster was much smaller and contained names that had two large peaks in popularity, one in the late 1930s or early 1940s and one in the early 2010s. We think that the names in this smaller cluster are 'timeless' names, or names that are making a comeback, and this observation supports the 100-Year Rule (described here).
We hypothesized that the most popular names would comprise a consistent proportion of all names given over time. To test this hypothesis, we found the percentages of babies given the top five male and female names in a given year, and used linear regression to test for a significant change over time.
We found that the most popular names have become substantially less dominant over time.
The percent of male babies given one of the top five names has decreased with a slope of -0.25 and an R-squared value of 0.9720.
The percent of female babies given one of the top five names has decreased with a slope of -0.15 and an R-squared value of 0.8284.
We were interested in learning about gender neutral names and their trends over time. However, we experienced difficulty with figuring out a useful metric for what makes a name gender neutral. Basing gender-neutrality on people's perceptions was not too useful, as seen in the graphic to the left. We also tried finding gender neutral names by determining which names had a balanced male:female ratio, but that also returned names which we did not think were intuitively gender neutral.
If we had more time, we would be very interested to examine the relationship between pop culture and baby names. We noticed, for example, that the name 'Linda' had a huge spike in popularity in the late 1940s, which was probably due to a very popular song also called 'Linda'. Also, while playing around with our K-Nearest Neighbors script, we noticed that some names from the same movie or book series have very similar sentiment ratings (for example, Severus and Lucius). It would be challenging to figure out where to find data that could connect pop culture names to the year they were most popular.
We also were limited by the Social Security Administration data because it was only reliable from 1937 to 2014. There is some baby name data from before 1937 but before the passage of the Social Security Act US citizens were not required to report baby names. It would be nice if we had more years of reliable data, especially for our name trend analysis.