Female Education and Birth Rates:
By: Christian Kaufman, Jacob Mies, Evan Norkey
PSYC 500 final project
By: Christian Kaufman, Jacob Mies, Evan Norkey
PSYC 500 final project
Picture Source: https://www.pinterest.com/pin/736620082791186727/
The goal of our research is to assess the relationship between educational attainment and birth rates for women across the globe. We used an open-source data-set that collected multiple data points regarding women’s education and fertility. Both of these things heavily affect the success of a given area socioeconomically with cultural implications as well. The variables used here are the countries, rate of birth in the area and the educational levels of reproductive age (15-49 years old) women of the areas. We felt that this would be a relevant topic to analyze solely due to the implications these factors have on a specific area. As it is understood by us, these different measures are indicative of the social, cultural, or economic situations of a given area. The data source that was used in analysis of these factors was Our World in Data, and this data measured the mean live birth per country/region as well as the average amount of schooling completed by women of reproductive age (15-49 years old) in a given country/region. Ultimately, this analysis was important as it sheds light on the changes that have occurred socially and culturally in the last 70 years in regard to women's education and birth rates.
The original DataFrame had 49,980 rows (one per country per year) and seven columns ("Entity", "Code", "Year", "Estimates, 1950 - 2020: Annually interpolated demographic indicators - Total fertility (live births per woman)", "Mean years of schooling, women (in reproductive age 15 to 49) (Our World In Data (2017))", "Total population (Gapminder, HYDE & UN)", and "Continent").
The Entity column identifies the entity to which each row's data pertains. Items in this column are countries, broader geographic areas, and levels of income. These data are stored as objects.
The Code column lists abbreviated identifiers for each row's Entity. These data are stored as objects.
The Year column identifies the year to which each row's data pertains. These data are stored as int64s.
The Estimates, 1950 - 2020: Annually interpolated demographic indicators - Total fertility (live births per woman) column lists the number of live births per woman for each row's Entity for that row's Year. These data are stored as float64s.
The Mean years of schooling, women (in reproductive age 15 to 49) (Our World In Data (2017)) column lists the average years of school per woman for each row's Entity for that row's Year. These data are stored as float64s.
The Total population (Gapminder, HYDE & UN) column lists the total population of each row's Entity for that row's Year. These data are stored as float64s.
The Continent column identifies the continent to which each row's Entity belongs. Many rows inexplicably have an NaN value in this column, even when the Entity clearly is part of a continent. These data are stored as objects.
This data was obtained ethically. Our World in Data, the open source website fights to protect the anonymity of the data. They do not expose anything compromising about the populations the data has been collected on. Instead, they are simply making relevant data accessible to the public. Their main goal for this is to focus on making progress in “poverty, disease, hunger etc.” Since the data was provided solely on a larger region or country (the United States for example), there was no ethical concern that the data was easily unencoded. Also, since the data that was utilized in our analysis was not able to be edited by users of the site, only by those who had collected the data empirically, there was no concern that the data had been tampered with. Overall, we felt that the data was collected ethically, without the possibility to be unencoded to reveal sensitive data, and was collected by a organization with a strong reputation in data collection.
We conducted our analyses on a subset of the data including only data pertaining to countries from the years 1950 and 2010. First, we filtered out data from years other than 1950 and 2010. This removed 49,416 observations from the original 49,980, leaving 564.
Only the "Entity," "Year," "Estimates, 1950 - 2020: Annually interpolated demographic indicators - Total fertility (live births per woman)," and "Mean years of schooling, women (in reproductive age 15 to 49) (Our World In Data (2017))" columns were relevant to our research question, so we dropped the other columns. The names of the final two of these columns were excessively long, so we renamed them to "Birth Rate" and "Years of School," respectively.
We only wanted to analyze entries with values for both Birth Rate and Years of School, so we dropped all rows with at least one NaN value. 272 rows included at least one NaN value, leaving 292 observations.
We chose to analyze only entities with an observation in both 1950 and 2010. With how the data have been processed so far, entity names that are duplicated in df3 will be those with an observation in both years, so we checked how many values in the Entity column of df3 were duplicated. All were duplicated, so we removed no observations in this step.
Finally, we wanted to limit our analysis to countries, but rows in this dataset can correspond to larger geographic areas, income levels, subdivisions of countries, and more in addition to countries. As such, we pulled a list of modern countries from [2], converted it into a list named "countries", and kept only Entities with a name matching a member of the list. In this step, we dropped 26 observations, bringing our total to 266 observations.
We wanted to ensure that we did not inadvertently drop any countries during the previous step due to differences in spelling or our list being incomplete, so we examined the Entity items removed in the previous step to ensure that none of them were countries.
Eight of the removed items were actually countries, so we added them back to the list. This added sixteen observations to the analyzed data, two per country.
The subset on which we conducted our analyses consisted of a total of 284 observations, one for each of 1950 and 2010 for each of 142 countries. 191 Entity names present in the original dataset were not included in our analysis.
Then, to describe the means and interquartile ranges of data from both 1950 and 2010, box plots were created for "Average Live Births per Woman" and "Average Years of School per Woman." The box plots are shown below along with their associated code cells:
As we can see from these plots, there has been a drastic change in the distributions of both "Average Live Births per Woman" and "Average Years of School per Woman" from 1950 to 2010.
We then analyzed the data from the two time periods. For each of the two time periods, descriptive stats were utilized in order to illustrate the differences in distribution for both Average Live Births per Woman and Average Educational Attainment. After completing the descriptive statistics for both of the variables, it became apparent that the trends for both Birth Rate and Years of School have changed inversely from 1950 to 2010.
We ran two linear regressions, one for each of 1950 and 2010, to model the relationship between Birth Rate and Average Years of School per Woman. In both years, we found a strong negative correlation between these two variables. Our results are depicted in the below figures. It was interesting to see the difference in slope between the two regressions, and ultimately this is indicative of a stronger relationship that the two variables (Educational Attainment and Birth Rate) had in 2010 when compared to 1950.
We also conducted two permutation tests to test two sets of hypotheses. First, we tested whether the mean birth rate was higher for 1950 than for 1950. Second, we tested whether the mean number of years of school per woman was lower for 1950 than for 2010.
The box plots and regression lines shown above illustrate and visualize our data.
The empirical difference of birth rate means between the years 1950 and 2010 is 2.5008 children.
Difference of means for Birth Rates= 2.5008
The empirical difference of educational attainment means between the years 1950 and 2010 is -5.9746 years.
Difference of Means for Years of School= -5.9746
Birth Rate:
H0: Mean1950 Birth Rate <= Mean2010 Birth Rate
HA: Mean1950 Birth Rate > Mean2010 Birth Rate
Educational Attainment:
H0: Mean1950 Educational Attainment >= Mean2010 Educational Attainment
HA: Mean1950 Educational Attainment< Mean2010 Educational Attainment
Using the functions for permutation hypothesis testing, we simulated a permutation sample, permutation replicates and drew from the replicates. Below are the functions and how we used them for our data.
Function for simulating a permutation sample (Birth Rate):
*For Educational attainment, the same steps were completed, but with differences in how the data was grouped ("Years of Education" instead of "Birth Rate")
Below is a permutation sample function:
def permutation_sample(data1, data2):
"""Generate a permutation sample from two data sets."""
# Concatenate the data sets: data
data = np.concatenate((data1, data2))
# Permute the concatenated array: permuted_data
permuted_data = np.random.permutation(data)
# Split the permuted array into two: perm_sample_1, perm_sample_2
perm_sample_1 = permuted_data[:len(data1)]
perm_sample_2 = permuted_data[len(data1):]
return perm_sample_1, perm_sample_2
After simulating a sample, we used a "for" loop to replicate the sample:
for _ in range(50):
# Generate permutation samples
perm_sample_1, perm_sample_2 = permutation_sample(data, data)
Using these functions and other numpy functions, we completed steps 4-6:
data_birth1950= data_countries_1950['Birth Rate'].to_numpy()
data_birth1950
data_birth2010= data_countries_2010['Birth Rate'].to_numpy()
data_birth2010
data_birth_both= np.concatenate((data_birth1950, data_birth2010))
data_birth_both
len(data_countries_1950['Birth Rate'])
len(data_countries_2010['Birth Rate'])
data_birth_both_perm= np.random.permutation(data_birth_both)
perm_sample_1950b= data_birth_both_perm[:len(data_birth1950)]
perm_sample_2010b= data_birth_both_perm[:len(data_birth2010)]
for i in range(1000):
permutation_sample_1950b, permutation_sample_2010b= permutation_sample(data_birth1950, data_birth2010)
That gave us this graph to compare the mean birth rates and average years of education between the years 1950 and 2010, and the simulated permutation replicates:
p_val = np.sum(permutation_replicates >= empirical_diff_means) / len(permutation_replicates)
print(f'p={p_val:0.07}')
p=0.00 (Birth Rate)
Since our p-value= 0.00, there would be a 0.00% chance that we would observe the difference of means that occurred in this iteration if the birth rates from 1950 and 2010 were identical. This p-value is indicative of a statistically significant difference in the distribution of birth rate when comparing the birth rates of 1950 and 2010.
p=0.00 (Educational Attainment)
Since our p-value= 0.00, there would be a 0.00% chance that we would observe the difference of means that occurred in this iteration if the educational attainment from 1950 and 2010 were identical. This p-value is indicative of a statistically significant difference in the distribution of birth rate when comparing the birth rates of 1950 and 2010.
Ultimately, we found that there was a negative correlational relationship between birth rate and education level in a given area. This was true for values from 1950 and 2010. What we saw from the descriptive statistical analysis was the general decrease of children birthed from 1950 to 2010. We also found that the education levels of these women in their reproductive years showed an increase from 1950 to 2010. These findings illustrate that as the world has grown and developed, there is more of an emphasis on education and less of an emphasis on reproduction, and I would expect this trend to continue into the future as the world continues to grow and develop. After running our permutation hypothesis testing, it became apparent that it was there was a 0.00% probability that we would observe the difference of means in both Birth Rate and Years of Education given that the data from 1950 and 2010 were identical. The p-value observed, p=0.00, was indicative of a statistically significant difference in distribution of Birth Rate and Years of Education attained from 1950 to 2010.
Although our data seemed to display a difference in distribution for educational attainment for women in the reproductive age (15-49 years old) and mean birth rates, there are limitations present in this iteration of the data analysis. For one, the presence of confounding variables may not have been accounted for, and these variables such as income of a given area, child mortality rate, etc. may illustrate that the relationship observed between these variables. Another limitation of this study was the lack of more recent data, as the comparison that was made between 1950 and 2010 was somewhat expected, as we have seen a shift in roles for women generally speaking in the last 70 years. Another limitation that presents itself with this analysis is that although there is a drastic change in the measured variables from 1950 to 2010, there were countries that were not accounted for as measurements of our two variables may not have been present within the data set, leading to these values not being analyzed. In this case, there is potential for changes in the descriptive statistics calculated above, and ultimately these changes may lead to discrepancies within our further analysis. Additionally, this data only accounted for women in the age range of 15-49 years old. These values do not account for those who are younger/older than that range, and ultimately if data were to be collected for all ages of women, there may be differences that arise not only in the descriptive statistics, but also the further analyses such as linear regression and permutation.
Conducting some variety of MANOVA test would likely have allowed us to better assess the relationships between our three variables of interest, but we were not comfortable implementing such a test in Python.
If we were to run this type of analysis again, we would suggest a few changes to the methodology that was utilized in analyzing the data. Firstly, it would be beneficial to find a data set that would have given values for each entity, so that the analysis would not have been missing certain countries. Next, I feel that it would be beneficial to group the different entities (i.e. countries) by geographical location, allowing for the possibility to analyze the data and then attempt to draw inferences as to why these values are the way they are (social conditions, cultural implications, etc.). I feel that this would give us a better understanding of the cultural climate of these different areas, and ultimately would further our understanding of the birth rates and educational attainment levels of certain areas. If we are able to find trends correlating to the different geographical locations, we may be able to use the cultural implications of these areas to further understand the data holistically. Finally, I feel that it would be best to conduct further research and data collection to analyze other factors and confounding variables that might be present within this data set, and this would allow for a better understanding socially and culturally as to why the values for mean birth rate and educational attainment change from location to location over time.
Ultimately, after completing the data analysis regarding how educational attainment and birth rate have changed over time, it becomes apparent that this trend has not occurred solely due to the passage of time. There has to have been some other factor influencing the change in these variables as time has passed, and we believe that this change has come as a result of the empowerment of women that has come to fruition in the last 70 years. When we look at the values from 1950, we must understand the context in which these values were measured. At this time, there was less of an emphasis on education for women as there was generally less opportunity for women in this time period. As we have progressed and grown through the passage of time, there is less of an emphasis on reproduction and more of an emphasis placed on education generally speaking as the opportunities for women entering the workforce has increased along with other changes socially. To get a better understanding of this relationship, it would be beneficial to assess the cultural, societal, and economic changes that have occurred in the last 70 years in relation to women entering the workforce at a higher rate, with less cultural emphasis on raising and taking care of children.
[1] Lee, and Barro. “Women's Educational Attainment vs. Number of Children per Woman.” Our World in Data, 2019, ourworldindata.org/grapher/womens-educational-attainment-vs-fertility?tab=table.
[2] Countries-ofthe-World. (n.d.). Alphabetical list of countries of the world. List of countries of the world in alphabetical order. https://www.countries-ofthe-world.com/all-countries.html.