Women's Education & Fertility Rates

Tabitha Foster, Grace Hoffman, Marilyn Curtis, & Emma Delphon

PSYC 500 final project

Data Curation and Ethics

About our dataset and why we chose it:

Our data came from the 'Our World in Data' website. 'Our World in Data' relies on scholars around the globe to answer important questions while presenting the best available research and data in an understandable and accessible way. Part of their mission is to build an infrastructure that makes research and data openly available and useful for all (source: ourworldindata.org). Our group decided that we were interested in looking at how the years of education a woman obtains is related to how many children they have. Women's education is an important topic globally, and looking at how education correlates to fertility rates can give insight into women's issues. As our group is all women, this data set immediately piqued our interest, and we wanted to take a deeper dive into what the data could tell us.

Ethical concerns regarding our data:

Protecting the moral and legal rights of individuals whose data we are utilizing must be top priority. Maintaining anonymity throughout the entire process, from collection to analysis, protects individuals' personal identifying information. The data that has been collected and analyzed for the purpose of this project does not use individuals' data, instead relying on large subsets of data that have been averaged. It is also open source, which means we do not need to worry about misusing private consumer data. This data is available for consumption by everyone, so as data scientists, we have the legal right to use it.

Aside from purely legal considerations, data scientists are bound by moral principles that they must follow. Transparency is a key part of the data analysis process because it helps ensure that data scientists are using the data for its intended purpose, and they are not misrepresenting their results. The way data scientists ensure transparency is by showing the process of their analysis. Through this project, we will show how we used the data and the output of our analysis. At every step, we will provide explanations of our analysis so it can be fully understood. This transparency is twofold; it shows that we are using the data for its intended purpose, and it allows for full disclosure of our results so they cannot be misinterpreted. By maintaining full transparency throughout the project, we can mitigate ethical concerns and ensure that individuals' moral and legal rights are protected.

2. Data Preparation

Our project contained three main variables. Fertility rate, found in the “Fertility Rate” column of our data frame, was an integer variable, meaning that this variable was only numbers. Fertility rate was measured by the average number of children born per woman. Mean years of schooling was also an integer variable. This variable was found in the “Mean years of schooling for women in reproductive age” column of the data frame. Our third variable was the 'Entity' variable, which contained a list of different countries used in this dataset. This variable was a categorical variable, which is a type of variable that consists of names or labels.

We first looked at the size and shape of our data to get a general feel for the dataset we were working with. We also familiarized ourselves with the types of variables we had.

We reorganized the dataframe so that we were only looking at data after the year 1949 (above). The majority of our data was contained here, so it made sense to condense the range of the data.

3. Exploratory Data Analysis

For the data, we formulated three hypotheses. The first hypothesis was, as time progresses, mean years of schooling for women increases. Our second hypothesis was, the higher the mean years of schooling for women, the lower the fertility rate as measured by the average amount of children born per woman. The third hypothesis was, as time progresses, the fertility rate decreases.

To begin our data analysis, we looked for general trends in the data.

As time progresses, mean years of schooling for women increases.
The higher the mean years of schooling for women, the lower the fertility rate as measured by the average amount of children born per woman.
As time progresses, the fertility rate decreases.

First, we plotted the mean fertility rate against the year to see if we could identify any trends in the fertility rate over time. We found that as time increased, fertility rate decreased. (below)

We cleaned up the code by dropping any columns with the value NAN (below).

We observed the mean, median, standard deviation, and size of the new dataframe based on the year to gain a better understanding of the variable.

We then plotted the mean years of schooling against the year to see if there was any identifiable trend. We found that as time increased, mean years of schooling increased. (below)

We cleaned up our dataframe by dropping all rows with NAN values. We also generated statistics for the fertility rate to better understand our variable.

We grouped the data by the mean years of schooling and created a plot to represent the fertility rate. The general trend of the plot showed that as mean years of schooling increased, the fertility rate decreased.

We also created a scatterplot to represent this data, which is shown below. The raw data was used for this plot because the data with the NAN rows removed created a plot that was confusing and visually unappealing.

Next, we looked at the distribution of our categorical variable, 'Entity'. This represents all of the countries included in this dataset. We grouped by our two variables seperately: fertility rate and mean years of schooling. That way, we could create two dataframes that showed the distributions of these variables compared to the countries. This was displayed visually by two bar graphs.

We merged the two datasets containing the distributions of fertility rates and mean years of schooling so that it was easier to work with, and we could make comparisons

We created a scatterplot to show the correlation between mean years of schooling and fertility distributions. This scatterplot reinforced our earlier judgement that fertility rates decrease as mean years of schooling increases.

The next step in our analysis was the use linear regression and hypothesis testing to observe the differences in the distributions of our variables. We loaded the functions necessary for our analysis.

Below we have the linear regression for our two continous variables, mean years of schooling, and fertility rate (as measured by the average number of children born per woman.) As we can see, there is a negative correlation between the two variables; as average years of schooling increases, fertility rate generally decreases, as a trend.

Next, we created bootstrap replicates of each variable's data. This was done by creating a random sample used to mirror the data. By resimulating the data through random sample, we are able to establish confidence intervals and standard error amongst the data.

Here, we found the means of each data sets. We were able to then find the difference between them, and compare that to the bootstrap replicates. By comparing it to the sample data, we were able to determine whether or not the difference in means was within standard error, or if the distribution was the same for both variables. As the distribution was almost identical, we got a p value of 1.0.

The final step of our analysis was to conduct a two sample bootstrap hypothesis test. We selected bootstrap instead of permutation because we did not know if our distributions were normal or not. To begin, we needed to decide what exactly we wanted to test, and we decided to split up the distribution of mean years of schooling into two categories: high and low. This was done by sorting the mean years of schooling in descending order, and then splitting the top and bottom half of the distribution into separate dataframes. From there, we wanted to test the difference between the distribution of fertility rates in the high and low categories.

After sorting the data, we created a cdf of both the high and low distributions to visually compare the two.

We then organized our data using the numpy function, and we looked at the means separately as well as the difference in means.

We then stated our null hypothesis for testing:

HO: The distributions of fertility rates based on high and low levels of schooling are identical.

We then simulated data assuming that the HO is true (a bootstrap sample). This gives us our shifted data, which we plotted against the raw mean data.

We calculated a replicate from the simulated dataset. From there, we repeated the above steps by finding the difference of means, which we then compared to the observed value to make our decision.

A p-value of 0.0001 tells us that there is about a 0.01% chance that you would get the difference of means observed in the experiment if the distributions of fertility rates for low and high levels of schooling were identical. Therefore, we can reject the null hypothesis. The distributions are not identical.

4. Discussion

Summary

The objective of our project was to view how mean years of schooling for women in reproductive age and fertility rate relate to each other. To complete our analysis, we pulled data from the Our World in Data website and loaded it into Google Colab. Once the data was loaded, we viewed the relationship between three variables: fertility rate, mean years of schooling for women, and country. We found that there was a strong, positive correlation between time and mean years of schooling, a significant negative correlation between time and fertility rate, and a significant negative correlation between fertility rate and mean years of schooling.

Limitations and Future Directions

While data scientists strive to be free of limitations, unfortunately, there are still some that exist for this project. At a broad level, one limitation is in how we interpreted the data. While we can make conjectures, we do not have the ability to give the “why”. Our data did show strong correlations between the different variables, but one must always note that correlation does not equal causation. This is because there will always be (though as data scientists we try to eliminate as many as we can,) extraneous, confounding variables that affect our data. Because of this, while we can discuss implications for the future, we must note, and possibly further analyze, the fact that there are other factors that affect our conjectures.

We also recognize limitations specific to our dataset. Because our data comes from around the world and across a large period of time, one confounding variable is the difference in cultures and time periods. For example, more traditional cultures may value education but also put a different or greater emphasis on motherhood. Another limitation is that while we did see general trends in our data, there were outliers that did not follow these trends. These outliers can skew our overall statistics for each variable. Outliers can exist from a number of reasons, for example due to conflicts in an area or shifts in or political climate. Our other specific limitation was how our data was gathered. As our data was sourced internationally, differences in language interpretation could skew the results. Furthermore, there are surely rural parts of the world unaccessed by this survey.

From our project, we would want to branch out into analysis of other variables. For example, we could analyze trends between our data and different socioeconomic states, or compare our data to other extraneous factors noted above, such as how conflict affects education rates or fertility rates.

Implications

After obtaining the results of our data analysis, we were able to ask ourselves: how is this beneficial to women? Looking at the general trends, there are a couple implications. First, we saw that as mean years of schooling increased, fertility rate decreased. With this trend in mind, we can then recommend potential policies and interventions. One possible intervention is the implementation of outreach programs that inform younger girls of the importance of obtaining higher education. While we don’t want to discourage them from having children, we could put into place an educational program that informs younger girls on safe sex and the implications of having more children. From the data, we also recommend policies that incentivize more women from lower income or less developed areas to obtain more education, and possibly before having a family.