Starting in 2014, a non-profit called Open Sourcing Mental Illness (OSMI) started giving questionnaires to attendees at various technology conferences. OSMI is an organization which is dedicated to raising awareness, educating and providing resources to support mental wellness in the tech and open source communities.
I am looking at the dataset from 2016. 63 questions were given and the sample size was 1434 participants. The dataset was recorded as a CSV. I chose this topic because I have dealt with mental health issues myself and I thought it would be interesting since I will be working in the Data Science field to see a snapshot of the mental health of people in the technology industry.
OSMI has already done some exploratory data analysis. In related literature, I found newspaper articles that looked at the OSMI data from 2016 [1][2][3]. The articles don't focus on the specific mental illnesses of the participants. As a result I have chosen to test the hypothesis “whether I can predict what mental health issues people have from analyzing this data?”
Exploratory data analysis for my Phase II project has yielded interesting results. I imported the data as a CSV. I had to make various changes to the columns in my dataset. The gender column had 70 unique values, which meant I needed to recode that column. I also had to recode the age column as there were a few nonsensical entries, as well as recoding the column asking “what condition(s) people had”. I dropped the columns that had more than half of the observations having missing values. Some rows needed changes as well. I had to drop the rows where people either didn’t answer or stated they didn’t have a mental disorder, leaving me with approximately 600 entries to use.
I imputed the missing values for the remaining columns using the mode and then created a training and test set. I used a MultiLabelBinarizer to create a model from the remaining entries. I ended up with a Naive Bayes model that has 88% accuracy. I also used stochastic gradient descent and that model has 99% percent accuracy.
I modified the hypothesis to explore whether I can predict if someone currently has a mental health issue. In other words, I modified the hypothesis to test their responses to the question “Do you currently have a mental health issue?” That meant I had to drop the responses that said “maybe” from the dataset because I needed a definitive "yes" or "no". For reference, mental health issues include Mood Disorders, Anxiety Disorders, Post Traumatic Stress Disorder etc.
Under this new hypothesis, I built many models when evaluating the data including building a Decision Tree and Random Forest Classifier. The models had moderate success, with the Decision Tree having an accuracy of 83% and the Random Forest Classifier having an 87% percent accuracy. I also built confusion matrices for all the models. The confusion matrix for Decision Trees had 38 incorrect guesses and 184 correct ones. The confusion matrix for Random Forests had 193 correct guesses and 29 incorrect ones.
As I mentioned in the previous phase, I also used Naive Bayes, which now had 82% accuracy and Stochastic Gradient Descent with 86% accuracy. The confusion matrix had 40 incorrect guesses and 182 correct guesses for the Naive Bayes. The confusion matrix for Stochastic Gradient Descent had 32 incorrect guesses and 190 correct guesses.
I built several models that can predict whether someone has a mental illness based on their responses to a questionnaire. The models utilized included Naive Bayes, Stochastic Gradient Descent, Decision Trees, and Random Forest. These models had varying levels of Accuracy.
There were some limitations in the questionnaire and the dataset.
One limitation is that there were only around 32 responses from individuals who identified as genderqueer/other, which is a small data sample. For this reason, the predicted incidence rates for genderqueer/other individuals is much more uncertain than the predictions for men or women.
There is the potential for selection bias in the dataset. Because the survey is based on self-reported information, those people that responded to the survey may have higher, or even lower, rates of mental health issues. More data inputs and analysis would strengthen the results of the study I performed.
For future work I would want to analyze a greater number of responses from genderqueer/other individuals. I would also want to establish greater reliability of the starting dataset to limit the self-selection bias problem. I believe also it would be beneficial to construct a neural network to validate some of the other models built. Pytorch could be used to build a simple neural network with one simple linear neural layer. A simple neural network would be sufficient since this is a classification problem. There should be no need for tanh or softmax in this neural network as the network is one layer deep.
It also might be useful to perform a cluster analysis on these responses, because this could show a profile of different perspectives on how the industry is dealing with mental health issues.