Dixit Kapuriya - EDA and Predicting the number of Suicides

EDA and Predicting the number of Suicides

Can we save lives with ML?

Source

Introduction

Suicide is the third leading cause of death among young adults worldwide. Around 800000 people die due to suicide every year. For every suicide there are many people who attempt suicide every year. A Prior Suicide attempt is the single most important risk factor for suicide in the general population.

79% of global suicides occur in low and middle-income countries. Ingestion of pesticide, firearms and hanging are among the most common methods of suicide globally.

In the last 45 years, suicide rates have increased by 60% worldwide. This project uses Python machine learning algorithms to create a model from thousands of data points and proceeds to test the machine learning model.

Prerequisites:

Python 3.+
Understanding of libraries (Scikit Learn, Pandas, flask, Matplotlib, Seaborn)
Google Colab
Basic understanding of machine learning classification methods or Algorithms

Problem Description:

Suicides are becoming ever more common around the world. In the past 45 years, rates have increased by 60% worldwide. This project uses Python machine learning algorithms to create a model from thousands of data points and proceeds to test the machine learning model.

It analyzes the correlation between variables and suicide rates. In addition, the program creates predictions that are often close to the true results.

Data Set Information:

We have used suicide rate dataset from kaggle.

Russell Yates with the username, “Rusty,” on the data collection website, “Kaggle,” compiled a list of 27,800 suicides from 1985 to 2016.

This list’s information is derived from sources including the United Nations Development Program (data from 2018), the World Bank (data from 2018), and the World Health Organization (data from 2018).

Link:

https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016

Data description :

The data set has following attributes:

Independent Variables:

- Country: Country of the citizen. Every country can have different awareness and prevention programs to limit suicides, so this should help determine the likeliness of a suicide.

- Year: Year of the suicide in question. Huge events that occur in a single year might impact self-harm rates. The program should also be able to see general trends and make predictions accordingly.

- Sex: Gender of the suicide victim. There may be slight differences in suicide rates depending on the gender of the victim.

- Age: Age of the victim during suicide. There may be varying self-harm rates between people of different ages.

- HDI: Human Development Index. This is a number used to assess the development of a country, not economic growth alone. Factors that impact an HDI include but are not limited to: the life expectancy at birth, government policy choices, and the expected years of schooling. The HDI of a country should have an impact on suicide rates.

- GDP per capita: GDP per capita is the total trade within a country. Thus, it correlates with the wealth of the average citizen and should correlate with peoples’ general happiness as well as suicide rates.

- Suicide Rate Per 100K: Number of suicides per 100k people within the same demographics.

- Population: Population is the population of any country in particular year.

- Generation: Generation of the victim. Examples would be: Generation X, Baby Boomers, etc. These are converted into a numerical format that the program is able to analyze. People who are born at different times may have different suicide likeliness, even when other factors are the same.

Dependent Variable:

- Suicide Number: It is the number of suicides in particular country. The machine learning algorithm will attempt to see how much each independent variable affects the suicide numbers within the same demographics.

Flow of data:

Import Dataset:

To import a CSV dataset, we have used the object pd.read_csv(). The basic argument inside is:

Example:

Path = “URL”

data = pd.read_csv(Path)

This dataset contain many string value, continuous value and also some null value. So first of all we have to clean the data.

Dataset Visualization:

Relationship between year and suicide number

Relationship between generation and suicide number

Relationship between age group and suicide number

Relationship between gender and suicide number

Trying Linear Regression:

Linear regression models are used to show or predict the relationship between two variables or factors. The factor that is being predicted (the factor that the equation solves for) is called the dependent variable.
Here is the performance of mode.

Trying Decision Trees Regression:

Decision trees use a tree-like model to show possibilities and make decisions by going from branch to branch. I use the Decision Tree method to show the importance of some factors to analyze variable reliance.
Here is the performance of model.

Trying Random Forest Regression:

Random forest is a Supervised Learning algorithm which uses ensemble learning method for classification and regression. The trees in random forests are run in parallel. There is no interaction between these trees while building the trees. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Here is the performance of model.

Comparison of all models:

Here we have compare all three models based on error.

Web App to predict suicides cases:

Above figure is the snapshot of the app that we have developed. In the app you have to enter country, year, population, age group, gender, generation, suicide per 100k population and gdp per capita. After inserting the above details, you have click predict button to predict the suicide cases.

From above figure you can see the details that we have inserted in the app for example purpose.

From above figure you can see the output of app “No of suicide : 1469”. App has predicted the suicide case and that is 1469.

Conclusion:

This project is used to investigate the pattern of suicides cases worldwide. After the investigating of the available dataset, we can conclude that magnitude ration of the suicide cases for male is comparatively beyond the female. Also, we have observed that the men who attempt suicide are belong to the age group of 30-44 years where as the women who commit suicides are belong to the age group of 15-29 years. For the prediction of the suicides we have developed three machine learning models that are linear regression, decision tree regression and random forest regression. Among the three models random forest gives the higher explained variance score than other two model and that is 99.75%.

Source Code

GitHub Repository

Technical Paper

Co-author : Abhi Kaila

Guided by : Priyanka Patel(Asst. Professor, CSPIT, CHARUSAT)

Page updated

Google Sites

Report abuse

EDA and Predicting the number of Suicides

Can we save lives with ML?

Introduction

79% of global suicides occur in low and middle-income countries. Ingestion of pesticide, firearms and hanging are among the most common methods of suicide globally.

In the last 45 years, suicide rates have increased by 60% worldwide. This project uses Python machine learning algorithms to create a model from thousands of data points and proceeds to test the machine learning model.

Prerequisites:

Python 3.+

Understanding of libraries (Scikit Learn, Pandas, flask, Matplotlib, Seaborn)

Google Colab

Basic understanding of machine learning classification methods or Algorithms

Problem Description:

Suicides are becoming ever more common around the world. In the past 45 years, rates have increased by 60% worldwide. This project uses Python machine learning algorithms to create a model from thousands of data points and proceeds to test the machine learning model.

It analyzes the correlation between variables and suicide rates. In addition, the program creates predictions that are often close to the true results.

Data Set Information:

We have used suicide rate dataset from kaggle.

Russell Yates with the username, “Rusty,” on the data collection website, “Kaggle,” compiled a list of 27,800 suicides from 1985 to 2016.

This list’s information is derived from sources including the United Nations Development Program (data from 2018), the World Bank (data from 2018), and the World Health Organization (data from 2018).

Link:

https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016

Data description :

The data set has following attributes:

Independent Variables:

Country: Country of the citizen. Every country can have different awareness and prevention programs to limit suicides, so this should help determine the likeliness of a suicide.

Year: Year of the suicide in question. Huge events that occur in a single year might impact self-harm rates. The program should also be able to see general trends and make predictions accordingly.

Sex: Gender of the suicide victim. There may be slight differences in suicide rates depending on the gender of the victim.

Age: Age of the victim during suicide. There may be varying self-harm rates between people of different ages.

GDP per capita: GDP per capita is the total trade within a country. Thus, it correlates with the wealth of the average citizen and should correlate with peoples’ general happiness as well as suicide rates.

Suicide Rate Per 100K: Number of suicides per 100k people within the same demographics.

Population: Population is the population of any country in particular year.

Generation: Generation of the victim. Examples would be: Generation X, Baby Boomers, etc. These are converted into a numerical format that the program is able to analyze. People who are born at different times may have different suicide likeliness, even when other factors are the same.

Dependent Variable:

Suicide Number: It is the number of suicides in particular country. The machine learning algorithm will attempt to see how much each independent variable affects the suicide numbers within the same demographics.

Flow of data:

Import Dataset:

To import a CSV dataset, we have used the object pd.read_csv(). The basic argument inside is:

Example:

This dataset contain many string value, continuous value and also some null value. So first of all we have to clean the data.

Dataset Visualization:

Relationship between year and suicide number

Relationship between generation and suicide number

Relationship between age group and suicide number

Relationship between gender and suicide number

Trying Linear Regression:

Linear regression models are used to show or predict the relationship between two variables or factors. The factor that is being predicted (the factor that the equation solves for) is called the dependent variable.

Here is the performance of mode.

Trying Decision Trees Regression:

Decision trees use a tree-like model to show possibilities and make decisions by going from branch to branch. I use the Decision Tree method to show the importance of some factors to analyze variable reliance.

Here is the performance of model.

Trying Random Forest Regression:

Here is the performance of model.

Comparison of all models:

Here we have compare all three models based on error.

Web App to predict suicides cases:

Above figure is the snapshot of the app that we have developed. In the app you have to enter country, year, population, age group, gender, generation, suicide per 100k population and gdp per capita. After inserting the above details, you have click predict button to predict the suicide cases.

From above figure you can see the details that we have inserted in the app for example purpose.

From above figure you can see the output of app “No of suicide : 1469”. App has predicted the suicide case and that is 1469.

Conclusion:

Source Code

Technical Paper

Co-author : Abhi Kaila

Guided by : Priyanka Patel(Asst. Professor, CSPIT, CHARUSAT)