DO PEOPLE SHOW UP FOR THEIR MEDICAL APPOINTMENTS?

Hello! My name is Babajide Tobiloba and in this report, I investigated the 'noshowappointment' Kaggle dataset to identify trends and patterns in the data. The dataset contains specific data about over 110,000 people who scheduled appointments in various neighbourhoods and in this investigation, I identified the relationships between the parameters.

In this investigation, I identified the relationships between the kind of illness, the neighbourhood, whether the person is funded by the government, and whether the person shows up for the appointment.

There are a few things to note about the data:

Generally, ‘0’ means ‘False’ and ‘1’ means ‘True’
The ‘Scholarship’ column refers to whether the appointee was sponsored by a government fund for citizens
The ‘Neighbourhood’ column refers to the place where the appointment takes place

For the purpose of this analysis, SQL server was used for the data cleaning and Tableau was used for the data visualization.

Let us now begin!

ASK

The following questions will be asked:

Does being handicapped affect whether they show up?
Does the kind of illness affect whether the appointee shows up?
Does having government funding have any relationship with whether the appointee shows up?
Does the neighbourhood affect whether an appointee will show up?
Is there any relationship between receiving at least one SMS and an appointee showing up?

At the end of this analysis, these questions would have been answered.

PREPARE

In order to complete this investigation, I downloaded the public Kaggle dataset using this link: https://www.kaggle.com/datasets/joniarroba/noshowappointments

The dataset was made public for use by JONIHOPPEN.

PROCESS/CLEAN

The table was imported into my local database ‘Tobiloba’ on my local machine as ‘dbo.appointment’. First, I took a look at the full table to see its dimensions:

-- View the entire table

SELECT * FROM Tobiloba.dbo.appointment;

-- In this table, there are 110,527 rows and 14 columns of data

The column names are:

PatientId
AppointmentID
Gender
ScheduledDay
AppointmentDay
Age
Neighbourhood
Scholarship
Hipertension
Diabetes
Alcoholism
Handcap
SMS_received
No_show

I noticed that the ‘Hipertension’ and ‘Handcap’ columns were spelled wrongly, so, I renamed them:

-- Change Hipertension and Handcap column name

EXEC sp_RENAME 'appointment.Hipertension', 'Hypertension', 'COLUMN'

EXEC sp_RENAME 'appointment.Handcap', 'Handicap', 'COLUMN';

Next, I had to check for missing and incorrect values. The ‘Scholarship’, ‘Hypertension’, ‘Diabetes’, ‘Alcoholism’, ‘Handicap’, and ‘SMS_received’ columns should only contain values of 0 and 1:

-- Check for missing values or incorrect values

SELECT * FROM Tobiloba.dbo.appointment

WHERE Gender = NULL OR

AppointmentDay = NULL OR

Age = NULL OR Neighbourhood = NULL

OR Scholarship = null OR Scholarship > 1

OR Hypertension = NULL OR Hypertension > 1

OR Diabetes = NULL OR Diabetes > 1

OR Alcoholism = NULL OR Alcoholism > 1

OR Handicap = NULL OR Handicap > 1

OR No_show = null;

I found that there are 199 rows with either null or incorrect values, so I deleted the rows:

DELETE FROM Tobiloba.dbo.appointment

WHERE Gender = NULL OR

AppointmentDay = NULL OR

Age = NULL OR Neighbourhood = NULL

OR Scholarship = null OR Scholarship > 1

OR Hypertension = NULL OR Hypertension > 1

OR Diabetes = NULL OR Diabetes > 1

OR Alcoholism = NULL OR Alcoholism > 1

OR Handicap = NULL OR Handicap > 1

OR No_show = null;

Next, I checked the number of rows in the dataset:

-- Total number of appointees

SELECT DISTINCT COUNT(*) AS Appointees FROM Tobiloba.dbo.appointment;

ANALYZE

In this phase of the analysis, I began to explore the data. First, I checked the number of male and female appointees in the dataset:

-- Total number of male and female appointees

SELECT COUNT(Gender) AS Count_Gender, Gender FROM Tobiloba.dbo.appointment

GROUP BY Gender;

Then, I checked the percentage of male and female appointees:

-- Percentage of male appointees

SELECT(SELECT COUNT(Gender) FROM Tobiloba.dbo.appointment

WHERE Gender = 'M')*100.0/(SELECT COUNT(Gender)

FROM Tobiloba.dbo.appointment) AS Male_percent;

-- Percentage of female appointees

SELECT(SELECT COUNT(Gender) FROM Tobiloba.dbo.appointment

WHERE Gender = 'F')*100.0/(SELECT COUNT(Gender) AS Female_percent

FROM Tobiloba.dbo.appointment) AS Female_percent;

The results of this showed that there were significantly more male than female appointees.

Next, I found the average age of an appointee:

-- Find the average age of an appointee

SELECT ROUND(AVG(Age),0) AS Average_Age

FROM Tobiloba.dbo.appointment

Then, I found the most common neighbourhood:

SELECT Neighbourhood AS Mode, COUNT(*) AS Count

FROM Tobiloba.dbo.appointment

GROUP BY Neighbourhood

HAVING COUNT(*) >= ALL

(SELECT COUNT(*) FROM Tobiloba.dbo.appointment GROUP BY Neighbourhood);

Next, I wanted to check the date of the first and last appointment but first, I had to change the datatype in the columns to ‘date’:

-- Change the Schedule date and Appointment date to datetime format

ALTER TABLE Tobiloba.dbo.appointment

ALTER COLUMN ScheduledDay date;

ALTER TABLE Tobiloba.dbo.appointment

ALTER COLUMN AppointmentDay date;

-- Find the first and last appointment date

SELECT MIN(AppointmentDay) AS first_day,

MAX(AppointmentDay) AS last_day

FROM Tobiloba.dbo.appointment;

Next, I found the youngest age of an appointee and the oldest age of an appointee:

-- Find the youngest and oldest ages

SELECT MIN(Age) AS youngest, MAX(Age) AS oldest

FROM Tobiloba.dbo.appointment;

/*The youngest age is shown to be -1 which indicates

that it is wrong and has to be cleaned*/

-- Drop rows where the Age column has values less than 0

DELETE FROM Tobiloba.dbo.appointment

WHERE Age < 0; -- only one row is affected

-- Check the youngest and oldest again to confirm

SELECT MIN(Age) AS youngest, MAX(Age) AS oldest

FROM Tobiloba.dbo.appointment;

Next, I checked the number of neighbourhoods in the dataset:

-- Check the number of neighbourhoods in the database

SELECT COUNT(DISTINCT(Neighbourhood)) AS Count_Neighbourhood

FROM Tobiloba.dbo.appointment;

Then, I began to further explore the data deeper to find the percentages of people with different conditions that attended their appointments:

I found the percentage of appointees that showed up for their appointments:

-- Find the percentage of appointees that showed up