Hello! My name is Babajide Tobiloba and in this report, I investigated the 'noshowappointment' Kaggle dataset to identify trends and patterns in the data. The dataset contains specific data about over 110,000 people who scheduled appointments in various neighbourhoods and in this investigation, I identified the relationships between the parameters.
In this investigation, I identified the relationships between the kind of illness, the neighbourhood, whether the person is funded by the government, and whether the person shows up for the appointment.
There are a few things to note about the data:
Generally, ‘0’ means ‘False’ and ‘1’ means ‘True’
The ‘Scholarship’ column refers to whether the appointee was sponsored by a government fund for citizens
The ‘Neighbourhood’ column refers to the place where the appointment takes place
For the purpose of this analysis, SQL server was used for the data cleaning and Tableau was used for the data visualization.
Let us now begin!
ASK
The following questions will be asked:
Does being handicapped affect whether they show up?
Does the kind of illness affect whether the appointee shows up?
Does having government funding have any relationship with whether the appointee shows up?
Does the neighbourhood affect whether an appointee will show up?
Is there any relationship between receiving at least one SMS and an appointee showing up?
At the end of this analysis, these questions would have been answered.
PREPARE
In order to complete this investigation, I downloaded the public Kaggle dataset using this link: https://www.kaggle.com/datasets/joniarroba/noshowappointments
The dataset was made public for use by JONIHOPPEN.
PROCESS/CLEAN
The table was imported into my local database ‘Tobiloba’ on my local machine as ‘dbo.appointment’. First, I took a look at the full table to see its dimensions:
-- View the entire table
SELECT * FROM Tobiloba.dbo.appointment;
-- In this table, there are 110,527 rows and 14 columns of data
The column names are:
PatientId
AppointmentID
Gender
ScheduledDay
AppointmentDay
Age
Neighbourhood
Scholarship
Hipertension
Diabetes
Alcoholism
Handcap
SMS_received
No_show
I noticed that the ‘Hipertension’ and ‘Handcap’ columns were spelled wrongly, so, I renamed them:
-- Change Hipertension and Handcap column name
EXEC sp_RENAME 'appointment.Hipertension', 'Hypertension', 'COLUMN'
EXEC sp_RENAME 'appointment.Handcap', 'Handicap', 'COLUMN';
Next, I had to check for missing and incorrect values. The ‘Scholarship’, ‘Hypertension’, ‘Diabetes’, ‘Alcoholism’, ‘Handicap’, and ‘SMS_received’ columns should only contain values of 0 and 1:
-- Check for missing values or incorrect values
SELECT * FROM Tobiloba.dbo.appointment
WHERE Gender = NULL OR
AppointmentDay = NULL OR
Age = NULL OR Neighbourhood = NULL
OR Scholarship = null OR Scholarship > 1
OR Hypertension = NULL OR Hypertension > 1
OR Diabetes = NULL OR Diabetes > 1
OR Alcoholism = NULL OR Alcoholism > 1
OR Handicap = NULL OR Handicap > 1
OR No_show = null;
I found that there are 199 rows with either null or incorrect values, so I deleted the rows:
DELETE FROM Tobiloba.dbo.appointment
WHERE Gender = NULL OR
AppointmentDay = NULL OR
Age = NULL OR Neighbourhood = NULL
OR Scholarship = null OR Scholarship > 1
OR Hypertension = NULL OR Hypertension > 1
OR Diabetes = NULL OR Diabetes > 1
OR Alcoholism = NULL OR Alcoholism > 1
OR Handicap = NULL OR Handicap > 1
OR No_show = null;
Next, I checked the number of rows in the dataset:
-- Total number of appointees
SELECT DISTINCT COUNT(*) AS Appointees FROM Tobiloba.dbo.appointment;
ANALYZE
In this phase of the analysis, I began to explore the data. First, I checked the number of male and female appointees in the dataset:
-- Total number of male and female appointees
SELECT COUNT(Gender) AS Count_Gender, Gender FROM Tobiloba.dbo.appointment
GROUP BY Gender;
Then, I checked the percentage of male and female appointees:
-- Percentage of male appointees
SELECT(SELECT COUNT(Gender) FROM Tobiloba.dbo.appointment
WHERE Gender = 'M')*100.0/(SELECT COUNT(Gender)
FROM Tobiloba.dbo.appointment) AS Male_percent;
-- Percentage of female appointees
SELECT(SELECT COUNT(Gender) FROM Tobiloba.dbo.appointment
WHERE Gender = 'F')*100.0/(SELECT COUNT(Gender) AS Female_percent
FROM Tobiloba.dbo.appointment) AS Female_percent;
The results of this showed that there were significantly more male than female appointees.
Next, I found the average age of an appointee:
-- Find the average age of an appointee
SELECT ROUND(AVG(Age),0) AS Average_Age
FROM Tobiloba.dbo.appointment
Then, I found the most common neighbourhood:
SELECT Neighbourhood AS Mode, COUNT(*) AS Count
FROM Tobiloba.dbo.appointment
GROUP BY Neighbourhood
HAVING COUNT(*) >= ALL
(SELECT COUNT(*) FROM Tobiloba.dbo.appointment GROUP BY Neighbourhood);
Next, I wanted to check the date of the first and last appointment but first, I had to change the datatype in the columns to ‘date’:
-- Change the Schedule date and Appointment date to datetime format
ALTER TABLE Tobiloba.dbo.appointment
ALTER COLUMN ScheduledDay date;
ALTER TABLE Tobiloba.dbo.appointment
ALTER COLUMN AppointmentDay date;
-- Find the first and last appointment date
SELECT MIN(AppointmentDay) AS first_day,
MAX(AppointmentDay) AS last_day
FROM Tobiloba.dbo.appointment;
Next, I found the youngest age of an appointee and the oldest age of an appointee:
-- Find the youngest and oldest ages
SELECT MIN(Age) AS youngest, MAX(Age) AS oldest
FROM Tobiloba.dbo.appointment;
/*The youngest age is shown to be -1 which indicates
that it is wrong and has to be cleaned*/
-- Drop rows where the Age column has values less than 0
DELETE FROM Tobiloba.dbo.appointment
WHERE Age < 0; -- only one row is affected
-- Check the youngest and oldest again to confirm
SELECT MIN(Age) AS youngest, MAX(Age) AS oldest
FROM Tobiloba.dbo.appointment;
Next, I checked the number of neighbourhoods in the dataset:
-- Check the number of neighbourhoods in the database
SELECT COUNT(DISTINCT(Neighbourhood)) AS Count_Neighbourhood
FROM Tobiloba.dbo.appointment;
Then, I began to further explore the data deeper to find the percentages of people with different conditions that attended their appointments:
I found the percentage of appointees that showed up for their appointments:
-- Find the percentage of appointees that showed up
SELECT(SELECT COUNT(*) FROM Tobiloba.dbo.appointment
WHERE No_show = 'No')*100.0/(SELECT COUNT(*) FROM Tobiloba.dbo.appointment)
AS Showed_up;
I found the percentage of handicapped appointees that showed up for their appointments:
SELECT(SELECT COUNT(*) FROM Tobiloba.dbo.appointment
WHERE Handicap = 1 AND No_show ='No')*100.0/(SELECT COUNT(*)
FROM Tobiloba.dbo.appointment WHERE Handicap = 1) AS Handicap_showup_percent;
I found the percentage of appointees with hypertension that showed up for their appointments:
SELECT(SELECT COUNT(*) FROM Tobiloba.dbo.appointment
WHERE Hypertension = 1 AND No_show ='No')*100.0/(SELECT COUNT(*)
FROM Tobiloba.dbo.appointment WHERE Hypertension = 1)
AS Hypertension_showup_percent;
I found the percentage of appointees with diabetes that showed up for their appointments:
SELECT(SELECT COUNT(*) FROM Tobiloba.dbo.appointment
WHERE Diabetes = 1 AND No_show ='No')*100.0/(SELECT COUNT(*)
FROM Tobiloba.dbo.appointment WHERE Diabetes = 1)
AS Diabetes_showup_percent;
I found the percentage of appointees battling alcoholism that showed up for their appointments:
SELECT(SELECT COUNT(*) FROM Tobiloba.dbo.appointment
WHERE Alcoholism = 1 AND No_show ='No')*100.0/(SELECT COUNT(*)
FROM Tobiloba.dbo.appointment WHERE Alcoholism = 1)
AS Alcoholism_showup_percent;
I found the percentage of appointees funded by the government that showed up for their appointments:
SELECT(SELECT COUNT(*) FROM Tobiloba.dbo.appointment
WHERE Scholarship = 1 AND No_show ='No')*100.0/(SELECT COUNT(*)
FROM Tobiloba.dbo.appointment WHERE Scholarship = 1)
AS Scholarship_showup_percent;
I found the percentage of appointees who received at least one SMS that showed up for their appointments:
SELECT(SELECT COUNT(*) FROM Tobiloba.dbo.appointment
WHERE SMS_received = 1 AND No_show ='No')*100.0/(SELECT COUNT(*)
FROM Tobiloba.dbo.appointment WHERE SMS_received = 1)
AS SMS;
I then proceeded to create tables showing the relationships that will be used to answer the questions that the investigation depends on:
I grouped the dataset by the gender of the appointee and whether they showed up:
SELECT Gender, No_Show, COUNT(*) AS Gender_Count
FROM Tobiloba.dbo.appointment
GROUP BY No_Show, Gender
ORDER BY No_Show DESC, Gender;
I grouped the dataset by the neighbourhood of the appointee and whether they showed up:
/* Group by neighbourhood and whether they attended
their appointment and save result as a table */
SELECT Neighbourhood, No_Show, COUNT(*) AS Neigh_Count
FROM Tobiloba.dbo.appointment
GROUP BY Neighbourhood, No_Show
ORDER BY Neighbourhood ASC,No_Show DESC;
I grouped the dataset by whether the appointee was funded by the government and whether they showed up:
/* Group by Scholarship and whether they attended
their appointment and save result as a table */
SELECT Scholarship, No_Show, COUNT(*) AS Scholarship_Show
FROM Tobiloba.dbo.appointment
GROUP BY Scholarship, No_show
ORDER BY Scholarship ASC, No_Show DESC
I grouped the dataset by whether the appointee is battling hypertension and whether they showed up:
/* Group by Hypertension and whether they attended
their appointment and save result as a table */
SELECT Hypertension, No_Show, COUNT(*) AS Handicap_Show
FROM Tobiloba.dbo.appointment
GROUP BY Hypertension, No_show
ORDER BY Hypertension ASC, No_Show DESC
I grouped the dataset by whether the appointee is battling diabetes and whether they showed up:
/* Group by Diabetes and whether they attended
their appointment and save result as a table */
SELECT Diabetes, No_Show, COUNT(*) AS Handicap_Show
FROM Tobiloba.dbo.appointment
GROUP BY Diabetes, No_show
ORDER BY Diabetes ASC, No_Show DESC
I grouped the dataset by whether the appointee is battling alcoholism and whether they showed up:
/* Group by Alcoholism and whether they attended
their appointment and save result as a table */
SELECT Alcoholism, No_Show, COUNT(*) AS Handicap_Show
FROM Tobiloba.dbo.appointment
GROUP BY Alcoholism, No_show
ORDER BY Alcoholism ASC, No_Show DESC
I grouped the dataset by whether the appointee is handicapped and whether they showed up:
/* Group by Handicap and whether they attended
their appointment and save result as a table */
SELECT Handicap, No_Show, COUNT(*) AS Handicap_Show
FROM Tobiloba.dbo.appointment
GROUP BY Handicap, No_show
ORDER BY Handicap ASC, No_Show DESC
I grouped the dataset by whether the appointee received at least one SMS and whether they showed up:
/* Group by SMS_received and whether they attended
their appointment and save result as a table */
SELECT SMS_received, No_Show, COUNT(*) AS Handicap_Show
FROM Tobiloba.dbo.appointment
GROUP BY SMS_received, No_show
ORDER BY SMS_received ASC, No_Show DESC
After this analysis, I visualized the data through the use of Tableau desktop. The public dashboard can be found here:
https://public.tableau.com/app/profile/tobiloba.babajide/viz/DoctorsAppointment1/Dashboard1
CONCLUSION
At the end of this analysis, the following things were discovered:
There are 110327 appointees in total, with 35% of all appointees being male and 65% of all appointees’ female.
The youngest appointee is 0 years old, the oldest appointee is 115 years old and the average age of an appointee is 37 years old.
There are 81 neighbourhoods represented in the data and Jardim Camburi is the neighbourhood with the most appointments scheduled.
The first appointment date is April 29, 2016; the last appointment date is June 8, 2016.
In general, approximately 79.8% of people who made appointments showed up for the appointments.
Approximately 76.2% of appointees funded by the government showed up for the appointment.
Approximately 72.4% of appointees who received at least one SMS showed up for the appointment.
Approximately 82.7% of appointees with hypertension showed up for their appointment.
Approximately 82% of appointees with diabetes showed up for their appointments.
Approximately 79.9% of appointees with alcoholism showed up for their appointment.
In general, the trend is that, regardless of condition or funding, about 15–20% of people who scheduled appointments did not make it to the appointment. Although that is not necessarily a terrible figure, it can be improved upon.
The SQL code used for the data cleaning and analysis can be found here.