Datasets are used for both instruction and projects (see the projects page). Datasets have been selected using the following criteria as much as possible:
Is real-world data, publicly available, at no cost
Can be used to develop student data and AI skills that are useful in a business context
Data exploration can be used to tell a compelling story
Has some interest or relevance.
Generally dataset size is the number of rows, and features are columns
Snapshot: a publicly available resource providing comprehensive health-related data across Chicago's community areas, encompassing over 160 public health indicators.
Size: 78 Chicago Community Areas (rows), each with ~60 extracted factors (columns)
Features (samples): Per Capita Income, % Unemployment, % Below Poverty, Crowded Housing, Dependents, No HS Diploma, Birth Rate, Teen Birth Rate, Low Birth Weight, Preterm Births, Lung Cancer, Female Breast Cancer, Male Prostate Cancer, Diabetes, Stroke, Tuberculosis
Uses: predictive modeling of factors contributing to "community health" categories such as income, crime, cancer, and crowded housing.
Potential Business Problems:
1. Where should a hospital or clinic open a new branch to maximize community impact?
3. How can insurance companies adjust premiums based on population-level risk?
Snapshot: contains comprehensive loan data from Lending Club, an online marketplace bank that connects borrowers with investors, including borrower information, loan terms, and repayment status.
Uses: modeling of factors that contribute to high, or low, quality loans and application of insights for ML-based predictive modeling.
Size: ~500K loan applications (rows)
Features: 131 features including Credit scores, loan amount, debt-to-income ratio, interest rate, loan purpose, job title
Business Scenario:
1. Can we improve an existing ML model to better predict whether a loan will be paid or charged off?
2. Does an ML model that more accurately predicts whether a loan is repaid or not actually increase the profits of the bank?
Snapshot: Random samples from the original 100K records from the 2017-2018 Pre-Pandemic NHANES dataset.
Uses: Analysis of relationships between health factors.
Size: 9,828 individuals (rows), each with 17 features (columns)
Features: Time_In_US, Marital Status, Education Level, Born_In_Country, Poverty, Sex, Age, Race, Weight, Waist, Height, BMI, Triglycerides_mol, Sodium, Cholest, Glucose, Iron, Triglycerides.
Potential Business Problems:
What customer characteristics could be used to identify candidates for preventive health care?
The following five datasets can be used together to analyze location-related questions.
Snapshot: US county-level data across a broad range of features, plus several other city-related datasets
Uses: Comparing geographic locations
Size: 3,143 US Counties (rows) each with 732 features (columns)
Features: Name, state, area, latitude/longitude, precipitation, temperature, race percentages, age percentages, male/female number, population, prominent causes of death, income, labor, causes of death, politics, education, cost of living, poverty rate, health, industries, zip codes.
Snapshot: Health data from the CDC for the largest 500 cities in the country.
Size: 500 US cities (rows) each with 33 features (columns)
Features (samples): Name, state, name, population, arthritis, binge drinking, high blood pressure, blood pressure meds, cancer, asthma, cholesterol, copd, smoking, dental, diabetes, obesity, sleep, stroke, teeth lost, location.
Snapshot: Cost of living index from AdvisorSmith
Size: 511 US cities (rows) each with 3 features (columns)
Features: City, State, Cost of living index
Snapshot: Average income by city, along with land and water areas.
Size: 32,526 US cities (rows) each with 19 features (columns)
Features (samples): City, county, state, zip code, area code, land area, water area, latitude, longitude, income
Snapshot: Risk measurements by county, from FEMA
Size: 3,231 Counties (rows) each with 465 features (columns)
Features (samples): Avalanche, coastal flooding, cold waver, drought, earthquake, hail, heat wave, hurricane, ice storm, landslide, lightning, river flooding, strong wind, tornado, tsunami, volcanic activity, wildfire, winter weather.
Potential Business Problems:
The financial services company you work for is expanding and wants to open new offices in desirable parts of the country. Extract, merge, transform, and analyze datasets, coming up with a rank-ordered list of most desirable cities, along with a set of hypotheses and questions that have value for future exploration.
Snapshot:
Size: 99,337 courses (rows), each representing a single UIC course offered
Features: Year, Semester, Subject (e.g. CS), Number, Title, Dept, # of each: (A,B,C,D,F,W), Instructor name, # students.
See the column descriptions page
Potential Business Problems:
1. Are there courses in a Department that are stumbling blocks for students?
2. For a college, is there a faculty member who is an outlier in terms of grading?
Snapshot: provides detailed crime incident reports, excluding murders, from 2001 to the present, updated daily from the Chicago Police Department's CLEAR system.
Size: ~250K rows per year, each one representing a crime
Features: ID, Date, Year, Type, FBI Code, Beat (area), District, Ward, Community Area, Arrest (binary), Domestic (binary)
Potential Business Problems:
1. How can realtors and property buyers evaluate neighborhood safety?
2. Is gentrification correlating with shifts in crime reporting or policing?
CDC: National Immunization Survey Data data dictionary, context
Snapshot: includes estimates of vaccination coverage rates for U.S. children and for U.S. adolescents.
Size: ~43K rows (2023), with 676 features for each row
Features: (~600 columns from vaccination survey, for example: ... HPVI_ANY (number of HPV Vaccination Shots), Sex, Poverty Status, WELLCHILD (Did teen receive an 11–12 year old well-child exam or check-up), etc.
Potential Bussiness Problems:
1. Which populations are least likely to be vaccinated, and why?
2. How can government campaigns improve HPV or flu vaccination rates?
Screen Time & Productivity (and related poster presentation). For "screen time vs. X" see also family time, medical and health
Finance-related
Wharton Research Data Services (WRDS) for stock market data. (You will need to register to use the UIC license.)