Datasets are used for both instruction and projects (see the projects page). Datasets have been selected using the following criteria as much as possible:
Is real-world data, publicly available, at no cost
Can be used to develop student data and AI skills that are useful in a business context
Data exploration can be used to tell a compelling story
Has some interest or relevance to students.
Generally dataset size is the number of rows, and features are columns
Chicago Health Atlas Data, 2005-2011
Snapshot: a publicly available resource providing comprehensive health-related data across Chicago's community areas, encompassing over 160 public health indicators.
Size: 78 Chicago Community Areas (rows), each with ~60 extracted factors (columns)
Features (samples): Per Capita Income, % Unemployment, % Below Poverty, Crowded Housing, Dependents, No HS Diploma, Birth Rate, Teen Birth Rate, Low Birth Weight, Preterm Births, Lung Cancer, Female Breast Cancer, Male Prostate Cancer, Diabetes, Stroke, Tuberculosis
Uses: predictive modeling of factors contributing to "community health" categories such as income, crime, cancer, and crowded housing.
Potential Business Problems:
1. Where should a hospital or clinic open a new branch to maximize community impact?
3. How can insurance companies adjust premiums based on population-level risk?
Lending Club private loan applications
Snapshot: contains comprehensive loan data from Lending Club, an online marketplace bank that connects borrowers with investors, including borrower information, loan terms, and repayment status.
Uses: modeling of factors that contribute to high, or low, quality loans and application of insights for ML-based predictive modeling.
Size: ~30 million loan applications (rows)
Features: Credit scores, loan amount, debt-to-income ratio, interest rate, loan purpose, job title
Potential Business Problems:
1. What interest rate should be offered to maximize return and minimize risk?
2. How can investors choose the best set of loans to invest in based on risk and reward?
National Health and Nutrition Examination Survey (NHANES)
Snapshot: Random samples from the original 100K records from the 2017-2018 Pre-Pandemic NHANES dataset.
Uses: Analysis of relationships between health factors.
Size: 9,828 individuals (rows), each with 17 features (columns)
Features: Time_In_US, Marital Status, Education Level, Born_In_Country, Poverty, Sex, Age, Race, Weight, Waist, Height, BMI, Triglycerides_mol, Sodium, Cholest, Glucose, Iron, Triglycerides.
Potential Business Problems:
What customer characteristics could be used to identify candidates for preventive health care?
US Counties
Snapshot: US county-level data across a broad range of features.
Uses: Comparing geographic locations
Size: 3,143 US Counties (rows) each with 732 features (columns)
Features: Name, state, area, latitude/longitude, precipitation, temperature, race percentages, age percentages, male/female number, population, prominent causes of death, income, labor, causes of death, politics, education, cost of living, poverty rate, health, industries, zip codes.
Potential Business Problems:
Your financial services company is expanding and wants to open new offices in desirable parts of the country. (You may be asked to move to one of these new locations to help open the new office!) Your task is to extract, merge, transform, and analyze the dataset, coming up with a set of hypotheses and questions that have value for future exploration.
Second step: Consider both the city dataset along with the county dataset for your analysis.
UIC course grades distribution Fall 2004 through Fall 2024
Size: 99,337 courses (rows), each representing a single UIC course offered
Features: Year, Semester, Subject (e.g. CS), Number, Title, Dept, # of each: (A,B,C,D,F,W), Instructor name, # students.
Potential Business Problems:
1. Are there courses in a Department that are stumbling blocks for students?
2. For a college, is there a faculty member that is an outlier in terms of grading?
Chicago Crime Data
Snapshot: provides detailed crime incident reports, excluding murders, from 2001 to the present, updated daily from the Chicago Police Department's CLEAR system.
Size: ~250K rows per year, each one representing a crime
Features: ID, Date, Year, Type, FBI Code, Beat (area), District, Ward, Community Area, Arrest (binary), Domestic (binary)
Potential Business Problems:
1. How can realtors and property buyers evaluate neighborhood safety?
2. Is gentrification correlating with shifts in crime reporting or policing?
CDC: National Immunization Survey Data data dictionary, context
Snapshot: includes estimates of vaccination coverage rates for U.S. children and for U.S. adolescents.
Size: ~43K rows (2023), with 676 features for each row
Features: (~600 columns from vaccination survey, for example: ... HPVI_ANY (number of HPV Vaccination Shots), Sex, Poverty Status, WELLCHILD (Did teen receive an 11–12 year old well-child exam or check-up), etc.
Potential Bussiness Problems:
1. Which populations are least likely to be vaccinated, and why?
2. How can government campaigns improve HPV or flu vaccination rates?
Screen Time & Productivity (and related poster presentation). For "screen time vs. X" see also family time, medical and health
Finance-related
Wharton Research Data Services (WRDS) for stock market data. (You will need to register to use the UIC license.)