The primary objective of this project was to leverage SQL and Python to extract insights from a large dataset regarding painting galleries, artists, and their works. The project aimed to:
Load data from CSV files into a MySQL database.
Perform data validation and checks.
Answer a set of business questions through SQL queries that analyze the relationship between various aspects such as artists, museums, canvas sizes, and painting subjects.
The dataset used in this project was sourced from Kaggle and consisted of 8 CSV files:
artist.csv - Information on artists, including full name, birth/death years, nationality, and style.
canvas.csv - Details of different canvas sizes used in the paintings.
image_link.csv - URLs pointing to images of the paintings.
museum.csv - Information about the museums that house the paintings, including name and location.
museum_hours.csv - Hours of operation for the museums.
product_size.csv - Information on the size of the products.
subject.csv - The subject matter of paintings (e.g., portraits, landscapes).
work.csv - Details about individual paintings, including artist ID, canvas ID, museum ID, subject ID, and sale price.
The process followed in this project was as follows:
The CSV files were first identified and downloaded from Kaggle. Using Python and the pandas library, the data was cleaned and then loaded into a MySQL database. A provided script (load_csv_datafiles.py) was used to load the data into the database by connecting with MySQL using the appropriate credentials. However, due to foreign key constraints, they were temporarily removed to ensure the data could load without error.
3.1.1. Python Script
# Libraries & dependencies
import pandas as pd
from sqlalchemy import create_engine
# MySQL connection string
conn_string = 'mysql+pymysql://MYUSERID:MYPASSWORD@localhost/painting_gallery'
# Create the engine to connect to MySQL
db = create_engine(conn_string)
# Files to load into MySQL
files = ['artist', 'canvas_size', 'image_link', 'museum_hours', 'museum', 'product_size', 'subject', 'work']
# Loop through each CSV file, read it into a DataFrame, and write to MySQL
for file in files:
# Load CSV data
df = pd.read_csv(f'dataset/{file}.csv')
# Load data into MySQL, replace existing table if it exists
df.to_sql(file, con=db, if_exists='replace', index=False)
The MySQL database painting_gallery was created using the provided SQL script (schema.sql). Each table was structured as follows (example with the artist table):
CREATE DATABASE IF NOT EXISTS painting_gallery;
USE painting_gallery;
CREATE TABLE artist (
artist_id INT PRIMARY KEY,
full_name VARCHAR(255) NOT NULL,
first_name VARCHAR(100),
middle_names VARCHAR(255),
last_name VARCHAR(100),
nationality VARCHAR(100),
style VARCHAR(100),
birth YEAR,
death YEAR
);
This schema was replicated for all other tables.
Data validation was carried out using a 0_datacheck.sql script. This step ensured that there were no missing values or data inconsistencies. Any issues were resolved prior to proceeding with the SQL queries.
The following insights were gathered from the SQL queries:
Museums and Cities: There is a concentration of museums in certain cities (such as Paris, New York) and a few cities stood out as having more museum diversity.
Canvas Size: Larger canvas sizes tend to be more expensive on average, but smaller sizes are more frequently used.
Artistic Popularity: Some artists, especially those who had a significant impact during their time, have a larger number of paintings available for public viewing.
Museum Operations: Many museums have specific hours of operation on weekends, though few are open every single day of the week.
Painting Styles: Certain painting styles (e.g., portraits, landscapes) are more popular and are more likely to be displayed in larger museums.
This project successfully demonstrated the ability to load a diverse dataset into a MySQL database, perform data validation, and use SQL queries to extract business insights. By leveraging SQL's powerful aggregation and filtering capabilities, we answered a series of relevant business questions, contributing to a better understanding of the global art gallery landscape.
Further improvements could include exploring the data in more depth (e.g., trends over time) and enhancing the dataset by adding more attributes (e.g., painting genres, artist ratings).
For future enhancements:
Investigate the relationship between painting price and museum size or location.
Add more fields to better analyze trends, such as artist popularity over time or museum visitation data.
Ensure data integrity by including more rigorous validation steps during the data import process, including the restoration of foreign key constraints.