Kickstarter is a popular crowdfunding platform that has helped thousands of entrepreneurs and creators bring their innovative ideas to life. However, not all Kickstarter projects are successful, and understanding the factors that contribute to success or failure can be valuable for both creators and investors alike.
In this dataset, we have collected information on a large number of Kickstarter projects and whether they ultimately succeeded or failed to meet their funding goals. This dataset includes a wide range of project types, including technology startups, creative arts endeavors, and social impact initiatives, among others.
By analyzing this dataset, researchers and analysts can gain insights into the characteristics of successful and unsuccessful Kickstarter projects, such as funding targets, project categories, and funding sources. This information can be used to inform investment decisions and guide future crowdfunding campaigns.
Overall, this dataset provides a comprehensive look at the Kickstarter ecosystem and can serve as a valuable resource for anyone interested in understanding the dynamics of crowdfunding and the factors that contribute to project success or failure.
In this project, I will analyze the dataset using SQL and Microsoft PowerBI and at the end of this I will present my findings and conclusion based on the insights gathered from dataset.
The dataset contains 374,853 projects that were applied for crowdfunding on Kickstarter from 22 different countries.
The dataset contains 15 distinct categories and 159 sub-categories of products or services offered by different projects.
Dataset contains a 'state' column which represents a categorical variable that tells us about whether the project is a success, failure, live or cancelled.
'backers' column represent the number of investors in a particular project.
Docker
PostgreSQL database
Micrsoft PowerBI
Jupyter Notebook
To run the PostgreSQL, I used PostgreSQL docker image. I created docker compose environment to run PostgreSQL database and pgAdmin. pgdatabase was running on localhost server port 5432 and pgAdmin is running on port number 8080.
Below is the docker-compose.yaml file:
services:
pgdatabase:
image: postgres:13
environment:
- POSTGRES_USER=root
- POSTGRES_PASSWORD=root
- POSTGRES_DB=kickstarter
volumes:
- ./postgres_data:/var/lib/postgresql/data:rw
ports:
- "5432:5432"
pgadmin:
image: dpage/pgadmin4
environment:
- PGADMIN_DEFAULT_EMAIL=admin@admin.com
- PGADMIN_DEFAULT_PASSWORD=root
ports:
- "8080:80"
The dataset is in the csv format. I created a data ingestion script in python to ingest the data into PostgreSQL database. This script loads the data into pandas dataframe and then creates schema readable in PostgreSQL using Sqlalchemy and loads the data into PostgreSQL database.
INGESTION SCRIPT:
#Importing necessary libraries
from sqlalchemy import create_engine
import pandas as pd
import pyarrow as pa
import numpy as np
from time import time
import argparse
import os
import psycopg2
def main(params):
#Arguments
user=params.user
password=params.password
host=params.host
port=params.port
db=params.db
location=params.location
table_name=params.table_name
#Creating pandas dataframe from the csv data file
df = pd.read_csv(location)
#Converting launch_date and deadline columns to Datetime type
df["launch_date"] = pd.to_datetime(df.Launched)
df["deadline"] = pd.to_datetime(df.Deadline)
df = df.drop(columns=['Launched', 'Deadline'])
#Converting all the column names to lower cases for better readability in Postgres
df.columns= df.columns.str.strip().str.lower()
#Creating a connection to Postgres database named ny_taxi
engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')
engine.connect()
#Creating iterator function that will iterator over a specific chunk of data
n = 100000 #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
#Using the iterator function uploading the data into PostgreSQL Database with a table name 'yellow_taxi'
df.head(n=0).to_sql(name=table_name,con=engine,if_exists="replace") #Creating column headers
for i in list_df:
t_start=time()
i.to_sql(name=table_name,con=engine,if_exists="append")
t_end=time()
print('inserted another chunk, took %.3f seconds'%(t_end-t_start))
if __name__ == '__main__':
#Parsing the arguments that are provided for the main function
parser = argparse.ArgumentParser(description='Ingest PARQUET data')
parser.add_argument('--user',help='Username for Postgres')
parser.add_argument('--password',help='Password for Postgres')
parser.add_argument('--host',help='Host for Postgres')
parser.add_argument('--port',help='Port for Postgres')
parser.add_argument('--db',help='Database name for Postgres')
parser.add_argument('--table_name',help='Table name for Postgres where we will write the results to')
parser.add_argument('--location',help='URL of the PARQUET file')
args = parser.parse_args()
main(args)
After the data is loaded into database, we begin our analysis in SQL. We have also established connection with Microsoft PowerBI to create visualizations based on our analysis.
The dataset contains information related to different countries and different years. So in order to perform a thorough analysis, I have created a problem statement.
Problem Statement: An investor is looking to invest in Canadian market. He/She wants to make decision based off of the analysis of this dataset. The expectation of the investor is to gain insights from the analysis report to make informed decision about investment in valuable categories of the market. This analysis is narrowed down to projects that are based in Canada.
There are rougly 380,000 different projects from 15 different categories of projects listed in the dataset.
Projects are listed from 22 countries of origin across the world.
The range of data is from 2013-2018, in which only 5 categories has data for 2018.
SQL Query for number of projects and backers across different categories over total data
SELECT
p.category,
COUNT(p.*) AS total_no_of_projects,
COUNT(p1.*) AS total_no_of_successful_projects,
SUM(p.backers) AS total_backers_per_category,
SUM(p1.backers) AS total_successful_backers_per_category,
ROUND(SUM(p1.backers)*100/SUM(p.backers),1) AS success_ratio_of_backers_per_category
FROM
projects p
FULL OUTER JOIN (SELECT * FROM projects WHERE state='Successful') p1 ON
p.id = p1.id
GROUP BY
p.category
ORDER BY
success_ratio_of_backers_per_category DESC;
Based on the data, it can be seen that most backers invested in Games category
Film & Video category had the highest number of projects
We have to look at the percentage of projects that were successful within a particular category to get the true picture of successful projects.
Most popular category to invest among all the backers across all the countries was Games.
In terms of successful projects, music category was the most successful category with 49% success rate.
Year 2015 saw highest number of projects and highest number of investors. After the year 2015, backers interest investing was high as it can be seen from the numbers which shows that followings year no of backers were higher than number of projects.
From 2010 to 2015, there was a steady rise in projects and backers, but after 2015 both the numbers dropped steeply. One reason could be the fact that as number of projects decreased, number of backers declined.
Product design was the hottest sub-category across all the years.
THANK YOU!!