Kickstarter Projects Data Analysis

Introduction

Kickstarter is a popular crowdfunding platform that has helped thousands of entrepreneurs and creators bring their innovative ideas to life. However, not all Kickstarter projects are successful, and understanding the factors that contribute to success or failure can be valuable for both creators and investors alike.

In this dataset, we have collected information on a large number of Kickstarter projects and whether they ultimately succeeded or failed to meet their funding goals. This dataset includes a wide range of project types, including technology startups, creative arts endeavors, and social impact initiatives, among others.

By analyzing this dataset, researchers and analysts can gain insights into the characteristics of successful and unsuccessful Kickstarter projects, such as funding targets, project categories, and funding sources. This information can be used to inform investment decisions and guide future crowdfunding campaigns.

Overall, this dataset provides a comprehensive look at the Kickstarter ecosystem and can serve as a valuable resource for anyone interested in understanding the dynamics of crowdfunding and the factors that contribute to project success or failure.

In this project, I will analyze the dataset using SQL and Microsoft PowerBI and at the end of this I will present my findings and conclusion based on the insights gathered from dataset.

General overview of the Dataset

The dataset contains 374,853 projects that were applied for crowdfunding on Kickstarter from 22 different countries.
The dataset contains 15 distinct categories and 159 sub-categories of products or services offered by different projects.
Dataset contains a 'state' column which represents a categorical variable that tells us about whether the project is a success, failure, live or cancelled.
'backers' column represent the number of investors in a particular project.

Technologies used

Docker
PostgreSQL database
Micrsoft PowerBI
Jupyter Notebook

Data Architecture

To run the PostgreSQL, I used PostgreSQL docker image. I created docker compose environment to run PostgreSQL database and pgAdmin. pgdatabase was running on localhost server port 5432 and pgAdmin is running on port number 8080.

Below is the docker-compose.yaml file:

services:

pgdatabase:

image: postgres:13

environment:

- POSTGRES_USER=root

- POSTGRES_PASSWORD=root

- POSTGRES_DB=kickstarter

volumes:

- ./postgres_data:/var/lib/postgresql/data:rw

ports:

- "5432:5432"

pgadmin:

image: dpage/pgadmin4

environment:

- PGADMIN_DEFAULT_EMAIL=admin@admin.com

- PGADMIN_DEFAULT_PASSWORD=root

ports:

- "8080:80"

The dataset is in the csv format. I created a data ingestion script in python to ingest the data into PostgreSQL database. This script loads the data into pandas dataframe and then creates schema readable in PostgreSQL using Sqlalchemy and loads the data into PostgreSQL database.

INGESTION SCRIPT:

#Importing necessary libraries

from sqlalchemy import create_engine

import pandas as pd

import pyarrow as pa

import numpy as np

from time import time

import argparse

import os

import psycopg2

def main(params):

#Arguments

user=params.user

password=params.password

host=params.host

port=params.port

db=params.db

location=params.location

table_name=params.table_name

#Creating pandas dataframe from the csv data file

df = pd.read_csv(location)

#Converting launch_date and deadline columns to Datetime type

df["launch_date"] = pd.to_datetime(df.Launched)

df["deadline"] = pd.to_datetime(df.Deadline)

df = df.drop(columns=['Launched', 'Deadline'])

#Converting all the column names to lower cases for better readability in Postgres

df.columns= df.columns.str.strip().str.lower()

#Creating a connection to Postgres database named ny_taxi

engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')

engine.connect()

#Creating iterator function that will iterator over a specific chunk of data

n = 100000 #chunk row size

list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

#Using the iterator function uploading the data into PostgreSQL Database with a table name 'yellow_taxi'

df.head(n=0).to_sql(name=table_name,con=engine,if_exists="replace") #Creating column headers

for i in list_df:

t_start=time()

i.to_sql(name=table_name,con=engine,if_exists="append")

t_end=time()

print('inserted another chunk, took %.3f seconds'%(t_end-t_start))

if __name__ == '__main__':

#Parsing the arguments that are provided for the main function

parser = argparse.ArgumentParser(description='Ingest PARQUET data')

parser.add_argument('--user',help='Username for Postgres')

parser.add_argument('--password',help='Password for Postgres')

parser.add_argument('--host',help='Host for Postgres')

parser.add_argument('--port',help='Port for Postgres')

parser.add_argument('--db',help='Database name for Postgres')

parser.add_argument('--table_name',help='Table name for Postgres where we will write the results to')

parser.add_argument('--location',help='URL of the PARQUET file')

args = parser.parse_args()

main(args)

After the data is loaded into database, we begin our analysis in SQL. We have also established connection with Microsoft PowerBI to create visualizations based on our analysis.

Scenario for Analysis

The dataset contains information related to different countries and different years. So in order to perform a thorough analysis, I have created a problem statement.

Problem Statement: An investor is looking to invest in Canadian market. He/She wants to make decision based off of the analysis of this dataset. The expectation of the investor is to gain insights from the analysis report to make informed decision about investment in valuable categories of the market. This analysis is narrowed down to projects that are based in Canada.

Exploratory Analysis

Distribution of projects across different categories, countries and timespan

There are rougly 380,000 different projects from 15 different categories of projects listed in the dataset.
Projects are listed from 22 countries of origin across the world.
The range of data is from 2013-2018, in which only 5 categories has data for 2018.

SQL Query for number of projects and backers across different categories over total data

SELECT

p.category,

COUNT(p.*) AS total_no_of_projects,

COUNT(p1.*) AS total_no_of_successful_projects,

SUM(p.backers) AS total_backers_per_category,

SUM(p1.backers) AS total_successful_backers_per_category,

ROUND(SUM(p1.backers)*100/SUM(p.backers),1) AS success_ratio_of_backers_per_category

FROM

projects p

FULL OUTER JOIN (SELECT * FROM projects WHERE state='Successful') p1 ON

p.id = p1.id

GROUP BY

p.category

ORDER BY

success_ratio_of_backers_per_category DESC;

Based on the data, it can be seen that most backers invested in Games category
Film & Video category had the highest number of projects
We have to look at the percentage of projects that were successful within a particular category to get the true picture of successful projects.

Complete Dashboard for country based Analysis

Category based analysis report

Sub-category based analysis report

Insights

Most popular category to invest among all the backers across all the countries was Games.
In terms of successful projects, music category was the most successful category with 49% success rate.
Year 2015 saw highest number of projects and highest number of investors. After the year 2015, backers interest investing was high as it can be seen from the numbers which shows that followings year no of backers were higher than number of projects.
From 2010 to 2015, there was a steady rise in projects and backers, but after 2015 both the numbers dropped steeply. One reason could be the fact that as number of projects decreased, number of backers declined.
Product design was the hottest sub-category across all the years.

THANK YOU!!

Page updated

Google Sites

Report abuse