https://www.scipy2020.scipy.org/tutorial-information
https://github.com/chendaniely/scipy-2020-pandas
https://github.com/hugobowne/deep-learning-from-scratch-pytorch
Run it online here:
Overview...
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn import treefrom sklearn.metrics import accuracy_score# Figures inline and set visualization style%matplotlib inlinesns.set()Read in data
df = pd.read_csv('https://raw.githubusercontent.com/hugobowne/deep-learning-from-scratch-pytorch/master/data/train.csv')# View first lines of training datadf.head(n=4)Check out data types
df.info()Check out summary statistics
df.describe()Split our data into train and test sets
from sklearn.model_selection import train_test_splitdf_train, df_test, y_train, y_test = train_test_split( df.drop('Survived', axis=1), df[['Survived']], test_size=0.33, random_state=42, stratify=df[['Survived']])Make bar plot of target variable
df_train['Survived'] = y_trainsns.countplot(x='Survived', data=df_train);Make a first baseline and very naive prediction that everybody died and compute accuracy
df_test['Survived'] = 0pred_diff = y_test['Survived'] - df_test['Survived'].arrayaccuracy = 1 - sum(pred_diff)/len(pred_diff)print(accuracy)data preparation and cleaning
df['Age'] = df.Age.fillna(df.Age.median())df['Fare'] = df.Fare.fillna(df.Fare.median())df.info()Convert Sex into a numerical feature
df = pd.get_dummies(df, columns=['Sex'], drop_first=True)df = df[['Sex_male', 'Fare', 'Age','Pclass', 'SibSp','Survived']]df.head()train test split
df_train, df_test, y_train, y_test = train_test_split( df.drop('Survived', axis=1), df[['Survived']], test_size=0.33, random_state=41, stratify=df[['Survived']])Instantiate model and fit to data
clf = tree.DecisionTreeClassifier(max_depth=3)clf.fit(df_train, y_train)Make predictions
Y_pred = clf.predict(df_test)