Logistic Regression Python
Here is the Logistic Regression implemented in Python. A few things to notice.
- Preprocess the training data so scale the features to zero mean and unit variance Gaussian distribution.
In this way the importance /weight of each feature is comparable.
- Train the model with penalty = 'l1' first to find zero importance/weight features and get rid of them
- Run the model for a second time with penalty = 'l2' to fit the trainig data better.
- The ROC curve is also drawn here
- C is the parameter of the regularization (penalty) term. ie. penalty = lambda * sum(weight^2), and C = 1/lambda.
- It's the inverse of lambda instead of just using lambda, because making it consistent with svm and other models?
import pyodbc
import numpy as np
import matplotlib.pyplot as plt
import sklearn.linear_model as lm
import sklearn.metrics as met
import sklearn.preprocessing as pre
'''get data from a database'''
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=111.111.111.111;DATABASE=database;Trusted_Connection=yes;')
cursor = conn.cursor()
cursor.execute('SELECT * FROM table')
rows = cursor.fetchall()
rows_array = np.array(rows)
'''split the data from the label'''
y = rows_array[:,1].astype(int) #2nd column is the label
y = np.reshape(y, (1, len(y)))[0] #convert to a row vector
X = rows_array[:, 2:].astype(float)
X = pre.scale(X) #standarization
'''train the model'''
lr = lm.LogisticRegression(penalty='l1', solver='liblinear', max_iter=200, C=1)
lr.fit(X, y)
'''check how well the model works on the training data'''
result = lr.predict(X)
result_prob = lr.predict_proba(X)[:, 1] # the second column is the probability of positive. check lr.classes_ to be sure which column which.
compare = np.concatenate((np.reshape(result,(len(y),1)),np.reshape(y, (len(y),1))),axis=1)
'''calculate indicators and print/plot them'''
tp = sum(int(x==1 and y==1) for (x,y) in compare)
fp = sum(int(x==1 and y==0) for (x,y) in compare)
tn = sum(int(x==0 and y==0) for (x,y) in compare)
fn = sum(int(x==0 and y==1) for (x,y) in compare)
recall = tp / (tp + fn)
precision = tp / (tp + fp)
specifity = tn / (tn + fp)
OneMinusSpec, sensitivity, thresholds = met.roc_curve(y, result_prob, pos_label = 1)
auc = met.roc_auc_score(y, result_prob, average='macro')
print('true positive {0}, false positive {1} true negative {2}, false negative {3}'.format(tp, fp, tn, fn))
print('recall/sensitivity {0}, precision {1}, specifity {2}'.format(recall, precision, specifity))
print('Area Under ROC Curve {0}'.format(auc))
plt.figure(figsize=(4,4))
plt.plot(OneMinusSpec, sensitivity)
plt.xlim([0,1])
plt.ylim([0,1])
plt.xlabel('1 - specificity')
plt.ylabel('recall / sensitivity')
plt.show()
Note, the LR coefficients are available through:
lr.coef_
lr.intercept_
The actual linear model is:
log(odds) = lr.coef_ * x + lr.intercept_
the probability is:
1/(e^-logodds)