Logistic Regression Python

Here is the Logistic Regression implemented in Python. A few things to notice.

- Preprocess the training data so scale the features to zero mean and unit variance Gaussian distribution.

In this way the importance /weight of each feature is comparable.

- Train the model with penalty = 'l1' first to find zero importance/weight features and get rid of them

- Run the model for a second time with penalty = 'l2' to fit the trainig data better.

- The ROC curve is also drawn here

- C is the parameter of the regularization (penalty) term. ie. penalty = lambda * sum(weight^2), and C = 1/lambda.

- It's the inverse of lambda instead of just using lambda, because making it consistent with svm and other models?

import pyodbc

import numpy as np

import matplotlib.pyplot as plt

import sklearn.linear_model as lm

import sklearn.metrics as met

import sklearn.preprocessing as pre

'''get data from a database'''

conn = pyodbc.connect('DRIVER={SQL Server};SERVER=111.111.111.111;DATABASE=database;Trusted_Connection=yes;')

cursor = conn.cursor()

cursor.execute('SELECT * FROM table')

rows = cursor.fetchall()

rows_array = np.array(rows)

'''split the data from the label'''

y = rows_array[:,1].astype(int) #2nd column is the label

y = np.reshape(y, (1, len(y)))[0] #convert to a row vector

X = rows_array[:, 2:].astype(float)

X = pre.scale(X) #standarization

'''train the model'''

lr = lm.LogisticRegression(penalty='l1', solver='liblinear', max_iter=200, C=1)

lr.fit(X, y)

'''check how well the model works on the training data'''

result = lr.predict(X)

result_prob = lr.predict_proba(X)[:, 1] # the second column is the probability of positive. check lr.classes_ to be sure which column which.

compare = np.concatenate((np.reshape(result,(len(y),1)),np.reshape(y, (len(y),1))),axis=1)

'''calculate indicators and print/plot them'''

tp = sum(int(x==1 and y==1) for (x,y) in compare)

fp = sum(int(x==1 and y==0) for (x,y) in compare)

tn = sum(int(x==0 and y==0) for (x,y) in compare)

fn = sum(int(x==0 and y==1) for (x,y) in compare)

recall = tp / (tp + fn)

precision = tp / (tp + fp)

specifity = tn / (tn + fp)

OneMinusSpec, sensitivity, thresholds = met.roc_curve(y, result_prob, pos_label = 1)

auc = met.roc_auc_score(y, result_prob, average='macro')

print('true positive {0}, false positive {1} true negative {2}, false negative {3}'.format(tp, fp, tn, fn))

print('recall/sensitivity {0}, precision {1}, specifity {2}'.format(recall, precision, specifity))

print('Area Under ROC Curve {0}'.format(auc))

plt.figure(figsize=(4,4))

plt.plot(OneMinusSpec, sensitivity)

plt.xlim([0,1])

plt.ylim([0,1])

plt.xlabel('1 - specificity')

plt.ylabel('recall / sensitivity')

plt.show()

Note, the LR coefficients are available through:

lr.coef_

lr.intercept_

The actual linear model is:

log(odds) = lr.coef_ * x + lr.intercept_

the probability is:

1/(e^-logodds)