Some common steps are:
1. extract tokens from documents, regex matching [a-z]
2. apply lemmatization
words in third person are changed to first person and verbs in past and future tenses are changed into present.
e.g. studies to study, went to go
3.1. maybe stemming
words are reduced to their root form. The root form may not be a word
e.g. studies to studi
3.2 maybe keep only tokens within a dictionary, e.g. 90k English
if not stemming, one can use a fixed dictionary to constrain the size of vocabulary
so only words in the dictionary are kept.
4. transform a list of document type names into numbers, the y value in a model
le = preprocessing.LabelEncoder()
le.fit(doc_types)
labels = le.transform(doc_types)
note this do inverse_transform as well, from numbers to types
le.inverse_transform(numbers)
5. a document is represented by a list of tokens (words), now we need to convert it to word counts
the columns are the dictionary words
a document is a row, where a column value is the count of the corresponding word
cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1,1))
counts_train = cv.transform(docs_train) #shape = (#docs, #vocabulary)
# ngram_range: (1:1) unigram, (1:2) unigram and bigram
now documents are vectorized
6. Transform the count matrix to a term frequency and/or inverse document frequency
term frequency = count of word in doc / # of words in doc
document frequency = the number of documents that contain a word, however it should be normalized = count of occurance / total documents
inverse document frequency = 1 / document frequency. so the less frequent the better, which means a rare word helps classification
when a word is so rare, the document frequency count/total tends to be 1/total and the inverse tends to be 'total' which could be too big for a big corpus
so we take the log, as log(inverse document frequency)
there is handy library to do the counts:
tt = TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
X_train = tt.fit_transform(counts_train) #fit and transform
7. now ready to train a model
firstly choose Hinge loss function, hinge_loss = max(0, 1 - actual * prediction)
clf = SGDClassifier(loss='hinge')
clf.fit(X_train, y_train)
8. study performance, testing, plotting.
Train a linear SVM model for classifying text.
Here it uses the 90k english dictionary to select words. Also exclude all stop words.
The raw document texts are from a database, but can be easily replaced with file operations.
import pyodbc
import regex as re
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
import numpy as np
from nltk.stem import WordNetLemmatizer
from sklearn import metrics
#load 90k english words dictionary, google and download it
dict90k = set()
with open(r'C:\words\90kEnglishDictionary.txt') as f:
lines = f.readlines()
for line in lines:
if line.startswith('#') == False: #the comments
dict90k.add(line.rstrip('\n').lower())
#load stop words, google and download it
stopwords = set()
with open( r'C:\words\EnglishStopWords.txt') as f:
lines = f.readlines()
for line in lines:
if line.startswith('#') == False:
stopwords.add(line.rstrip('\n').lower())
#replace the database part accordingly with file operations if needed.
database_conn = 'DRIVER={SQL Server};SERVER=dbserver;DATABASE=databasename;Trusted_Connection=yes;'
query = "SELECT [Type] \
,[FileName]\
,[Text]\
FROM [Claims_Automation].[bayes].[train_data]"
conn = pyodbc.connect(database_conn)
cursor = conn.cursor()
cursor.execute(query)
rows = cursor.fetchall()
cursor.close()
conn.close()
#may need to download the data first for nltk. On anaconda, can download nltk_data as well.
#nltk.set_proxy('http://proxy:8080', ('user', 'password'))
#nltk.download()
lemmatizer = WordNetLemmatizer()
# tokenize the docs, and lemmetize words
# include only words in dictionary, and exclude stop words
# use all surviving words as vocabulary
docs = []
doc_types = []
files= []
vocabulary = set()
for row in rows:
doc_type = row[0]
file_name = row[1]
text = row[2]
files.append(file_name)
tokens = re.findall("[a-zA-Z]+", text)
tokens = [lemmatizer.lemmatize(t) for t in tokens]
doc_types.append(doc_type)
selected_tokens = [t for t in tokens if t in dict90k and t not in stopwords]
docs.append(' '.join(selected_tokens))
for t in selected_tokens:
vocabulary.add(t)
# encode doc types into indegers
le = preprocessing.LabelEncoder()
le.fit(doc_types)
labels = le.transform(doc_types)
#print(le.classes_)
#split docs into train and test
docs_train, docs_test, zip_train, zip_test = train_test_split(docs, list(zip(labels, files)), test_size=0.2, random_state=0)
y_train = np.array([label for label, file in zip_train])
y_test = np.array([label for label, file in zip_test])
file_test = np.array([file for label, file in zip_test])
# count the words for each document given the vocabbulary
# vectorize the word counts
# ngram_range: (1:1) unigram, (1:2) unigram and bigram
cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1,1))
counts_train = cv.transform(docs_train) #shape = (#docs, #vocabulary)
#Transform a count matrix to a normalized tf or tf-idf representation
#sublinear_tf, default=False. Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
#norm{‘l1’, ‘l2’}, default=’l2’. Each output row will have unit norm
#use_idf, default=True. Enable inverse-document-frequency reweighting.
#smooth_idf, default=True. Smooth idf weights by adding one to document frequencies to avoid division by 0.
#if the vocabulary is from the training data, then smooth_idf can be false
tt = TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
X_train = tt.fit_transform(counts_train) #fit and transform
# linear SVM
# loss : 'hinge' for linear SVM, 'log' for logistic regression
# alpha: weight of the penalty
# max_iter: max epochs
# tol: If not None, training stops when (loss > best_loss - tol) for n_iter_no_change consecutive epochs.
clf = SGDClassifier(loss='hinge',\
penalty='l2',\
alpha=1e-3,\
random_state=88,\
max_iter=20,\
tol=None,\
learning_rate= 'optimal')
clf.fit(X_train, y_train)
#prediction for test data
counts_test = cv.transform(docs_test)
X_test = tt.transform(counts_test)
pred = clf.predict(X_test)
#print precision
precision = sum(y_test==pred) / len(y_test)
print(precision)
#print the error predictions in test
error_types = le.inverse_transform(y_test[y_test!=pred])
error_preds = le.inverse_transform(pred[y_test!=pred])
error_files = file_test[y_test!=pred]
for t, f, p in zip( error_types, error_files, error_preds):
print('{0} | {1} | {2}'.format(t, f, p))
#print performance report
print(metrics.classification_report(y_test, pred, target_names=le.classes_))