text classification

Some common steps are:

1. extract tokens from documents, regex matching [a-z]

2. apply lemmatization

words in third person are changed to first person and verbs in past and future tenses are changed into present.

e.g. studies to study, went to go

3.1. maybe stemming

words are reduced to their root form. The root form may not be a word

e.g. studies to studi

3.2 maybe keep only tokens within a dictionary, e.g. 90k English

if not stemming, one can use a fixed dictionary to constrain the size of vocabulary

so only words in the dictionary are kept.

4. transform a list of document type names into numbers, the y value in a model

le = preprocessing.LabelEncoder()

le.fit(doc_types)

labels = le.transform(doc_types)

note this do inverse_transform as well, from numbers to types

le.inverse_transform(numbers)

5. a document is represented by a list of tokens (words), now we need to convert it to word counts

the columns are the dictionary words

a document is a row, where a column value is the count of the corresponding word

cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1,1))

counts_train = cv.transform(docs_train) #shape = (#docs, #vocabulary)

# ngram_range: (1:1) unigram, (1:2) unigram and bigram

now documents are vectorized

6. Transform the count matrix to a term frequency and/or inverse document frequency

term frequency = count of word in doc / # of words in doc

document frequency = the number of documents that contain a word, however it should be normalized = count of occurance / total documents

inverse document frequency = 1 / document frequency. so the less frequent the better, which means a rare word helps classification

when a word is so rare, the document frequency count/total tends to be 1/total and the inverse tends to be 'total' which could be too big for a big corpus

so we take the log, as log(inverse document frequency)

there is handy library to do the counts:

tt = TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)

X_train = tt.fit_transform(counts_train) #fit and transform

7. now ready to train a model

firstly choose Hinge loss function, hinge_loss = max(0, 1 - actual * prediction)

clf = SGDClassifier(loss='hinge')

clf.fit(X_train, y_train)

8. study performance, testing, plotting.

Train a linear SVM model for classifying text.

Here it uses the 90k english dictionary to select words. Also exclude all stop words.

The raw document texts are from a database, but can be easily replaced with file operations.

import pyodbc

import regex as re

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.feature_extraction.text import CountVectorizer

from sklearn import preprocessing

from sklearn.linear_model import SGDClassifier

from sklearn.model_selection import train_test_split

import numpy as np

from nltk.stem import WordNetLemmatizer

from sklearn import metrics

#load 90k english words dictionary, google and download it

dict90k = set()

with open(r'C:\words\90kEnglishDictionary.txt') as f:

lines = f.readlines()

for line in lines:

if line.startswith('#') == False: #the comments

dict90k.add(line.rstrip('\n').lower())

#load stop words, google and download it

stopwords = set()

with open( r'C:\words\EnglishStopWords.txt') as f:

lines = f.readlines()

for line in lines:

if line.startswith('#') == False:

stopwords.add(line.rstrip('\n').lower())

#replace the database part accordingly with file operations if needed.

database_conn = 'DRIVER={SQL Server};SERVER=dbserver;DATABASE=databasename;Trusted_Connection=yes;'

query = "SELECT [Type] \

,[FileName]\

,[Text]\

FROM [Claims_Automation].[bayes].[train_data]"

conn = pyodbc.connect(database_conn)

cursor = conn.cursor()

cursor.execute(query)

rows = cursor.fetchall()

cursor.close()

conn.close()

#may need to download the data first for nltk. On anaconda, can download nltk_data as well.

#nltk.set_proxy('http://proxy:8080', ('user', 'password'))

#nltk.download()

lemmatizer = WordNetLemmatizer()

# tokenize the docs, and lemmetize words

# include only words in dictionary, and exclude stop words

# use all surviving words as vocabulary

docs = []

doc_types = []

files= []

vocabulary = set()

for row in rows:

doc_type = row[0]

file_name = row[1]

text = row[2]

files.append(file_name)

tokens = re.findall("[a-zA-Z]+", text)

tokens = [lemmatizer.lemmatize(t) for t in tokens]

doc_types.append(doc_type)

selected_tokens = [t for t in tokens if t in dict90k and t not in stopwords]

docs.append(' '.join(selected_tokens))

for t in selected_tokens:

vocabulary.add(t)

# encode doc types into indegers

le = preprocessing.LabelEncoder()

le.fit(doc_types)

labels = le.transform(doc_types)

#print(le.classes_)

#split docs into train and test

docs_train, docs_test, zip_train, zip_test = train_test_split(docs, list(zip(labels, files)), test_size=0.2, random_state=0)

y_train = np.array([label for label, file in zip_train])

y_test = np.array([label for label, file in zip_test])

file_test = np.array([file for label, file in zip_test])

# count the words for each document given the vocabbulary

# vectorize the word counts

# ngram_range: (1:1) unigram, (1:2) unigram and bigram

cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1,1))

counts_train = cv.transform(docs_train) #shape = (#docs, #vocabulary)

#Transform a count matrix to a normalized tf or tf-idf representation

#sublinear_tf, default=False. Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

#norm{‘l1’, ‘l2’}, default=’l2’. Each output row will have unit norm

#use_idf, default=True. Enable inverse-document-frequency reweighting.

#smooth_idf, default=True. Smooth idf weights by adding one to document frequencies to avoid division by 0.

#if the vocabulary is from the training data, then smooth_idf can be false

tt = TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)

X_train = tt.fit_transform(counts_train) #fit and transform

# linear SVM

# loss : 'hinge' for linear SVM, 'log' for logistic regression

# alpha: weight of the penalty

# max_iter: max epochs

# tol: If not None, training stops when (loss > best_loss - tol) for n_iter_no_change consecutive epochs.

clf = SGDClassifier(loss='hinge',\

penalty='l2',\

alpha=1e-3,\

random_state=88,\

max_iter=20,\

tol=None,\

learning_rate= 'optimal')

clf.fit(X_train, y_train)

#prediction for test data

counts_test = cv.transform(docs_test)

X_test = tt.transform(counts_test)

pred = clf.predict(X_test)

#print precision

precision = sum(y_test==pred) / len(y_test)

print(precision)

#print the error predictions in test

error_types = le.inverse_transform(y_test[y_test!=pred])

error_preds = le.inverse_transform(pred[y_test!=pred])

error_files = file_test[y_test!=pred]

for t, f, p in zip( error_types, error_files, error_preds):

print('{0} | {1} | {2}'.format(t, f, p))

#print performance report

print(metrics.classification_report(y_test, pred, target_names=le.classes_))

Page updated

Google Sites

Report abuse