Trang chủ‎ > ‎IT‎ > ‎Machine Learning‎ > ‎sciki-learn‎ > ‎

load data from csv into Scikit learn SVM

Here's a step-by-step guide for how to train an SVM using your data and then evaluate using the same dataset. It's also available at http://nbviewer.ipython.org/gist/anonymous/2cf3b993aab10bf26d5f. At the url you can also see the output of the intermediate data and the resulting accuracy (it's an iPython notebook)

Step 0: Install dependencies

You need to install the following libraries:

  • pandas
  • scikit-learn

From command line:

pip install pandas
pip install scikit-learn

Step 1: Load the data

We will use pandas to load our data. pandas is a library for easily loading data. For illustration, we first save sample data to a csv and then load it.

We will train the SVM with train.csv and get test labels with test.csv

import pandas as pd

train_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1"""


with open('train.csv', 'w') as output:
    output.write(train_data_contents)

train_dataframe = pd.read_csv('train.csv')

Step 2: Process the data

We will convert our dataframe into numpy arrays which is a format that scikit- learn understands.

We need to convert the labels "B", "M", "C",... to numbers also because svm does not understand strings.

Then we will train a linear svm with the data

import numpy as np

train_labels = train_dataframe.class_label
labels = list(set(train_labels))
train_labels = np.array([labels.index(x) for x in train_labels])
train_features = train_dataframe.iloc[:,1:]
train_features = np.array(train_features)

print "train labels: "
print train_labels
print 
print "train features:"
print train_features

We see here that the length of train_labels (5) exactly matches how many rows we have in trainfeatures. Each item in train_labels corresponds to a row.

Step 3: Train the SVM

from sklearn import svm
classifier = svm.SVC()
classifier.fit(train_features, train_labels)

Step 4: Evaluate the SVM on some testing data

test_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1
"""

with open('test.csv', 'w') as output:
    output.write(test_data_contents)

test_dataframe = pd.read_csv('test.csv')

test_labels = test_dataframe.class_label
labels = list(set(test_labels))
test_labels = np.array([labels.index(x) for x in test_labels])

test_features = test_dataframe.iloc[:,1:]
test_features = np.array(test_features)

results = classifier.predict(test_features)
num_correct = (results == test_labels).sum()
recall = num_correct / len(test_labels)
print "model accuracy (%): ", recall * 100, "%"

Links & Tips

You should be able to take this code and replace train.csv with your training data, test.csv with your testing data, and get predictions for your test data, along with accuracy results.

Note that since you're evaluating using the data you trained on the accuracy will be unusually high.


from collections import namedtuple

# Using namedtuples for descriptive purposes, in actual code a normal tuple would work fine.

Category = namedtuple("Category", ["index", "name"])

Feature = namedtuple("Feature", ["category_index", "distance_from_beginning", "distance_from_end", "contains_digit", "capitalized"])


# Separate up the set of categories, libsvm requires a numerical index so we associate each with an index.

categories = dict()

for index, name in enumerate("B M C S NA".split(' ')):

    # LibSVM expects index to start at 1, not 0.

    categories[name] = Category(index + 1, name)

categories


Out[0]: {'B': Category(index=1, name='B'),

   'C': Category(index=3, name='C'),

   'M': Category(index=2, name='M'),

   'NA': Category(index=5, name='NA'),

   'S': Category(index=4, name='S')}


# Faked set of CSV input for example purposes.

csv_input_lines = """category_index,distance_from_beginning,distance_from_end,contains_digit,capitalized

B,1,10,1,0

M,10,1,0,1

C,2,3,0,1

S,23,2,0,0

NA,12,0,0,1""".split("\n")

# We just ignore the header.

header = csv_input_lines[0]


# A list of Feature namedtuples, this will be trained as lists.

features = list()

for line in csv_input_lines[1:]:

    split_values = line.split(',')

    # Create a Feature with the values converted to integers.

    features.append(Feature(categories[split_values[0]].index, *map(int, split_values[1:])))


features


Out[1]: [Feature(category_index=1, distance_from_beginning=1, distance_from_end=10, contains_digit=1, capitalized=0),

 Feature(category_index=2, distance_from_beginning=10, distance_from_end=1, contains_digit=0, capitalized=1),

 Feature(category_index=3, distance_from_beginning=2, distance_from_end=3, contains_digit=0, capitalized=1),

 Feature(category_index=4, distance_from_beginning=23, distance_from_end=2, contains_digit=0, capitalized=0),

 Feature(category_index=5, distance_from_beginning=12, distance_from_end=0, contains_digit=0, capitalized=1)]


# Y is the category index used in training for each Feature. Now it is an array (order important) of all the trained indexes.

y = map(lambda f: f.category_index, features)

# X is the feature vector, for this we convert all the named tuple's values except the category which is at index 0.

x = map(lambda f: list(f)[1:], features)


from svmutil import svm_parameter, svm_problem, svm_train, svm_predict

# Barebones defaults for SVM

param = svm_parameter()

# The (Y,X) parameters should be the train dataset.

prob = svm_problem(y, x)

model=svm_train(prob, param)


# For actual accuracy checking, the (Y,X) parameters should be the test dataset.

p_labels, p_acc, p_vals = svm_predict(y, x, model)


Out[3]: Accuracy = 100% (5/5) (classification)

Comments