Lab 7 Sample Output

Notes on sample output

Part of the lab is to be sure you can write the BinaryLabels class and the BagOfWordsFeatures class and use them to extract the X and y matrices. Therefore, there will be less guidance on how the sample output below was created. However, the command-line arguments used when running the program will always be shown.

Implementing your own Labeler

Assuming you run your program with the command-line arguments shown below, you should get the following for the y matrix extracted. The output below also assumes that the variable assigned to the instance of the BinaryLabels class is called labeler.

$ python3 hyperpartisan_main.py /cs/cs159/data/semeval/articles-training-bypublisher-20181122.parsed.xml /cs/cs159/data/semeval/ground-truth-training-bypublisher-20181122.xml /cs/cs159/data/semeval/vocab.txt --train_size 10 -x5


>>> y

[1, 1, 0, 1, 1, 0, 0, 0, 0, 0]


Here is another run with a larger set of articles.

$ python3 hyperpartisan_main.py /cs/cs159/data/semeval/articles-training-bypublisher-20181122.parsed.xml /cs/cs159/data/semeval/ground-truth-training-bypublisher-20181122.xml /cs/cs159/data/semeval/vocab.txt --train_size 100 -x5


>>> print(y)

[1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0]

>>> print(sum(y))

53

>>> print(labeler.labels)

{'false': 0, 'true': 1}


Implementing your own Feature Extractor

It is harder to show the lil_matrix completely unless the vocabulary size is drastically limited and the training size is drastically limited. In addition, the lil_matrix doesn’t have a nice way of converting itself to a string for easy viewing, so the example below shows how to convert it to a dense matrix (for debugging very small cases only) so you can compare your output. The code below assumes that the variable assigned to the instance of the BagOfWordsFeatures class is called bow_features.

$ python3 hyperpartisan_main.py /cs/cs159/data/semeval/articles-training-bypublisher-20181122.parsed.xml /cs/cs159/data/semeval/ground-truth-training-bypublisher-20181122.xml /cs/cs159/data/semeval/vocab.txt --train_size 10 -x5 -s100 -v 20


>>> print(repr(X))

<10x20 sparse matrix of type '<class 'numpy.uint8'>'

with 49 stored elements in List of Lists format>

>>> print(X.toarray())

[[ 1 0 0 0 0 1 0 2 0 0 1 0 3 1 0 0 0 0 2 1]

[ 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0]

[ 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[ 1 0 0 0 0 0 0 1 0 0 1 1 0 1 0 1 0 1 3 0]

[ 1 0 0 0 1 0 0 0 0 2 0 0 0 2 0 0 1 0 0 1]

[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[ 0 1 4 0 4 0 0 1 0 2 0 0 0 0 0 1 0 1 0 0]

[ 1 1 0 0 0 0 1 3 0 1 0 0 1 0 0 1 0 0 0 1]

[ 0 0 0 0 2 0 1 2 17 0 0 0 1 0 0 0 0 0 0 3]

[ 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

>>> print(np.sum(X.toarray(), axis=0))

[ 4 5 4 0 7 1 3 9 18 6 2 1 5 4 0 4 1 2 5 6]

>>> print(np.sum(X.toarray(), axis=1))

[12 4 1 10 8 0 14 10 26 2]

>>> print(bow_features.vocab.index_to_label(0))

most

>>> print(bow_features.vocab.index_to_label(2))

percent

>>> print(bow_features.vocab._dict['percent'])

2

>>> print(ids)

['0000001', '0000002', '0000008', '0000012', '0000015', '0000016', '0000018', '0000020', '0000025', '0000027']


Here is some extracted data from a larger run:

$ python3 hyperpartisan_main.py /cs/cs159/data/semeval/articles-training-bypublisher-20181122.parsed.xml /cs/cs159/data/semeval/ground-truth-training-bypublisher-20181122.xml /cs/cs159/data/semeval/vocab.txt --train_size 100 -x5


>>> print(repr(X))

<100x1603164 sparse matrix of type '<class 'numpy.uint8'>'

with 25168 stored elements in List of Lists format>

>>> print(X.toarray())

[[14 18 15 ... 0 0 0]

[10 12 11 ... 0 0 0]

[11 13 12 ... 0 0 0]

...

[30 12 14 ... 0 0 0]

[27 21 34 ... 0 0 0]

[ 2 0 0 ... 0 0 0]]

>>> s = np.sum(X, axis=0)

>>> print(s)

[[3115 2598 2399 ... 0 0 0]]

>>> print(s[0,0])

3115

>>> print(s[0,1])

2598

>>> bow_features.vocab._dict['obama']

160

>>> s[0, 160]

47

>>> r = X[:,160].toarray()

>>> np.where(r!=[0])[0]

array([ 0 10 11 19 27 31 37 68 82 89 95])

>>> np.sum(X)

62017


Classifying data

Here are some sample runs on small sets using the MultinomialNB classifier with no optional arguments. The first set of sample runs uses 5-fold cross-validation and assumes that you set the return value of cross_val_predict to the variable class_probs, and your final prediction for each instance to the variable y_pred.

$ python3 hyperpartisan_main.py /cs/cs159/data/semeval/articles-training-bypublisher-20181122.parsed.xml /cs/cs159/data/semeval/ground-truth-training-bypublisher-20181122.xml /cs/cs159/data/semeval/vocab.txt --train_size 20 -x5 -s 25 -v 1000 -o predictions.txt


>>> print(class_probs)

[[7.83874897e-18 1.00000000e+00]

[1.00000000e+00 2.94147015e-10]

[9.23982239e-01 7.60177605e-02]

[2.51593893e-10 1.00000000e+00]

[1.25244209e-28 1.00000000e+00]

[2.98836482e-04 9.99701164e-01]

[1.00000000e+00 1.75973419e-31]

[1.30155356e-02 9.86984464e-01]

[7.96871024e-44 1.00000000e+00]

[2.61766548e-01 7.38233452e-01]

[1.01711144e-16 1.00000000e+00]

[5.10063975e-65 1.00000000e+00]

[1.52316416e-18 1.00000000e+00]

[1.00000000e+00 2.23315973e-10]

[1.00000000e+00 2.72856184e-26]

[1.00000000e+00 1.98995873e-27]

[2.30914904e-01 7.69085096e-01]

[2.87531686e-33 1.00000000e+00]

[2.35563175e-08 9.99999976e-01]

[8.47334782e-01 1.52665218e-01]]

>>> print(y_pred)

[1 0 0 1 1 1 0 1 1 1 1 1 1 0 0 0 1 1 1 0]


$ python3 semeval-pan-2019-evaluator.py -d /cs/cs159/data/semeval/ground-truth-training-bypublisher-20181122.xml -r predictions.txt

semeval-pan-2019-evaluator.py:32: UserWarning: Missing 599980 predictions

warnings.warn("Missing {} predictions".format(len(groundTruth) - sum(c.values())), UserWarning)

{

"truePositives": 8,

"trueNegatives": 5,

"falsePositives": 5,

"falseNegatives": 2,

"accuracy": 0.65,

"precision": 0.6153846153846154,

"recall": 0.8,

"f1": 0.6956521739130435

}


Here is a test on more data.

$ python3 hyperpartisan_main.py /cs/cs159/data/semeval/articles-training-bypublisher-20181122.parsed.xml /cs/cs159/data/semeval/ground-truth-training-bypublisher-20181122.xml /cs/cs159/data/semeval/vocab.txt --train_size 1000 -x5 -s 25 -v 10000 -o predictions.txt


>>> print(class_probs)

[[2.75307198e-11 1.00000000e+00]

[1.00000000e+00 5.36541138e-11]

[9.99999873e-01 1.26840482e-07]

...

[3.61384760e-34 1.00000000e+00]

[1.00000000e+00 1.39833865e-89]

[2.48991713e-26 1.00000000e+00]]

>>> print(y_pred)

[1 0 0 1 1 0 0 1 1 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0

0 0 0 1 1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1

1 0 1 1 0 1 0 0 1 1 0 1 0 0 0 1 1 1 0 1 1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1

0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 1 1 1 0 0

0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0

0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0

0 0 0 0 0 1 0 1 0 1 0 1 1 0 1 1 1 0 0 0 1 1 0 0 1 0 1 1 0 1 1 1 1 0 0 1 0

1 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 1 0 1

1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1

1 0 1 1 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 1 0 1 0 0 0 0 1 0

1 0 0 1 0 1 1 1 1 1 0 1 0 1 0 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 0 0 0 1 1 0 0

1 0 1 0 0 0 1 0 1 1 0 1 1 0 0 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1 0 0 0 1 1 0 1

1 0 1 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0 0 0 0 1 0 1 0 1 1 0 1 0 0 1

0 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 1 0 0 1

0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 0 1 0 1 1 1 0 1

1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 1 1 0 1 0 0 1 0

1 1 0 0 0 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 1 1 0 0 1 0 1

0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0

0 1 0 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1

0 0 1 1 0 1 1 0 1 1 1 1 0 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 1 1 0 1 0 0 0 0 1

0 1 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1 0 0 0 1 0 1 0 0 1 1

1 0 1 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 1 0 0 0 1 1 0

1 0 0 1 0 1 1 0 0 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 0 0

0 1 1 0 1 1 1 0 1 0 1 1 1 1 1 0 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0

0 1 1 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0

0 0 1 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 0 0 1 1 1 1 0 1 1 1 1 1 1

0 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 1 0 1 0 0 1 0

1]

>>> print(sum(y_pred))

495


$ python3 semeval-pan-2019-evaluator.py -d /cs/cs159/data/semeval/ground-truth-training-bypublisher-20181122.xml -r predictions.txt

semeval-pan-2019-evaluator.py:32: UserWarning: Missing 599000 predictions

warnings.warn("Missing {} predictions".format(len(groundTruth) - sum(c.values())), UserWarning)

{

"truePositives": 400,

"trueNegatives": 389,

"falsePositives": 95,

"falseNegatives": 116,

"accuracy": 0.789,

"precision": 0.8080808080808081,

"recall": 0.7751937984496124,

"f1": 0.7912957467853611

}


The second set of runs is training on the by-publisher training data and testing on the by-article data.

$ python3 hyperpartisan_main.py /cs/cs159/data/semeval/articles-training-bypublisher-20181122.parsed.xml /cs/cs159/data/semeval/ground-truth-training-bypublisher-20181122.xml /cs/cs159/data/semeval/vocab.txt --train_size 1000 -s 25 -v 10000 -t /cs/cs159/data/semeval/articles-training-byarticle-20181122.parsed.xml --test_size 10 -o predictions.txt


>>> print(class_probs)

[[9.48943829e-049 1.00000000e+000]

[9.94727067e-001 5.27293307e-003]

[2.10862345e-008 9.99999979e-001]

[1.71574876e-010 1.00000000e+000]

[1.43464669e-044 1.00000000e+000]

[2.49804965e-243 1.00000000e+000]

[2.22978535e-013 1.00000000e+000]

[1.20089838e-025 1.00000000e+000]

[1.82700520e-012 1.00000000e+000]

[3.10788267e-036 1.00000000e+000]]

>>> print(y_pred)

[1 0 1 1 1 1 1 1 1 1]


$ python3 semeval-pan-2019-evaluator.py -d /cs/cs159/data/semeval/ground-truth-training-byarticle-20181122.xml -r predictions.txt

semeval-pan-2019-evaluator.py:32: UserWarning: Missing 635 predictions

warnings.warn("Missing {} predictions".format(len(groundTruth) - sum(c.values())), UserWarning)

{

"truePositives": 8,

"trueNegatives": 0,

"falsePositives": 1,

"falseNegatives": 1,

"accuracy": 0.8,

"precision": 0.8888888888888888,

"recall": 0.8888888888888888,

"f1": 0.8888888888888888

}


And a larger training/testing setup:

$ python3 hyperpartisan_main.py /cs/cs159/data/semeval/articles-training-bypublisher-20181122.parsed.xml /cs/cs159/data/semeval/ground-truth-training-bypublisher-20181122.xml /cs/cs159/data/semeval/vocab.txt --train_size 2500 -s 25 -v 20000 -t /cs/cs159/data/semeval/articles-training-byarticle-20181122.parsed.xml --test_size 645 -o predictions.txt


>>> print(class_probs)

[[2.34361479e-85 1.00000000e+00]

[4.84487587e-06 9.99995155e-01]

[4.38515516e-07 9.99999561e-01]

...

[1.62721208e-01 8.37278792e-01]

[9.99998648e-01 1.35180573e-06]

[1.95490836e-22 1.00000000e+00]]

>>> print(y_pred)

[1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 0 1

1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1

1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1

0 0 0 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 0 1 1 0 1 0

0 1 1 1 0 0 0 0 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 1 1 0 0 0 0 1 1 1 1

1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1

1 1 0 0 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 1 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 0 0 0 1 1 0

0 1 1 1 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 1 1 0

1 0 1 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1

0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 1

1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0

0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1

1 1 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0

1 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 0 1 1 0 0 0 0 0 0

1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 0 0 0 1 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 0

1 1 1 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 1 0 0 0 1 1 0 0 0 1 1

1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1]

>>> sum(y_pred)

444


$ python3 semeval-pan-2019-evaluator.py -d /cs/cs159/data/semeval/ground-truth-training-byarticle-20181122.xml -r predictions.txt

{

"truePositives": 218,

"trueNegatives": 181,

"falsePositives": 226,

"falseNegatives": 20,

"accuracy": 0.6186046511627907,

"precision": 0.49099099099099097,

"recall": 0.9159663865546218,

"f1": 0.6392961876832844

}