Rather than stop at coding the text (Option A for the course), my curiosity was piqued and I tried my hand at prediction in Lightside. Features that I haven't quite mastered in Excel prevented me from creating a chart that could answer my question. I still had to manually count how many times each label appeared in order to get an idea of how often users celebrated or shared content among the DPLN. Machine learning to this point was not helpful.
After importing my csv file with codes for celebration, I followed directions from Carolyn Rosé (2014) to extract features, train the model and predict labels (figures 5, 6, and 7). Her recommendation was to extract Basic Features and apply the Naive Bayes model.
figure 5. Screenshot of LightSIDE extraction.
figure 6. Screenshot of LightSIDE model training.
LightSIDE found every word and unique piece of punctuation. When trained, it output values. As can be seen in the trained model matrix, the accuracy value was 0.48 while the Kappa value was 0.22.
"Accuracy is just the percentage of test examples that were predicted with the correct class, but Kappa is an adjusted version." (Rosé, 2014 Nov 24).
The scale is 0 to 1, with the latter being optimal. (Rosé, 2014 Nov 24). The model's Kappa value is rather abysmal, meaning the percentage of predicted examples with the correct class was close to none! This begs the question, why didn't the machine learn? Did the data need more cleaning? Were there too many punctuation marks? The answer is beyond the scope of this introductory unit.
figure 7. Screenshot of LightSIDE predicted labels.
Despite the small Kappa value, it's interesting to go into the predicted labels to see what LightSIDE did with some codes. For instance, in figure 8 above, several tweets that I coded as "both" celebration and content were labeled celebration by the model. I can understand this based on my examination of tweets more in depth with regards to content of images attached or the content of hyperlinks that has to be interpreted by viewing in a web browser. These items were not inferred from the text alone.
While the predicted labels are interesting, they don't help answer my initial question. Therefore, I employed another tool to produce a word cloud that might indicate what terms occurred most frequently within the text itself. I deleted the Code column from my csv and the "text" row to leave just the tweet data. I saved it as a plain text file and imported into a word cloud machine at TagCloud.com. See figure 8. I chose to program it to leave out articles such as "a", "an", and "the" and ran the visualization, adding words to omit that were not contextual, such as "http" and "co", which was often derived from a URL. From the generated word cloud, it's obvious why "ncties" was counted 204 times, but also interesting to note how prominent other terms are. It's interesting to note that "ready" was a word from my own Tweet (figure 1) that I was going to chose for training the model before I decided to attempt it.
While a small sample size, this could be an indicator that NCTIES17 participants were using Twitter to "share" "student" "learning" and "presenting", while seeking to "empower" "professionally" "ready" "PD" "sessions". The percentage of text containing "content" (40%) was nearly equal to the those labeled "celebration" (38%), with significant overlap as "both" (21%). As Ross Brenemann pointed out in an EdWeek column (2015), "the social media platform offers immediacy and practicality lacking from school PD programs". Therefore, we might construe that participants are utilizing Twitter equally to share content and cultivate professional relationships within the DPLN. One might not be independent from the other.
figure 8. Screenshot of tagcrowd.com word cloud generation.