Data and Text Analytics

Portfolio - Data and Text Analytics

In the half decade before starting my PhD, I pursued a few projects independently on my own time. The more interesting projects that I can publicly share are briefly described below.

Additionally, I have a strong interest in raising the standard of data literacy in the school system. You can read some of my work in teacher training on my page for pedagogy at this link.

Projects on this page (click links to jump to section):

Online physics homework help: What, when and how?

Understanding the Singaporean school landscape from what parents say online

A statistical study of student and teacher demographics in Singapore's school system

A psychometric approach towards mapping knowledge structures in physics: breaking down MCQ responses for 1000+ students

Online physics homework help: What, when and how?

Text data of all posts between 2012 to 2019 from a popular online physics homework help forum (https://www.physicsforums.com/) was scraped. Latent Dirichlet Allocation was used to identify topics that corresponded well to known topics in physics. For example, the topic manually identified as "Sound waves" had the following top terms: wave, frequency, sound, hz, wavelength, vibrate, stand, fundamental, amplitude, string, hear, effect, phase, open, pipe.

When forum posts were aggregated by month, periodic trends appeared, suggesting that certain topics are discussed more often at certain times of the year. Each topic was embedded using a vector of dimensionality 12 representing the average weight of that topic per month (e.g. the first value in that vector represents the average weight for all Januaries over all years from 2012 to 2019). The topics were then hierarchically clustered using cosine distance with that vector. This was then interpreted manually, to show that topics that are related from a pedagogical content perspective do indeed tend to be discussed at the same time of the year every year, and this appears to correspond to common school syllabi.

Part of the hierarchical clustering output is shown in the image below; the full outcome of the clustering can be seen at this link.

On the other hand, there was a lack of correlation between topics, i.e. there was no evidence that any topics tended to be more often discussed alongside certain other topics. While the latter seems counterintuitive at first, considering that this is a homework help forum, and homework questions tend to be focused on single specific topics, this finding makes sense.

These findings show that, in cases where text analytics alone may not yield interesting relationships, patterns may be mined from other peripheral data (such as timestamps).

🔗Click to go back to top

Understanding the Singaporean school landscape from what parents say online

To better understand online discourse pertaining to education in Singapore, all posts between 2007 to 2022 made in sub-forums relating to education were scraped from Kiasu Parents, a popular local online forum for discussion of parenting. Latent Dirichlet Allocation and Sentiment Analysis were applied to the posts. Approx. 50 distinct topics were identified, but these were manually categorised into 10 broader 'topics'.

It was found that many of these broad topics of discussion were periodic in nature and correspond to real-life events, such as discussions of primary school admissions surfacing in July due to this being the time period where such applications take place.

More interestingly, it was found that topic moderated the strength and direction of the relationship between amount of forum activity and sentiment. A key finding was that for a number of topics, the amount of forum activity had a negative correlation to sentiment, i.e. an increase of activity corresponds to an increase in unhappiness.

An example is shown below, for the topic identified as "homework and exam support". Annual activity tends to peak during September and October, which corresponds to the period leading up to end-of-year national exams, and sentiment generally becomes more negative as activity increases. For the graph on the left, line thickness corresponds to number of posts per week.

An analysis of the subtopics within each broad topic may be used to narrow down the cause of unhappiness. For example, as shown below, the subtopic "homework help: science" has increased activity in October, while "Tuition" is the biggest discussion topic for most of the year.

Interestingly, while comparing different sentiment analysis models such as VADER or FLAIR, we found that although these models showed minimal correlation between their sentiment ratings per post, the aggregated results shown above still held regardless of which model was used. It is not clear why this is the case.

Additionally, other than these periodic trends, we also noted certain once-off peaks of activity that also correlate to increased negativity. We were able to trace these peaks to important national news relating to education, such as the proposed abolishment of the Gifted Education system.

Tracking "trending" topics and associated sentiment is standard for social media analysis. However, it is interesting to see that these trends are indeed periodic, and correlated to activity.

🔗Click to go back to top

A statistical study of student and teacher demographics in Singapore's school system

Publicly accessible data was obtained from data.gov.sg to track student and teacher populations across decades.

For teacher data, we are able to clearly see the effect of recruitment drives and model attrition rates. In particular, we see that the largest attrition rate happens for teachers around the age of 25-29 years old, i.e. around 4-5 years of teaching. This suggests that the early experience may be critical for teacher retention.

Similarly, we see the effect of policy changes on student populations, such as merging of education streams or opening up to ASEAN scholars. Interestingly, we also see the impact of Dragon Year babies propagate through the school system. We found that the correlation between Pre-University cohort size to Sec 4 cohort size is relatively stable, despite changes in the broader population size, although cohorts in the 1980s show some fluctuation.

🔗Click to go back to top

Analysing similarities between recipes with topic modeling

This project applied bag-of-words latent dirichlet allocation topic modeling and hierarchical clustering methods to approximately 54000 recipes scraped from an online database to develop a model for classifying recipes. The findings demonstrated that, even without the use of deep learning models, a simple bag-of-words approach with statistical analysis is sufficient to obtain meaningful relations between recipes.

The recipes on the webpage were manually grouped into categories. For each recipe, topic modeling identified sets of topics and associated terms, and assigned weights to each recipe for each topic. These topics were then manually annotated, for example:

Topic 1: Seafood
- Top terms: fish, crab, seafood, fillet, shallot, scallop, lime, thai, mussel, lobster, squid, lemon, juice
Topic 2: Mexican
- Top terms: chilli, tortilla, mexican, salsa, tomato, cumin, jalapeno, taco, chile, cilantro, jack, green, chop, powder, ounce

and so on.

You can explore the LDA display at this link. From this, it is already clear that a number of topics are related. For example, the topics on the left side of the visualisation mostly pertain to baking. λ=1 is useful for showing what the topics have in common, while λ=0 shows what differentiates each topic. λ=0.6 is a good balance for getting an idea of what the topic discusses.

By aggregating average topic weights within each category, hierarchical clustering was then performed using distance between each category, which yielded ecologically meaningful clusters of recipe categories. The full list of these clusters can be viewed at this link. Some examples are shown here: Indonesian, Thai and Vietnamese recipes are closely linked, as expected, but Hawaiian recipes are also reasonably near those South East Asian recipes - which should not be surprising, considering the strong Asian heritage in Hawaii.

Statistical analysis was then done on the topics associated with each cluster:

Topics with the highest mean weight in the cluster are taken to be the defining topics of that cluster.
Topics with the highest standard deviation, normalised by mean, may possibly be used to differentiate between topics in that cluster.

These resulted in some meaningful distinctions. For example, the cluster that comprises the following categories of recipes (as labelled in the source data):

"ethnic asia chinese; ethnic asia filipino; ethnic asia hawaiian; ethnic asia indonesian; ethnic asia japanese; ethnic asia korean; ethnic asia singapore; ethnic asia thai; ethnic asia vietnamese; main-dishes meat seafood"

was defined by the following topics:

Topic 1: East Asian
- Top terms: soy, sesame, sauce, chinese, fry, oil, tofu, wok, tablespoon, ginger, sprout, chestnut, cornstarch, peanut, scallion
Topic 2: Seafood
- Top terms: fish, crab, seafood, fillet, shallot, scallop, lime, thai, mussel, lobster, squid, lemon, juice
Topic 3: Bread dough
- Top terms: dough, yeast, roll, rise, knead, warm, let, make, water, place, ball, hand, double, flour, work

and the categories within the cluster had the largest variance in these topics:

Topic 1: Flatbreads:
- Top terms: pizza, bread, cheese, bake, oven, crumb, phyllo, slice, sprinkl(e), crepe, preheat, mozzarella, sheet, parmesan, artichoke
Topic 2: Mediterranian
- Top terms: olive, pasta, basil, oil, garlic, tomato, fresh, italian, parsley, pepper, herb, parmesan, pine, virgin, chop
Topic 3: Fruit tarts
- Top terms: pie, orange, pineapple, banana, crust, peach, cherry, shell, fruit, juice, whip, unbaked, raspberry, blueberry, dessert

However, for closer differentiation between topics, more sophisticated models may be required. As this was done with statistical analysis of a bag-of-words model, it does not capture the relation between terms in different parts of the document.

🔗Click to go back to top

A psychometric approach towards mapping knowledge structures in physics: breaking down MCQ responses for 1000+ students

This project attempts to develop a new framework for diagnostic instruments for evaluating student knowledge, and to use data from this instrument to construct an aggregate network model of student knowledge.

One key challenge in Intelligent Tutoring Systems (ITS) or Adaptive Learning Systems (ALS) lies in evaluating student learning when multiple interconnected skills or pieces of knowledge are involved. Additionally, traditional systems also only consider "correct" vs "wrong" outcomes, and do not consider the idea of misconceptions, despite this being a key part of pedagogical study.

Building on traditional Item Response Theory (IRT), we propose a model that instead assumes that each option on a multiple choice question (MCQ) suggests the presence of one or more constructs, which in this context may be pieces of knowledge or misunderstandings. This varies from traditional IRT, where a single scale of difficulty (the Item Characteristic Curve) is assumed across the entire test, and questions are only scored based on whether that question was answered correctly or not. In essence, this framework treats each MCQ option as its own separate true/false question, and considers several independent difficulty scales, one for each skill or knowledge.

To validate this approach, preliminary testing was conducted with 1171 students across 8 schools using the topic of "Forces and Dynamics" in Physics. We attempted to produce a network of weights between constructs using pairwise Cramer's V tests (instead of chi-square tests, which do not account for variations in sample size). The outcomes suggest the presence of expected associations between these theoretical constructs, as shown below. In this diagram, blue arrows show connections where a relationship above a particular threshold was found, with thicker lines indicating a stronger relationship. Thin grey arrows indicate relationships that we expected to observe, but did not observe in the data.

While we did not see unexpected relations where we do not expect to see them, we also fail to see some expected relations where they were expected to be appear, as indicated by the thin grey arrows. In general, it appears quite likely that, due to the large number of variables involved, it is difficult to separate out clearer relationships between constructs. One key flaw of the study design lies in that, for ecological validity, we suggested that teachers use any combinations of questions from the database with their students. This resulted in a skewed dataset.

Hence, while this study works well as a proof-of-concept, a follow-up study needs to be done with greater clarity.

Portfolio - Data and Text Analytics

Online physics homework help: What, when and how?

Understanding the Singaporean school landscape from what parents say online

A statistical study of student and teacher demographics in Singapore's school system

Analysing similarities between recipes with topic modeling

A psychometric approach towards mapping knowledge structures in physics: breaking down MCQ responses for 1000+ students

Portfolio contents