Abdul-RAHMAN:2017:CGF

A. Abdul-Rahman, G. Roe, M. Olsen, C. Gladstone, R. Whaling, N. Cronk, R. Morrissey, and M. Chen. Constructive Visual Analytics for Text Similarity Detection. Computer Graphics Forum, 36(1):237-248, 2017. DOI.

This Case Study Report was authored in 2026 by Alfie Abdul-Rahman, King's College London, UK

This case study is based on Text Similarity Detection work by Abdul-Rahman et al. [2017]. The work was part of the Digging into Data Challenge JISC/NEH Program (III). The team consists of visualization scientists and domain experts working in literary studies, intellectual history, and digital humanities, distributed across three continents: the UK, the US, and Australia. In this work, we use a novel visual analytics approach to identify commonplaces in 18th-century literary and print culture.

A commonplace is a thematic collection of quotations and similar passages intended for future recall and reuse. In other words, two similar sequences in different settings can be considered commonplaces. The 18th-century can be viewed as one of the last periods in a long tradition of ‘commonplace cultures’ stretching from Antiquity through the Renaissance and Early Modern eras, where commonplaces often mirror the "social network" of that time. Essentially, commonplaces likely reveal who liked whom in the 18th century; if two people liked each other, they tended to copy more from one another.

Detecting similarity between texts is a common task in text mining that domain experts have studied extensively, often using automated methods like machine learning models. Because similarity measurement typically depends on multiple metrics, some of which are sensitive to subjective interpretation, a generic detector built through machine learning often struggles to balance the roles of different metrics depending on the semantic context of a specific text collection. While these methods have seen considerable success, they also produce many false positives that can be time-consuming to eliminate.

Fig. 1. The workflow for optimizing a workflow in this case study. The workflow was initially drawn as three iterations for dealing with the symptom in the baseline workflow and two side-effects as the work on three iterations were done sequentially. Later we realized that the two side-effects were both related to the first iteration, and could be dealt with in parallel.

Fig. 1 shows the process under the lens of the SCORE* ontological framework. This is where we identified the first symptom: domain experts had experience of using machine learning models that made predications at the passage level, but these models produced too many false positives. This caused a time-consuming process in their workflow to eliminate false positives, making it harder for domain experts to focus on the actual cases. The information was being compressed too quickly and excessively, leading to high entropy reduction and high Alphabet Compression (AC). A possible remedy was to use visualizations to represent some analytical results at the word level as alternatives to machine learning models, especially since we had noticed that these models compressed information too much and too quickly. The passage-level similarity is judged by humans through visualization. One side effect of the word-based approach was that the obvious visualization method is a bipartite graph showing the connections between similar words would be very cluttered as shown on the top of Fig. 2.

Fig. 2. A screenshot of ViTA. The graph representation at the top is cluttered, while the similarity pattern can be seen easily in the dot plot on the right.

By treating a side-effect as a new symptom, the next symptom is that when we visualized texts as a bipartite graph, the Potential Distortion (PD) becomes too high, making it difficult to see patterns. Visually, because each connection between two similar word is a line, the display cost for each piece of information is too high. From the perspective of each pixel along a line connection, the computer screen would consider every pixel is an independent variable (i.e., an alphabet). However, from the perspective of the drawing algorithm, there is only one variable about the similarity between the two words. If the computer screen could use only one pixel, then it would achieve a great deal of Alphabet Compression (AC). Hence, the possible remedy was to use a 2D pixel map or matrix to display each similarity connection using only one pixel. Humans can use patterns of pixels to judge the similarity at the passage-level as shown on the right of Fig. 2. In comparison with displaying a similarity connection using a line, the side effect of the pixel- or matrix-based approach is increased cognitive load due to need to find the source and destination nodes from a pixel. When there are only a few connections, the line-based approach has lower cognitive load as lines guide the viewers to find the source and destination for each similarity connection. However, when the lines become cluttered, the cognitive load for following each line becomes very high, so the pixel-based approach offers the better trade-offs. Following this reasoning, we consider that the side-effect is tolerable.

Going back to the early discussion about visualizing analytical results at the word level, there was another side effect. Because we noticed that the literature scholars had a lot knowledge and different definitions in judging similarity, we considered that it would be difficult for machine learning to acquire such knowledge and definitions solely from the dataset available at that time. For different definitions, ideally there would be different analytical algorithms, but through our discussions, we could not establish a relatively stable set of definitions, as the literature scholars were able to formulate definitions dynamically, often inflenced by their search tasks and the relevant data. So the side-effect was that the literature scholars might have to translate a definition to an algorithm on demand.

By treating this side-effect as a new symptom, the third symptom, is the difficulty for literature scholars to construct algorithms. If one were to define an algorithm using a programming language or pseudo-code, there would be a huge amount of flexibility. This means the alphabet for possible algorithms is huge. We therefore reduced this alphabet by allowing users to select some predefined algorithmic components, so they only need to deal with the components rather than statements in a programming language. A "program" at the component level typically has only a few components, so the alphabet for such "programs" is smaller. We introduced interactions as shown in Fig. 3. A side effect was that literature scholars had to learn to build such algorithms. We organized a workshop involving a number of literature scholars. We found that they could learn to "program" their algorithms within two hours and they enjoyed the learning process. So the size-effect was manageable and the learning cost was justifiable.

Fig. 3. LEFT: Part of a screenshot of ViTA showing block-based programming. RIGHT: On a workshop, literature scholars learned to specify a similarity detection algorithm within two hours.

* During an action research project, the ontology framework is named in 2026 as SCORE (Symptom, Cause, Optimize, Remedy, side-Effect).