Markup language

Over forty years ago (ages before HTML or Wikipedia), I wrote a science fiction short story about a document analyzer called DAV that could read arbitrary scientific and technical documents and answer questions about them posed by a human. It seemed to me at the time—and I still think—that such a functionality could be realized in our world.

A document markup language for report writing that describes the logical and argumentation connections within the report could facilitate or at least should integrate with Peggy Sutherland's scheme for justification tracking, the Statshow program for tracking statistical analyses, Conal Brown's project to analyse the information encoded in engineering inspection reports, and possibly Marco De Angelis' GUI-in-a-box platform for hosting interfaces for scientific and engineering models.

The idea is that, if a writer would be willing to invest a little bit of extra effort in composing such documents, one could radically improve their utility to would-be readers, or rather in this case to people who don't have time to read but need to understand the arguments, evidence and reasoning in a text.

Note that this idea does not depend on natural language processing in which a computer is taught to read unmodified but contextualized texts and utterances made by humans, although it might benefit from and might synergize with natural language processing.

The idea was a riff on document markup languages like GML or the better instincts of word processors. Recall that the earliest form of text markup was to insert (hard-coded) formatting such as italics, boldface, and spacing in the text. Then came document markup which identified the structural elements of the document, like the title, headers, ordered and unordered lists, text to be emphasized, etc. Then the reader's browser might decide what format to use to render these elements, although this idea has fallen out of favor because writers tend to care very much about how their pieces look, and not only about what they say. Nobody really uses GML anymore, and practical use of HTML has mostly fallen back into text formatting rather than document markup. Modern HTML almost total abdicates those better instincts. CSS improves HTML by separating content from format, but it is arguably regressive in terms of focusing the author/marker on formatting details rather than the logical structure of the document.

If you go one step beyond marking up document structure to marking up document semantics and argumentation, a computer could read (without conscious understanding perhaps) a text and understand its main point, ancillary points, subplots, arguments, lines of evidence, the evidence itself, its sources and provenance, definitions, examples and counterexamples,

This markup would allow computers to fish up answers to questions such as

why?
so what?
what makes you think so?
how do you know?
who said so?
what is the best estimate?
what is the uncertainty on that?
where is the data?
what is the sample size?
what assumptions underlie this?
what does the statistical analysis say about it?
what would the result be if an assumption were changed?

and to cobble together answers far more relevant than are really possible with search-based schemes alone.

In today's environment characterised by a flood of technical literature, one can't expect a busy person to read even a short abstract; it needs to be digested for them.

I even developed a program in the computer language Logo that would do this. (I think it is in my vertical files somewhere.) Why questions could be answered by a variety of markup tags, and those were handled in a priority scheme. When it ran out of specially inserted markings, it would try to parse the sentences to find 'because' clauses and other text fragments that might be explanations suitable as answers, but that is obviously veering into natural language processing. Eventually, when there are no further details under any possibly relevant tags, it gives a dunno answer, which might be "I don't know", "No further details are available in this document", or similar message.

The markup tags, inserted by the document writer, and perhaps augmented by a human editor or machine postprocessing, identify the elements of the messages and arguments, link these elements internally within the document and externally to the wider Internet. This language of markup tags could breathe life into the enormous body of gray literature and reports that are generated each year, and knit together documents and their contents in ways that are not otherwise possible (even with the elaborate identification scheme used in the Internet of Things).

A reader can ask what the main point is of a document, or a section, or even a paragraph. What the import of the claims are, and what the evidence for them is. A reader can also ask questions of fact, like how big is the quantity, what its value would be in different units, or what its uncertainty is. Readers can dial their size preference for answers that are terse, full, loquacious, or archival (with references). A reader could even ask the software whether there is anything else important in the document that hadn't yet been seen.

The language of the markup tags is obviously the critical part of this scheme. It should be rich enough to satisfy the whole of human argumentation, while at the same being simple enough to be remembered and deployed with facility by the author. We might appeal to various theories of argumentation, but a homespun language of tags could also be very useful. A convention on the subject if not a research program might be warranted. An early list of markup tags is available in the Logo implementation. From tattered memory, I recall something like

:mainpoint
:definition
:calculation
:why
:because
:sowhat
:implication
:conclusion
:assumption
:presumption
:hypothesis
:example
:counterexample
:who
:data
:evidence
:title
:argument()
:thesis()
:claimkind()
:graph()
:picture(), recording()
:next(), prev()
:reference(supports, diverges, contradicts)

The tags could be prefixed by URL, DOI, or other document or record locator schemes. The tags extend and generalise the idea already embodied in Twitter hashtags, which create links and topics, but not semantic or argumentation relations.

The tags can be applied hierarchically and multiply, and they can refer to arbitrary parts of the document text and graphics. They can be used to form answers to multiple questions, although immediate re-use is deprecated. Relevance to each new question is computed, probably with some weighting or fuzzy logic assessment, to compose an answer of the desired size. The markup language could identify specific portions of text or parts of a document, or just areas (sections, paragraphs) or starting points within the text. The markups work much like they do for automatically creating document indexes. The tags need not be extensive or deeply applied within a document to be useful. Even very sparse and high-level application could turn out to be very useful in literature reviews. This fact is already evident in the recent emergence and increasing popularity in several medical journals of structured abstracts that require distinct, labeled sections (e.g., Introduction, Methods, Results, Discussion) intended to facilitate comprehension. These rather ham-handed conventions are obviously too simplistic and too rigid for general use, but if they could be thoughtfully generalised and adapted for unobtrusive application, they could have a salutary impact on current guidance for writing abstracts and, in broad use, could have a profound effect on the accessibility and transparency of the scientific literature.

The next step of document markup would of course address automatic document markup, but that would entail some real AI or at least some serious natural language processing at an intermediate level. Recognizing the structure of arguments and evidence without necessarily noticing what's inside. It would be at the level of a secretary or copy editor who can see the structure of the text possibly without following or being able to critique that text.

Please let me know your thoughts. Maybe this idea has already been proposed and its infrastructure already implemented. If so, it needs to be championed with more clarion calls. I don't see what we want among existing markup languages. However, there are doubtless some rumblings with which this proposal might make common cause. See, for example, the EU's Better Regulation: Guidelines and Toolbox.

The scheme should play well with other revolutions in documentation brought by computers and the Internet:

Justification would be most useful when fully integrated with the other concurrent revolutions. So "who?" questions might refer to an Orcid e-science identifier, and "how do you know?" might bring up a link to a re-doable calculation.

The idea for Justification seems most close to that of the Semantic Web, and perhaps it should be subsumed by earlier and more comprehensive work there. But we envision a system that is far simpler and vastly easier for document writers to use and and perhaps more robust for automated document abstractors to process. The strength of the various schemes that have been proposed to make Internet data machine-readable is also their drawback: great complexity. For instance, there are dozens of 'reason's recognized by the consortium schema.org used with RDFa. There are twice as many entries for 'evidence', about a hundred. This system is wildly too detailed for our use. At the same time, it seems insufficiently detailed. For instance, the word 'justification' does not appear at all in the schema (https://schema.org/docs/search_results.html?q=justification).

It seems clear that DAV could be designed to access and read documents with RDF encodings, and it seems likely that DAV could be implemented using RDFa and schema.org's type system, although that environment may be a bit too inflexible. The types are arranged in a hierarchy with multiple inheritance. The types of datatypes of surprisingly simple, including only 'Boolean', 'Date', 'DateTime', 'Number', 'Text' and 'Time', although more elaborated data and dataset structures can be assembled, such as a 'Dataset' that can have a 'distribution' which is a downloadable form of this dataset, at a specific location, in a specific format.

A 'Thing' can have a property 'name' (whose value is an instance 'Text'). It can also have a property 'description'. The fact that a 'thing' can also have a property 'alternateName' but no property for an alternate description tells me that the values of properties are exclusive and cannot be multiplied (otherwise there would be no need for an alternative title). Yet clearly alternative descriptions are extremely important because the uses of a description are quite varied and so multiple forms may be needed to meet different requirements, e.g., as in descriptions of different lengths from six-words, single-sentence thesis, precis, elevator pitch, executive summary, abstract, summary, extended abstract, scientific article, monograph, systematic review, or explanations at Wired's 5 Levels.

The main challenges for the Semantic web are vastness, uncertainty, vagueness, inconsistency, and deceit. Our approach is either insensitive to these challenges or specifically designed to address them.

DAV focuses on one document or a set of related documents and is not concerned with the vastness of the documents across the Internet. Semantic duplication is not problematic for DAV because it does not depend on one-to-one correspondence between a property and its value. Multiple descriptions can be useful in describing something to different audiences, and multiple justifications may be germane in an argument. Vastness does not appear to be a challenge for DAV.
DAV is designed to wrangle uncertainty. It deploys uncertainty quantification technology, including automatic recoding of crystal box calculations to incorporate uncertain numbers, including intervals, distributions, and probability boxes. It can recalculate expressions encoded in arguments with new or variant values or tenable uncertainties that were not used by the original writer. It can automatically refigure projections from numerical estimates based on known sample sizes (accessible in the marked up text) using modern confidence boxes. It can synthesize meta-analyses on the fly from disparate evidence sources, about both effect magnitudes and statistical significance (of deviance of like sign).
Vagueness describes situations for which there is an accompanying semantic gradation that yields borderline cases. For instance, the claim that a day is rainy can be judged in terms of how many centimeters of precipitation have fallen on that day, or the claim that a species is endangered might be judged by how many individuals of the species are currently living. There is no bright line between rainy and not rainy, between endangered and not, but more rain makes a day rainier, and fewer individuals makes a species more endangered. Vagueness, ambiguity and imprecision are often considered to be advantageous in expressions in English and other natural languages. Their value is obvious in many settings such as poetry and diplomatic language. Steven Pinker pointed out that “...the vagueness of language, far from being a bug or an imperfection, actually might be a feature of language, one that we use to our advantage in social interactions.” Sometimes this feature is a rhetorical strategy or simply a practical scheme for bandwidth condensation or language compression. Sometimes we may not know all the details necessary to paint the full picture. Philip Fernbach further argued that in fact “ignorance is a feature of the human mind, not a bug.” No one can possibly have in their memory all the details behind a very complex analysis or endeavor. Because the DAV markup language does not require any element, anything for which imprecision or other uncertainty precludes clarity can simply be omitted. Justification chains terminate with such omissions, leading to dunno answers to further questions by the reader (or perhaps suggestions for how to search the larger indexable web). At the same time, because the markup language allows indefinitely long justification chains, language can be high-level (like in an abstract) but simultaneously information-rich and detailed according to the diligence of the composer and the curiosity and appetite of the reader. For instance, a claim that a data analysis finding is statistically "significant" (a vague term) can be accompanied by the p-value but also the underlying data and the details about the statistical methodology and assumptions needed to reconstruct the analysis from scratch.
DAV models deductive reasoning, but it does not itself depend on it, so it is largely immune to inconsistency which it generally passes along to the reader, which is actually the proper thing to do. DAV facilitates the honest acknowledgement by the document author/marker of recognized inconsistencies via markers such as 'cf' and 'but see', and through contextualization in meta-analyses, just as the strength of an argument or line of reasoning as perceived by the author can also be encoded in tags. Although DAV has limited features for detecting inconsistencies not already recognized by the author/marker, one such feature it does employ is the ability to check the dimensional soundness of arithmetic expressions. Inconsistency and exceptions are important features of the world, and they are present in most substantial works (save perhaps for the Principia Mathematica where a perfect lack of inconsistency would still doom it to incompleteness). Sarcasm, jokes, puns, and lighthearted asides (such as the parenthetical in the previous sentence) are another kind of inconsistency in that they play orthogonally to the straight reading of text. They can be tagged as to their intent just as the sarcasm tag /s, irony_punctuation, emojis, emoticons (generalized into Emotion_Markup_Language) are already commonly used in tagging emails, tweets and small documents. It is not yet clear whether we should make some provision for tagging other kinds of off-topic interjections and meta commentary such as tl;dr and reviewed-by chains.
Although DAV is susceptible to deceit such as outright fabrication of raw data, its main purpose is to reveal and test the justifications and underlying evidence supporting a document's theses, including re-plotting graphs on different scales, checking for outliers, checking assumptions used in data analyses, and even re-running statistical tests with different null hypotheses. DAV could therefore become a significant weapon in the fight against indubiety. Related ideas on tracking data provenance may likewise be helpful to defend against scientific fraud.

Interestingly, Wikipedia concludes its discussion of the challenges facing the Semantic Web [colour emphasis added]:

This list of challenges is illustrative rather than exhaustive, and it focuses on the challenges to the "unifying logic" and "proof" layers of the Semantic Web. The World Wide Web Consortium (W3C) Incubator Group for Uncertainty Reasoning for the World Wide Web (URW3-XG) final report lumps these problems together under the single heading of "uncertainty".[21] Many of the techniques mentioned here will require extensions to the Web Ontology Language (OWL) for example to annotate conditional probabilities. This is an area of active research.[22]
[21] "Uncertainty Reasoning for the World Wide Web" W3.org.
[22] Lukasiewicz, Thomas; Umberto Straccia (2008). "Managing uncertainty and vagueness in description logics for the Semantic Web" Web Semantics: Science, Services and Agents on the World Wide Web. 6 (4): 291–308.

Clearly, justification marking can straightforwardly be applied to writing beyond scientific documents and engineering reports. It seems especially well suited to medical reports and customer complaint responses, as well as more discursive forms such as journalistic/headline writing and bottom line up front styles popular in business and military communications.

Page updated

Google Sites

Report abuse