Technical Infrastructure & Data Roadmap

Technical infrastructure

ALICE is optimized to be used in tandem with the Transcription Task.

Both the Transcription Task and ALICE make use of a Zooniverse service called Caesar. Caesar has the ability to 'read' the stream of classification data (e.g. data submitted by volunteers) on the Zooniverse platform, and perform actions based on predetermined configurations, such as extracting information from a classification and reducing extracted data into 'results'.

As of writing, projects using ALICE and the Transcription Task require manual setup from our team. For more information on setting up your project with the Transcription Task, read our documentation.

ALICE

Aggregation

For a given classification, Caesar performs predetermined actions for extracting and reducing data from individual classifications. This produces the annotation information and the related transcription information shown in ALICE. The default aggregation method is as follows:

  • Data is clustered using the OPTICS reducer to generate clusters based on the annotations drawn by volunteers. This clustering method is density independent, meaning it works well if the size of the text being transcribed varies across the data set (e.g. a page of handwritten text which grows smaller as the writer runs out of room towards the bottom of the page).

  • Once the lines of text have been identified for each annotation in the cluster, the text is aligned using the Collatex package. The most common word for each column of the alignment table is used to form the aggregate text, and a consensus score is calculated with:

`consensus_score = Sum(# of instances of the most common word) / (# of words in the line)`

This score can be treated as the average number of volunteers who agreed for the line of text. A low consensus score may be a helpful determiner of which documents are in need of review.

The final step is to put the text in reading order. This process is done by first identifying what angle(s) the text was written at, and ordering the text from top to bottom based on the position on the page.

Database

The back-end API for ALICE is called TOVE, or Transcription Object Viewer/Editor. TOVE stores JSON representations of transcriptions, which the front-end (ALICE) uses alongside subject reductions from Caesar.

App data to be exported is stored as flat files, and returned to project builders as a .zip file. The file structure includes nested folders which contain the contents of the request (e.g. an individual subject, group, etc.).

Transcription data files are generated and saved to storage when a transcription is approved.

Code repositories

ALICE (front-end) code: https://github.com/zooniverse/alice

TOVE (back-end) code: https://github.com/zooniverse/tove

Transcription Task

The Transcription Task is made up of a drawing line task created by two click events to mark the start and end of a line. The start/end points of the line are either be open or closed, to communicate the direction in which the line was created (open = start; closed = end). The line color indicates the status (completed, in progress, already-interacted-with, not-yet-interacted-with).

Once the transcription mark is made, a sub-task displays. This sub-task displays:

  • a text-entry field where volunteers can type in a new transcription

  • any previous transcriptions via a dropdown menu

    • previous transcriptions can be selected/inserted as an editable string in the text input field

Transcription Task code: https://github.com/zooniverse/front-end-monorepo

The Zooniverse project lifecycle

The chart below shows the average lifecycle of a Zooniverse project in 6 stages: Design, Create, Data (Pre), Interact, Data (Post), & Share. The following text will track the use of the resources described here through the stages of a project.

A chart showing the Zooniverse project lifecycle in 6 stages: Design, Create, Data (pre), Interact, Data (post), Share

The resources described here are relevant at all stages of the project lifecycle.

In the Design phase, choosing to use tools like the Transcription Task and ALICE can directly impact the types of resources project builders may need; e.g. reducing the need to hire a data scientist to run data aggregation code.

In the Create phase, project builders can test the Transcription Task and even preview data output directly in ALICE.

The Data (Pre) phase includes beta testing, which allows project builders to get feedback from Zooniverse volunteers, as well as preview data output. Beta testing is required of all teams who wish for their projects to launch publicly and be featured on https://www.zooniverse.org/projects.

The Interact phase is when volunteers are actively using the Transcription Task to transcribe data. Once images are transcribed completely and 'retired' from the project, they will appear in ALICE.

In the Data (Post) phase, all the images are completely transcribed, and project builders can use ALICE to review and edit the results before exporting their data. The Interact phase and the Data (Post) phase may overlap.

The Share phase is when project builders are sharing the results of their project with the public. Per the Zooniverse Lab Policies: "All publications resulting from public Zooniverse projects must make their classification data open after a proprietary period, normally lasting two years from project launch."

Zooniverse code

Zooniverse code is released under an Apache 2.0 license and is made available as open source code on GitHub (https://github.com/zooniverse). It can be reused, forked and adapted without limits. Some repositories are closed while under active development, but are all ultimately made freely available.


Next section:

Next Steps