Multi-rater datasets

Multi-rater datasets were collected to evaluate interrater variability among pathologists, to evaluate the accuracy of non-pathologists, and to measure biases introduced by providing algorithm-generated suggestions during annotation. The same FOVs were annotated for each multi-rater dataset. Scroll down to download.

Experimental setup

Participants annotated independently of each other. For the evaluation set, we used higher-quality suggestions, while the bootstrap control used lower-quality suggestions. Participants were not shown suggestions for the unbiased control.

Label & truth inference process

A constrained clustering process was used to obtain the potential nuclear locations from multi-rater annotations. Then, an Expectation-Maximization statistical framework was used to aggregate opinion about specific nuclei, taking participant reliability into account. When the opinions of non-pathologists were aggregated, this was called the inferred NP-label. For pathologists, it was called the inferred P-truth.

Evaluation dataset

> Click here to download the raw data (each annotator independently).

> Click here to download the inferred NP-labels.

> Click here to download the inferred P-truth.

40,028 annotations | 1,358 unique nuclei | 530 boundaries

Bootstrap control dataset

> Click here to download the raw data (each annotator independently).

> Click here to download the inferred NP-labels.

> Click here to download the inferred P-truth.

19,881 annotations | 1,349 unique nuclei | 148 boundaries

Unbiased control dataset

> Click here to download the raw data (each annotator independently).

> Click here to download the inferred NP-labels.

> Click here to download the inferred P-truth.

37,434 annotations | 1,569 unique nuclei | 0 boundaries*

* By definition, we did not show participants any algorithmic suggestions in this control experiment. However, we did ask one practicing pathologist (SP.3) to manually trace all boundaries. All nuclear boundaries in FOVs prefixed by "SP.3_#_U-control_#_" are manually traced (1,223 boundaries).

Advanced: raw data (not recommended)

Some users pointed out that they prefer using SQlite to csv, in order to reuse some of the codebase as-is. Please use this link to access the raw single-rater sqlite database. Important note: this is not recommended! The csv files are better suited for use, but feel free to use the raw data if you have a very strong preference for sqlite.