PanopTILs

An integrated region and cell-level annotation dataset for panoptic segmentation of the breast tumor microenvironment

What is PanopTILs?

It is a unique large-scale dataset of annotated boundaries of histologic regions, along with the locations, classification labels, and segmentation boundaries of nuclei of the same fields of view. PanopTILs was created by reconciling and expanding two public datasets, BCSS and NuCLS, to enable in-depth analysis of the tumor microenvironment (TME) in whole-slide images (WSI) of H&E stained slides of breast cancer.  Panoptic segmentation is particularly useful for, but not limited to, the assessment of tumor-infiltrating lymphocytes (TILs) in accordance with clinical guidelines. Having an integrated dataset enables training deep-learning models that produce biologically-sensible predictions.

Highlights

>   TCGA invasive breast cancer cases

>   151 Patients  from numerous hospitals 

>   1,709 Regions of interest

>   814,886 Nuclei

Data format

The dataset is composed of 1024x1024 regions of interest (ROIs) at 0.25 microns-per-pixel (MPP) resolution, which corresponds to 40x magnification on many scanners. Annotations of histologic regions (semantic segmentation), as well as nuclear classifications and segmentation (object segmentation), are provided for the same ROI to enable exhaustive detection of all relevant elements within the tumor microenvironment in H&E scanned slides.


The dataset contains four folders:

> rgbs/: These are the RGB images in .png format

> masks/: These are the corresponding segmentation masks. They have a three-channel .png format. 

> csv/: These are the classification labels and segmentation boundary coordinates. 

> vis/: These are visualizations of the segmentation masks, provided for convenience.


Each mask is an (m, n, 3) png image, where m, n are the ROI dimensions, and the three channels encode the following:

> First channel: is the region semantic segmentation mask. 

> Second channel: is the nucleus semantic segmentation mask. 

> Third channel: is a binary mask of nuclear boundary edges. 

This is a compact data format that allows training semantic segmentation models for tissue regions and/or cell nuclei. Having access to nuclear boundary pixels can be useful in some modeling applications, such as assigning high weights to learn edge pixels. It also improves visualization . 

Classes

First channel (regions)

Suggested grouping:

Second channel (nuclei)

Suggested grouping:

Download!

There are two flavors of this dataset. One is more suitable for training deep learning models, while the other is suitable for validation. Make sure you exclude slides/hospitals used in training when producing validation accuracy metrics.

Click on the links below to download!

All labels are approved by pathologists (for model validation) 

Algorithmically-expanded nuclear labels (for model training) 

Citations

PanopTILs dataset

Liu S, Amgad M, Rathore MA, Salgado R, Cooper LA. A panoptic segmentation approach for tumor-infiltrating lymphocyte assessment: development of the MuTILs model and PanopTILs dataset. medRxiv 2022.01.08.22268814. 

BCSS dataset

Amgad M, Elfandy H, Hussein H, Atteya LA, Elsebaie MA, Abo Elnasr LS, Sakr RA, Salem HS, Ismail AF, Saad AM, Ahmed J. Structured crowdsourcing enables convolutional segmentation of histology images. Bioinformatics. 2019 Sep 15;35(18):3461-7.

NuCLS dataset

Amgad M, Atteya LA, Hussein H, Mohammed KH, Hafiz E, Elsebaie MA, Alhusseiny AM, AlMoslemany MA, Elmatboly AM, Pappalardo PA, Sakr RA. NuCLS: A scalable crowdsourcing approach and dataset for nucleus classification and segmentation in breast cancer. GigaScience. 2022 May 17;11.