Panel of Normals PON

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by GATK_Team

on 2017-12-27

A Panel of Normal or PON is a type of resource used in somatic variant analysis. Depending on the type of variant you're looking for, the PON will be generated differently. What all PONs have in common is that (1) they are made from normal samples (in this context, "normal" means derived from healthy tissue that is believed to not have any somatic alterations) and (2) their main purpose is to capture recurrent technical artifacts in order to improve the results of the variant calling analysis.

As a result, the most important selection criteria for choosing normals to include in any PON are the technical properties of how the data was generated. It's very important to use normals that are as technically similar as possible to the tumor (same exome or genome preparation methods, sequencing technology and so on). Additionally, the samples should come from subjects that were young and healthy to minimize the chance of using as normal a sample from someone who has an undiagnosed tumor. Normals are typically derived from blood samples.

There is no definitive rule for how many samples should be used to make a PON (even a small PON is better than no PON) but in practice we recommend aiming for a minimum of 40.

At the Broad Institute, we typically make a standard PON for a given version of the pipeline (corresponding to the combination of all protocols used in production to generate the sequence data, starting from sample preparation and including the analysis software) and use it to process all tumor samples that go through that version of the pipeline. Because we process many samples in the same way, we are able to make PONs composed of hundreds of samples.

Variant type-specific recommendations are given below.

Short variants (SNVs and indels)

For short variant discovery, the PON is created by running the variant caller Mutect2 individually on a set of normal samples and combining the resulting variant calls with some criteria (e.g. excluding any sites that are not present in at least 2 normals) as defined in the Best Practices documentation. This produces a sites-only VCF file that can be used as PON for Mutect2.

Copy Number Variants

For CNV discovery, the PON is created by running the initial coverage collection tools individually on a set of normal samples and combining the resulting copy ratio data using a dedicated PON creation tool. This produces a binary file that can be used as PON.

From jgockley on 2018-01-18

Hi,

I have a question on generating a Panel of Normals for somatic variant detection from WGS data. My issue is as follows:

1. I’m using the new 10x WGS Chromium sequencing strategy and which doesn’t have much released data yet.

2. As such, the only Normal sequencing runs I have are from my study, and I’ve read that you don’t want to base the PoN data included in the study as you could bias your results.

So Should I either:

A. Use your 1000 genomes from a different chemistry

B. Generate a PoN from my data and use it regardless of bias

C. Generate a Panel of Normal custom for each sample, which leaves out the individual’s normal sample from the samples used to create the PoN which will be applied to that sample

D. Not use a PoN at all

From shlee on 2018-01-31

Hi @jgockley,

I think you’ll find what the paragraph about PoNs at the end of [Section 2 of Article#11136](https://software.broadinstitute.org/gatk/documentation/article?id=11136#2) helpful:

> Ideally, the PoN includes samples that are technically representative of the tumor case sample—i.e. samples sequenced on the same platform using the same chemistry, e.g. exome capture kit, and analyzed using the same toolchain. However, even an unmatched PoN will be remarkably effective in filtering a large proportion of sequencing artifacts. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches.

From matti on 2018-03-12

Hi,

could you please elaborate, why the minN parameter of CombineVariants has been disabled from the CreateSomaticPanelOfNormals and/or how a user may in GATK4 control the minimum number of input files that must support a certain site

From shlee on 2018-03-14

Hi @matti,

GATK4 CreateSomaticPanelOfNormals is a different tool than GATK3 CombineVariants whose sole purpose is to create a panel of normals for variant sites present in a minimum of two samples. The latter is still in the process of being ported over to GATK4. However, it sounds like you would like to be able to vary this number in CreateSomaticPanelOfNormals? If you can confirm, then I can ask our developers if they can implement such a feature.

From matti on 2018-03-20

Hi @shlee, able to vary the minimum support level (i.e. files that support a certain site) would be of great importance for us.

From shlee on 2018-03-22

Hi Matti (@matti),

I’ve put in a feature request on your behalf at https://github.com/broadinstitute/gatk/issues/4552. You can check the status of the request and add comments to it directly in the issue ticket. All you need is a Github account.

I just realized you are the Matti I met in Helsinki. I hope the research is going well and that the GRCh38 version of MutSig is working well for you. Please send my regards to the workshop crew.

Soo Hee

From matti on 2018-03-22

Hi Shlee (@shlee),

yep, its me :smile: Our research goes well and we are super happy users of GATK and MuSig. Will forward your regards to the Eija and others.

From ddaneels on 2019-02-01

In your description you state:

It’s very important to use normals that are as technically similar as possible to the tumor … Normals are typically derived from blood samples.

We are dealing with tumor FFPE samples, would it then be best to create a panel of “normal” FFPE samples, because this is technically more similar to the tumor samples or would peripheral blood samples suffice?

From dcraig01 on 2019-02-13

Hello,

Regarding the last paragraph of the post:

“At the Broad Institute, we typically make a standard PON for a given version of the pipeline (corresponding to the combination of all protocols used in production to generate the sequence data, starting from sample preparation and including the analysis software) and use it to process all tumor samples that go through that version of the pipeline. Because we process many samples in the same way, we are able to make PONs composed of hundreds of samples.”

Is this PON publically available and would you recommend applying it to other datasets?

From thisisi3 on 2019-02-22

Dear GATK team,

First of all thank you for providing such a great variant discovery tool, we are also happy users.

Our library preparation uses hybrid capture and we have different panels that include genes ranging from 10 to 100 (roughly). Do we need to make PON for each panel? Or PON of a big panel can be used for any small panels whose region is strictly contained in the big panel?

Also is it roughly true that the more samples in the PON, the better?

Thanks

From GER on 2019-08-26

Is there a representative Panel of Normals already created for standard Illumina Truseq libraries sequenced on a standard Illumina sequencer?

98% of potential Mutect users will be using the above sample prep and sequencing pipeline. Although public WGS data is not an exact replica of what each user will be creating, for the many researchers who do not have available a large number of normal WGS samples, this default PoN would be useful to make available rather than each person having to create this on their own from public WGS datasets.

Report abuse