How to Call somatic copy number variants using GATK4 CNV

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by shlee

on 2017-03-08

A more recent CNV tutorial using v4.0.1.1 has been posted in two parts elsewhere at:

https://software.broadinstitute.org/gatk/documentation/article?id=11682 and
https://software.broadinstitute.org/gatk/documentation/article?id=11683

The first part mostly recapitulates the workflow on this page, while the second part adds detection of allelic ratios. Although the v4.0.1.1 tutorial is under review as of May 2, 2018, we recommend you update to the official workflow, especially if performing CNV analyses on WGS data. The official workflow has algorithmic improvements to the GATK4.beta workflow illustrated here.

This demonstrative tutorial provides instructions and example data to detect somatic copy number variation (CNV) using a panel of normals (PoN). The workflow is optimized for Illumina short-read whole exome sequencing (WES) data. It is not suitable for whole genome sequencing (WGS) data nor for germline calling.

The tutorial recapitulates the GATK demonstration given at the 2016 ASHG meeting in Vancouver, Canada, for a beta version of the CNV workflow. Because we are still actively developing the CNV tools (writing as of March 2017), the underlying algorithms and current workflow options, e.g. syntax, may change. However, the presented basic approach and general concepts will still be germaine. Please check the forum for updates.

Many thanks to Samuel Lee (@slee) for developing the example data, data figures and discussion that set the backbone of this tutorial.

► For a similar example workflow that pertains to earlier releases of GATK4, see Article#6791. ► For the mathematics behind the workflow, see this whitepaper.

Jump to a section

Collect proportional coverage using target intervals and read data using CalculateTargetCoverage
Create the CNV PoN using CombineReadCounts and CreatePanelOfNormals
Normalize a raw proportional coverage profile against the PoN using NormalizeSomaticReadCounts
1. Segment the normalized coverage profile using PerformSegmentation ☞ I get an error at this step!
2. (Optional) Plot segmented coverage using PlotSegmentedCopyRatio ☞ What is the QC value?
Call segmented copy number variants using CallSegments
1. Discussion of interest to some ☞ Why can't I use just a matched normal? ☞ How do the results compare to SNP6 analyses?

Tools, system requirements and example data download

This tutorial uses a beta version of the CNV workflow tools within the GATK4 gatk-protected-1.0.0.0-alpha1.2.3 pre-release (Version:0288cff-SNAPSHOT from September 2016). We previously made the program jar specially available alongside the data bundle in the workshops directory here. The original worksheets are in the 1610 folder. However, the data bundle was only available to workshop attendees. Note other tools in this program release may be unsuitable for analyses.
- The example data is whole exome capture sequence data for chromosomes 1–7 of matched normal and tumor samples aligned to GRCh37. Because the data is from real cancer patients, we have anonymized them at multiple levels. The anonymization process preserves the noise inherent in real samples. The data is representative of Illumina sequencing technology from 2011.
R (install from https://www.r-project.org/) and certain R components. After installing R, install the components with the following command.
- Rscript install_R_packages.R
- We include installRpackages.R in the tutorial data bundle. Alternatively, download it from here.
XQuartz for optional section 5. Your system may already have this installed.
The tutorial does not require reference files. The optional plotting step that uses the PlotSegmentedCopyRatio tool plots against GRCh37 and should NOT be used for other reference assemblies.

1. Collect proportional coverage using target intervals and read data using CalculateTargetCoverage

In this step, we collect proportional coverage using target intervals and read data. We have actually pre-computed this for you and we provide the command here for reference.

We process each BAM, whether normal or tumor. The tool collects coverage per read group at each target and divides these counts by the total number of reads per sample.

java -jar gatk4.jar CalculateTargetCoverage \ -I <input_bam_file> \ -T <input_target_tsv> \ -transform PCOV \ -groupBy SAMPLE \ -targetInfo FULL \ –keepdups \ -O <output_pcov_file>

The target file -T is a padded intervals list of the baited regions. You can add padding to a target list using the GATK4 PadTargets tool. For our example data, padding each exome target 250bp on either side increases sensitivity.
Setting the -targetInfo option to FULL keeps the original target names from the target list.
The –keepdups option asks the tool to include alignments flagged as duplicate.

The top plot shows the raw proportional coverage for our tumor sample for chromosomes 1–7. Each dot represents a target. The y-axis plots proportional coverage and the x-axis targets. The middle plot shows the data after a median-normalization and log2-transformation. The bottom plot shows the tumor data after normalization against its matched-normal.

Different data types come with their own caveats. WGS, while providing even coverage that enables better CNV detection, is costly. SNP arrays, while the standard for CNV detection, may not be part of an analysis protocol. Being able to resolve CNVs from WES, which additionally introduces artifacts and variance in the target capture step, requires sophisticated denoising.

For each of these progressions, how certain are you that there are copy-number events? How many copy-number variants are you certain of? What is contributing to your uncertainty?

back to top

2. Create the CNV PoN using CombineReadCounts and CreatePanelOfNormals

In this step, we use two commands to create the CNV panel of normals (PoN).

The normals should represent the same sequencing technology, e.g. sample preparation and capture target kit, as that of the tumor samples under scrutiny. The PoN is meant to encapsulate sequencing noise and may also capture common germline variants. Like any control, you should think carefully about what sample set would make an effective panel. At the least, the PoN should consist of ten normal samples that were ideally subject to the same batch effects as that of the tumor sample, e.g. from the same sequencing center. Our current recommendation is 40 or more normal samples. Depending on the coverage depth of samples, adjust the number.

The first step combines the proportional read counts from the multiple normal samples into a single file. The -inputList parameter takes a file listing the relative file paths, one sample per line, of the proportional coverage data of the normals.

java -jar gatk4.jar CombineReadCounts \ -inputList normals.txt \ -O sandbox/combined-normals.tsv

The second step creates a single CNV PoN file. The PoN stores information such as the median proportional coverage per target across the panel and projections of systematic noise calculated with PCA (principal component analysis). Our tutorial’s PoN is built with 39 normal blood samples from cancer patients from the same cohort (not suffering from blood cancers).

java -jar gatk4.jar CreatePanelOfNormals \ -I sandbox/combined-normals.tsv \ -O sandbox/normals.pon \ -noQC \ --disableSpark \ --minimumTargetFactorPercentileThreshold 5

This results in two files, the CNV PoN and a target_weights.txt file that typical workflows can ignore. Because we have a small number of normals, we include the -noQC option and change the --minimumTargetFactorPercentileThreshold to 5%.

back to top

3. Normalize a raw proportional coverage profile against the PoN using NormalizeSomaticReadCounts

In this step, we normalize the raw proportional coverage (PCOV) profile using the PoN. Specifically, we normalize the tumor coverage against the PoN’s target medians and against the principal components of the PoN.

java -jar gatk4.jar NormalizeSomaticReadCounts \ -I cov/tumor.tsv \ -PON sandbox/normals.pon \ -PTN sandbox/tumor.ptn.tsv \ -TN sandbox/tumor.tn.tsv

This produces the pre-tangent-normalized file -PTN and the tangent-normalized file -TN, respectively. Resulting data is log2-transformed.

Denoising with a PoN is critical for calling copy-number variants from WES coverage profiles. It can also improve calls from WGS profiles that are typically more evenly distributed and subject to less noise. Furthermore, denoising with a PoN can greatly impact results for (i) samples that have more noise, e.g. those with lower coverage, lower purity or higher activity, (ii) samples lacking a matched normal and (iii) detection of smaller events that span only a few targets.

back to top

4. Segment the normalized coverage profile using PerformSegmentation

Here we segment the normalized coverage profile. Segmentation groups contiguous targets with the same copy ratio.

java -jar gatk4.jar PerformSegmentation \ -TN sandbox/tumor.tn.tsv \ -O sandbox/tumor.seg \ -LOG

For our tumor sample, we reduce the ~73K individual targets to 14 segments. The -LOG parameter tells the tool that the input coverages are log2-transformed.

View the resulting file with cat sandbox/tumor.seg.

What is better, tissue-matched normals or blood normals of tumor samples? What makes a better background control, a matched normal sample or a panel of normals?

Based on what you know about PCA, what do you think are the effects of using more normal samples? A panel with some profiles that are outliers?

Which chromosomes have events?

☞ I get an error at this step!

This command will error if you have not installed R and certain R components. Take a few minutes to install R from https://www.r-project.org/. Then install the components with the following command.

Rscript install_R_packages.R

We include installRpackages.R in the tutorial data bundle. Alternatively, download it from here.

back to top

5. (Optional) Plot segmented coverage using PlotSegmentedCopyRatio

This is an optional step that plots segmented coverage.

This command requires XQuartz installation. If you do not have this dependency, then view the results in the precomputed_results folder instead. Currently plotting only supports human assembly b37 autosomes. Going forward, this tool will accommodate other references and the workflow will support calling on sex chromosomes.

java -jar gatk4.jar PlotSegmentedCopyRatio \ -TN sandbox/tumor.tn.tsv \ -PTN sandbox/tumor.ptn.tsv \ -S sandbox/tumor.seg \ -O sandbox \ -pre tumor \ -LOG

The -O defines the output directory, and the -pre defines the basename of the files. Again, the -LOGparameter tells the tool that the inputs are log2- transformed. The output folder contains seven files--three PNG images and four text files.

Before_After.png (shown above) plots copy-ratios pre (top) and post (bottom) tangent-normalization across the chromosomes. The plot automatically adjusts the y-axis to show all available data points. Dotted lines represent centromeres.
BeforeAfterCRLim4.png shows the same but fixes the y-axis range from 0 to 4 for comparability across samples.
FullGenome.png colors differential copy-ratio segments in alternating blue and orange. The horizontal line plots the segment mean. Again the y-axis ranges from 0 to 4.

Open each of these images. How many copy-number variants do you see?

☞ What is the QC value?

Each of the four text files contain a single quality control (QC) value. This value is the median of absolute differences (MAD) in copy-ratios of adjacent targets. Its calculation is robust to actual copy-number variants and should decrease post tangent-normalization.

preQc.txt gives the QC value before tangent-normalization.
postQc.txt gives the post-tangent-normalization QC value.
dQc.txt gives the difference between pre and post QC values.
scaled_dQc.txt gives the fraction difference (preQc - postQc)/(preQc).

back to top

6. Call segmented copy number variants using CallSegments

In this final step, we call segmented copy number variants. The tool makes one of three calls for each segment--neutral (0), deletion (-) or amplification (+). These deleted or amplified segments could represent somatic events.

java -jar gatk4.jar CallSegments \ -TN sandbox/tumor.tn.tsv \ -S sandbox/tumor.seg \ -O sandbox/tumor.called

View the results with cat sandbox/tumor.called.

Besides the last column, how is this result different from that of step 4?

back to top

7. Discussion of interest to some

☞ Why can't I use just a matched normal?

Let’s compare results from the raw coverage (top), from normalizing using the matched-normal only (middle) and from normalizing using the PoN (bottom).

What is different between the plots? Look closely.

The matched-normal normalization appears to perform well. However, its noisiness brings uncertainty to any call that would be made, even if visible by eye. Furthermore, its level of noise obscures detection of the 4th variant that the PoN normalization reveals.

☞How do the results compare to SNP6 analyses?

As with any algorithmic analysis, it’s good to confirm results with orthogonal methods. If we compare calls from the original unscrambled tumor data against GISTIC SNP6 array analysis of the same sample, we similarly find three deletions and a single large amplification.

back to top

Updated on 2018-05-16

From EADG on 2017-04-11

Hi @shlee,

can I also use the GATK4Pre-Release from the Docker-Container (Version:4.alpha-249-g7df4044-SNAPSHOT) ? The actual nightly don`t seems to contain CalculateTargetCoverage. Also in both release the arguments for input intervals and target feature file are listed under Optional Arguments, but you need to at least one of them to make the programm running.

In addition target-files for both releases seems to differ: Version:4.alpha-249-g7df4044-SNAPSHOT needs a standard .bed-file with fileformat-tag and correct filename extension while the 1.2.3.PreRelease needs an tsv-file formated like that:

##fileFormat = tsv contig start stop name 1 1 101 target-0

Can you suggest which file format will be needed by the final release ?

Greetings EADG

From slee on 2017-04-18

Hi @EADG,

You’re correct that at least one of the optional arguments is required.

In the final release, we will most likely stick with the TSV file format. You can use the tool ConvertBedToTargetFile if you have a BED file you’d like to use.

Unfortunately, I am not familiar with the Docker image you ask about—-maybe @LeeTL1220 can chime in here.

From dannykwells on 2017-05-09

Hi,

This method looks great and I love the thoroughness of the approach. I have a few questions as we try to implement it:

1. The method as described starts with a BAM file. Is there an assumption that this bam file has been processed more than being aligned? That is, do you remove PCR duplicates from the BAM? Presumably BQSR or the like is not performed here?

2. Ideally, there would be a WDL file you could share that would show this pipeline with code – is that something you have/could point me to?

We are the Parker Institute are very interested in using this approach, and your help here will be great to get this going!

From shlee on 2017-05-12

Hi @dannykwells,

Thanks for your compliments on the method and interest in using this approach. We are excited about it too. Let us know what else we can help you with.

1. Yes, the BAMs are preprocessed according to our Best Practices, including duplicate marking and BQSR. I checked with our developers and it is still the case that this workflow uses the `-keepdups` option to include duplicates. Our developers have found including the duplicates gives empirically better results.

2. The WDLs are in https://github.com/broadinstitute/gatk-protected/tree/master/scripts/cnv_wdl/somatic. Two of the scripts are relevant: cnv_somatic_panel_workflow.wdl is for creating the PoN and cnv_somatic_pair_workflow.wdl is for calling on a Tumor-Normal pair.

From Pepetideo on 2017-05-26

would this process work for very small panels or only exome-type panels?

From shlee on 2017-05-30

Hi @Pepetideo,

This workflow works for exome-type panels. I believe it can also work for small panels so long as you have some requisite minimum number of samples in the PoN. This is because the normalization is per target site.

It would be helpful for us to know exactly what you mean by very small panels.

From dannykwells on 2017-05-30

Hi @shlee Thanks! We are hopefully starting to implement soon.

From EADG on 2017-06-02

Hi,

just for your information for the wdl-Workflow you need the actual gatk4-protected. The 1.2.3.PreRelease wont work!

From EADG on 2017-06-02

Hi @shlee ,

is there a bug in the wdl-file ?

Line 67 In cnv_somatic_tasks.wdl is —transform PCOV, but CombineReadCounts need an absolut readcount. So it had to be set to RAW … Can you confirm this ?

Greetings EADG

From shlee on 2017-06-02

Hi @EADG,

You are asking whether CombineReadCounts requires raw or proportional coverage data. I believe it should take either. Are you getting an error message?

For the workflow presented in this page’s tutorial, CombineReadCounts uses proportional coverage (PCOV) data.

From leiendeckerlu on 2017-06-05

Hi there,

i was wondering about the best approach to do CNV on WGS data?

In principle, I shouldn’t need to define a target region/interval (-T / -L options) as I’m interested on the whole genome, however, to improve the performance it might be good to limit CNV on exome regions, right?

So my question would be, whether there is a way to look only for CNV in exonic regions or whether there is like a standard bed file for hg38 that defines exonic regions that I might use with CalculateTargetCoverage ?

Thank you!

From shlee on 2017-06-05

Hi @leiendeckerlu,

Good news and bad news. We have a new GATK4 GermlineCNVCaller. However, it is still experimental. We’ll have a beta release jar [here](https://github.com/broadinstitute/gatk/releases) sometime this week or next I’m told. Again, the release jar will be in BETA status but the tool itself is in ALPHA status and may undergo further changes.

If you are interested, you can read up on the tool [here](https://github.com/broadinstitute/gatk/blob/56e6baa79b4e56ebee5fb8d2b2288373a4269fa8/src/main/java/org/broadinstitute/hellbender/tools/coveragemodel/germline/GermlineCNVCaller.java). I just updated the document portion, which are the lines that start with an asterisk `*`.

As for your question on whether to define target intervals, this depends on the quality of the analysis set [reference you are using](http://gatkforums.broadinstitute.org/gatk/discussion/7857/reference-genome-components), whether you want to include the sex chromosomes knowing the implications for doing so, etc.

Remember that exonic regions are a small fraction of the whole genome. You can check our [Resource Bundle](https://software.broadinstitute.org/gatk/download/bundle) for interval lists for human. I believe there is an WGS interval list for GRCh38. If you insist on exonic regions, then one suggestion I have that I have myself used in the past, is to take GENCODE exons and converted them in to an intervals list for use with whole exome data.

From leiendeckerlu on 2017-06-05

Hi @shlee,

Thank you for this comprehensive answer! I’ve a few follow up questions:

> Good news and bad news. We have a new GATK4 GermlineCNVCaller. However, it is still experimental. We’ll have a beta release jar [here](https://github.com/broadinstitute/gatk/releases) sometime this week or next I’m told. Again, the release jar will be in BETA status but the tool itself is in ALPHA status and may undergo further changes.

That’s great news and I will check out the GermlineCNVCaller once it is out!

However, I’m a little bit confused now: I only need the GermlineCNVCaller if I want to call for CNV in the germline, right? If I’m only interested in CNV in tumors (somatic), I create a PON with all the normal WGS files (> 20T/N) that I have (using the human interval list or as I described below) and then follow the pipeline described above, or am I missing something here?

Also, what’s per definition the PON when doing germline CNV calling?

> Remember that exonic regions are a small fraction of the whole genome. You can check our [Resource Bundle](https://software.broadinstitute.org/gatk/download/bundle) for interval lists for human. I believe there is an WGS interval list for GRCh38. If you insist on exonic regions, then one suggestion I have that I have myself used in the past, is to take GENCODE exons and converted them in to an intervals list for use with whole exome data.

I just went ahead and created a bed file covering all canoncial transcript regions and used this for my analysis. Should work the same, right? I will rerun the analysis with the human intervals list and see how things change.

Again, thanks a lot!

From leiendeckerlu on 2017-06-06

An additional comment regarding optional step 5 above:

> java -jar gatk4.jar PlotSegmentedCopyRatio \

> -TN sandbox/tumor.tn.tsv \

> -PTN sandbox/tumor.ptn.tsv \

> -S sandbox/tumor.seg \

> -O sandbox \

> -pre tumor \

> -LOG

My version requires the additional option of a sequence dictionary “-SD”, e.g. hg38.dict. I’m not sure what’s the case in the final GATK4 but maybe you should/could add that above.

From EADG on 2017-06-06

Hi @shlee,

yes Iam getting an error-message that CombineReadCounts requires an integer-value in the tsv-file (wdl-workflow only). So i changed —transform PCOV to RAW.

Now Iam running in an other error/exeception right at the end: CreatePanelOfNormals throws the following stack trace:

```

17/06/02 13:15:13 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool

17/06/02 13:15:13 INFO DAGScheduler: ResultStage 3 (treeAggregate at RowMatrix.scala:121) finished in 0,112 s

17/06/02 13:15:13 INFO DAGScheduler: Job 2 finished: treeAggregate at RowMatrix.scala:121, took 1,253390 s

Jun 02, 2017 1:15:14 PM com.github.fommil.jni.JniLoader load

INFORMATION: already loaded netlib-native_system-linux-x86_64.so

17/06/02 13:15:14 INFO SparkUI: Stopped Spark web UI at http://194.XX.XX.XX:4040

17/06/02 13:15:14 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

17/06/02 13:15:14 INFO MemoryStore: MemoryStore cleared

17/06/02 13:15:14 INFO BlockManager: BlockManager stopped

17/06/02 13:15:14 INFO BlockManagerMaster: BlockManagerMaster stopped

17/06/02 13:15:14 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!

17/06/02 13:15:14 INFO SparkContext: Successfully stopped SparkContext

[2. Juni 2017 13:15:14 MESZ] org.broadinstitute.hellbender.tools.exome.CreatePanelOfNormals done. Elapsed time: 0,11 minutes.

Runtime.totalMemory()=2962751488

breeze.linalg.NotConvergedException: at breeze.linalg.svd$.breeze$linalg$svd$$doSVD_Double(svd.scala:109) at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:39) at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:38) at breeze.generic.UFunc$class.apply(UFunc.scala:48) at breeze.linalg.svd$.apply(svd.scala:22) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:259) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:193) at org.broadinstitute.hellbender.utils.svd.SparkSingularValueDecomposer.createSVD(SparkSingularValueDecomposer.java:48) at org.broadinstitute.hellbender.utils.svd.SVDFactory.createSVD(SVDFactory.java:35) at org.broadinstitute.hellbender.tools.pon.coverage.pca.HDF5PCACoveragePoNCreationUtils.calculateReducedPanelAndPInverses(HDF5PCACoveragePoNCreationUtils.java:294) at org.broadinstitute.hellbender.tools.pon.coverage.pca.HDF5PCACoveragePoNCreationUtils.create(HDF5PCACoveragePoNCreationUtils.java:105) at org.broadinstitute.hellbender.tools.exome.CreatePanelOfNormals.runPipeline(CreatePanelOfNormals.java:248) at org.broadinstitute.hellbender.utils.SparkToggleCommandLineProgram.doWork(SparkToggleCommandLineProgram.java:39) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:116) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:121) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:142) at org.broadinstitute.hellbender.Main.main(Main.java:218)

17/06/02 13:15:14 INFO ShutdownHookManager: Shutdown hook called

```

Iam using the latest version of gatk4 protected. Any clue ?

From EADG on 2017-06-07

Hi @shlee again,

i found what was rising the error…

First I disabled spark like recommended for the non-WDL workflow here. After that CreatePanelOfNormals was running for hours for 3 samples and a handful of targets without creating any output.

```

contig start stop name Sample01 Sample02 Sample03

chr1 36466310 36466515 K2 0 905.1473525473677 0

chr1 36466516 36466733 K3 0 1385.874753971472 0

chr1 36466734 36467090 K4 0 171.77080597042197 0

chr1 36467797 36468252 F1 2064.4097124071554 0 0

chrX 134424919 134425661 D4 1265.198074675947 1546.8064910636297 1659.8150860748076

```

Then I replaced all 0 in the tsv-file with 1.0 and voilà CreatePanelOfNormals was finished in seconds.

```

contig start stop name Sample01 Sample02 Sample03

chr1 36466310 36466515 K2 1.0 905.1473525473677 1.0

chr1 36466516 36466733 K3 1.0 1385.874753971472 1.0

chr1 36466734 36467090 K4 1.0 171.77080597042197 1.0

chr1 36467797 36468252 F1 2064.4097124071554 1.0 1.0

chrX 134424919 134425661 D4 1265.198074675947 1546.8064910636297 1659.8150860748076

```

Is there an option that CreatePanelOfNormals automatically ignores lines with 0 coverage for a specific target ?

From shlee on 2017-06-07

Hi @leiendeckerlu,

I misanswered your initial question! Sorry about that and thanks for the followup. So yes, as you say the GermlineCNVCaller is for germline contexts and for your somatic contexts you want to use somatic workflow. For WGS data, to calculate read coverage, please use [SparkGenomeReadCounts](https://github.com/broadinstitute/gatk-protected/blob/133acb5e6eeb2a712ef17a6f10c0b3d34119c276/src/main/java/org/broadinstitute/hellbender/tools/genome/SparkGenomeReadCounts.java).

You are asking workflow questions and I think you’ll find plumbing the newly public repo helpful. Specifically, there are WDL scripts at . The somatic CNV scripts are in . One of these outlines PoN creation. We will be working on documenting all of these workflows in the forum going forward.

Remember there are differences in BED files versus GATK/Picard-style intervals. This is the difference between 0 and 1-based interval coordinates, repectively. You can use the new [ConvertBedToTargetFile](https://github.com/broadinstitute/gatk/blob/56e6baa79b4e56ebee5fb8d2b2288373a4269fa8/src/main/java/org/broadinstitute/hellbender/tools/exome/convertbed/ConvertBedToTargetFile.java). I’ve just finished documenting it so the link will tell you more. Please be sure to use the GATK/Picard-style intervals with our other tools.

P.S. As for your comment:

> My version requires the additional option of a sequence dictionary “-SD”, e.g. hg38.dict. I’m not sure what’s the case in the final GATK4 but maybe you should/could add that above.

These types of differences are the reason this tutorial mentions exactly which version of the tool we use. There is new optional reference handling in GATK4. These will be documented in the argument options.

From shlee on 2017-06-07

@EADG,

I’ve asked a developer to look into your questions. We’ll get back to you shortly.

From slee on 2017-06-08

Hi @EADG,

You should feel free to use Spark for the non-WDL workflow. When Spark is enabled, the Spark implementation of SVD is used to perform PCA (rather than the Apache Commons implementation).

The test case with a very small number of targets that you are trying to run is rather idiosyncratic. Furthermore, there is some hard filtering that occurs before the PCA that gets rid of low coverage or extreme targets, so your first case is probably getting reduced to a single target. This is probably causing the Apache implementation to fail silently. Can I ask what you are trying to test with this small number of targets?

Also, can you post the stacktrace for the previous error with CombineReadCounts that you were running into? Thanks!

From EADG on 2017-06-09

Hi @slee,

thx for your quick answer. I just performed a dry run of the wdl-pipeline, that why i choose such a little number of targets and samples. Is there a minimum number of samples/targets which you can recommend for a dry run ?

09:13:50.875 INFO HDF5PCACoveragePoNCreationUtils - All 29 targets are kept 09:13:50.889 INFO HDF5PCACoveragePoNCreationUtils - There were no columns with a large number of targets with zero counts (<= 1 of 29) to drop 09:13:50.892 INFO HDF5PCACoveragePoNCreationUtils - There are no targets with large number of columns with zero counts (<= 1 of 3) to drop 09:13:50.895 INFO HDF5PCACoveragePoNCreationUtils - No column dropped due to extreme counts outside [0,9889360708, 1,1610151841] 09:13:50.903 INFO HDF5PCACoveragePoNCreationUtils - No count is 0.0 thus no count needed to be imputed. 09:13:50.908 INFO HDF5PCACoveragePoNCreationUtils - None of the 87 counts were truncated as they all fall in the non-extreme range [0,82, Infinity] 09:13:50.909 INFO HDF5PCACoveragePoNCreationUtils - Counts normalized by the column median and log2'd. 09:13:50.909 INFO HDF5PCACoveragePoNCreationUtils - Counts centered around the median of medians 0,00 09:13:50.910 WARN HDF5PCACoveragePoNCreationUtils - No Spark context provided, not going to use Spark... 09:13:50.910 INFO HDF5PCACoveragePoNCreationUtils - Starting the SVD decomposition of the log-normalized counts ...

Is that the filtering which you mentioned ? It seems nothing being filtered for my case...

Stacktrace for CombineReadCount: ``` [2. Juni 2017 12:03:53 MESZ] org.broadinstitute.hellbender.tools.exome.CombineReadCounts done. Elapsed time: 0,00 minutes. Runtime.totalMemory()=1513619456

A USER ERROR has occurred: Bad input: format error in '/data/capture/cromwell-executions/CNVSomaticPanelWorkflow/b60397d6-a8ed-442d-b27c-5c6d51dfca08/call-CombineReadCounts/inputs/data/capture/cromwell-executions/CNVSomaticPanelWorkflow/b60397d6-a8ed-442d-b27c-5c6d51dfca08/call-CollectCoverage/shard-0/execution/Sample1.aligned.duplicate_marked.sorted.coverage.tsv' at line 5: expected int value for column Sample1 but found 0,000

``` Greetings EADG

From slee on 2017-06-12

Hi @EADG,

In practice, I haven’t found it necessary to do a dry run. But perhaps ~10 samples and ~100 targets would be enough to avoid any weird Apache-related errors.

I am also not sure why you are getting that CombineReadCounts error. Are you sure that Sample1.aligned.duplicate_marked.sorted.coverage.tsv is a properly formatted TSV? If looks like you may have some commas in that file that are being parsed incorrectly, resulting in the message that you see. Can you confirm if this is the case?

You should be able to pass PCOV coverage files to CombineReadCounts, and then pass the resulting file on to CreatePanelOfNormals, as is done in the WDL.

From wleeidt on 2017-06-27

I tried running CalculateTargetCoverage with —groupBy SAMPLE option, it finished successfully but I don’t see the 5th column (proportional coverage) in the output tsv file. When I tried —groupBy COHORT, I got the 5th column and all the values are NaN. Here the command I used: `java -Djava.io.tmpdir=$TMPDIR -Xmx4g -jar gatk4-protected-local.jar CalculateTargetCoverage —reference $REF -I $BAM -T $TARGETS -transform PCOV —targetInformationColumns FULL —groupBy SAMPLE -O ${OUTDIR}/${SM}.tsv`. Am I missing something?

From shlee on 2017-06-28

I @wleeidt,

The columns represent: contig, start, stop, name and then the sample. You are saying the first four columns have values (target information) but the sample column is empty of values for the targets. I have two suggestions.

[1] Include your duplicate reads with:

``` —disableReadFilter NotDuplicateReadFilter \

```

[2] Use the latest release at . We’ve merged the protected repo with the gatk repo.

Let me know if you still get the error after [1], and then after [2].

From wleeidt on 2017-07-11

Thanks @shlee. I downloaded the latest release and tried your suggestion with the `—disableReadFilter` parameter. It still didn’t generate the sample column. Here is the command that I used: `/mnt/mywork/bin/GATK4/gatk-4.beta.1/gatk-launch CalculateTargetCoverage -I B_2.sorted.bam -T input_target.tsv —transform PCOV —groupBy SAMPLE —targetInformationColumns FULL —disableReadFilter NotDuplicateReadFilter -O cov/B_2.targetCov.tsv`

From shlee on 2017-07-11

Hi @wleeidt,

Can you check two things for me then. First, can you `samtools view -H B_2.sorted.bam | grep ‘@RG’` and post what you see. Is there an SM field? Second, if all seems well with your RG fields, then can you run Picard CollectHsMetrics on your BAM to ensure there is coverage for your targets? You will have to convert your targets intervals to a Picard format target intervals. You can use the same target intervals for the BAIT_INTERVALS and TARGET_INTERVALS.

From anand_mt on 2017-07-12

shlee said: > Hi

dannykwells,

> Thanks for your compliments on the method and interest in using this approach. We are excited about it too. Let us know what else we can help you with.

> 1. Yes, the BAMs are preprocessed according to our Best Practices, including duplicate marking and BQSR. I checked with our developers and it is still the case that this workflow uses the `-keepdups` option to include duplicates. Our developers have found including the duplicates gives empirically better results.

> 2. The WDLs are in https://github.com/broadinstitute/gatk-protected/tree/master/scripts/cnv_wdl/somatic. Two of the scripts are relevant: cnv_somatic_panel_workflow.wdl is for creating the PoN and cnv_somatic_pair_workflow.wdl is for calling on a Tumor-Normal pair.

Will this work if I have a single tumor-normal pair (not a panel)? Can I just make pon using single normal sample and use it as downstrem ?

From EADG on 2017-07-12

Hi @slee,

sry for my late response, I was captured by a bunch of urgent analyses. I run your original wdl for pon creation with 80 samples and got the same error. I will attach a snippet of the file later this day.

Greetings EADG

From shlee on 2017-07-12

Hi @anand_mt,

Please construct a CNV PoN from a minimum of 10 samples. The recommended number of samples is 40. CreatePanelOfNormals will error if you provide it just a single normal. It will also error if you provide it the same normal (with different names) multiple times. Also, the point of the workflow is to denoise coverage profiles with the PoN. Remember that copy number profiles are inherently noisy as there is some stochastic element to coverage depths. A good PoN is critical to discerning CNV events from inherently noisy coverages with the sensitivity that somatic analyses require.

P.S. I have just finished writing a new tutorial (for the UK workshop) that shows the effects of calling CNV events from two different CNV PoNs. You might find taking the hour to run through it helpful in understanding the importance of the PoN. In fact, let me just attach the worksheet here. It’s the first iteration and will likely undergo further improvements. Please direct any questions you have about the new tutorial to me if anything is unclear.

Hammer_SomaticCNV_worksheet.pdf

From anand_mt on 2017-07-13

Thank you for the document and all the nice explanation. I was about to do something stupidly similar which has been addressed in the document (Using TCGA blood normals as to create PoN). Thank you, super helpful.

From wleeidt on 2017-07-13

Thanks @shlee! After adding read group in the bam header, I got the column with the coverage.

From wleeidt on 2017-07-13

@shlee. I ran into problem running gatk-launch CombineReadCounts -inputList normals.txt -O S1-S2-cov.tsv, where the normals.txt contains the list of tsv files (in full path) generated by CalculateTargetCoverage. I checked to make sure that the files in the normals.txt are present. But when I run CombineReadCounts, it complaint that it couldn't find the files: `Using GATK jar /mnt/work/bin/GATK4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar Running: java -Dsamjdk.useasyncioreadsamtools=false -Dsamjdk.useasynciowritesamtools=true -Dsamjdk.useasynciowritetribble=false -Dsamjdk.compressionlevel=1 -Dsnappy.disable=true -jar /mnt/work/bin/GATK4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar CombineReadCounts -inputList normals.txt -O S1-S2-cov.tsv 12:19:22.811 INFO NativeLibraryLoader - Loading libgklcompression.so from jar:file:/mnt/work/bin/GATK4/gatk-4.beta.1/gatk-package-4.beta.1-local.jar!/com/intel/gkl/native/libgklcompression.so [July 13, 2017 12:19:22 PM PDT] CombineReadCounts --inputList normals.txt --output S1-S2-cov.tsv --maxOpenFiles 100 --help false --version false --showHidden false --verbosity INFO --QUIET false --usejdkdeflater false --usejdkinflater false [July 13, 2017 12:19:22 PM PDT] Executing as wlee@test2 on Linux 3.10.0-514.21.2.el7.x8664 amd64; OpenJDK 64-Bit Server VM 1.8.092-b15; Version: 4.beta.1 12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.COMPRESSIONLEVEL : 1 12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.USEASYNCIOREADFORSAMTOOLS : false 12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.USEASYNCIOWRITEFORSAMTOOLS : true 12:19:22.842 INFO CombineReadCounts - HTSJDK Defaults.USEASYNCIOWRITEFOR_TRIBBLE : false 12:19:22.842 INFO CombineReadCounts - Deflater: IntelDeflater 12:19:22.842 INFO CombineReadCounts - Inflater: IntelInflater 12:19:22.842 INFO CombineReadCounts - Initializing engine 12:19:22.842 INFO CombineReadCounts - Done initializing engine 12:19:22.852 INFO CombineReadCounts - Shutting down engine [July 13, 2017 12:19:22 PM PDT] org.broadinstitute.hellbender.tools.exome.CombineReadCounts done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=1561853952

A USER ERROR has occurred: Couldn't read file /usr/local/wlee/AHM/cov/S1.targetCov.tsv . Error was: /usr/local/wlee/AHM/cov/S1.targetCov.tsv (No such file or directory)`

From shlee on 2017-07-13

@wleeidt,

Can you echo or cat your file to ensure it exists? Make sure the `tsv` file doesn’t have a `.txt` extension.

From wleeidt on 2017-07-13

Yes. they are in .tsv format. Here is a little snippet of the file: ##fileFormat = tsv ##commandLine = CalculateTargetCoverage --output /mnt/HPC/work/wlee/SpinalTap/170707NB5020230003AHMLHJAFXX/cov/20170707-HMLHJAFXX-S1.targetCov.tsv --groupBy sample --transform PCOV --targets /mnt/HPC/work/wlee/SpinalTap/170707NB5020230003AHMLHJAFXX/bed/Exomev1CNVv2Targets.merged.tsv --targetInformationColumns FULL --input /mnt/HPC/work/wlee/SpinalTap/170707NB5020230003AHMLHJAFXX/bam/20170707-HMLHJAFXX-S1.sorted.RG.bam --cohortName --intervalsetrule UNION --intervalpadding 0 --intervalexclusionpadding 0 --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --usejdkdeflater false --usejdkinflater false --disableToolDefaultReadFilters false ##title = Read counts per target and sample contig start stop name 20170707-HMLHJAFXX-S1 chr1 14870 14990 core1 1.735e-05 chr1 69090 70008 46935846379501(OR4F5)12 2.392e-05 chr1 367658 368597 469360600729759(OR4F29)13 4.171e-05 chr1 565346 565466 core2 2.082e-06 chr1 621095 622034 46935870881399(OR4F16)14 4.382e-05 chr1 755880 756000 core_3 6.501e-06

From shlee on 2017-07-13

Next thing to confirm is that none of your files have duplicate sample names?

From wleeidt on 2017-07-13

For testing, I only have 2 samples, and they have unique sample names. I have tried running `CombineReadCounts` on another two samples, and it works. The process of generating the tsv files was consistent, using `CalculateTargetCoverage`.

From shlee on 2017-07-14

@wleeidt, that’s odd. If you provide each coverage with the `-I` argument, does CombineReadCounts process them? If not, would you mind regenerating your coverages for the two samples fresh and then running CombineReadCounts on them?

From wleeidt on 2017-07-14

@shlee. Thanks. When I use `-I` argument for each of the coverage file in the `CombineReadCounts`, it processes them just fine. But when I tried the argument `inputList normals.txt`, where normals.txt contains the fullpaths to the same two coverage files, it fails.

From shlee on 2017-07-14

Alright, then something’s up with your file paths. I think the tool is able to handle both absolute and relative file paths. I don’t think ownership of the target directory should make a difference, e.g. read-write permissions, as you should have at least read permission for all of your directories. Is there anything else you can say is different between the file paths that worked and did not work in your test of the `inputList`?

From EADG on 2017-07-17

HI @slee,

as promised i put an example of one of the tsv-files I feed to CombineReadCounts: ```

fileFormat = tsv

commandLine = org.broadinstitute.hellbender.tools.exome.CalculateTargetCoverage --output Sample.tsv --groupBy sample --transform PCOV --targets /$

title = Read counts per target and sample

contig start stop name Sample chr1 23557682 23559182 X22.1 0,0007630 chr1 23558083 23559786 X31.2 0,0008154 chr1 26695143 26697900 X41.3 0,002453 chr1 26725390 26730523 X52.4 0,0008222 chr1 26720891 26731964 X6_3.5 0,001617 ... ``` Still got the same error...

Hm when I review the course-material from last week I see that there a point(.) instead of a comma (,) maybe it is something wrong with my language options in linux/R/ or elsewhere ?

Thanks in advance EADG

From shlee on 2017-07-17

@wleeidt, I hope you were able to resolve your issue. If so, please do share so the wider community can benefit. If not, let’s file a bug report. Instructions are in [Article#1894](https://software.broadinstitute.org/gatk/guide/article?id=1894).

From shlee on 2017-07-17

@EADG,

Can you replace your commas with points and see if you still get the error? Thanks.

P.S. Our developer says that you should expect the error you see due to the commas.

From wleeidt on 2017-07-17

@shlee. I ran `PlotSegmentedCopyRatio`, and it returned SUCCESS, but no plots are generated except for a file called `plotting_dump.rda`. Here is the command `gatk-4.beta.1/gatk-launch PlotSegmentedCopyRatio -TN S4_tumor.pn.tsv -PTN S4_tumor.ptn.tsv -S S4_tumor.seg -O sandbox -SD hg19.dict -pre S4_gatk4_cnv_segment -LOG`. I tried running it in Xterm on a mac, still no plot is generated.

From EADG on 2017-07-18

Hi shlee, slee,

yes it is working when I replace the commas with points, but since I am using the wdl-pipeline which was provided by slee on gitHub, this is not a sufficient workaround. Due to my expectation that the problem (commas instead of points) arise from my language settings I try different thing to change my actual language on the system. To make the long story short the solution was to start CalculateTargetCoverage with the flags (I added them in the cnv_somatic_tasks.wdl):

```

java -Duser.language=en -Duser.country=US -Xmx${default=4 mem}g -jar ${gatk_jar} CalculateTargetCoverage

```

And now the cnv_somatic_panel_workflow.wdl runs smoothly to the end, with points instead of commas :)

An other solution would be to running the pipeline in a DockerContainer wich have an english java-version installed, like the one which is provide by the gatk-team.

I hope this helps other user too which are working with non-englisch java versions.

Greetings EADG

From shlee on 2017-07-18

@wleeidt, Can you double check that the names of the contigs in the sequence dictionary match in representation to that in your data ?

From wleeidt on 2017-07-18

@shlee. I checked the bam header and the sequence dictionary, and the configs are identical.

From shlee on 2017-07-18

Thanks @EADG. I’ll let our developer know that the tools themselves generate the data with the commas that then error downstream tools. And thanks for the workaround of adding `-Duser.language=en -Duser.country=US` to commands. For gatk-launch, these options would then be:

```

gatk-launch —javaOptions “-Duser.language=en -Duser.country=US“

```

From shlee on 2017-07-18

Ok, @wleeidt. Let’s fill out a bug report so we can recapitulate your error on our side and get to the source of the error. Instructions for filing a bug report are in [Article#1894](https://software.broadinstitute.org/gatk/guide/article?id=1894).

From wleeidt on 2017-07-18

@shlee. Thanks for sending the instruction for submitting a bug report. I have put all the required files, the descriptions of how I ran the command, and the stdout of the tool in a text file in bugReport_by_wleeidt.updated.zip and uploaded to the ftp server.

From shlee on 2017-07-18

Hi @wleeidt,

I have received your data and in my hands I can generate plots. Your data shows a negative dQc value that indicates PoN normalization was ineffective. I will have a developer look into better error messaging and they may followup here with questions.

Three dependencies pop into my head and these are:

- Does your system have R 3.1.3 or higher installed?

- Do you have XQuartz installed?

- Finally, did you run the Rscript to install the dependent packages?

Can you check to see if you have all of the above? If not, then can install these and run your command again? You can find links/instructions for these dependencies in the attached PDF. This is the first iteration (to be improved) of a somatic CNV tutorial that I recently finished writing. It compares the effects of using two different PoNs for denoising and how it can impact somatic CNV detection.

Hammer_SomaticCNV_worksheet.pdf

From wleeidt on 2017-07-19

@shlee. I had installed XQuartz, and R 3.3.3 was installed. I just installed the R dependencies from [https://github.com/broadinstitute/gatk/blob/master/scripts/docker/gatkbase/install_R_packages.R](https://github.com/broadinstitute/gatk/blob/master/scripts/docker/gatkbase/install_R_packages.R). Now `PlotSegmentedCopyRatio` is able to generate the plots. Thanks so much for your help!

From shlee on 2017-07-19

That’s great @wleeidt. Glad you can move ahead.

From wleeidt on 2017-07-19

@shlee. Is there a tool for plotting the B-allele frequency along with the copy number ratio?

From shlee on 2017-07-19

@wleeidt, [PlotACNVResults](https://software.broadinstitute.org/gatk/documentation/tooldocs/4.beta.1/org_broadinstitute_hellbender_tools_exome_plotting_PlotACNVResults.php) can plot both segmented copy-ratio and minor-allele-fraction estimates.

From wleeidt on 2017-07-19

@shlee. I have a question about the PoN. You mentioned previously that it’s recommended to have at least 10 normal samples to generate the PoN. What if I only have 2 normal samples? Is that going to affect the accuracy of the whole CNV analysis? I plotted the minor allele frequency using PlotACNVResults using GetBayesianHetCoverage to get the het SNPs with one normal bam and one tumor bam. And in the end, the minor allele frequency all hovering around 0.3, which looks strange.

Screen Shot 2017-07-19 at 12.53.49 PM.png

From shlee on 2017-07-19

@wleeidt,

In general, I think it best if you follow the recommendations of minimum 10 to 40 samples for the PoN. For your other questions, because I am unfamiliar with the ACNV workflow and GetBayesianHetCoverage, I am passing your question on to our developers. You can click on the Issue Number above to view any ensuing discussion.

From slee on 2017-07-25

@wleeidt,

If you only have 2 normal samples in your PoN, that will reduce the amount of denoising that you can do on your copy-ratio data. Noisy copy-ratio data may yield poor copy-ratio segments, and since the latter is used as input to ACNV, may ultimately affect minor-allele-fraction estimates.

However, from your plots, it looks like your copy-ratio data is not too noisy, so I don’t think this is to blame for your hets/minor-allele-fraction result. It looks like the hets are indeed unusual—-there are many hets with zero minor-allele-fraction and the rest are rather widely distributed in [0, 0.5].

I’m not sure if I can diagnose the issue without knowing more about your data. What was the depth of your normal and tumor? Were they sequenced using the identical capture kits? Could you post your GetBayesianHetCoverage and AllelicCNV command line invocations?

From wleeidt on 2017-07-25

@shlee. I ran into an error in `AllelicCNV` that I didn’t see before, and it only happened to one of my samples.

Screen Shot 2017-07-25 at 7.56.41 AM.png

From wleeidt on 2017-07-25

@slee. The samples are using the same capture kit. Here are the commands: `gatk-launch —javaOptions “-Xmx8g” GetBayesianHetCoverage —reference hg19.fa —snpIntervals hg19.snpv147.interval_list —tumor S4.sorted.bam —normal S1.sorted.bam —tumorHets tumor.S4.hets.tsv —normalHets normal.S1.hets.tsv` and `gatk-launch —javaOptions “-Xmx8g” AllelicCNV -pre S4_gatk4_cnv -S S4_tumor.seg -TN S4_tumor.pn.tsv -THET tumor.S4.hets.tsv`. The mean target coverage for normal samples are around 80, while S4 tumor sample is a bit lower, 48.

From slee on 2017-07-25

@wleeidt It looks like your command lines are in order and the coverage sounds reasonable. Perhaps you can try running the normal through AllelicCNV to see if you get reasonable minor-allele fractions around 0.5, as a sanity check? Could it be reasonable that your tumor is high purity and exhibits a genome-wide CNV duplicating one of the alleles (since the MAFs are suspiciously close to 1/3), or could it be contaminated by another normal? It’s hard for me to say anything conclusive without examining your data in detail, but the tool appears to be fitting the data you’re giving it as expected.

From wleeidt on 2017-07-25

I ran the normal sample and the minor-allele fractions are mostly 0.5 (except for chromosomes 1, 5, and 8). Please see attached plot. The tumor sample that I have is just a reference sample with known CNVs (https://www.horizondiscovery.com/structural-multiplex-reference-standard-hd753).

Screen Shot 2017-07-25 at 2.35.39 PM.png

From slee on 2017-07-25

@wleeidt The normal sample looks very reasonable (at least in terms of hets and minor-allele fractions; it’s possible that any activity in the copy ratio is being denoised away, since your normal is included in your PoN—-which we don’t recommend!)

Just to double check, the normal and tumor samples S1 and S4 are matched?

Also, regarding your AllelicCNV error, this is a bug that we’ve encountered previously and should be fixed (see https://github.com/broadinstitute/gatk-protected/issues/804). It’s possible that you’ve encountered another edge case and we may have to revise the fix. What version of the jar are you running?

From wleeidt on 2017-07-26

@slee. We only have two normal samples, which I used them to make the PoN, and I used one of the two normal samples for the CNV analysis with a tumor sample. What kind of normal samples would you recommend if I don’t use the “normal” that I have in the CNV analysis. Also, for this dataset, the samples are not exactly tumor/normal matched pair. The normal sample is the Horizon’s AKT1 Wild Type Reference Standard (https://www.horizondiscovery.com/akt1-wild-type-reference-standard-hd659) while the “tumor” sample is the Horizon’s Structural Multiplex Reference Standard (https://www.horizondiscovery.com/structural-multiplex-reference-standard-hd753). Essentially, I am trying to have known CNV references to see if we can detect them. I also have tumor/normal matched dataset analysis results, which is very noisy (see attached). Do you know of any GIAB type of dataset that we can use to do benchmarking?

Screen Shot 2017-07-26 at 8.50.05 AM.png

From slee on 2017-07-27

@wleeidt I see—-for the allelic workflow, the samples must be matched. The purpose of the GetBayesianHetCoverage tool is to identify het sites in the normal and to collect the corresponding allelic counts at these sites in the tumor, which allows the minor-allele fraction to be estimated from the change in the alt-allele fraction.

As for the issue of which normal samples to use to build a PoN, let me clarify: the purpose of the PoN is to use normal samples to learn patterns of systematic sequencing noise, which is essentially done using PCA. However, it’s possible that any CNVs in the normal samples, if they contribute enough to the covariance of the data, may get picked up as sequencing noise. When you have only two normals in the PoN, any activity in them may indeed contribute significantly to the covariance!

In our sanity check, we used only two normals to build the PoN and then tried to use the PoN to denoise the copy-ratio data from one of those normals. So it’s likely that CNV activity got “denoised” away in the copy-ratio result for this normal, but that it would still show up in the allele-fraction data—-this is a possible explanation for the behavior we see in chromosomes 1, 5, and 8.

If we had really intended to perform a proper analysis on that normal sample (rather than just a sanity check of the allele-fraction result), a better approach would have been to build a PoN of many other normal samples (as @shlee said earlier, we recommend at least 10s of samples), not including the normal of interest, and then using these to denoise said normal.

As for benchmarking, we often use TCGA samples with matched WES/WGS to check concordance. There are also some cell lines, such as HCC1143 and COLO829, which have CNV call sets. However, I think a somatic benchmark on the level of the GIAB datasets is still not available.

From slee on 2017-07-27

@wleeidt Also, if you could let me know the version you ran, I can look into that AllelicCNV bug you encountered.

From wleeidt on 2017-07-27

@slee. Thanks for the explanation. I was using gatk-4.beta.1.

From slee on 2017-07-27

@wleeidt OK, thanks. Looks like we’ll have to amend the bug fix. Thanks for catching it!

From slee on 2017-08-01

@wleeidt Just to let you know, I amended the bug fix in https://github.com/broadinstitute/gatk/pull/3378. Please let me know if that fixes your error on that sample, when you get a chance.

From Sheila on 2017-08-01

@EADG

Hi EADG,

Can you tell us which version of the tool you are using?

One of the developers says this should not be an issue with the latest release. They recently made a change that should prevent this from happening (set the locale to English/US before running the tool).

-Sheila

From ameynert on 2017-08-02

Running the AllelicCNV tool in gatk-4.beta.3-SNAPSHOT, I’m getting a dynamic library linking error:

```

Using GATK jar /gpfs/igmmfs01/eddie/bioinfsvice/ameynert/software/gatk-4.beta.3-SNAPSHOT/gatk-package-4.beta.3-SNAPSHOT-local.jar

Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -Xmx16G -jar /gpfs/igmmfs01/eddie/bioinfsvice/ameynert/software/gatk-4.beta.3-SNAPSHOT/gatk-package-4.beta.3-SNAPSHOT-local.jar AllelicCNV —tumorHets ../cnvs/allelic/WW00250a.het_pulldown —tangentNormalized ../cnvs/segments/HGSOC.normals/WW00250a.tn.tsv —segments ../cnvs/segments/HGSOC.normals/WW00250a.called.tsv —outputPrefix ../cnvs/allelic/WW00250a

13:14:33.211 INFO NativeLibraryLoader – Loading libgkl_compression.so from jar:file:/gpfs/igmmfs01/eddie/bioinfsvice/ameynert/software/gatk-4.beta.3-SNAPSHOT/gatk-package-4.beta.3-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_compression.so

[02 August 2017 13:14:33 BST] AllelicCNV —tumorHets ../cnvs/allelic/WW00250a.het_pulldown —tangentNormalized ../cnvs/segments/HGSOC.normals/WW00250a.tn.tsv —segments ../cnvs/segments/HGSOC.normals/WW00250a.called.tsv —outputPrefix ../cnvs/allelic/WW00250a —useAllCopyRatioSegments false —maxNumIterationsSNPSeg 25 —smallSegmentThreshold 3 —numSamplesCopyRatio 100 —numBurnInCopyRatio 50 —numSamplesAlleleFraction 100 —numBurnInAlleleFraction 50 —intervalThresholdCopyRatio 4.0 —intervalThresholdAlleleFraction 2.0 —maxNumIterationsSimSeg 10 —numIterationsSimSegPerFit 1 —sparkMaster local[*] —help false —version false —showHidden false —verbosity INFO —QUIET false —use_jdk_deflater false —use_jdk_inflater false

[02 August 2017 13:14:33 BST] Executing as ameyner2@node1h26.ecdf.ed.ac.uk on Linux 3.10.0-327.36.3.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_141-b16; Version: 4.beta.3-SNAPSHOT

```

snipped out normal output

```

Aug 02, 2017 3:30:56 PM com.github.fommil.jni.JniLoader liberalLoad

INFO: successfully loaded /tmp/ameyner2/jniloader3462942138955839305netlib-native_system-linux-x86_64.so

java: symbol lookup error: /tmp/ameyner2/jniloader3462942138955839305netlib-native_system-linux-x86_64.so: undefined symbol: cblas_daxpy

```

Any ideas?

From EADG on 2017-08-10

Hi @Sheila,

i try to get the correct version…

``` java -jar gatk-protected-local.jar CombineReadCounts —version True

Version:version-unknown-SNAPSHOT

Tool returned:

```

But i think it is one before the lattest. I will try the latest version and check if the error persist.

From Sheila on 2017-08-10

@ameynert

Hi,

I have asked someone else on the team to respond. They will get back to you soon.

-Sheila

From shlee on 2017-08-10

Hi @ameynert,

It appears that you are using the jar directly by invoking it with `java -jar`. Can you try your command via the gatk-launch script, e.g. `gatk-launch AllelicCNV …` and let me know if you get the same error? Also, can you tell me the reference you are using, e.g. GRCh38? Thanks.

From EADG on 2017-08-10

Hi Sheila, shlee

the new version (version 4.beta.3 ) fixed the issue, thank for your help.

By the way i see the same behavior/issue with Picard MergeVcfs in the latest picard-version (Picard version: 2.10.9-SNAPSHOT)…

Greetings EADG

From shlee on 2017-08-10

Hi @EADG,

Which issue do you speak of?

From EADG on 2017-08-10

The point and comma issue with CalculateTargetCoverage which was coming from a different language setting…

I also observed the issue with MergeVcfs in Picard when I combine mutiple vcf on a maschine with german language settings..

From shlee on 2017-08-10

Hi @EADG,

Does your previously stated solution:

```

java -Duser.language=en -Duser.country=US -Xmx${default=4 mem}g -jar ${jar}

```

work towards your MergeVcfs issue?

From EADG on 2017-08-11

Hi @shlee,

yes it does but maybe the developers can fix this in coming versions of picard…

From Geraldine_VdAuwera on 2017-08-11

Soon you’ll be able to invoke Picard from GATK4 so this will be solved that way.

From EADG on 2017-08-11

Yeah looking forward to it :)

From ameynert on 2017-08-14

Hi @shlee. I’m running using gatk-launch and it’s assembly hg38.

From shlee on 2017-08-14

Hi @ameynert,

My best guess is your environment’s Java is choking on GRCh38 contigs, e.g. those that contain colons `:` and/or asterisks `*`. Here is a command that I know folks in the research community use that works:

```

java -Xmx7g -Djava.library.path=/usr/lib/jni/ -jar ${jar_fn} AllelicCNV …

```

This was shared with me last week and it is the first time I’ve seen folks specify the java jni path. I’m not sure if this would make a difference for your case, but perhaps you could try it.

Otherwise, can you do me a favor and try your command with the official GATK Docker? The Docker image is `broadinstitute/gatk:4.beta.3`. You can invoke the launch script with `/gatk/gatk-launch`. Currently, our official WDL scripts use the jar. Here’s an example: https://github.com/broadinstitute/gatk/blob/master/scripts/cnv_wdl/somatic/cnv_somatic_allele_fraction_pair_workflow.wdl. But we recommend invoking the program with the `gatk-launch` script.

If you need simple instructions on how to start using Docker locally, then let me know.

P.S. With `gatk-launch`, I believe you specify java options as `—javaOptions “-Djava.library.path=/usr/lib/jni/”`.

P.P.S. I am trying to find my reply to another forum member about why you should exclude such alt and decoy contigs from CNV analyses. Alas, I cannot find it. Please let me know if you need me to expand on why.

From anand_mt on 2017-08-17

Hi,

I have been using gatk4 alpha for copy number analysis. I created PoN and did segmentation. Now I want to check for allelic CN which requires heterozygous sites. For now I am using germline heterozygous variants. is it okay ? Are there any suggestion on selecting variants ?

Sorry for cross-posting. (https://gatkforums.broadinstitute.org/gatk/discussion/10164/gatk4-allelic-cnv#latest)

From shlee on 2017-08-17

Hi @anand_mt,

I believe you provide GetHetCoverage a list of common SNP sites with the `—snpIntervals` argument and the tool should derive sites of heterozygous variants for you from the normal BAM. These sites are written to the `—normalHets` file that you name.

From shlee on 2017-08-17

Hi @ameynert,

I’ve written up simple instructions on getting started with Docker at https://gatkforums.broadinstitute.org/gatk/discussion/10172/. The tutorial is titled (How to) Run the GATK4 Docker locally and take a look inside. The document lives under the gatk-4-beta category at https://gatkforums.broadinstitute.org/gatk/categories/gatk-4-beta on account of GATK4 being in beta status.

From anand_mt on 2017-08-18

shlee said: > Hi

anand_mt,

> I believe you provide GetHetCoverage a list of common SNP sites with the `—snpIntervals` argument and the tool should derive sites of heterozygous variants for you from the normal BAM. These sites are written to the `—normalHets` file that you name.

Thank you for the reply. Just to make sure, ‘common SNP sites’ would be germline variants (present in both tumor and normal)?

From shlee on 2017-08-18

Hi @anand_mt,

Common SNP sites are where a species shows variation commonly across individuals. So the variation refers to germline variants. You could use a known variant sites resource, e.g. 1000 Genomes Project callset. Alternatively, you could use your Normal’s genotype callset, if you have it. The tool uses the callset as a list of genomic positions to check within the BAMs and will do it’s own determination of whether the Normal is heterozygous. Having this list allows the tool to be efficient as it doesn’t have to comb through each genomic position to assess whether the Normal is het at the position.

From ameynert on 2017-08-22

Hi @shlee, I’ve installed Docker and got gatk 4 beta 3 running in an instance. AllelicCNV is still failing for me, though at a different point. I checked, R is installed and looks fine.

```

root@8e464076458d:/home/cnvs# ../../gatk/gatk-launch —javaOptions “-Xmx32G” AllelicCNV —tumorHets $OUTDIR/$TUMOUR.het_pulldown —tangentNormalized $SEGDIR/$TUMOUR.tn.tsv —segments $SEGDIR/$TUMOUR.called.tsv —outputPrefix $OUTDIR/$TUMOUR

Using GATK wrapper script /gatk/build/install/gatk/bin/gatk

Running: /gatk/build/install/gatk/bin/gatk AllelicCNV —tumorHets allelic/WW00246a.het_pulldown —tangentNormalized segments/WW00246a.tn.tsv —segments segments/WW00246a.called.tsv —outputPrefix allelic/WW00246a

09:29:09.575 INFO NativeLibraryLoader – Loading libgkl_compression.so from jar:file:/gatk/build/install/gatk/lib/gkl-0.5.2.jar!/com/intel/gkl/native/libgkl_compression.so

[August 22, 2017 9:29:09 AM UTC] AllelicCNV —tumorHets allelic/WW00246a.het_pulldown —tangentNormalized segments/WW00246a.tn.tsv —segments segments/WW00246a.called.tsv —outputPrefix allelic/WW00246a —useAllCopyRatioSegments false —maxNumIterationsSNPSeg 25 —smallSegmentThreshold 3 —numSamplesCopyRatio 100 —numBurnInCopyRatio 50 —numSamplesAlleleFraction 100 —numBurnInAlleleFraction 50 —intervalThresholdCopyRatio 4.0 —intervalThresholdAlleleFraction 2.0 —maxNumIterationsSimSeg 10 —numIterationsSimSegPerFit 1 —sparkMaster local[*] —help false —version false —showHidden false —verbosity INFO —QUIET false —use_jdk_deflater false —use_jdk_inflater false

[August 22, 2017 9:29:09 AM UTC] Executing as root@8e464076458d on Linux 4.9.36-moby amd64; OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11; Version: 4.beta.3-SNAPSHOT

09:29:10.003 INFO AllelicCNV – HTSJDK Defaults.COMPRESSION_LEVEL : 1

09:29:10.003 INFO AllelicCNV – HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false

09:29:10.004 INFO AllelicCNV – HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true

09:29:10.005 INFO AllelicCNV – HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false

09:29:10.005 INFO AllelicCNV – Deflater: IntelDeflater

09:29:10.005 INFO AllelicCNV – Inflater: IntelInflater

09:29:10.005 INFO AllelicCNV – GCS max retries/reopens: 20

09:29:10.005 INFO AllelicCNV – Using google-cloud-java patch 317951be3c2e898e3916a4b1abf5a9c220d84df8

09:29:10.005 INFO AllelicCNV – Initializing engine

09:29:10.006 INFO AllelicCNV – Done initializing engine

09:29:11.162 WARN NativeCodeLoader:62 – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

09:29:14.447 INFO AllelicCNV – Starting workflow for WW00246a…

09:29:14.448 INFO AllelicCNV – Loading input files…

09:29:28.875 INFO AllelicCNV – Number of input target-coverage segments: 3909

09:29:28.876 INFO AllelicCNV – Preparing target-coverage segments…

09:29:28.876 INFO AllelicCNV – Merging copy-neutral and uncalled target-coverage segments…

09:29:28.895 INFO AllelicCNV – Number of segments after copy-neutral merging: 1912

09:29:29.012 INFO AllelicCNV – Performing SNP segmentation…

09:29:29.012 INFO AllelicCNV – Performing initial SNP segmentation (assuming no allelic bias)…

09:33:25.324 INFO AllelicCNV – Initial number of SNP segments: 1049

09:33:25.329 INFO AllelicCNV – SNP-segmentation iteration: 1

09:33:29.235 INFO AlleleFractionInitializer – Initializing allele-fraction model. Iterating until log likelihood converges to within 0.500.

09:34:48.372 INFO AlleleFractionInitializer – Iteration 1, model log likelihood = -47452437.975.

09:35:59.436 INFO AlleleFractionInitializer – Iteration 2, model log likelihood = -47413435.463.

09:37:11.769 INFO AlleleFractionInitializer – Iteration 3, model log likelihood = -47412036.580.

09:38:10.942 INFO AlleleFractionInitializer – Iteration 4, model log likelihood = -47411969.803.

09:39:07.527 INFO AlleleFractionInitializer – Iteration 5, model log likelihood = -47411910.760.

09:40:04.582 INFO AlleleFractionInitializer – Iteration 6, model log likelihood = -47411910.695.

09:40:04.622 INFO AllelicCNV – Mean allelic bias: 1.015

09:43:08.080 INFO AllelicCNV – Number of SNP segments: 1045

09:43:08.081 INFO AllelicCNV – SNP-segmentation iteration: 2

09:43:15.685 INFO AlleleFractionInitializer – Initializing allele-fraction model. Iterating until log likelihood converges to within 0.500.

09:44:29.928 INFO AlleleFractionInitializer – Iteration 1, model log likelihood = -47452736.432.

09:45:46.193 INFO AlleleFractionInitializer – Iteration 2, model log likelihood = -47413750.568.

09:46:45.883 INFO AlleleFractionInitializer – Iteration 3, model log likelihood = -47412310.371.

09:48:07.005 INFO AlleleFractionInitializer – Iteration 4, model log likelihood = -47412253.590.

09:49:08.096 INFO AlleleFractionInitializer – Iteration 5, model log likelihood = -47412214.261.

09:50:06.338 INFO AlleleFractionInitializer – Iteration 6, model log likelihood = -47412214.186.

09:50:06.343 INFO AllelicCNV – Mean allelic bias: 1.015

09:50:39.741 INFO AllelicCNV – Shutting down engine

[August 22, 2017 9:50:39 AM UTC] org.broadinstitute.hellbender.tools.exome.AllelicCNV done. Elapsed time: 21.51 minutes.

Runtime.totalMemory()=1992294400

org.broadinstitute.hellbender.utils.R.RScriptExecutorException:

Rscript exited with 137

Command Line: Rscript -e tempLibDir = ‘/tmp/root/Rlib.5644378460105055694’;source(‘/tmp/root/CBS.3197508950061860503.R’); —args —sample_name=WW00246a —targets_file=/tmp/root/targets-from-snps6017543827668919816.tsv —output_file=/tmp/root/WW00246a-maf-cbs-tmp3563139348210305451.seg —log2_input=FALSE —min_width=2 —alpha=0.01 —nperm=10000 —pmethod=hybrid —kmax=25 —nmin=200 —eta=0.05 —trim=0.025 —undosplits=none —undoprune=0.05 —undoSD=3

Stdout: $sample_name

[1] “WW00246a”

$targets_file

[1] “/tmp/root/targets-from-snps6017543827668919816.tsv”

$output_file

[1] “/tmp/root/WW00246a-maf-cbs-tmp3563139348210305451.seg”

$log2_input

[1] “FALSE”

$min_width

[1] 2

$alpha

[1] 0.01

$nperm

[1] 10000

$pmethod

[1] “hybrid”

$kmax

[1] 25

$nmin

[1] 200

$eta

[1] 0.05

$trim

[1] 0.025

$undosplits

[1] “none”

$undoprune

[1] “0.05”

$undoSD

[1] 3

$help

[1] FALSE

Stderr: at org.broadinstitute.hellbender.utils.R.RScriptExecutor.exec(RScriptExecutor.java:163) at org.broadinstitute.hellbender.utils.segmenter.RCBSSegmenter.writeSegmentFile(RCBSSegmenter.java:114) at org.broadinstitute.hellbender.utils.segmenter.RCBSSegmenter.writeSegmentFile(RCBSSegmenter.java:130) at org.broadinstitute.hellbender.utils.segmenter.RCBSSegmenter.writeSegmentFile(RCBSSegmenter.java:126) at org.broadinstitute.hellbender.tools.exome.SNPSegmenter.writeSegmentFile(SNPSegmenter.java:63) at org.broadinstitute.hellbender.tools.exome.AllelicCNV.calculateAndWriteSNPSegmentation(AllelicCNV.java:407) at org.broadinstitute.hellbender.tools.exome.AllelicCNV.performSNPSegmentationStep(AllelicCNV.java:389) at org.broadinstitute.hellbender.tools.exome.AllelicCNV.runPipeline(AllelicCNV.java:300) at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:116) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:173) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:192) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152) at org.broadinstitute.hellbender.Main.main(Main.java:233)

root@8e464076458d:/home/cnvs# R

R version 3.2.5 (2016-04-14) — “Very, Very Secure Dishes“

Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.

You are welcome to redistribute it under certain conditions.

Type ‘license()’ or ‘licence()’ for distribution details.

R is a collaborative project with many contributors.

Type ‘contributors()’ for more information and

‘citation()’ on how to cite R or R packages in publications.

Type ‘demo()’ for some demos, ‘help()’ for on-line help, or

‘help.start()’ for an HTML browser interface to help.

Type ‘q()’ to quit R.

```

From shlee on 2017-08-23

Hi @ameynert,

Our developer says that from the standard out:

> You see that at least one round of SNP segmentation, which is done via an R script that runs CBS, finishes successfully….The number of segments from both the copy-ratio and allele-count data is extremely high.

So we think something is up with your data. Have you any insight about your data that could explain this? Would it be possible for you to share your data so we can investigate further the cause of the error? Instructions for submitting data to our FTP site are in [Article#1894](https://software.broadinstitute.org/gatk/guide/article?id=1894).

From ameynert on 2017-08-29

Hi @shlee – It’s ovarian cancer data. Manta analysis shows very high levels of genomic instability. I’ll talk to the PI about sharing. I also should note that I only have 10 normal samples at the moment from which to construct a PON, and the somatic CNV analysis done with GATK 4 beta 3 is not very clean as a result. I am hoping that more normal samples will help with this problem (expecting 200 eventually). I did create a PON of 100 female samples from a population study with samples sequenced on the same machines, but the resulting CNV calls were considerably worse.

From shlee on 2017-08-29

Hi @ameynert,

Given these are human subject data, if you do get approval from your PI to share, I think the course of action is to share with us on FISMA-compliant [FireCloud](https://gatkforums.broadinstitute.org/firecloud/). Our developers are keen on understanding what is going on with your data.

Are you using normals for the PoN that use the same sample prep, e.g. capture kit? It is insufficient to have normals sequenced just from the same machines. I believe a lot of variance comes from sample preparation and this is the key covariate to control for in lieu of a large PoN. This is something I have asked our developers—the difference between a carefully selected and limited PoN-set and a larger PoN-set that adds more dimensions to the variance. The answer is that our algorithms are designed to account for added dimensions in the PCA projection. The thing to note is that there is a different linear range towards improvement depending on the capture kit. E.g. I have in my notes that for ICE capture 150 samples in the PoN is where improvement plateaus, versus for Agilent 300 samples in the PoN is where improvement plateaus (empirical observations in our hands) . Seems your 100-sample PoN falls short of this range and your choice is to (i) add more samples to the PoN (which you are planning on) or (ii) carefully select samples for a limited-PoN.

Like any experimental control, selecting samples for the PoN may be the most challenging part of the workflow. However, it is worth spending time on for the improved results it can give.

From shlee on 2017-09-06

Users have reported a bug in CombineReadCounts that occurs when attempting to combine the same sample sets from different batches that

> have the same target identification column, as produced by CalculateTargetCoverage (e.g. G7, A3)…. [T]he target identification column corresponds to the well name, and when combining samples from different batches those well names are…repeated.

The solution to this is to rename samples and @mcadosch gives a link in that shows one approach. Many thanks for their solution.

From ameynert on 2017-09-07

Hi @shlee, thanks for the detailed answer. The samples are WGS, but I agree the sample source seems to have a strong effect. I’m in the process of confirming with the PI about sharing the read counts and common het SNP info with you.

From liupeng2117 on 2017-10-16

Hi @shlee ,

I have been using GATK4 beta to all CNVs. At step 6, I am wondering what’s the threshold for CallSegments to make CNV calls? It seems that there is no document available for CallSegments.

Many thanks in advance.

From Sheila on 2017-10-18

@liupeng2117

Hi,

Soo Hee is away at a conference. I will have her or someone else on the team get back to you. It may take a few days.

-Sheila

From liupeng2117 on 2017-10-19

Hi @Sheila,

Thank you! Another question, I am using whole exome sequencing(WES) data to do CNV calling. For the target intervals list, shall we use each entire exon as an interval, or split the exon into, say <=100 bp regions as target intervals? Which is better and more powerful?

From slee on 2017-10-19

Hi @liupeng2117,

Thanks for your questions. The caller in CallSegments uses a relatively naive method to call events that is exactly ported from a previous algorithm developed by Broad CGA, ReCapSeg. If you’re interested in the details, we unfortunately don’t have convenient documentation, but you may want to look at the relevant source code: https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/exome/ReCapSegCaller.java You can see that the threshold for calling is based on taking the segments with mean log2 copy ratio below 0.1 as neutral and determining the standard deviation of the points in these segments.

For the upcoming release, we are making significant method-related and performance-related improvements to the CNV pipeline (including using a new segmentation algorithm, rather than CBS—-if you’re interested in many more details, you can see https://github.com/broadinstitute/gatk/issues/2858). These improvements will also include a more principled event caller. Stay tuned!

As for your second question, this new pipeline will include a simple tool called PreprocessIntervals that will do exactly what you ask—-you’ll be able to increase the resolution of your WES analyses by binning exons into smaller regions. We will run some internal evaluations to find an optimal binning size (since we will want to strike a balance between resolution and increasing the variance of the read count in each bin) and give some default recommendations, but this is ultimately a parameter that should be decided based on the particular goals of each analysis.

Right now, a version of the tool is already in the latest version of the GATK, but we have not yet integrated it with the rest of the new CNV pipeline. Notably, the new tool produces a Picard interval list. However, if you’d like to go ahead and try to use the tool with the current CNV pipeline, you can manually convert back to the target-file format expected by that pipeline.

Thanks,

Samuel

From liupeng2117 on 2017-10-23

Hi @slee, thanks for your detailed answer. Great to know a new version of the tool is releasing. Can’t wait to try the new pipeline!

From LanC on 2017-10-24

I tried to call somatic CNV for Mouse tumor cell line data with matched normal strain samples. The genetic background of these normal strain samples actually are much different. Can I summarized them to build PON although they are sequenced with the same technology and batch?

I think the genetic differences of normal samples might be captured as the sequencing errors and cause bias in the somatic CNV calling of the corresponding tumor cell lines. It is right? Is it better to use tumor Vs. matched normal pair to call CNV in this kind of case?

Does GATK also provide CNV calling functions for tumor – normal pairs ? If not, do you have any suggestion that how can I handle it? Many thanks!

From Sheila on 2017-10-26

@LanC

Hi,

I will confirm with the team and get back to you.

-Sheila

From zhengzha2000 on 2017-10-27

Hi,

I am using beta5 local jar from the download button on download page, but it seems CallCNLoHAndSplits is not in the release. Is it replaced Or it would be released in new beta version?

thanks.

From Sheila on 2017-10-30

@zhengzha2000

Hi,

I will check with the team and get back to you.

-Sheila

From Geraldine_VdAuwera on 2017-10-30

@zhengzha2000 Can you clarify where you saw a reference to the tool you’re asking about? I’m not familiar with it and it’s not referenced anywhere in this discussion. Are you sure of the spelling?

From shlee on 2017-10-30

Hi @LanC,

Are you saying the normals are rather different from each other or that the normals are different from your tumor cell line’s parental strain?

The somatic CNV PoN will remove outlier information and also try to average the information in the normals you give it. Your experimental design:

> call somatic CNV for Mouse tumor cell line data with matched normal strain samples. The genetic background of these normal strain samples actually are much different.

should give you CNV events that differ between the tumor and PoN normals. You may have to adjust settings. I think if the normals are wildly different from each other for certain regions, these regions may become unusable.

Could you please describe in more detail (or point us to some publications) the extent of the CNV differences in mouse strains? It would be great for our team to be aware of such intra-species differences so workflows can (ideally) have options that enable such research.

We are actually in the process of developing the exact functionality you are asking for that will allow you to compare just a matched pair.

- If you feel adventurous and would like to test this functionality out, then it is currently available in the GATK4 github repo on the development branch `sl_wgs_acnv`.

- – The relevant WDL scripts (containing commands and workflow structure) are also within the same development branch at .

Otherwise, we hope to have the repackaged somatic CNV workflow available at the earliest end of year.

From shlee on 2017-10-30

Hi @zhengzha2000,

CallCNLoHAndSplits was in the GATK4-alpha release.

We’ve moved on to the beta release and the tools and workflow differ. First, you run somatic CNV (as above), then run a separate ACNV workflow, which requires a matched Normal BAM.

In the near future, we will be integrating these two workflows for more modularity and so the tool names will likely change yet again.

In the meantime, the GATK4-beta tools of interest to you are `VcfToIntervalList`, `GetHetCoverage` OR `GetBayesianHetCoverage`, `AllelicCNV` and `PlotACNVResults`.

I wrote up a mini-tutorial for the September (2018) Helsinki workshop that outlines example commands of the ACNV workflow. You can find a link to workshop materials at . The folder you are interested in is dated `1709`. The mini-tutorial worksheet is in the data bundle `GATK4_AllelicCNV.tar.gz` and has one glaring omission. The AllelicCNV command is missing the `—useAllCopyRatioSegments` parameter. Also, if you need to iterate over parameters, for faster runs, you can tweak options in AllelicCNV, e.g. :

```

—maxNumIterationsSimSeg 2 \

—maxNumIterationsSNPSeg 2 \

—numIterationsSimSegPerFit 2

```

P.S. As I mention to @LanC above, you could try out the development version of the CNV workflow using `sl_wgs_acnv`. The commands are outlined in the WDL script.

From zhengzha2000 on 2017-11-08

@shlee:

Thanks for your answer. I will pay attention to next release and look into the presentation you mentioned.

Zheng

From johnhe on 2017-12-09

Hi,

I don’t know if anyone else had this issue, but I had to update the getoptURL and optparseURL in the install_R_packages.R script linked from this webpage before PerformSegmentation would work for me. Thought I would just share just in case anyone else ran into the same problem. Apologies if this has been discussed before.

Best,

John

From Sheila on 2017-12-14

@johnhe

Hi John,

Thanks for sharing.

-Sheila

From LanC on 2017-12-21

Hi @shlee

Many thanks for your answer and sorry for my late response. Yes what I mean is that the genetic background of the normal mouse strains are much different from each other. I agree with you that: “if the normals are wildly different from each other for certain regions, these regions may become unusable.

Sorry that I have not run the germline CNV identification on these normal strains so that have no accurate estimation of the CNV differences. However I searched one paper that investigated CNV on 351 mouse samples including many classical laboratory strains that we uses. Hope it is helpful.

“Locke, M. Elizabeth O., et al. “Genomic copy number variation in Mus musculus.” BMC genomics 16.1 (2015): 497.”

Many Thanks!

From shlee on 2017-12-22

Hi @LanC,

Thank you so much for the journal article. I am printing it right now to peruse. I think it will be of great interest to our developers as well. Thanks again and happy holidays.

From micknudsen on 2018-02-13

Are there any plans to update this tutorial? The tool `CalculateTargetCoverage` is missing in the final GATK4 release.

From shlee on 2018-02-13

Hi @micknudsen,

I’m just getting to the CNV tutorial now. I’ve just finished generating the tutorial data using the WDL scripts available in the [broadinstitute/gatk](https://github.com/broadinstitute/gatk) repository with the new tools and am turning to writing about each step. The tools and their features have changed. For example, instead of the deprecated `CalculateTargetCoverage`, you will want to use `CollectFragmentCounts` ([link to tooldoc](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_copynumber_CollectFragmentCounts.php)) which is categorized under ‘Coverage Analysis’.

I believe the tooldocs themselves are fairly comprehensive in explaining the workflow. Together with the pipeline scripts (written in WDL), you should be able to piece together the workflow.

If you are very eager to test out the new workflow, and absolutely need the tutorial data and example commands, I can make the data bundle and share it and also share a live draft of the tutorial as a Google Doc. In return, I would ask the favor of feedback. = )

From micknudsen on 2018-02-14

Hi @shlee,

I am currently in the process of untangling the WDL-workflows, and things are going well so far. If I run into troubles, I may ask for your tutorial draft (and happily provide feedback, of course).

Thanks,

Michael

From micknudsen on 2018-02-14

Can I see your draft? :)

From shlee on 2018-02-14

Sure @micknudsen.

https://docs.google.com/document/d/1AX-GDYb2HJB0I-wS_XJjQDfjeOaynu39dzG6SsaeSwo/edit?usp=sharing

It is just a skeleton of commands and is woefully incomplete. I'm not sure how helpful it will be. I'll update this particular Google Doc draft sporadically.

I also think what might be helpful to you is an input JSONs file for the WDL scripts you are studying. Most of the optional parameters need not be specified when you are starting out as you would be using default parameters. Since currently there are no example JSONs available (soon there should be one at https://github.com/gatk-workflows/gatk4-somatic-cnvs), I can share with you the ones I made to generate the tutorial data using GATK4.0.1.1. Here is for the cnvsomaticpair_workflow:

``` { "##COMMENT1:": "WORKFLOW STEP OPTIONS", "CNVSomaticPairWorkflow.gatkdocker": "broadinstitute/gatk:4.0.1.1", "CNVSomaticPairWorkflow.gatk4jaroverride": "/home/shlee/gatk-4.0.1.1/gatk-package-4.0.1.1-local.jar", "CNVSomaticPairWorkflow.isrunoncotator": "False", "CNVSomaticPairWorkflow.oncotator_docker": "broadinstitute/oncotator:1.9.3.0",

"##COMMENT2:": "DATA", "CNVSomaticPairWorkflow.reffasta": "/home/shlee/Documents/ref/hg38/GRCh38fullanalysissetplusdecoyhla.fa", "CNVSomaticPairWorkflow.reffastafai": "/home/shlee/Documents/ref/hg38/GRCh38fullanalysissetplusdecoyhla.fa.fai", "CNVSomaticPairWorkflow.reffastadict": "/home/shlee/Documents/ref/hg38/GRCh38fullanalysissetplusdecoyhla.dict",

"CNVSomaticPairWorkflow.readcountpon": "/home/shlee/Documents/cnv180207/cnvponC.pon.hdf5", "CNVSomaticPairWorkflow.intervals": "/home/shlee/Documents/cnv180207/intervals/targetsC.intervallist",

"CNVSomaticPairWorkflow.commonsites": "/home/shlee/Documents/cnv180207/intervals/thetabiallelicsnpsagilentintervals.intervallist", "CNVSomaticPairWorkflow.tumorbam": "/home/shlee/Documents/hcc/hcc1143Tclean.bam", "CNVSomaticPairWorkflow.tumorbamidx": "/home/shlee/Documents/hcc/hcc1143Tclean.bai", "CNVSomaticPairWorkflow.normalbam": "/home/shlee/Documents/hcc/hcc1143Nclean.bam", "CNVSomaticPairWorkflow.normalbamidx": "/home/shlee/Documents/hcc/hcc1143N_clean.bai",

"##_COMMENT3:": "ANALYSIS PARAMETERS",

"##(optional)binlengthdefaultof1000isappropriateforWGS,setto0forWES": "", "CNVSomaticPairWorkflow.binlength": "0" } ```

From micknudsen on 2018-02-14

Thanks! B)

From tushardave26 on 2018-02-27

@shlee – thanks for the sample JSON file. I do have a question for “CNVSomaticPairWorkflow.common_sites”. Is there any tutorial which explains how to make common sites file for the analysis. I followed the tutorial explained [here](https://gatkforums.broadinstitute.org/gatk/discussion/7812/creating-a-list-of-common-snps-for-use-with-getbayesianhetcoverage “here”). Unfortunately, the code produces the interval files but with no variants. Can you provide your file? Thanks.

From shlee on 2018-02-27

@tushardave26, the `common_sites` file I prepped for use in the tutorial is appropriate for the exome capture data I am using, mapped to GRCh38. You’ll want to make your own for the type of coverage you expect and the reference you are using. You can start with the Mutect2 germline af resource available in the [GATK bundle](https://software.broadinstitute.org/gatk/download/bundle) and subset common biallelic variants with SelectVariants. You can additionally apply an intervals list to subset the variants further.

From pdu on 2018-02-28

Hi,

I’m trying to use GATK-4.0.1.2 to call SCNVs using in vitro murine exome data. Since our exome data is from in vitro cultured cells, I expect that the noise level compared to in vivo tumors to be much lower. Therefore, I have not used a panel of normals in my analysis and have instead opted to compare only to a matched normal tissue sample.

I have been using the workflow in the google doc that @shlee posted above:

1. CollectFragmentCounts for tumor and normal

2. CollectAllelicCounts for tumor and normal

3. DenoiseReadCounts for CollectFragmentCounts output

4. ModelSegments with tumor allelic counts, tumor denoised copy ratios, and normal allelic counts

I’m running into some trouble using ModelSegments. The program runs to completion, but in the error logs, it appears that the models are not a good fit for the data.

When it tries to fit the allele-fraction model, the model log likelihood hovers around -430000, without much change after several iterations. Eventually, the modeler exits because the maximum-likelihood estimate approaches 0.5, and the program recommends that I change the parameters for filtering homozygous sites. Here is an example error output for the allele-fraction modeler:

```

11:37:59.920 INFO MultidimensionalModeller – Fitting allele-fraction model…

11:49:33.812 INFO AlleleFractionInitializer – Initializing allele-fraction model, iterating until log likelihood converges to within 0.500000…

11:49:34.098 INFO AlleleFractionInitializer – Iteration 1, model log likelihood = -430641.392238…

11:49:34.111 INFO AlleleFractionInitializer – AlleleFractionGlobalParameters{meanBias=1.038076, biasVariance=0.337363, outlierProbability=0.147882}

11:49:34.276 INFO AlleleFractionInitializer – Iteration 2, model log likelihood = -430490.985227…

11:49:34.276 INFO AlleleFractionInitializer – AlleleFractionGlobalParameters{meanBias=1.256389, biasVariance=0.210133, outlierProbability=0.147882}

11:49:34.404 INFO AlleleFractionInitializer – Iteration 3, model log likelihood = -430476.106619…

11:49:34.404 INFO AlleleFractionInitializer – AlleleFractionGlobalParameters{meanBias=1.305350, biasVariance=0.243663, outlierProbability=0.147882}

11:49:34.535 INFO AlleleFractionInitializer – Iteration 4, model log likelihood = -430466.256679…

11:49:34.535 INFO AlleleFractionInitializer – AlleleFractionGlobalParameters{meanBias=1.347862, biasVariance=0.271174, outlierProbability=0.147882}

11:49:34.668 INFO AlleleFractionInitializer – Iteration 5, model log likelihood = -430458.344358…

11:49:34.669 INFO AlleleFractionInitializer – AlleleFractionGlobalParameters{meanBias=1.383601, biasVariance=0.301706, outlierProbability=0.147882}

11:49:34.796 INFO AlleleFractionInitializer – Iteration 6, model log likelihood = -430450.787700…

11:49:34.796 INFO AlleleFractionInitializer – AlleleFractionGlobalParameters{meanBias=1.420590, biasVariance=0.332392, outlierProbability=0.147882}

11:49:34.922 INFO AlleleFractionInitializer – Iteration 7, model log likelihood = -430443.858371…

11:49:34.922 INFO AlleleFractionInitializer – AlleleFractionGlobalParameters{meanBias=1.458092, biasVariance=0.363497, outlierProbability=0.147882}

11:49:35.051 INFO AlleleFractionInitializer – Iteration 8, model log likelihood = -430437.980551…

11:49:35.051 INFO AlleleFractionInitializer – AlleleFractionGlobalParameters{meanBias=1.494063, biasVariance=0.393534, outlierProbability=0.147882}

11:49:35.181 INFO AlleleFractionInitializer – Iteration 9, model log likelihood = -430432.981572…

11:49:35.181 INFO AlleleFractionInitializer – AlleleFractionGlobalParameters{meanBias=1.527987, biasVariance=0.423246, outlierProbability=0.147882}

11:49:35.313 INFO AlleleFractionInitializer – Iteration 10, model log likelihood = -430428.734241…

11:49:35.313 INFO AlleleFractionInitializer – AlleleFractionGlobalParameters{meanBias=1.559930, biasVariance=0.452381, outlierProbability=0.147882}

11:49:35.449 INFO AlleleFractionInitializer – Iteration 11, model log likelihood = -430425.055409…

11:49:35.449 INFO AlleleFractionInitializer – AlleleFractionGlobalParameters{meanBias=1.589443, biasVariance=0.480186, outlierProbability=0.145378}

11:49:35.581 INFO AlleleFractionInitializer – Iteration 12, model log likelihood = -430421.815571…

11:49:35.581 INFO AlleleFractionInitializer – AlleleFractionGlobalParameters{meanBias=1.616958, biasVariance=0.497862, outlierProbability=0.132941}

11:49:35.710 INFO AlleleFractionInitializer – Iteration 13, model log likelihood = -430421.147517…

11:49:35.710 INFO AlleleFractionInitializer – AlleleFractionGlobalParameters{meanBias=1.635304, biasVariance=0.497862, outlierProbability=0.126075}

11:49:35.841 INFO AlleleFractionInitializer – Iteration 14, model log likelihood = -430421.034548…

11:49:35.841 INFO AlleleFractionInitializer – AlleleFractionGlobalParameters{meanBias=1.642114, biasVariance=0.497862, outlierProbability=0.128001}

11:49:35.842 WARN AlleleFractionInitializer – The maximum-likelihood estimate for the global parameter AF_reference_bias_variance (0.497862) was near its boundary (0.500000), the model is likely not a good fit to the data! Consider changing parameters for filtering homozygous sites.

```

Does this mean that I should raise —genotyping-homozygous-log-ratio-threshold to include more heterozygous sites called? Or is there another problem that I should be aware of?

Thanks,

Peter

From shlee on 2018-02-28

Hi @pdu,

Can you please share with us the ModelSegments command that is causing the WARN? In the meanwhile, I will consult our developer.

From pdu on 2018-02-28

@shlee Thanks for getting back to me so quickly.

I think the WARN is caused by MultidimensionalModeller using AlleleFractionInitializer – unless you are asking for something else.

The parameters for running the ModelSegments are the same as the ones in the google doc. The initial heterozygous site detection, MultidimensionalKernelSegmenter, and GibbsSampler commands run fine.

From shlee on 2018-02-28

@pdu, while we wait for our developer, can you try running your command without the allelic input as shown below:

```

gatk —java-options “-Xmx10000m” ModelSegments \ —denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \ —output out \ —output-prefix hcc1143_T_clean

```

Thanks.

From pdu on 2018-03-01

@shlee, I have run ModelSegments with only the denoised copy ratio .tsv file, and the output did not throw any WARNs. I’m trying to generate plots using PlotModeledSegments, but I am not sure how to generate a .dict file of the contigs I want to plot. Can you point me to a tool that I can use to make the .dict file?

From shlee on 2018-03-02

@pdu, you can use the dictionary from the reference set you used to originally align the data. You can limit the plotting to the larger contigs using the `—minimum-contig-length` argument. For example, `—minimum-contig-length 46709983` will limit GRCh38 contigs to 24 chromosomes.

If you need to generate the dictionary from the FASTA reference, use [CreateSequenceDictionary](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_CreateSequenceDictionary.php).

From pdu on 2018-03-03

@shlee, I plotted the modeled segments generated by ModelSegments with only the denoised copy ratios as input and got the following plot.

![](https://us.v-cdn.net/5019796/uploads/editor/ff/ulrzcodclvqx.png “”)

Looking back, I think I did not denoise correctly. The command I used was

```

gatk DenoiseReadCounts \ -I counts.tsv \ —standardized-copy-ratios sample.standardizedCR.tsv \ —denoised-copy-ratios sample.denoisedCR.tsv

```

How can I perform the normalization shown in the middle panel of section 7 in this tutorial?

From pdu on 2018-03-04

@shlee, I denoised using a PoN created by inputting my normal file into CreateReadCountPanelOfNormals, and now my plots look much better.

![](https://us.v-cdn.net/5019796/uploads/editor/hs/5iwgqy1ckqcn.png “”)

Please let me know when you hear back from the developers about using the allelic inputs.

From shlee on 2018-03-05

Hi @pdu,

I’m glad that tidbit that you can denoise against the normal-only was helpful to you.

Now that you are actually denoising against a control, I think your ModelSegments (which takes in denoised data) should run without error. There are considerations for the sites at which to collect allelic counts that I am looking into currently. Do you have a specific question on how to collect allelic counts or was the error your primary concern? If the latter, please see how the denoised data performs.

From tedtoal on 2018-03-06

I’m curious about tumor purity/contamination by normal. I would think that would be a big factor in accurately calling CNV, yet I’ve seen no mention of it in the GATKCNV documentation. Why is that?

From shlee on 2018-03-07

Hi @tedtoal,

The complexity that tumor purity/heterogeneity brings is the reason why the workflow calls copy ratios and not absolute copy numbers.

From shlee on 2018-03-09

Hi @tushardave26,

I've prepared two slightly different sites-only germline SNPs VCF based on gnomAD http://gnomad.broadinstitute.org/ and placed it on our FTP server:

gnomadgrch38snps_sitesonly.vcf.gz

af-gnomadgrch38snps.vcf.gz

These are provided as-is for use with the CNV workflow, with no guarantees. The af-gnomad_grch38_snps.vcf.gz version allows subsetting based on population allele frequencies.

I subset all SNPs-only sites from the Mutect2 resource file described in footnotes [3] and [4] of Tutorial#11136. When I say all SNPs, I mean I include non-biallelic SNP sites. The subsetting excludes any site with mixed type variants, e.g. SNP and indel, as you would not want to use such sites for allelic CNV. To minimize file size, I removed all extraneous information such that records have information in only four or five columns, columns, like so:

chr1 143931334 . A G,T . . .

and

chr1 143931334 . A G,T . . AC=15,7;AF=0.0003961,0.0001756

If you are analyzing exome capture data, you might consider subsetting these records to your padded intervals regions, to further minimize file size by additionally include the padded intervals to the CollectAllelicCounts command, e.g. -L population.vcf.gz -L padded.interval_list --interval-set-rule INTERSECTION. In the end, the workflow uses only those variants that overlap with copy ratio data.

Finally, each resource file contains 240,555,518 records, for the same number of unique sites.

P.S. I amended my description on 3/9/2018, to include a version of the resource containing population allele frequencies.

From slee on 2018-03-09

Hi @pdu,

With those properly denoised copy ratios, my guess is that you will not run into the warning you previously encountered. Most likely your noisy copy ratios were affecting the joint copy-ratio—allele-fraction segmentation, which was in turn affecting the fit of the allele-fraction model. Can you try running again (and perhaps plotting with PlotModeledSegments) to see if your results look reasonable now?

From pdu on 2018-03-09

Hi @slee,

I tried running ModelSegments again with the new denoised copy ratios, and I’m getting the same error output with AlleleFractionInitializer:

```

06:26:17.961 INFO AlleleFractionInitializer – Iteration 7, model log likelihood = -369270.577908…

06:26:17.961 INFO AlleleFractionInitializer – AlleleFractionGlobalParameters{meanBias=1.615908, biasVariance=0.498201, outlierProbability=0.148022}

06:26:17.961 WARN AlleleFractionInitializer – The maximum-likelihood estimate for the global parameter AF_reference_bias_variance (0.498201) was near its boundary (0.500000), the model is likely not a good fit to the data! Consider changing parameters for filtering homozygous sites.

06:26:17.961 WARN AlleleFractionInitializer – The maximum-likelihood estimate for the global parameter AF_outlier_probability (0.148022) was near its boundary (0.150000), the model is likely not a good fit to the data! Consider changing parameters for filtering homozygous sites.

```

I tried plotting the resulting modeled segments, and I get the following plot:

![](https://us.v-cdn.net/5019796/uploads/editor/zu/jnfwjmr2i70x.png “”)

From slee on 2018-03-09

@pdu, it looks like you do not have that much allele-fraction data to work with. How large is your list of common sites? What is your average depth of coverage, and how many het sites do you end up with in your *hets.tsv file? (There is a ModelSegments parameter `minimum-total-allele-count` that filters out sites based on their total count, since we typically need total counts above ~20-30 to get good binomial statistics. So if your depth is low, you may be not have enough sufficiently covered sites to do a good allele-fraction analysis.)

From pdu on 2018-03-09

@slee,

I am not sure where to find my list of common sites, but I have around 1850 het sites in my hets.tsv file. Based on Picard HS metrics, I have an average bait coverage of 100, and an average target coverage of ~65.

From slee on 2018-03-09

@pdu,

OK, your coverage seems fine, but that is not very many sites. Typically we can expect around ~20k in an exome.

The common sites list is what you provided to CollectAllelicCounts via `-L`. This should be any input compatible with `-L` (e.g., BED file, Picard/GATK interval list, VCF, etc.) that specifies sites of common variation—-typically using an allele-frequency cutoff of around 10% is sufficient. It’s possible that you were using a version of this list that was prepared for the tutorial and did not contain very many sites.

@shlee is currently putting together a version of the gnomAD list she posted above that will also include the allele frequencies, which you can then filter to produce a common-sites list with your desired cutoff.

From shlee on 2018-03-09

A version of the resource containing population allele frequencies is now available and described above.

From amjadd on 2018-04-04

Is the 39 blood PON available to download somewhere?

From shlee on 2018-04-04

Hi @amjadd,

It’s not clear what PoN you are referring to. Also, we highly recommend you create your own PoN to match to your own case sample.

From Chatchawit on 2018-05-16

I suggest you should inform readers about the best-so-far version in the first page. The important information is the GATK version and the corresponding document (steps) to accomplish this pipeline. I think the best-so-far version is probably GATK 4.0.0.0 and your document in Google Drive. I found that “CollectFragmentCounts” is missing in GATK 4.0.4.0 but can be found in GATK 4.0.0.0. Thank you.

From shlee on 2018-05-16

Hi @Chatchawit,

Are you referring to the recent CNV tutorials using v4.0.1.1? These have been under review since May 2 and are posted at:

- https://software.broadinstitute.org/gatk/documentation/article?id=11682

- https://software.broadinstitute.org/gatk/documentation/article?id=11683

Unfortunately, our reviewers have been too busy to make even a single comment as of yet! Thanks for suggesting these be posted to the top of this now outdated tutorial. I will do so.

From Begali on 2018-08-16

shlee Hi

Geraldine_VdAuwera

I have question I can call copy number variants without matched normal BAM file which I do not have

I did first steps PreprocessIntervals &CollectReadCounts for my 33 sample cfdna

based on info here https://software.broadinstitute.org/gatk/documentation/article?id=11682

but when I applied this CreateReadCountPanelOfNormals

gives me error sample not matched account for all without exception

how can generate cnv.pon then ,,,

with best regards

From Sheila on 2018-08-22

@Begali

Hi,

Can you post the exact command you ran and the log output with error message?

Thanks,

Sheila

From Begali on 2018-08-23

@Sheila

first:

java -jar gatk-package-4.0.6.0-local.jar PreprocessIntervals -L annonated_gene109_17.vcf -R hg38.fa —bin-length 0 —interval-merging-rule OVERLAPPING_ONLY -O interval109_17_list

2nd:

java -jar gatk-package-4.0.6.0-local.jar CollectFragmentCounts -I recal109_17.bam -L targets_C.preprocessed.interval_list —interval-merging-rule OVERLAPPING_ONLY -O tumor109_17.counts.hdf5

B) here at least three samples should apply

gatk —java-options “-Xmx6500m” CreateReadCountPanelOfNormals \ -I tumor109_17.counts.hdf5 -I tumor30_17I.counts.hdf5 …….. —minimum-interval-median-percentile 5.0 -O cnvZ.pon.hdf5

error message

java.lang.IllegalArgumentException: Intervals for read-counts file tumor30_17I.counts.hdf5 do not match those in other read-counts files. at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:724) at org.broadinstitute.hellbender.tools.copynumber.CreateReadCountPanelOfNormals.constructReadCountMatrix(CreateReadCountPanelOfNormals.java:328) at org.broadinstitute.hellbender.tools.copynumber.CreateReadCountPanelOfNormals.runPipeline(CreateReadCountPanelOfNormals.java:287) at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:137) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:182) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:201) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289)

18/08/23 09:43:17 INFO ShutdownHookManager: Shutdown hook called

18/08/23 09:43:17 INFO ShutdownHookManager: Deleting directory /tmp/pathology/spark-9712c6b4-ebdd-479b-ae45-4cf9994ff635

how to fix do not match moreover all samples belong to cysts fluid different patients

with best regards

From Sheila on 2018-09-04

@Begali

Hi,

Did you use the exact same interval list when running on each of the samples?

-Sheila

From Begali on 2019-04-21

@Sheila

May I will clarify some issues which I would like to receive your hints about that please

the samples for different patients and for tumor only, normal.bam files are not available at all

Can I call somatic or Germline CNV only using tumor samples, only tumor bam files ????

and those bam files also for different diseases and all Fastq files areprepration at wet lab using NGS

with best regards

Report abuse