Germline copy number variant discovery CNVs

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2018-01-07

Purpose

Identify germline copy number variants.

Diagram is not available

Reference implementation is not available

This workflow is in development; detailed documentation will be made available when the workflow is considered fully released.

Updated on 2018-01-09

From mglclinical on 2018-02-05

Hi @Geraldine_VdAuwera ,

I am under the impression that GATK4 can be used to detect SVs (structural variations) or CNVs (copy number variations) in germline samples from Exome sequencing. Please correct me if my understand is correct or not.

Is there a GATK4’s reference implementation of CNV detection in germline samples from Exome Sequencing ?

From Geraldine_VdAuwera on 2018-02-06

Hi @mglclinical, yes we have pipelines in development for this. The germline CNV pipeline (for which this doc is a placeholder) is close to being in a releasable state. The SV pipeline is going to take a few more months, I believe.

From mglclinical on 2018-02-06

Thank you @Geraldine_VdAuwera

From KMS_Meltzy on 2018-03-05

Hello,

I was hoping to run the germline CNV pipeline, but I got stuck at the DetermineGermlineContigPloidy step. I am not sure what is the best way to generate the inferred ploidy model for CASE runs.

Thanks in advance!

From Sheila on 2018-03-09

@KMS_Meltzy

Hi,

Let me have someone on the team get back to you soon.

-Sheila

From shlee on 2018-03-09

Hi @KMS_Meltzy,

This workflow is under development and I am not altogether familiar with it. I think you might find https://github.com/broadinstitute/gatk/tree/master/scripts/cnv_wdl/germline helpful. Note that some of the workflow components are shared in a separate script called `cnv_common_tasks.wdl`.

We have some germline CNV resource files available in the GATK Resource Bundle, e.g grch37_germline_CN_priors.tsv that were used with the GATK4.beta version of the tools.

From hexy on 2018-04-08

Hi @Geraldine_VdAuwera, your slides showed that GATK4 can be used to detect germline CNV, but I cannot find the best practice doc. Would you pelease tell me where to find this?

From Sheila on 2018-04-09

@hexy

Hi,

The germline CNV documentation is not yet ready. We hope to have some out within a month or two. If you search the forum for “germline CNV” you should get some helpful threads/docs.

-Sheila

From hexy on 2018-04-11

@Sheila

Hi, thanks! Hope to see that soon and would you please upload the test data of GATK4 to the ftp server?

From Sheila on 2018-04-16

@hexy

Hi,

>would you please upload the test data of GATK4 to the ftp server?

I am not sure which test data you are referring to?

-Sheila

From mglclinical on 2018-05-07

Hi Geraldine_VdAuwera and Sheila ,

I want to ask a question on GATK4’s ability on detecting SVs or CNVs (copy number variations) in germline samples. I know that the best practices for this task are still under development. And my question is :

We have a cell line that contains a single exon deletion on MECP2 gene. MLPA is used to validate this single exon deletion. This cell line is exome sequenced and analyzed by tools like xhmm & the Single exon deletion is not detected. I guess xhmm cannot detect this deletion because my deletion is just in 1 exon (or) because my sample size was too small (11 samples).

Does GATK’s germline-CNV detection tool suffer the same problem ?

Thanks,

mglclinical

From Sheila on 2018-05-09

@mglclinical

Hi mglclinical,

I know the tools are still in beta, but the team has said our workflow performs better than xhmm :smile:

That said, perhaps it would be nice if you could test the workflow out yourself and report back to us. [This thread](https://gatkforums.broadinstitute.org/wdl/discussion/comment/47511) will help with some details..also another user has reported our workflow performing well. Also, you may find the poster presented at AACR helpful [here](https://drive.google.com/drive/folders/1XDXlck-El8uZ7e60ANDSHycEpkgo6NoP).

-Sheila

From stefstef on 2018-06-12

Hi guys,

Just a quick question – if I just wanted to test out gCNVs, do I need to have run BSQR on my bam files?

Thanks

Stef

From shlee on 2018-06-13

Hi @stefstef,

The quick answer is no. gCNV coverage collection is the same as for the somatic workflow. In terms of qualities, CollectReadCounts only takes into consideration mapping quality. The read filters are

> MappedReadFilter, MappingQualityReadFilter, NonZeroReferenceLengthAlignmentReadFilter, NotDuplicateReadFilter, WellformedReadFilter

You can read about each in the [Tool Docs](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/), under Read Filters.

From manolis on 2018-07-27

Hi, someone have a “not official” pipe ? I would like to start to test CNV discovery by GATK4.

Many thanks

From shlee on 2018-07-27

Hi @manolis,

Please check out the gatk GitHub repository scripts folder at https://github.com/broadinstitute/gatk/tree/master/scripts/cnv_wdl. All of the new workflows come with versioned WDL scripts including the gCNV and CNV workflows.

From alphahmed on 2018-08-11

Hi,

Thank you for all the time put in supporting GATK users.

I’ve been trying to complete an analysis of several whole-genomes germline CNVs on our local servers.

I am now where I run the GermlineCNVCaller in the cohort mode of 30 “normal genomes”, with the provision of annotated-intervals for denoising, but it seems that this requires a humongous amount of RAM.

It utilizes around 500GB of memory, just to complete 10% of the initial denoising iterations.

Is there anyway where I can divide the job into smaller chunks and then recombine the output to finally get the GermlineCNV model?

It is a bit confusing on the github’s wdl pipeline where they are using the tar compression and decompression on the cohort and sample modes.

I do appreciate all awesome efforts put into this tool and GATK4; but this tool has been in beta for several months now and I can’t wait anymore to make use of this gCNV model.

Ahmed

From alphahmed on 2018-08-11

> Is there anyway where I can divide the job into smaller chunks and then recombine the output to finally get the GermlineCNV model?

By “smaller chunks” I meant less number of cases in the cohort, then combining them to make the model.

From slee on 2018-08-14

Hi @alphahmed,

We typically scatter across genomic chunks, not chunks of samples. If you study the WDL, you’ll see that this is accomplished by using the ScatterIntervals task to break the intervals for coverage collection into chunks containing an equal number of intervals.

The tar compression/decompression is admittedly a little confusing, but it is required to package up the results from each chunk into a single file when running the WDL on the cloud.

Thanks for your patience. The gCNV model and inference schemes are both relatively sophisticated in comparison to similar tools/methods, so we’re still subjecting the pipeline to rigorous testing and benchmarking. We are hoping to take it out of beta and publish a paper on the model/methods in the coming months.

Hope this helps,

Samuel

From alphahmed on 2018-08-15

Thank you Samuel!

After running the wdl locally, I got only one tar file off the GermlineCNVCaller cohort-mode as a gcnv_model, but the case-mode is requiring an array of gcnv_model_tars (Array[File]). I tried to untar it and just provide it as a directory input for GermlineCNVCaller case-mode, but it didn’t work.

That’s the main reason why I am now trying to run the whole cohort without ScatterIntervals on a large server, hoping that the model output would be acceptable by the case-mode without any tarring.

I’m patiently waiting for your final release and a published paper. I believe this model will be among the best CNV models for short-reads, if not the most reliable one.

Ahmed

From slee on 2018-08-20

@alphahmed if you do not scatter across genomic chunks, then you will only have a single model tar file covering the entire genome, which you should be able to use as input to the case-mode WDL. gcnv_model_tars will then be an array with only a single element.

Thanks,

Samuel

From alphahmed on 2018-09-27

Is the recommended number of normal cohort bams still 30? What would be the effect of using a smaller number; provided, of course, that they were done under the same experimental parameters?

From shlee on 2018-09-27

Hi @alphahmed,

Can you point me to where in our documentation the number of normal bams should be 30? I believe the developer of the gCNV workflow recommends 100 high coverage WGS BAMs for guaranteed great results. That being said, I’m developing a tutorial that uses 24 samples, which is less than this recommended number because these are all the WGS samples I can get my hands on. And although I haven’t performed any comparisons (as the tutorial is about illustrations), concordance with a Phase 3 1000 Genomes Project SV callset seems at glance decent.

From alphahmed on 2018-09-28

Hi @shlee

The [GermlineCNVCaller documentation](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_copynumber_GermlineCNVCaller.php “GermlineCNVCaller documentation”) states: “For WES and WGS samples, we recommend including at least 30 samples.”

I look forward to see the final results of your illustration tutorial; getting concordant results with just 24 samples is really impressive!! Meanwhile, could you please let me know:

- Are you using the default parameters? If not, what parameters you found to be the most tweak-demanding?

- Are you using the wdl pipeline that has been updated few days ago on [github](https://github.com/broadinstitute/gatk/tree/master/scripts/cnv_wdl/germline “github”)? I know this question is more about the basics of wdl input formats, but how do you define the input bams and bais within the ‘Array[String]+’ field? I’ve tried doing that in different ways, including changing it to [read_lines(normal_bam_list)] using file location lists, but kept having errors along the lines of “No coercion defined….”

Thank you!

From shlee on 2018-09-28

Thanks for the link @alphahmed. I believe then 30 is the minimum number of samples one should start with.

> Are you using the default parameters? If not, what parameters you found to be the most tweak-demanding?

I am indeed using default tool parameters. I was asked to use WGS data and to use default parameters for the tutorial. As these workflows are still under BETA status, they are still being tuned. I am aware of current efforts to finetune recommended parameters for WGS. Until the tool documentation is updated with new recommendations, the bandwidth I have as a technical writer allows me to test out some parameters towards describing them more clearly if there is a need. So if there are points in the tool documentation that you think could use clarification or illustration, please let us know.

> Are you using the wdl pipeline that has been updated few days ago on github?

As it stands for GCNV tutorial development, most of my efforts have been toward scripting and testing for small tutorial data, in accordance with having a small dataset that can run on a laptop for workshop hands-on tutorials. So the WDL pipelines in the github repo do not apply well to the test cases I am developing. I am aware of the updates to the WDL scripts in the repository and have asked the developers if they prefer we update the version of GATK that the tutorial uses and if there are changes in the WDL pipeline that the tutorial should incorporate. Given the tutorial is meant to be illustrative, the response has been nay. One could ask whether tutorials should highlight steps in the WDL workflow much like I do [here](https://gatkforums.broadinstitute.org/gatk/discussion/7899/reference-implementation-pairedendsinglesamplewf-pipeline); however, given even this production_-level reference implementation has been changed multiple times soon after being written, it seems the communication team’s efforts are better spent focusing on _illustrative tutorials (the How to tutorials), especially for our BETA status workflows. Also, different researchers use different pipelining approaches and our tutorials are meant to be agnostic towards these, to enable every approach.

> how do you define the input bams and bais within the ‘Array[String]+’ field? I’ve tried doing that in different ways, including changing it to [read_lines(normal_bam_list)] using file location lists, but kept having errors along the lines of “No coercion defined….”

We have a repository, [gatk-workflows](https://github.com/gatk-workflows), that provide tried-and-tested WDL scripts and example JSON inputs files filled out with publically accessible test data. The GCNV workflow isn’t one of the showcased workflows yet but you can peruse the different workflow JSON inputs files to get an idea of how they are filled out. I am certain one of these illustrates an array[String]+ field. Otherwise, you can post to https://gatkforums.broadinstitute.org/wdl/discussions and get help from those who actually develop the features of WDL. You can also see if the WDL specification at may provide an example. I highly recommend posting this part of your question to the WDL forum, again at https://gatkforums.broadinstitute.org/wdl/discussions.

P.S. I will see if the developers who have been updating the GCNV wdls can help here.

From asmirnov on 2018-09-28

Hi @alphahmed! I’m one of the gCNV developers.

> Are you using the default parameters? If not, what parameters you found to be the most tweak-demanding?

We are using the default parameters for the most part except for `gcnv_sample_psi_scale` and `gcnv_interval_psi_scale` both of which we found `0.01` to be a good value for. In general we found that by decreasing`gcnv_interval_psi_scale`, specificity is increased (however sensitivity might suffer a little).

Other few parameters to play around with are `p-active`(roughly corresponds to probability of multiallelic loci), `p-alt`(probability of non-reference copy number), and `cnv-coherence-length` and `class-coherence-length`(they relate to average length of CNV events and lengths of multiallelic regions respectively).

> Are you using the wdl pipeline that has been updated few days ago on github? I know this question is more about the basics of wdl input formats, but how do you define the input bams and bais within the ‘Array[String]+’ field? I’ve tried doing that in different ways, including changing it to [read_lines(normal_bam_list)] using file location lists, but kept having errors along the lines of “No coercion defined….“

The recent change to WDL workflow was needed to reduce case mode cost on the cloud, and is functionally equivalent to the previous version. However make sure to grab the most recent commit, as we just pushed a bug fix yesterday!

In regards to the workflow inputs see an example here:

https://github.com/broadinstitute/gatk/blob/master/scripts/cnv_cromwell_tests/germline/cnv_germline_cohort_workflow.json

Let us know if you run into any problems or error modes while running gCNV – we would appreciate your feedback!

From alphahmed on 2018-10-01

Thank you shlee and asmirnov !

From Jyang32 on 2018-12-20

@shlee Do we have this germline CNV best practice workflow available now? I daw the github WDL version. However, it is not enough information for me. Just want to double check?

From dkolbe on 2019-02-19

Given the announcement that CNV pipelines are out of beta and ready for production, is there documentation that describes them and how to use them?

From thilakam on 2019-03-18

Hello,

I am trying to run the PostprocessGermlineCNVCalls and I am having trouble with it. I am not sure what I am not doing right.

Thanks for your help.

From shlee on 2019-04-03

Hi everyone (thilakam dkolbe @Jyang32 et al). The gCNV tutorial is now available. Here are links to relevant documentation:

- Main tutorial: https://gatkforums.broadinstitute.org/gatk/discussion/11684

- companion Notebook tutorial: https://gatkforums.broadinstitute.org/gatk/discussion/11685

- companion Notebook tutorial:https://gatkforums.broadinstitute.org/gatk/discussion/11686

- followup discussion: https://gatkforums.broadinstitute.org/gatk/discussion/11687

Thanks for your patience waiting. It’s taken me quite a bit of effort to finalize these before [my departure from the team](https://gatkforums.broadinstitute.org/gatk/discussion/23801/goodbye-note-to-the-gatk-community). Please do ask any clarifying questions you have on the forum and @slee and others will be able to help you.

Best,

Soo Hee

From Royston on 2019-05-03

Hi, can anyone comment on how well gCNV works on targeted gene panels and what would the recommended coverage be to get good quality CNV calls?

From Yangyxt on 2019-08-04

Geraldine_VdAuwera said: > Hi

mglclinical, yes we have pipelines in development for this. The germline CNV pipeline (for which this doc is a placeholder) is close to being in a releasable state. The SV pipeline is going to take a few more months, I believe.

Hello,

I have been trying to use gCNV to build a model with 20+ training samples in COHORT model. However, I have been running this task for over 300 hours and the job still hasn’t been finished.

Here I paste the script to you.

`wkd=/paedwy/disk1/yangyxt/wes/healthy_bams_for_CNV

v6dir=/paedwy/disk1/yangyxt/wes/healthy_bams_for_CNV/using_V6_probe

v7dir=/paedwy/disk1/yangyxt/wes/healthy_bams_for_CNV/using_V7_probe

gatk=/home/yangyxt/software/gatk-4.1.0.0/gatk

valid_ploidy_call=${v6dir}/v6_model_dir/v6_normal_cohort-calls

gCNV_model=${v6dir}/v6_gCNV_model

source activate gatk

cd ${v6dir}

$gatk GermlineCNVCaller \ —run-mode COHORT \ -L ${v6dir}/v6.cohort.gc.filtered.interval_list \ —interval-merging-rule OVERLAPPING_ONLY \ —contig-ploidy-calls ${valid_ploidy_call} \ —verbosity DEBUG \ —annotated-intervals ${v6dir}/v6.annotated.tsv \ —input ${v6dir}/A180346.counts.hdf5 \ —input ${v6dir}/A180347.counts.hdf5 \ —input ${v6dir}/A180362.counts.hdf5 \ —input ${v6dir}/A180576.counts.hdf5 \ —input ${v6dir}/A190007.counts.hdf5 \ —input ${v6dir}/A190013.counts.hdf5 \ —input ${v6dir}/A190047.counts.hdf5 \ —input ${v6dir}/A190048.counts.hdf5 \ —input ${v6dir}/PID15-131.counts.hdf5 \ —input ${v6dir}/PID18-041.counts.hdf5 \ —input ${v6dir}/PID18-042.counts.hdf5 \ —input ${v6dir}/PID18-048.counts.hdf5 \ —input ${v6dir}/PID18-102.counts.hdf5 \ —input ${v6dir}/PID18-125.counts.hdf5 \ —input ${v6dir}/PID18-126.counts.hdf5 \ —input ${v6dir}/PID18-128.counts.hdf5 \ —input ${v6dir}/PID18-130.counts.hdf5 \ —input ${v6dir}/PID18-131.counts.hdf5 \ —input ${v6dir}/PID18-137.counts.hdf5 \ —input ${v6dir}/PID18-138.counts.hdf5 \ —input ${v6dir}/PID18-142.counts.hdf5 \ —input ${v6dir}/PID18-143.counts.hdf5 \ —input ${v6dir}/PID19-054.counts.hdf5 \ —input ${v6dir}/PID19-055.counts.hdf5 \ —output ${gCNV_model} \ —output-prefix v6_gCNV_normal_cohort

source deactivate`

Much appreciated if you could give me a hint whether this is normal or not.

From matdmset on 2019-08-08

Hi @Geraldine_VdAuwera,

Are there any updates planned for this document? I can imagine there’s been a lot of development since Jan ’18, and I’d be interested in seeing the best practice guidelines for CNV detection.

Thanks!

Matthias

From NicolasK on 2019-08-08

@Yangyxt , I ran GermlineCNVCaller in cohort mode with 64 samples (WES) on a 64core computer for 3 days and the process used much of the CPU power. I think the runtime strongly depends on the amount of samples, sequencing depth and computer you have.

Best regards,

Nicolas

From Geraldine_VdAuwera on 2019-08-08

@matdmset Yes we’re working on releasing a new version of the workflow and a Terra workspace with a fully working example.

From Yangyxt on 2019-08-21

NicolasK said: >

Yangyxt , I ran GermlineCNVCaller in cohort mode with 64 samples (WES) on a 64core computer for 3 days and the process used much of the CPU power. I think the runtime strongly depends on the amount of samples, sequencing depth and computer you have.

> Best regards,

> Nicolas

Dear Nicolas,

I used a server in my department and the computing resources are allocated by PBS pro. For the command I show you, I used 12 core and 80gb RAM. Still, it takes more than 300 hours. Furthermore, according to the IT supporter in our department, this job only uses one computing thread.( I’m supposing this means this job only uses one CPU for computing? )

Can I have more info about how to allocate more computing resources to gCNV and does it support multi-threading computation?

Thanks!

From slee on 2019-08-21

Yangyxt GermlineCNVCaller is designed to be scattered over the genome in multiple shards. See the tutorial posted above by shlee and the WDLs referenced there to see how this works.

From Yangyxt on 2019-08-27

slee said: >

Yangyxt GermlineCNVCaller is designed to be scattered over the genome in multiple shards. See the tutorial posted above by @shlee and the WDLs referenced there to see how this works.

Dear @NicolasK

Thank you for your information. I would like to enquire that for WES data, how many shards you separate your interval_list into? And for every shard’s job, how many CPUs did you allocate for?

And Dear @slee

Thank you for your guidance. I noticed the shards part in the tutorial. And I have another question regarding the cohort mode. I would like to detect CNV events in patients WES data. To set model parameters, we need to run gCNV in cohort mode with control samples’ WES data.

The thing is, we don’t have more than 30 control samples yet for model training.

Given all the patients we have are widely heterogeneous regarding their genetic defects. Can I include all the patient’s WES data in the cohort mode and use the parameters trained to detect CNV events in these patients.

Much appreciated if you can share relevant info with me. Thanks!

From NicolasK on 2019-08-28

@Yangyxt

I used the standard parameters, my interval list contains all exons captured by my WES experiment.

This makes 217.683 shards.

Kind regards

From Prabhavi on 2019-12-09

Hi ,

Could you please help me to set up the GAT-K CNV pipeline. I have exome negative hereditary cancer patients whom the CNV detection should be done. but I have no idea regarding the implementation of it .

Please help me

From manolis on 2019-12-27

Hi,

[hear](https://software.broadinstitute.org/gatk/documentation/topic?name=tutorials) there are many tutorials about gatk pipelines. You can take a look [here](https://software.broadinstitute.org/gatk/documentation/article?id=11682) and [here](https://software.broadinstitute.org/gatk/documentation/article?id=11683). [Here](https://software.broadinstitute.org/gatk/best-practices/workflow?id=11147) you can find the full pipeline validated from the GATK team.

About germline CNV look [here](https://software.broadinstitute.org/gatk/documentation/article?id=11684).

Best

Report abuse