created by GATK_Team
on 2018-09-04
Some tools in GATK4, like the gCNV pipeline and the new deep learning variant filtering tools, require extensive Python dependencies. To avoid having to worry about managing these dependencies, we recommend using the GATK4 docker container, which comes with everything pre-installed, as explained [here](https://software.broadinstitute.org/gatk/documentation/article?id=11090). If you are running GATK4 on a server and/or cannot use the Docker image, we recommend using the Conda package manager as a backup solution. The Conda package manager comes with all the dependencies you need, so you do not need to install everything separately. Both Conda and Docker are intended to solve the same problem, but one of the big differences/benefits of Conda is that you can use Conda without having root access. Conda should be easy to install if you follow these steps.
1) Refer to the [installation instructions](https://conda.io/docs/user-guide/install/index.html) from Conda. Choose the correct version/computer you need to download it for. You will have the option of downloading Anaconda or Miniconda. Conda provides [documentation](https://conda.io/docs/user-guide/install/download.html#anaconda-or-miniconda) about the difference between Anaconda and Miniconda. We chose to use Miniconda for this tutorial because we just wanted to use the GATK conda environment and did not want to take up too much space on our computer. If you are not going to use Conda for anything other than GATK4, you might consider doing the same. If you choose to install Anaconda, you may have access to other bioinformatics packages that are helpful to you, and you won’t have to install each package you need. Follow the prompts to properly install the .pkg file. Make sure you choose the correct package for the version of Python you are using. For example, if you have Python 2.7 on your computer, choose the version specific to it.
2) Go to the directory where you have stored the GATK4 jars and the `gatk` wrapper script, and make sure gatkcondaenv.yml is present. Run
`conda env create -n gatk -f gatkcondaenv.yml`
`source activate gatk`
3) To check if your Conda environment is running properly, type `conda list` and you should see a list of packages installed.
`gatkpythonpackages` should be one of them.
4) You can also test out whether the new variant filtering tool (CNNScoreVariants) runs properly. If you run `python -c “import vqsr_cnn”` the output should look like `Using TensorFlow backend.`. If you do not have the Conda environment configured correctly, you will get an error immediately saying `ImportError: No module named vqsr_cnn`.
5) If you later upgrade to a new version of GATK4, you will need to update the Conda configuration in the new GATK4 folder. If you simply overwrite the old GATK with the new one, you will get an error message saying “CondaValueError: prefix already exists: /anaconda2/envs/gatk”. For example, when I upgraded from GATK 4.0.1.2 to GATK 4.0.2.0, I simply ran (in my 4.0.2.0 folder) `source deactivate` `conda env remove -n gatk`
Then, follow Steps 2-4 again to re-install it.
Important
Do not confuse the above mentioned GATK conda environment setup with this [bioconda gatk](https://bioconda.github.io/recipes/gatk4/README.html “bioconda gatk”) installation. The current version of the bioconda installation of GATK does not set up the conda environment used for the GATK python tools, so that must still be set up manually.
Updated on 2019-08-13
From lakhujanivijay on 2019-03-05
Thank you for the article. :) It will be great if you can add hyper links to the following
1. GATK4 jars
2. the gatk wrapper script
I am having difficulty locating them :# . Could you please help?
From lakhujanivijay on 2019-03-05
Additionally, I followed the steps,
conda env create -n gatk -f gatkcondaenv.yml
It gave the output
Collecting package metadata: done Solving environment: done Downloading and Extracting Packages intel-openmp-2018.0. | 620 KB | ############################################################################################################################################# | 100% pip-9.0.1 | 1.7 MB | ############################################################################################################################################# | 100% zlib-1.2.11 | 109 KB | ############################################################################################################################################# | 100% readline-6.2 | 606 KB | ############################################################################################################################################# | 100% openssl-1.0.2l | 3.2 MB | ############################################################################################################################################# | 100% tk-8.5.18 | 1.9 MB | ############################################################################################################################################# | 100% certifi-2016.2.28 | 216 KB | ############################################################################################################################################# | 100% xz-5.2.3 | 667 KB | ############################################################################################################################################# | 100% python-3.6.2 | 16.5 MB | ############################################################################################################################################# | 100% sqlite-3.13.0 | 4.0 MB | ############################################################################################################################################# | 100% setuptools-36.4.0 | 563 KB | ############################################################################################################################################# | 100% mkl-2018.0.1 | 184.7 MB | ############################################################################################################################################# | 100% wheel-0.29.0 | 88 KB | ############################################################################################################################################# | 100% mkl-service-1.1.2 | 11 KB | ############################################################################################################################################# | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done # # To activate this environment, use # # $ conda activate gatk # # To deactivate an active environment, use # # $ conda deactivate
Then i activated gatk
environment
conda activate gatk
Then I ran following command which throws errors:
(gatk) bioinfo$ gatk NeuralNetInference -R reference.fasta -V NA12878.vcf -O NeuralNetInferenceFiltered.vcf -a cnn_1d_annotations.hd5 No command 'gatk' found, did you mean: Command 'gitk' from package 'gitk' (main) Command 'gak' from package 'gui-apt-key' (universe) Command 'gawk' from package 'gawk' (main) gatk: command not found
Can you please help?
From SkyWarrior on 2019-03-05
You need to put GATK to your path.
From lakhujanivijay on 2019-03-05
Thanks [SkyWarrior](https://gatkforums.broadinstitute.org/gatk/profile/SkyWarrior “SkyWarrior”) . That helped. However, now I able to launch GATK without activating gatk environment.
bioinfo@bioinfo$ conda activate gatk (gatk) bioinfo@bioinfo$ gatk Usage template for all tools (uses —spark-runner LOCAL when used with a Spark tool) gatk AnyTool toolArgs Usage template for Spark tools (will NOT work on non-Spark tools) gatk SparkTool toolArgs [ — —spark-runner sparkArgs ] Getting help gatk —list Print the list of available tools gatk Tool —help Print help on a particular tool Configuration File Specification —gatk-config-file PATH/TO/GATK/PROPERTIES/FILE gatk forwards commands to GATK and adds some sugar for submitting spark jobs —spark-runner controls how spark tools are run valid targets are: LOCAL: run using the in-memory spark runner SPARK: run using spark-submit on an existing cluster —spark-master must be specified —spark-submit-command may be specified to control the Spark submit command arguments to spark-submit may optionally be specified after — GCS: run using Google cloud dataproc commands after the — will be passed to dataproc —cluster must be specified after the — spark properties and some common spark-submit parameters will be translated to dataproc equivalents —dry-run may be specified to output the generated command line without running it —java-options ‘OPTION1[ OPTION2=Y … ]’ optional – pass the given string of options to the java JVM at runtime. Java options MUST be passed inside a single string with space-separated values.
Now , I deactivate `conda`
```(gatk) bioinfo@bioinfo$ conda deactivate ```
and launch GATK
```bioinfo@bioinfo$ gatk```
It still launches, hence, I wonder if this is expected.
From SkyWarrior on 2019-03-05
Environment and PATH are seperate things therefore the behavior is expected. Launching gatk without environment is a problem for CNV CNN and some other tools. Environment must be active for those tasks.
From lakhujanivijay on 2019-03-11
Hi [SkyWarrior](https://gatkforums.broadinstitute.org/gatk/profile/SkyWarrior “SkyWarrior”)
That really helped. Thanks!
From tiaojon on 2019-04-25
When I go to run conda env create -n gatk -f gatkcondaenv.yml i get this error:
Collecting package metadata: done
Solving environment: failed
ResolvePackageNotFound: – anaconda::tensorflow==1.12.0=mkl_py36h69b6ba0_0
It seems like I can’t download the 1.12.0 version of tensorflow anymore, because when I check the anaconda site, I can only find version 1.13.1. Is there some way to force the right version? Should I change the .yml file? I wasn’t sure if 1.13.1 is backwards compatible with 1.12.0
From annaship on 2019-06-01
I have the same problem. I saw that tensorflow was not listed as a package in my miniconda so I installed it, v 1.13.1. But when I try again to create the gatk environment, I still get:
Collecting package metadata: done
Solving environment: failed
ResolvePackageNotFound: – anaconda::tensorflow==1.12.0=mkl_py36h69b6ba0_0
Any assistance very welcome!
From sohta on 2019-11-20
I ran into the same issue as tiaojon and annaship did, but resolved it by rewriting the anaconda::tensorflow line so that it looks like the following (which is probably the latest release compatible with 1.12.0 at the moment):
- anaconda::tensorflow=1.12.0=mkl_py36h2b2bbaf_0
cf. https://anaconda.org/anaconda/tensorflow/files?version=1.12.0