created by Sheila
on 2014-07-23
This document describes the procedure used by HaplotypeCaller to evaluate the evidence for variant alleles based on candidate haplotypes determined in the previous step for a given ActiveRegion. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.
The previous step produced a list of candidate haplotypes for each ActiveRegion, as well as a list of candidate variant sites borne by the non-reference haplotypes. Now, we need to evaluate how much evidence there is in the data to support each haplotype. This is done by aligning each sequence read to each haplotype using the PairHMM algorithm, which produces per-read likelihoods for each haplotype. From that, we'll be able to derive how much evidence there is in the data to support each variant allele at the candidate sites, and that produces the actual numbers that will finally be used to assign a genotype to the sample.
We originally obtained our list of haplotypes for the ActiveRegion by constructing an assembly graph and selecting the most likely paths in the graph by counting the number of supporting reads for each path. That was a fairly naive evaluation of the evidence, done over all reads in aggregate, and was only meant to serve as a preliminary filter to whittle down the number of possible combinations that we're going to look at in this next step.
Now we want to do a much more thorough evaluation of how much evidence we have for each haplotype. So we're going to take each individual read and align it against each haplotype in turn (including the reference haplotype) using the PairHMM algorithm (see Durbin et al., 1998). If you're not familiar with PairHMM, it's a lot like the BLAST algorithm, in that it's a pairwise alignment method that uses a Hidden Markov Model (HMM) and produces a likelihood score. In this use of the PairHMM, the output score expresses the likelihood of observing the read given the haplotype by taking into account the information we have about the quality of the data (i.e. the base quality scores and indel quality scores). Note: If reads from a pair overlap at a site and they have the same base, the base quality is capped at Q20 for both reads (Q20 is half the expected PCR error rate). If they do not agree, we set both base qualities to Q0.
This produces a big table of likelihoods where the columns are haplotypes and the rows are individual sequence reads. (example figure TBD)
The table essentially represents how much supporting evidence there is for each haplotype (including the reference), itemized by read.
Having per-read likelihoods for entire haplotypes is great, but ultimately we want to know how much evidence there is for individual alleles at the candidate sites that we identified in the previous step. To find out, we take the per-read likelihoods of the haplotypes and marginalize them over alleles, which produces per-read likelihoods for each allele at a given site. In practice, this means that for each candidate site, we're going to decide how much support each read contributes for each allele, based on the per-read haplotype likelihoods that were produced by the PairHMM.
This may sound complicated, but the procedure is actually very simple -- there is no real calculation involved, just cherry-picking appropriate values from the table of per-read likelihoods of haplotypes into a new table that will contain per-read likelihoods of alleles. This is how it happens. For a given site, we list all the alleles observed in the data (including the reference allele). Then, for each read, we look at the haplotypes that support each allele; we select the haplotype that has the highest likelihood for that read, and we write that likelihood in the new table. And that's it! For a given allele, the total likelihood will be the product of all the per-read likelihoods. (example fig TBD)
At the end of this step, sites where there is sufficient evidence for at least one of the variant alleles considered will be called variant, and a genotype will be assigned to the sample in the next (final) step.
Updated on 2015-09-14
From Juno on 2014-11-27
Hi, Sheila, where is the example fig TBD although it is very simple?
From Sheila on 2014-11-27
@Juno
Hi,
I am so sorry, but we have not posted it yet. However, you can have a look at this presentation which has some nice figures to help: https://www.broadinstitute.org/gatk/events/4768/GATKwr1409-BP-4-Variant_calling_genotyping.pdf
-Sheila
From Geraldine_VdAuwera on 2014-11-27
My bad, it’s been hard to prioritize the figures, but hopefully we can get to them in the near future.
From eflannery on 2015-01-24
Hello,
I have a few questions on haplotype caller. First, I don’t understand where to get the output of candidate haplotypes. How do you generate the table that is talked about in 1?
The attached screen shot shows an original bam for one sample and then the output for that region after using haplotype caller with several samples together. I don’t understand why four haplotypes (denoted by the shaded reads I colored using Tag=HC) are present when there are only two unique cigar strings (the boxes show two of the strings). What does Read name= HC49 mean? Is this the haplotype? Are there 49 different haplotypes? Also the heterozygous SNV at 333,518 is not emitted in the vcf file. Shouldn’t it be?
Thanks for your help.
Screen Shot 2015-01-23 at 6.10.52 PM.png
Screen Shot 2015-01-23 at 6.10.52 PM.png
From Sheila on 2015-01-26
@eflannery
Hi,
The table of candidate haplotypes is not output from GATK. However, you can see the candidate haplotypes that were considered in the bamout file. The candidate haplotypes are labeled as Artificial Haplotypes.
I have noticed HaplotypeCaller outputting two of the same artificial haplotypes. I think this is just a quirk of the tool and it does not affect output. HC49 is indeed the name of the Haplotype. 49 means it is the 49th haplotype considered by HaplotypeCaller. (For example, the very first haplotype constructed in the very first active region will be labeled HC1.)
It is odd that position 333,518 is not in the vcf. Can you check and make sure all the base quality scores and mapping quality scores are high? This could be related to another bug that is waiting to be fixed.
Thanks,
Sheila
From Sheila on 2015-01-26
@eflannery
Hi,
The artificial haplotypes are output in both the forward and reverse orientations. So, the two same artificial haplotypes are not simply a quirk of Haplotype Caller.
-Sheila
From Matteodigg on 2017-04-10
Hi staff,
I read a lot about this topic but I didn’t find an exhaustive answer about my doubts.
I’m working on a target panel and I don’t understand exactly how GATK (v.3.7) select reads to calculate genotype sample.
For example, I was looking to this variant from g.vcf file:
chr1 3311101 . C G, 775.77 . BaseQRankSum=1.060;ClippingRankSum=0.000;DP=106;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=381600.00;ReadPosRankSum=-0.933 GT:AD:DP:GQ:PL:SB 0/1:61,42,0:103:99:804,0,1188,985,1314,2299:29,32,28,14
DP here is 103 but looking at bam file, coming out from pre-processing, reads are 115, as you can see on this info:
chr1 3311101 C G
Pair: 115
Unpair: 0
Unmapped: 0
qc_fail: 0
is_read1: 57
is_read1_reverse: 29
is_read1_forward: 28
is_read2: 58
is_read2_reverse: 27
is_read2_forward: 31
is_reverse: 56
is_secondary: 1
is_supplementary: 0
is_proper_pair_mapped: 112
Coverage: 115
mate_is_unmapped: 0
mean_mapping_qual: 59
is_unmapped: 0
However, looking at bam file coming out from HC, reads reported are 106 with two Artificial reads. Looking at the 11 excluded reads, can you help me to figure out why are filtered out? For example, these reads have been discarded:
M03943:73:000000000-AV81U:1:1108:22334:19322 161 1 3310991 60 151M 2 33141307 151 GAGTGAGGGAGGATGCCCGAGGTCTCACGTCAGGACCCTCGACGGGGGTGGGGTGGAGAGGGAGGTGGTGGGTGGGGTGGGTGGTGCGGCCTGGAGTGGAGCCAAGGGGGGGGCGGCCAGGGTCCACATTGCCCTCTAACGGGACCCCTCG array(‘B’, [31, 31, 31, 31, 35, 33, 37, 36, 32, 31, 29, 32, 33, 33, 38, 38, 36, 37, 32, 34, 38, 36, 32, 37, 36, 38, 36, 37, 32, 33, 37, 35, 37, 37, 32, 37, 36, 36, 38, 37, 32, 33, 36, 32, 36, 36, 36, 36, 32, 38, 36, 36, 36, 27, 34, 36, 34, 38, 34, 34, 35, 36, 32, 34, 35, 31, 38, 36, 30, 38, 36, 35, 27, 33, 35, 35, 36, 27, 33, 36, 36, 27, 33, 36, 30, 38, 37, 30, 35, 38, 36, 36, 34, 38, 34, 39, 31, 38, 33, 34, 38, 37, 32, 36, 35, 39, 34, 36, 36, 36, 35, 35, 35, 36, 32, 35, 36, 29, 34, 37, 35, 34, 30, 35, 32, 30, 36, 34, 32, 32, 38, 36, 34, 35, 35, 35, 35, 32, 33, 36, 32, 33, 35, 30, 31, 34, 34, 33, 32, 32, 28]) [(‘BD’, ‘LLNNOPOOLOOOOOPPNKMNNNNNMNNNMMNNPNNNNKNNMNNMMKKKNNNKKNNNNNONNKNNNNNNNNNKNNNKKNNNKNNNNNOMMMNNONNNNNNNNONNKMNKKKKKKMMMMNNPNKNNNNNMOMNONKNNMMLMMMKNNNKKNNM’), (‘MD’, ‘5C103C41’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘PPRQRRSSORRRRQRRPPQQRRRRQPQPQQRQRQRRQPRPQQRQQNNNRQRNNRQRRRQRRNRRRRQRRQRNRQRNNRQRNRQRRQRQQQPRSRRRQQRRRRPRPQRNNNNNNQQQQPRRRNRRQRPQQPQRPPRPQQNPQQNRRQPPRPQ’), (‘NM’, 2), (‘AS’, 141), (‘XS’, 21)]
M03943:73:000000000-AV81U:1:1109:23019:7394 401 1 3311088 1 102H48M 1 3311018 48 GGAGCCAAGGGGGGGGCGGCCAGGGTCCACATTGCCCTCTAACGGGAC array(‘B’, [35, 30, 18, 13, 31, 20, 26, 17, 36, 36, 36, 36, 36, 35, 36, 33, 31, 36, 37, 35, 38, 37, 36, 36, 37, 35, 36, 38, 33, 38, 34, 33, 35, 38, 36, 36, 38, 33, 35, 32, 33, 32, 31, 35, 33, 30, 31, 32]) [(‘SA’, ‘chr2,33141319,-,42S49M59S,1,0;’), (‘BD’, ‘NNNMNNLNKKKKKKNMMNMNONKNNNNNNONKONLOOPOMNNNLNNLL’), (‘MD’, ‘12C35’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘QPRQRQPRPPPPPPOPQPQRSRPQRRRQRRQPRQNRRQQRRRRQQRPP’), (‘NM’, 1), (‘AS’, 43), (‘XS’, 19)]
M03943:73:000000000-AV81U:1:2112:13824:17315 147 1 3311069 60 41S110M 1 3311018 110 CGGGGCGGGGGGGGGGGGGAGGGGGGGTGGGGGGGGGGGGGGGGTGGGGCGGCCTGGTGTGGGCCCAAGGGGGGGGCGGCCAGGGTCCACATTGCCCTCTAACGGGACCCCTCGAGAAGCCTCAGCCTCGCCCGCTGACCCTCCTCTCTCC array(‘B’, [24, 2, 2, 2, 13, 15, 35, 32, 26, 22, 1, 1, 22, 35, 35, 29, 29, 22, 14, 14, 30, 1, 30, 22, 22, 22, 13, 15, 1, 1, 1, 1, 26, 22, 1, 1, 32, 29, 1, 26, 1, 32, 29, 26, 16, 22, 24, 1, 12, 14, 4, 13, 5, 26, 16, 26, 12, 16, 12, 25, 1, 4, 12, 27, 24, 20, 9, 17, 7, 36, 32, 26, 23, 7, 7, 10, 25, 36, 36, 35, 38, 31, 32, 30, 37, 33, 35, 38, 31, 39, 34, 33, 35, 37, 36, 37, 38, 30, 38, 32, 34, 32, 32, 36, 36, 37, 32, 36, 36, 36, 38, 33, 32, 37, 37, 35, 34, 36, 37, 36, 38, 33, 38, 36, 34, 33, 31, 31, 31, 36, 36, 37, 32, 39, 38, 34, 35, 32, 35, 32, 32, 34, 27, 32, 31, 32, 33, 32, 30, 32, 31]) [(‘MC’, ‘150M’), (‘BD’, ‘MKKNMMKKKKKKKKKKKNNNKKKKKNNNKKKKKKKKKKKKKKNNNKKNMMNMNPNNNMNNKNMKNNLNKKKKKKNMMNMNONKNNNNNNONKNMKNNONLMMMKNNNKKNNNMNMNLNMNNOONMNNNMMKMMOPOOOLOOOOOPOPNNLL’), (‘MD’, ‘6T9A4A0G8C78’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘QPPPPQPPPPPPPPPPPQPRPPPPPQPRPPPPPPPPPPPPPPQPRPPPPQPQRRRQPQPRPPQNRQPRPPPPPPPPQPQRSRPQRRRQRRQPRQNRRQQQQQQPQRRNNRRQQPQRPQQRRQSRQRRQQQNQQRRQRRNRRRSSRSRRRPP’), (‘NM’, 5), (‘MQ’, 60), (‘AS’, 85), (‘XS’, 19)]
M03943:73:000000000-AV81U:1:1113:3640:17203 83 1 3311050 60 15S134M 1 3311052 134 TTGGGTAAAAGAGACGGGAGGGGGTGGGGGGGGTGGGTGGTGCGACAAGGAGTGGAGCCAAGGGGGGGGCGGCCAGGATCCACATTGCCCTCTAACGGGACCCCTCGAGAAGCCTCAGCCTCGCCCGCTGACCCTCCTCTCTCCTGCCC array(‘B’, [5, 17, 22, 9, 24, 14, 24, 9, 9, 24, 13, 24, 13, 24, 16, 2, 2, 12, 15, 1, 1, 4, 4, 24, 16, 26, 22, 22, 1, 29, 4, 24, 13, 17, 22, 24, 12, 17, 23, 12, 15, 14, 16, 26, 16, 20, 9, 18, 7, 27, 27, 13, 17, 4, 16, 35, 12, 15, 20, 33, 16, 29, 8, 30, 23, 8, 8, 23, 11, 15, 11, 31, 14, 32, 36, 35, 30, 15, 30, 31, 38, 32, 34, 32, 25, 27, 11, 11, 33, 38, 32, 19, 32, 33, 32, 32, 36, 36, 33, 32, 33, 12, 11, 19, 33, 32, 37, 32, 17, 32, 28, 12, 29, 21, 30, 21, 36, 34, 11, 31, 14, 25, 33, 36, 33, 32, 36, 38, 30, 35, 32, 15, 15, 28, 33, 33, 34, 16, 23, 15, 23, 29, 32, 31, 31, 30, 28, 29, 30]) [(‘MC’, ‘132M’), (‘BD’, ‘KNKNOLDDLMNMNMMKNNNKKKNNNKKKKKKNNNKNNNNNNMMNNNLNNNNNNNNNMNNLNKKKKKKNMMNMNONNONNNNNONKNMKNNONLMMMKNNNKKNNNMNMNLNMNNOONMNNNMMKMMOPNNNLOOOOOPOPOOOQOMKLL’), (‘MD’, ‘6T6T15G1C0T17C11G71’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘PRPQQQIIPQPQRQQPQPRPPPQPRPPPPPPQPRPQPRQPRPQRRQPRQPRPRQPRQRQPRPPPPPPPPQPQRSRQQQRRQRRQPRQNRRQQQQQQPQRRNNRRQQPQRPRQRRQSRQRRQQQNQQRRQRRNRRRRRQRRSSSSSQNPP’), (‘NM’, 7), (‘MQ’, 60), (‘AS’, 99), (‘XS’, 22)]
M03943:73:000000000-AV81U:1:2104:9715:10639 83 1 3311086 60 89S61M 1 3310983 61 GGGGGGTGCCGGGGGTCCACGCGGGGGCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGCCCGGGGGTGGAGCCAGGGGGGGGGCGGCCAGGGTCCACATTGCCCTCTAACGGGACCCCTCGAGAAG array(‘B’, [26, 29, 26, 22, 22, 12, 16, 13, 3, 16, 1, 4, 22, 22, 13, 16, 3, 15, 24, 24, 12, 15, 1, 2, 1, 2, 12, 3, 20, 20, 16, 2, 2, 1, 22, 22, 30, 26, 22, 22, 22, 1, 2, 22, 22, 2, 2, 2, 22, 2, 2, 2, 26, 22, 2, 2, 2, 26, 2, 30, 26, 22, 22, 35, 35, 29, 2, 30, 22, 2, 2, 2, 23, 26, 7, 8, 8, 23, 8, 11, 15, 11, 11, 11, 16, 8, 26, 32, 24, 14, 17, 24, 27, 17, 13, 15, 21, 19, 26, 32, 33, 36, 36, 35, 36, 35, 34, 28, 36, 37, 37, 39, 37, 36, 36, 36, 34, 38, 38, 33, 38, 34, 33, 35, 38, 36, 37, 38, 33, 37, 32, 33, 32, 32, 35, 36, 35, 31, 33, 33, 27, 31, 34, 30, 35, 33, 31, 28, 32, 31]) [(‘SA’, ‘chr2,33141319,-,30S50M70S,0,0;’), (‘MC’, ‘151M’), (‘BD’, ‘KKKKOONMMMKKKNNNNNMMMMKKKNMKKMMKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKNMMMKMMKKKNNNNNNMNONKKKKKKKNMMNMNONKNNNNNNONKNMKNNONLMNNLOOOLLOOONONNLLL’), (‘MD’, ‘9A4C46’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘PPPPQPRQQQPPPQRRRQQQPQPPPPQNNQQPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPQQNQQPPPQPRQPRQRSRPPPPPPPPPQPQRSRPQRRRQRRQPRQNRRQQQQQQPQRRNNSSRRQRRPPP’), (‘NM’, 2), (‘MQ’, 60), (‘AS’, 51), (‘XS’, 19)]
I can clearly see that some reads have alignments problem, like flagged 401 (MQ=1) and 161 (not proper mapped read), and GATK filter out them, but what about M03943:73:000000000-AV81U:1:2104:9715:10639 and 1:1113:3640:17203 (chimeric reads) ? It depends, maybe, on Q-score (even if Q-score is greather than 30 for this position) or Alignment Score ‘AS’ ?
Is there any detailed documentation on this argument?
Thank you very much for your help, have a nice day!
Matteo
From Sheila on 2017-04-12
@Matteodigg
Hi Matteo,
You can use [`—emitDroppedReads`](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#—emitDroppedReads) to check out why the reads have been dropped.
-Sheila
From Matteodigg on 2017-04-13
Hi @Sheila ,
I tried -edr option with -bamout option, in order to check only that position with -L option and I got this result:
INFO 10:43:40,089 MicroScheduler – 22 reads were filtered out during the traversal out of approximately 241 total reads (9.13%)
INFO 10:43:40,089 MicroScheduler – -> 0 reads (0.00% of total) failing BadCigarFilter
INFO 10:43:40,090 MicroScheduler – -> 20 reads (8.30% of total) failing DuplicateReadFilter
INFO 10:43:40,090 MicroScheduler – -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 10:43:40,090 MicroScheduler – -> 1 reads (0.41% of total) failing HCMappingQualityFilter
INFO 10:43:40,090 MicroScheduler – -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 10:43:40,090 MicroScheduler – -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 10:43:40,091 MicroScheduler – -> 1 reads (0.41% of total) failing NotPrimaryAlignmentFilter
INFO 10:43:40,091 MicroScheduler – -> 0 reads (0.00% of total) failing UnmappedReadFilter**
So, I tried with -drf DuplicateRead option with this output:
INFO 10:46:34,608 MicroScheduler – 2 reads were filtered out during the traversal out of approximately 241 total reads (0.83%)
INFO 10:46:34,608 MicroScheduler – -> 0 reads (0.00% of total) failing BadCigarFilter
INFO 10:46:34,608 MicroScheduler – -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 10:46:34,609 MicroScheduler – -> 1 reads (0.41% of total) failing HCMappingQualityFilter
INFO 10:46:34,609 MicroScheduler – -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 10:46:34,609 MicroScheduler – -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 10:46:34,609 MicroScheduler – -> 1 reads (0.41% of total) failing NotPrimaryAlignmentFilter
INFO 10:46:34,609 MicroScheduler – -> 0 reads (0.00% of total) failing UnmappedReadFilter
However, even with -edr, bamout has always 106 DP.
Can you explain me why I see a total of 241 reads? Does it include artificial reads and Duplicate reads?
Reading this post http://gatkforums.broadinstitute.org/gatk/discussion/8501/outputting-bamout-to-read-post-realignment-coverage-including-non-variants you said:
> Although, —emitDroppedReads will have no effect on the Depth that is output. That is simply to show you where some of the unused reads went/why they were filtered.
But I still continue to not understand why these reads have been filtered. Am I looking to -edr results in a proper way?
Thank you,
Matteo
From Sheila on 2017-04-18
@Matteodigg
Hi Matteo,
I think the 241 reads are the reads that span the entire active region, not just that site. As for why the 11 reads are being filtered out, can you confirm that all 115 reads have high base quality (greater than 10) and high mapping quality (greater than 20)? Also, to confirm, in the bamout you see 115 reads at the site of interest and only two are artificial haplotypes?
If all of this is true, I may need you to submit a bug report.
Thanks,
Sheila
From Matteodigg on 2017-04-19
@Sheila
Hi Sheila,
No, in the bamout I see only 106 reads, 2 of them are artificial haplotypes. I see 115 reads only in bam file used for variant calling.
So, considering 104 reads for that position, the 11 discarded reads have been filtered, I suppose, due to different reasons. Some of them have a secondary alignment, like these, or not proper flag:
M03943:73:000000000-AV81U:1:2104:9715:10639 83 1 3311086 60 89S61M 1 3310983 61 GGGGGGTGCCGGGGGTCCACGCGGGGGCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGCCCGGGGGTGGAGCCAGGGGGGGGGCGGCCAGGGTCCACATTGCCCTCTAACGGGACCCCTCGAGAAG array(‘B’, [26, 29, 26, 22, 22, 12, 16, 13, 3, 16, 1, 4, 22, 22, 13, 16, 3, 15, 24, 24, 12, 15, 1, 2, 1, 2, 12, 3, 20, 20, 16, 2, 2, 1, 22, 22, 30, 26, 22, 22, 22, 1, 2, 22, 22, 2, 2, 2, 22, 2, 2, 2, 26, 22, 2, 2, 2, 26, 2, 30, 26, 22, 22, 35, 35, 29, 2, 30, 22, 2, 2, 2, 23, 26, 7, 8, 8, 23, 8, 11, 15, 11, 11, 11, 16, 8, 26, 32, 24, 14, 17, 24, 27, 17, 13, 15, 21, 19, 26, 32, 33, 36, 36, 35, 36, 35, 34, 28, 36, 37, 37, 39, 37, 36, 36, 36, 34, 38, 38, 33, 38, 34, 33, 35, 38, 36, 37, 38, 33, 37, 32, 33, 32, 32, 35, 36, 35, 31, 33, 33, 27, 31, 34, 30, 35, 33, 31, 28, 32, 31]) [(‘SA’, ‘chr2,33141319,-,30S50M70S,0,0;’), (‘MC’, ‘151M’), (‘BD’, ‘KKKKOONMMMKKKNNNNNMMMMKKKNMKKMMKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKNMMMKMMKKKNNNNNNMNONKKKKKKKNMMNMNONKNNNNNNONKNMKNNONLMNNLOOOLLOOONONNLLL’), (‘MD’, ‘9A4C46’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘PPPPQPRQQQPPPQRRRQQQPQPPPPQNNQQPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPQQNQQPPPQPRQPRQRSRPPPPPPPPPQPQRSRPQRRRQRRQPRQNRRQQQQQQPQRRNNSSRRQRRPPP’), (‘NM’, 2), (‘MQ’, 60), (‘AS’, 51), (‘XS’, 19)]
M03943:73:000000000-AV81U:1:1105:18756:9300 147 1 3311088 60 77S73M 1 3310971 73 GGTCGCGGTCAGGGCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCGGGGGGGGAGCCAGGGGGGGGGCGGCCAGGGTCCACATTGCCCTCTAACGGGACCCCTCGAGAAGCCTCAGCCTCGCCC array(‘B’, [22, 12, 15, 24, 12, 15, 8, 13, 14, 16, 16, 26, 22, 12, 2, 20, 24, 1, 1, 26, 1, 1, 30, 35, 35, 33, 22, 1, 29, 35, 35, 30, 24, 1, 22, 1, 1, 29, 33, 35, 30, 26, 22, 32, 26, 1, 26, 24, 1, 35, 26, 1, 2, 35, 35, 30, 26, 1, 29, 24, 1, 24, 22, 1, 1, 1, 1, 26, 12, 24, 14, 7, 33, 26, 7, 7, 7, 35, 16, 16, 12, 27, 19, 17, 7, 35, 36, 35, 36, 36, 36, 32, 29, 32, 36, 37, 37, 38, 37, 36, 36, 37, 34, 37, 38, 33, 38, 34, 33, 35, 38, 36, 38, 38, 34, 37, 32, 34, 33, 32, 36, 36, 37, 32, 36, 36, 36, 38, 33, 32, 38, 36, 38, 33, 36, 38, 36, 38, 34, 38, 34, 34, 34, 35, 32, 32, 32, 31, 33, 32]) [(‘SA’, ‘chr2,33141339,-,17S75M58S,0,4;’), (‘MC’, ‘151M’), (‘BD’, ‘NNNMMMNNOONKNMKMMKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKNMMMKKKKKKNNNMNONKKKKKKKNMMNMNONKNNNNNNONKNMKNNONLMMMKNNNKKNNNMNMNLONOOPPONOOONMKLL’), (‘MD’, ‘7A4C60’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘QRQQPQQRQSRPPQNQQPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPQQQPPPPPPQPRQRSRPPPPPPPPPQPQRSRPQRRRQRRQPRQNRRQQPQQQPQRRNNRRQQPQRPRQRRQSRRSSRRQNPP’), (‘NM’, 2), (‘MQ’, 60), (‘AS’, 63), (‘XS’, 19)]
M03943:73:000000000-AV81U:1:1108:22334:19322 161 1 3310991 60 151M 2 33141307 151 GAGTGAGGGAGGATGCCCGAGGTCTCACGTCAGGACCCTCGACGGGGGTGGGGTGGAGAGGGAGGTGGTGGGTGGGGTGGGTGGTGCGGCCTGGAGTGGAGCCAAGGGGGGGGCGGCCAGGGTCCACATTGCCCTCTAACGGGACCCCTCG array(‘B’, [31, 31, 31, 31, 35, 33, 37, 36, 32, 31, 29, 32, 33, 33, 38, 38, 36, 37, 32, 34, 38, 36, 32, 37, 36, 38, 36, 37, 32, 33, 37, 35, 37, 37, 32, 37, 36, 36, 38, 37, 32, 33, 36, 32, 36, 36, 36, 36, 32, 38, 36, 36, 36, 27, 34, 36, 34, 38, 34, 34, 35, 36, 32, 34, 35, 31, 38, 36, 30, 38, 36, 35, 27, 33, 35, 35, 36, 27, 33, 36, 36, 27, 33, 36, 30, 38, 37, 30, 35, 38, 36, 36, 34, 38, 34, 39, 31, 38, 33, 34, 38, 37, 32, 36, 35, 39, 34, 36, 36, 36, 35, 35, 35, 36, 32, 35, 36, 29, 34, 37, 35, 34, 30, 35, 32, 30, 36, 34, 32, 32, 38, 36, 34, 35, 35, 35, 35, 32, 33, 36, 32, 33, 35, 30, 31, 34, 34, 33, 32, 32, 28]) [(‘BD’, ‘LLNNOPOOLOOOOOPPNKMNNNNNMNNNMMNNPNNNNKNNMNNMMKKKNNNKKNNNNNONNKNNNNNNNNNKNNNKKNNNKNNNNNOMMMNNONNNNNNNNONNKMNKKKKKKMMMMNNPNKNNNNNMOMNONKNNMMLMMMKNNNKKNNM’), (‘MD’, ‘5C103C41’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘PPRQRRSSORRRRQRRPPQQRRRRQPQPQQRQRQRRQPRPQQRQQNNNRQRNNRQRRRQRRNRRRRQRRQRNRQRNNRQRNRQRRQRQQQPRSRRRQQRRRRPRPQRNNNNNNQQQQPRRRNRRQRPQQPQRPPRPQQNPQQNRRQPPRPQ’), (‘NM’, 2), (‘AS’, 141), (‘XS’, 21)]
M03943:73:000000000-AV81U:1:1109:23019:7394 401 1 3311088 1 102H48M 1 3311018 48 GGAGCCAAGGGGGGGGCGGCCAGGGTCCACATTGCCCTCTAACGGGAC array(‘B’, [35, 30, 18, 13, 31, 20, 26, 17, 36, 36, 36, 36, 36, 35, 36, 33, 31, 36, 37, 35, 38, 37, 36, 36, 37, 35, 36, 38, 33, 38, 34, 33, 35, 38, 36, 36, 38, 33, 35, 32, 33, 32, 31, 35, 33, 30, 31, 32]) [(‘SA’, ‘chr2,33141319,-,42S49M59S,1,0;’), (‘BD’, ‘NNNMNNLNKKKKKKNMMNMNONKNNNNNNONKONLOOPOMNNNLNNLL’), (‘MD’, ‘12C35’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘QPRQRQPRPPPPPPOPQPQRSRPQRRRQRRQPRQNRRQQRRRRQQRPP’), (‘NM’, 1), (‘AS’, 43), (‘XS’, 19)]
M03943:73:000000000-AV81U:1:2104:5405:17741 83 1 3311083 60 44S106M 1 3310993 106 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGCCCGGGAGTGGGGCCAGGGGGGGGGCGGCCAGGGTCCACATTGCCCTCTAACGGGACCCCTCGAGAAGCCTCAGCCTCGCCCGCTGACCCTCCTCTCTCCTGCCCAGGGC array(‘B’, [33, 33, 34, 30, 26, 2, 26, 2, 2, 22, 29, 35, 35, 26, 26, 2, 24, 2, 22, 1, 26, 32, 26, 22, 1, 35, 29, 26, 26, 22, 33, 22, 2, 1, 2, 2, 2, 12, 16, 13, 24, 24, 16, 4, 33, 24, 17, 32, 25, 23, 4, 4, 14, 6, 17, 16, 2, 35, 36, 36, 36, 35, 33, 30, 25, 25, 36, 38, 37, 38, 38, 36, 36, 37, 34, 37, 38, 33, 38, 34, 33, 35, 38, 32, 37, 38, 34, 37, 32, 34, 33, 32, 36, 36, 37, 32, 36, 36, 36, 38, 35, 32, 37, 37, 37, 34, 36, 38, 36, 38, 34, 39, 36, 38, 36, 38, 33, 32, 38, 36, 36, 32, 37, 39, 36, 37, 32, 36, 36, 38, 33, 36, 38, 33, 38, 33, 38, 32, 34, 32, 35, 31, 29, 34, 38, 35, 33, 31, 32, 31]) [(‘SA’, ‘chr2,33141320,-,38M112S,0,0;’), (‘MC’, ‘151M’), (‘BD’, This read instead is flagged as a paired read but not mapped in a proper pair:‘KKKKLLKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKNMMMKMMKNNNNNKKNMNONKKKKKKKNMMNMNONKNNNNNNONKNMKNNONLMMMKNNNKKNNNMNMNLNMNNOONMNNNMMKMMOPNNNKNNNNOPOPOOOQONLOPOKNLL’), (‘MD’, ‘7A4A4C88’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPQQNQQPQPRPRPPPQRSRPPPPPPPPPQPQRSRPQRRRQRRQPRQNRRQQQQQQPQRRNNRRQQPQRPRQRRQSRQRRQQQNQQRRQRRNRRRRRQRQRRRRSROSTSPPPP’), (‘NM’, 3), (‘MQ’, 60), (‘AS’, 91), (‘XS’, 0)]
This read, instead, have Quality base of 8. I tried to use -mbq 1 in my command line but I don’t see it again in bamout or gvcf DP:
M03943:73:000000000-AV81U:1:1113:3640:17203 83 1 3311050 60 15S134M 1 3311052 134 TTGGGTAAAAGAGACGGGAGGGGGTGGGGGGGGTGGGTGGTGCGACAAGGAGTGGAGCCAAGGGGGGGGCGGCCAGGATCCACATTGCCCTCTAACGGGACCCCTCGAGAAGCCTCAGCCTCGCCCGCTGACCCTCCTCTCTCCTGCCC array(‘B’, [5, 17, 22, 9, 24, 14, 24, 9, 9, 24, 13, 24, 13, 24, 16, 2, 2, 12, 15, 1, 1, 4, 4, 24, 16, 26, 22, 22, 1, 29, 4, 24, 13, 17, 22, 24, 12, 17, 23, 12, 15, 14, 16, 26, 16, 20, 9, 18, 7, 27, 27, 13, 17, 4, 16, 35, 12, 15, 20, 33, 16, 29, 8, 30, 23, 8, 8, 23, 11, 15, 11, 31, 14, 32, 36, 35, 30, 15, 30, 31, 38, 32, 34, 32, 25, 27, 11, 11, 33, 38, 32, 19, 32, 33, 32, 32, 36, 36, 33, 32, 33, 12, 11, 19, 33, 32, 37, 32, 17, 32, 28, 12, 29, 21, 30, 21, 36, 34, 11, 31, 14, 25, 33, 36, 33, 32, 36, 38, 30, 35, 32, 15, 15, 28, 33, 33, 34, 16, 23, 15, 23, 29, 32, 31, 31, 30, 28, 29, 30]) [(‘MC’, ‘132M’), (‘BD’, ‘KNKNOLDDLMNMNMMKNNNKKKNNNKKKKKKNNNKNNNNNNMMNNNLNNNNNNNNNMNNLNKKKKKKNMMNMNONNONNNNNONKNMKNNONLMMMKNNNKKNNNMNMNLNMNNOONMNNNMMKMMOPNNNLOOOOOPOPOOOQOMKLL’), (‘MD’, ‘6T6T15G1C0T17C11G71’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘PRPQQQIIPQPQRQQPQPRPPPQPRPPPPPPQPRPQPRQPRPQRRQPRQPRPRQPRQRQPRPPPPPPPPQPQRSRQQQRRQRRQPRQNRRQQQQQQPQRRNNRRQQPQRPRQRRQSRQRRQQQNQQRRQRRNRRRRRQRRSSSSSQNPP’), (‘NM’, 7), (‘MQ’, 60), (‘AS’, 99), (‘XS’, 22)]
This read instead is flagged as a paired read but not mapped in a proper pair:
M03943:73:000000000-AV81U:1:1109:23019:7394 97 1 3311018 60 118M 2 33141318 118 CGTCAGGACCCTCGACGGGGGTGGGGTGGAGAGGGAGGTGGTGGGTGGGGTGGGTGGTGCGGCCTGGAGTGGAGCCAAGGGGGGGGCGGCCAGGGTCCACATTGCCCTCTAACGGGAC array(‘B’, [31, 29, 29, 32, 34, 35, 35, 34, 33, 34, 34, 36, 36, 32, 33, 36, 32, 36, 36, 36, 36, 32, 38, 36, 36, 36, 30, 34, 33, 33, 38, 34, 38, 36, 36, 33, 38, 36, 27, 37, 36, 27, 38, 35, 36, 27, 34, 36, 36, 35, 32, 38, 36, 36, 31, 38, 36, 32, 38, 37, 31, 36, 38, 36, 36, 38, 35, 33, 38, 32, 38, 36, 34, 38, 36, 36, 36, 33, 38, 36, 36, 36, 36, 35, 35, 35, 36, 31, 35, 36, 34, 33, 37, 35, 35, 32, 35, 35, 34, 36, 34, 32, 32, 38, 36, 35, 35, 35, 35, 35, 32, 33, 36, 32, 35, 35, 33, 36]) [(‘BD’, ‘LLMNOQOOOOLOONOONNKKKNNNKKNNNNNONNKNNNNNNNNNKNNNKKNNNKNNNNNOMMMNNONNNNNNNNONNKMNKKKKKKMMMMNNPNKNNNNNMOMNONKNNMMLMMMKNN’), (‘MD’, ‘82C35’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘PPQRRSSSSRPRPQQRQQNNNRQRNNRQRRRQRRNRRRRQRRQRNRQRNNRQRNRQRRQRQQQPRSRRRQQRRRRPRPQRNNNNNNQQQQPRRRNRRQRPQQPQRPPRPQQNPQQNRR’), (‘NM’, 1), (‘AS’, 113), (‘XS’, 21)]
However, I really don’t understand why these reads have been filtered. They all have Quality score > 30 (only one with quality = 22) and mapping quality = 60 :
M03943:73:000000000-AV81U:1:1114:22354:18380 147 1 3311054 60 23S125M 1 3311048 125 ATGTGTATAAGAAGCAGGCGGGGGGTGGTGGGTGGGTTGGTTGGTGCGGCCTGGAGTGCAGCCAAGGGGCTGGCGGCCAGGCTCCCCATTGCCCTCTAACGGGACCCCTCGAGAAGCCTCAGCCTCGCCCGCTGACCCTCCTCTCTCC array(‘B’, [15, 17, 13, 17, 13, 13, 24, 13, 8, 18, 16, 24, 18, 14, 16, 31, 26, 25, 15, 22, 1, 1, 3, 22, 12, 16, 22, 12, 16, 22, 3, 12, 16, 23, 3, 25, 6, 17, 3, 25, 7, 17, 33, 32, 25, 25, 14, 7, 10, 27, 21, 27, 23, 16, 18, 27, 25, 25, 19, 18, 12, 27, 20, 9, 17, 33, 33, 24, 10, 18, 16, 7, 31, 14, 30, 12, 10, 32, 18, 7, 10, 18, 28, 10, 14, 14, 32, 14, 10, 34, 34, 10, 35, 20, 30, 28, 27, 32, 27, 27, 30, 8, 15, 27, 33, 11, 36, 34, 29, 32, 30, 31, 15, 31, 31, 30, 29, 21, 31, 33, 35, 10, 10, 34, 29, 15, 34, 33, 33, 28, 11, 19, 34, 30, 15, 15, 15, 31, 31, 14, 31, 29, 31, 30, 28, 28, 27, 31]) [(‘MC’, ‘131M’), (‘BD’, ‘OMNMNLLLLMNLNOONNMMKKKKNNNNNNKNNNKNMKNNMKNNNNMMNMNPNNNNNNOONMNNLNKKNOPNNMMNMNONNONNKKNONKNMKNNONLMMMKNNNKKNNNMNMNLNMNNOONMNNNMMKMMOPOOOLOOOOOPOPNNLL’), (‘MD’, ‘13G3G17G11G10G3A62’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘QQPQQPQQPQRPRRSRPPQPPPPQPRQPRPQPRPQPPRQPPRQPRPQPQRRRQPRPRRSRQRQPRPPPRRRPPQPQRSRPRRRNNRRQPRQNRRQQQQQQPQRRNNRRQQPQRPQQRRQSRQRRQQQNQQRRQRRNRRRSSRSRRRPP’), (‘NM’, 6), (‘MQ’, 60), (‘AS’, 95), (‘XS’, 21)]
M03943:73:000000000-AV81U:1:2112:24719:22353 147 1 3311065 60 150M 1 3310953 150 GGGGGGGGGGTGCGGCCTGGAGGGGAGCCAGGGGGGGGGCGGCCAGGGTCCACATTGCCCTCTAACGGGACCCCTCGAGAAGCCTCAGCCTCGCCCGCTGACCCTCCTCTCTCCTGCCCAGGGCCCGGGACAGGCCACCTGCCCCAGGGG array(‘B’, [25, 24, 2, 2, 26, 32, 26, 22, 26, 26, 24, 12, 15, 29, 14, 24, 20, 17, 29, 24, 17, 22, 3, 32, 24, 17, 13, 5, 16, 24, 1, 36, 35, 36, 35, 36, 33, 33, 30, 25, 36, 37, 38, 38, 38, 36, 36, 36, 32, 37, 38, 34, 38, 34, 33, 33, 38, 35, 37, 38, 34, 38, 32, 34, 33, 32, 36, 36, 31, 27, 36, 35, 36, 34, 34, 32, 37, 35, 37, 34, 37, 38, 36, 38, 34, 38, 37, 37, 36, 38, 33, 32, 37, 36, 36, 32, 37, 38, 36, 37, 32, 36, 36, 38, 34, 36, 38, 33, 38, 35, 38, 33, 36, 38, 36, 37, 36, 37, 38, 37, 36, 35, 38, 36, 36, 32, 36, 37, 37, 32, 39, 36, 37, 38, 36, 38, 32, 36, 35, 33, 34, 34, 34, 34, 33, 34, 31, 31, 33, 31]) [(‘MC’, ‘151M’), (‘BD’, ‘KKKKKKKKNNNMMNMNPNNNNKKNNNMNONKKKKKKKNMMNMNONKNNNNNNONKNMKNNONLMMMKNNNKKNNNMNMNLNMNNOONMNNNMMKMMOPNNNKNNNNNONONNNPNMKNONKNMKMMKNNNONNMOOOOQONLLOPOKKLL’), (‘MD’, ‘3T3T14T7A4C93C16C3’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘PPPPPPPPQPRPQPQRRRQPRPPQPRQRSRPPPPPPPPPQPQRSRPQRRRQRRQPRQNRRQQQQQQPQRRNNRRQQPQRPRQRRQSRQRRQQQNQQRRQRRNRRRRRQRQRRRRRQMRSRPPQNQQPQRRSRPQRQRRRRQOOSTSPPPP’), (‘NM’, 7), (‘MQ’, 60), (‘AS’, 118), (‘XS’, 20)]
M03943:73:000000000-AV81U:1:2112:13824:17315 147 1 3311069 60 41S110M 1 3311018 110 CGGGGCGGGGGGGGGGGGGAGGGGGGGTGGGGGGGGGGGGGGGGTGGGGCGGCCTGGTGTGGGCCCAAGGGGGGGGCGGCCAGGGTCCACATTGCCCTCTAACGGGACCCCTCGAGAAGCCTCAGCCTCGCCCGCTGACCCTCCTCTCTCC array(‘B’, [24, 2, 2, 2, 13, 15, 35, 32, 26, 22, 1, 1, 22, 35, 35, 29, 29, 22, 14, 14, 30, 1, 30, 22, 22, 22, 13, 15, 1, 1, 1, 1, 26, 22, 1, 1, 32, 29, 1, 26, 1, 32, 29, 26, 16, 22, 24, 1, 12, 14, 4, 13, 5, 26, 16, 26, 12, 16, 12, 25, 1, 4, 12, 27, 24, 20, 9, 17, 7, 36, 32, 26, 23, 7, 7, 10, 25, 36, 36, 35, 38, 31, 32, 30, 37, 33, 35, 38, 31, 39, 34, 33, 35, 37, 36, 37, 38, 30, 38, 32, 34, 32, 32, 36, 36, 37, 32, 36, 36, 36, 38, 33, 32, 37, 37, 35, 34, 36, 37, 36, 38, 33, 38, 36, 34, 33, 31, 31, 31, 36, 36, 37, 32, 39, 38, 34, 35, 32, 35, 32, 32, 34, 27, 32, 31, 32, 33, 32, 30, 32, 31]) [(‘MC’, ‘150M’), (‘BD’, ‘MKKNMMKKKKKKKKKKKNNNKKKKKNNNKKKKKKKKKKKKKKNNNKKNMMNMNPNNNMNNKNMKNNLNKKKKKKNMMNMNONKNNNNNNONKNMKNNONLMMMKNNNKKNNNMNMNLNMNNOONMNNNMMKMMOPOOOLOOOOOPOPNNLL’), (‘MD’, ‘6T9A4A0G8C78’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘QPPPPQPPPPPPPPPPPQPRPPPPPQPRPPPPPPPPPPPPPPQPRPPPPQPQRRRQPQPRPPQNRQPRPPPPPPPPQPQRSRPQRRRQRRQPRQNRRQQQQQQPQRRNNRRQQPQRPQQRRQSRQRRQQQNQQRRQRRNRRRSSRSRRRPP’), (‘NM’, 5), (‘MQ’, 60), (‘AS’, 85), (‘XS’, 19)]
M03943:73:000000000-AV81U:1:2102:28044:15728 83 1 3311077 60 72S78M 1 3311001 78 CCCAGGGGTCCCCTGCGGGGGCCTGGGGGGGGGGGGGGTTGAGGAGGGGGGGGGGGGGGGGTTGGGGGGGTCCGGCCTGGAGGGGAGCCAGGGGGGGGGCGGCCAGGGTCCACATTGCCCTCTAACGGGACCCCTCGAGAAGCCTCAGCC array(‘B’, [20, 6, 17, 17, 4, 4, 22, 12, 16, 3, 2, 3, 15, 24, 25, 16, 2, 2, 4, 4, 14, 24, 18, 16, 1, 2, 22, 2, 32, 26, 22, 22, 22, 1, 29, 29, 22, 26, 10, 17, 24, 17, 4, 24, 15, 30, 29, 35, 29, 22, 22, 2, 2, 2, 23, 2, 33, 29, 23, 4, 13, 7, 16, 8, 26, 8, 8, 8, 23, 13, 14, 11, 15, 24, 10, 15, 17, 17, 24, 15, 32, 24, 8, 8, 16, 31, 12, 33, 20, 27, 8, 32, 32, 32, 33, 32, 32, 29, 34, 25, 35, 12, 33, 38, 36, 29, 12, 36, 33, 15, 18, 34, 38, 35, 33, 35, 34, 35, 36, 38, 33, 30, 30, 33, 33, 30, 36, 32, 37, 32, 36, 36, 36, 38, 33, 30, 37, 35, 36, 32, 31, 30, 29, 37, 33, 34, 32, 30, 32, 31]) [(‘MC’, ‘151M’), (‘BD’, ‘KNONLLNNNKKNPNMMKKKNMNPNKKKKKKKKKKKKNMKNNNNNNKKKKKKKKKKKKKKNMKNKKKKKNNNMMNMNPNNNNKKNNNMNONKKKKKKKNMMNMNONKNNNNNNONKNMKNNONLMMMKNNNKKOOONONOMONOOPPNMLL’), (‘MD’, ‘10T7A4C54’), (‘PG’, ‘MarkDuplicates’), (‘RG’, ‘20161125_01_Cardio’), (‘BI’, ‘NRSRPPQRRNNRRRPQPPPPQRRRPPPPPPPPPPPPQPPQPRQPRPPPPPPPPPPPPPPQPPRPPPPPQRRQQPQRRRQPRPPQPRQRSRPPPPPPPPPQPQRSRPQRRRQRRQPRQNRRQQQQQQPQRRNNRRQQPQRPSRSSRTRQPP’), (‘NM’, 3), (‘MQ’, 60), (‘AS’, 63), (‘XS’, 20)]
The only thing that seems to be strange in previous reads is CIGAR string, with Soft clipped bases, but I can confirm that I didn’t use —dontUseSoftClippedBases command. Just to be sure, does HC use any filter based on edit distance or alignment score?
Looking forward for some news,
Thank you for your help
Matteo
From Sheila on 2017-04-21
@Matteodigg
Hi Matteo,
Yes, that is correct HaplotypeCaller does not use any filter based on edit distance or alignment score.
The internal filtering in HaplotypeCaller is complicated. Different filtering is done at different stages. Right now, it is not possible to get perfect accounting for every single read in GATK3. However, this has been reimplemented in a more straightforward way in GATK4. For now, in GATK3 HaplotypeCaller those minor differences are just a limitation of the tool.
-Sheila