Package picard.vcf

Class GenotypeConcordance

java.lang.Object
picard.cmdline.CommandLineProgram
picard.vcf.GenotypeConcordance

@DocumentedFeature public class GenotypeConcordance extends CommandLineProgram

Summary

Calculates the concordance between genotype data of one sample in each of two VCFs - one being considered the truth (or reference) the other being the call. The concordance is broken into separate results sections for SNPs and indels. Satistics are reported in three different files.

Details

This tool evaluates the concordance between genotype calls for a sample in different callsets where one is being considered as the \"truth\" (aka standard, or reference) and the other as the \"call\" that is being evaluated for accuracy. The Comparison can be restricted to a confidence interval which is typically used in order to enable proper assessment of False Positives and the False-Positive Rate (FPR).

Usage example

Compare two VCFs within a confidence region

 java -jar picard.jar GenotypeConcordance \\
       CALL_VCF=input.vcf \\
       CALL_SAMPLE=sample_name \\
       O=gc_concordance.vcf \\
       TRUTH_VCF=truth_set.vcf \\
       TRUTH_SAMPLE=sample_in_truth \\
       INTERVALS=confident.interval_list \\
       MISSING_SITES_HOM_REF = true
 

Output Metrics:

Output metrics consists of GenotypeConcordanceContingencyMetrics, GenotypeConcordanceSummaryMetrics, and GenotypeConcordanceDetailMetrics. For each set of metrics, the data is broken into separate sections for SNPs and INDELs. Note that only SNP and INDEL variants are considered, MNP, Symbolic, and Mixed classes of variants are not included.
  • GenotypeConcordanceContingencyMetrics enumerate the constituents of each contingent in a callset including true-positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN) calls.
  • GenotypeConcordanceDetailMetrics include the numbers of SNPs and INDELs for each contingent genotype as well as the number of validated genotypes.
  • GenotypeConcordanceSummaryMetrics provide specific details for the variant caller performance on a callset including values for sensitivity, specificity, and positive predictive values.


Useful definitions applicable to alleles and genotypes:
  • Truthset - A callset (typically in VCF format) containing variant calls and genotypes that have been cross-validated with multiple technologies e.g. Genome In A Bottle Consortium (GIAB) (https://sites.stanford.edu/abms/giab)
  • TP - True-positives are variant sites that match against the truth-set
  • FP - False-positives are reference sites miscalled as variant
  • FN - False-negatives are variant sites miscalled as reference
  • TN - True-negatives are correctly called as reference
  • Validated genotypes - are TP sites where the exact genotype (HET or HOM-VAR) appears in the truth-set

VCF Output:

  • The concordance state will be stored in the CONC_ST tag in the INFO field
  • The truth sample name will be \"truth\" and call sample name will be \"call\"
  • Field Details

    • TRUTH_VCF

      @Argument(shortName="TV", doc="The VCF containing the truth sample") public PicardHtsPath TRUTH_VCF
    • CALL_VCF

      @Argument(shortName="CV", doc="The VCF containing the call sample") public PicardHtsPath CALL_VCF
    • OUTPUT

      @Argument(shortName="O", doc="Basename for the three metrics files that are to be written. Resulting files will be <OUTPUT>.genotype_concordance_summary_metrics, <OUTPUT>.genotype_concordance_detail_metrics, and <OUTPUT>.genotype_concordance_contingency_metrics.") public File OUTPUT
    • OUTPUT_VCF

      @Argument(doc="Output a VCF annotated with concordance information.") public boolean OUTPUT_VCF
    • TRUTH_SAMPLE

      @Argument(shortName="TS", doc="The name of the truth sample within the truth VCF. Not required if only one sample exists.", optional=true) public String TRUTH_SAMPLE
    • CALL_SAMPLE

      @Argument(shortName="CS", doc="The name of the call sample within the call VCF. Not required if only one sample exists.", optional=true) public String CALL_SAMPLE
    • INTERVALS

      @Argument(doc="One or more interval list files that will be used to limit the genotype concordance. Note - if intervals are specified, the VCF files must be indexed.", optional=true) public List<PicardHtsPath> INTERVALS
    • INTERSECT_INTERVALS

      @Argument(doc="If true, multiple interval lists will be intersected. If false multiple lists will be unioned.") public boolean INTERSECT_INTERVALS
    • MIN_GQ

      @Argument(doc="Genotypes below this genotype quality will have genotypes classified as LowGq.") public int MIN_GQ
    • MIN_DP

      @Argument(doc="Genotypes below this depth will have genotypes classified as LowDp.") public int MIN_DP
    • OUTPUT_ALL_ROWS

      @Argument(doc="If true, output all rows in detailed statistics even when count == 0. When false only output rows with non-zero counts.") public boolean OUTPUT_ALL_ROWS
    • USE_VCF_INDEX

      @Argument(doc="If true, use the VCF index, else iterate over the entire VCF.", optional=true) public boolean USE_VCF_INDEX
    • MISSING_SITES_HOM_REF

      @Argument(shortName="MISSING_HOM", doc="Default is false, which follows the GA4GH Scheme. If true, missing sites in the truth set will be treated as HOM_REF sites and sites missing in both the truth and call sets will be true negatives. Useful when hom ref sites are left out of the truth set. This flag can only be used with a high confidence interval list.") public boolean MISSING_SITES_HOM_REF
    • IGNORE_FILTER_STATUS

      @Argument(doc="Default is false. If true, filter status of sites will be ignored so that we include filtered sites when calculating genotype concordance. ", optional=true) public boolean IGNORE_FILTER_STATUS
    • SUMMARY_METRICS_FILE_EXTENSION

      public static final String SUMMARY_METRICS_FILE_EXTENSION
      See Also:
    • DETAILED_METRICS_FILE_EXTENSION

      public static final String DETAILED_METRICS_FILE_EXTENSION
      See Also:
    • CONTINGENCY_METRICS_FILE_EXTENSION

      public static final String CONTINGENCY_METRICS_FILE_EXTENSION
      See Also:
    • OUTPUT_VCF_FILE_EXTENSION

      public static final String OUTPUT_VCF_FILE_EXTENSION
      See Also:
    • snpCounter

      protected GenotypeConcordanceCounts snpCounter
    • indelCounter

      protected GenotypeConcordanceCounts indelCounter
    • CONTINGENCY_STATE_TAG

      public static final String CONTINGENCY_STATE_TAG
      See Also:
    • CONTINGENCY_STATE_HEADER_LINE

      public static final htsjdk.variant.vcf.VCFHeaderLine CONTINGENCY_STATE_HEADER_LINE
    • OUTPUT_VCF_TRUTH_SAMPLE_NAME

      public static final String OUTPUT_VCF_TRUTH_SAMPLE_NAME
      See Also:
    • OUTPUT_VCF_CALL_SAMPLE_NAME

      public static final String OUTPUT_VCF_CALL_SAMPLE_NAME
      See Also:
  • Constructor Details

    • GenotypeConcordance

      public GenotypeConcordance()
  • Method Details

    • getSnpCounter

      public GenotypeConcordanceCounts getSnpCounter()
    • getIndelCounter

      public GenotypeConcordanceCounts getIndelCounter()
    • customCommandLineValidation

      protected String[] customCommandLineValidation()
      Description copied from class: CommandLineProgram
      Put any custom command-line validation in an override of this method. clp is initialized at this point and can be used to print usage and access argv. Any options set by command-line parser can be validated.
      Overrides:
      customCommandLineValidation in class CommandLineProgram
      Returns:
      null if command line is valid. If command line is invalid, returns an array of error message to be written to the appropriate place.
    • doWork

      protected int doWork()
      Description copied from class: CommandLineProgram
      Do the work after command line has been parsed. RuntimeException may be thrown by this method, and are reported appropriately.
      Specified by:
      doWork in class CommandLineProgram
      Returns:
      program exit status.
    • classifyVariants

      public static boolean classifyVariants(Optional<htsjdk.variant.variantcontext.VariantContext> truthContext, String truthSample, Optional<htsjdk.variant.variantcontext.VariantContext> callContext, String callSample, int minGq, int minDp, boolean ignoreFilterStatus)
    • classifyVariants

      public static boolean classifyVariants(Optional<htsjdk.variant.variantcontext.VariantContext> truthContext, String truthSample, Optional<htsjdk.variant.variantcontext.VariantContext> callContext, String callSample, Optional<GenotypeConcordanceCounts> snpCounter, Optional<GenotypeConcordanceCounts> indelCounter, int minGq, int minDp, boolean ignoreFilteredStatus)
      Attempts to determine the concordance state given the truth and all variant context and optionally increments the genotype concordance count for the given variant type (SNP or INDEL). This will ignore cases where an indel was found in the truth and a SNP was found in the call, and vice versa. We typically fail to classify Mixed, Symbolic variants, or MNPs.
      Parameters:
      truthContext - A variant context representing truth
      truthSample - The name of the truth sample
      callContext - A variant context representing the call
      callSample - The name of the call sample
      snpCounter - optionally a place to increment the counts for SNP truth/call states
      indelCounter - optionally a place to increment the counts for INDEL truth/call states
      minGq - Threshold for filtering by genotype attribute GQ
      minDp - Threshold for filtering by genotype attribute DP
      Returns:
      true if the concordance state could be classified.
    • addMissingTruthAndMissingCallStates

      public static void addMissingTruthAndMissingCallStates(double numVariants, long intervalBaseCount, GenotypeConcordanceCounts counter)
      Method to add missing sites that are KNOWN to be HOM_REF in the case of the NIST truth data set.
    • outputDetailMetricsFile

      public static void outputDetailMetricsFile(htsjdk.variant.variantcontext.VariantContext.Type variantType, htsjdk.samtools.metrics.MetricsFile<GenotypeConcordanceDetailMetrics,?> genotypeConcordanceDetailMetricsFile, GenotypeConcordanceCounts counter, String truthSampleName, String callSampleName, boolean missingSitesHomRef, boolean outputAllRows)
      Outputs the detailed statistics tables for SNP and Indel match categories.
    • normalizeAlleles

      protected static final GenotypeConcordance.Alleles normalizeAlleles(htsjdk.variant.variantcontext.VariantContext truthContext, String truthSample, htsjdk.variant.variantcontext.VariantContext callContext, String callSample, Boolean ignoreFilteredStatus)
      Gets the alleles for the truth and call genotypes. In particular, this handles the case where indels can have different reference alleles.
    • determineState

      public static final GenotypeConcordanceStates.TruthAndCallStates determineState(htsjdk.variant.variantcontext.VariantContext truthContext, String truthSample, htsjdk.variant.variantcontext.VariantContext callContext, String callSample, int minGq, int minDp, Boolean ignoreFilteredStatus)
      A method to determine the truth and call states for a pair of variant contexts representing truth and call. A variety of variant and genotype-level checks are first used to determine if either of the the variant contexts are filtered and after that a comparison of the called genotype alleles to determine appropriate truth and call state Note that this method does NOT check for SNP versus Indel. It is assumed that that check is done by the caller and the results of this method are stored by SNP/Indel. Note that if a variant context has BOTH GQ and DP less than the specified threshold, then it will be of Truth/Call State LOW_GQ
      Parameters:
      truthContext - A variant context representing truth
      truthSample - The name of the truth sample
      callContext - A variant context representing the call
      callSample - The name of the call sample
      minGq - Threshold for filtering by genotype attribute GQ
      minDp - Threshold for filtering by genotype attribute DP
      Returns:
      TruthAndCallStates object containing the TruthState and CallState determined here.