Class BasecallsConverter<CLUSTER_OUTPUT_RECORD>

java.lang.Object
picard.illumina.BasecallsConverter<CLUSTER_OUTPUT_RECORD>
Type Parameters:
CLUSTER_OUTPUT_RECORD - The type of record that this converter will convert to.
Direct Known Subclasses:
SortedBasecallsConverter, UnsortedBasecallsConverter

public abstract class BasecallsConverter<CLUSTER_OUTPUT_RECORD> extends Object
BasecallsConverter utilizes an underlying IlluminaDataProvider to convert parsed and decoded sequencing data from standard Illumina formats to specific output records (FASTA records/SAM records).

The underlying IlluminaDataProvider apply several optional transformations that can include EAMSS filtering, non-PF read filtering and quality score recoding using a BclQualityEvaluationStrategy.

The converter can also limit the scope of data that is converted from the data provider by setting the tile to start on (firstTile) and the total number of tiles to process (tileLimit).

Additionally, BasecallsConverter can optionally demultiplex reads by outputting barcode specific reads to their associated writers..

  • Field Details

    • DATA_TYPES_WITH_BARCODE

      public static final Set<IlluminaDataType> DATA_TYPES_WITH_BARCODE
    • DATA_TYPES_WITHOUT_BARCODE

      public static final Set<IlluminaDataType> DATA_TYPES_WITHOUT_BARCODE
    • laneFactories

      protected final IlluminaDataProviderFactory[] laneFactories
    • demultiplex

      protected final boolean demultiplex
    • ignoreUnexpectedBarcodes

      protected final boolean ignoreUnexpectedBarcodes
    • barcodeRecordWriterMap

      protected final Map<String,? extends htsjdk.io.Writer<CLUSTER_OUTPUT_RECORD>> barcodeRecordWriterMap
    • includeNonPfReads

      protected final boolean includeNonPfReads
    • writerPool

      protected final htsjdk.io.AsyncWriterPool writerPool
    • converter

    • tiles

      protected List<Integer> tiles
    • barcodeExtractor

      protected BarcodeExtractor barcodeExtractor
    • TILE_NUMBER_COMPARATOR

      public static final Comparator<Integer> TILE_NUMBER_COMPARATOR
      A comparator used to sort Illumina tiles in their proper order. Because the tile number is followed by a colon, a tile number that is a prefix of another tile number should sort after. (e.g. 10 sorts after 100). Tile numbers with the same number of digits are sorted numerically.
  • Constructor Details

    • BasecallsConverter

      public BasecallsConverter(File basecallsDir, File barcodesDir, int[] lanes, ReadStructure readStructure, Map<String,? extends htsjdk.io.Writer<CLUSTER_OUTPUT_RECORD>> barcodeRecordWriterMap, boolean demultiplex, Integer firstTile, Integer tileLimit, BclQualityEvaluationStrategy bclQualityEvaluationStrategy, boolean ignoreUnexpectedBarcodes, boolean applyEamssFiltering, boolean includeNonPfReads, htsjdk.io.AsyncWriterPool writerPool, BarcodeExtractor barcodeExtractor)
      Constructs a new BasecallsConverter object.
      Parameters:
      basecallsDir - Where to read basecalls from.
      barcodesDir - Where to read barcodes from (optional; use basecallsDir if not specified).
      lanes - What lanes to process.
      readStructure - How to interpret each cluster.
      barcodeRecordWriterMap - Map from barcode to CLUSTER_OUTPUT_RECORD writer. If demultiplex is false, must contain one writer stored with key=null.
      demultiplex - If true, output is split by barcode, otherwise all are written to the same output stream.
      firstTile - (For debugging) If non-null, start processing at this tile.
      tileLimit - (For debugging) If non-null, process no more than this many tiles.
      bclQualityEvaluationStrategy - The basecall quality evaluation strategy that is applyed to decoded base calls.
      ignoreUnexpectedBarcodes - If true, will ignore reads whose called barcode is not found in barcodeRecordWriterMap.
      applyEamssFiltering - If true, apply EAMSS filtering if parsing BCLs for bases and quality scores.
      includeNonPfReads - If true, will include ALL reads (including those which do not have PF set). This option does nothing for instruments that output cbcls (Novaseqs)
      barcodeExtractor - The `BarcodeExtractor` used to do inline barcode matching.
  • Method Details

    • processTilesAndWritePerSampleOutputs

      public abstract void processTilesAndWritePerSampleOutputs(Set<String> barcodes) throws IOException
      Abstract method for processing tiles of data and outputting records for each barcode.
      Parameters:
      barcodes - The barcodes used optionally for demultiplexing. Must contain at least a single null value if no demultiplexing is being done.
      Throws:
      IOException
    • closeWriters

      public void closeWriters() throws IOException
      Closes all writers. If an AsycnWriterPool is used call close on that, otherwise iterate each writer and close it.
      Throws:
      IOException - throw if there is an error closing the writer.
    • getTiledFiles

      public static File[] getTiledFiles(File baseDirectory, Pattern pattern)
      Applies an lane and tile based regex to return all files matching that regex for each tile.
      Parameters:
      baseDirectory - The directory to search for tiled files.
      pattern - The pattern used to match files.
      Returns:
      A file array of all of the tile based files that match the regex pattern.
    • getDataTypesFromReadStructure

      protected static Set<IlluminaDataType> getDataTypesFromReadStructure(ReadStructure readStructure, boolean demultiplex, File barcodesDir)
      Given a read structure return the data types that need to be parsed for this run
      Parameters:
      readStructure - The read structure that defines how the read is set up.
      demultiplex - If true, output is split by barcode, otherwise all are written to the same output stream.
      barcodesDir - The barcodes dir that contains barcode files.
      Returns:
      A data type array for each piece of data needed to satisfy the read structure.
    • getLaneFactories

      protected IlluminaDataProviderFactory[] getLaneFactories()
      Gets the data provider factory used to create the underlying data provider.
      Returns:
      A factory used for create the underlying data provider.
    • setConverter

      protected void setConverter(BasecallsConverter.ClusterDataConverter<CLUSTER_OUTPUT_RECORD> converter)
      Must be called before doTileProcessing. This is not passed in the ctor because often the IlluminaDataProviderFactory is needed in order to construct the converter.
      Parameters:
      converter - Converts ClusterData to CLUSTER_OUTPUT_RECORD
    • setTileLimits

      protected void setTileLimits(Integer firstTile, Integer tileLimit)
      Uses the firstTile and tileLimit parameters to set which tiles will be processed. The processor will start with firstTile and continue to process tiles in order until it has processed at most tileLimit tiles.
      Parameters:
      firstTile - The tile to begin processing at.
      tileLimit - The maximum number of tiles to process.
    • maybeDemultiplex

      protected String maybeDemultiplex(ClusterData cluster, Map<String,BarcodeMetric> metrics, BarcodeMetric noMatch, ReadStructure outputReadStructure)
      If we are demultiplexing and a barcodeExtractor is defined then this method will perform on-the-fly demuxing. Otherwise it will just return the pre-demuxed barcode from `ExtractIlluminaBarcodes`.
      Parameters:
      cluster - The cluster data to demux
      metrics - The metrics object that will store the demux metrics.
      noMatch - A no-match metric object to store metrice for any read that doesn't demux
      outputReadStructure - The output `ReadStructure` for this cluster
      Returns:
      The matched barcode or null if no barcode was matched.
    • interruptAndShutdownExecutors

      protected void interruptAndShutdownExecutors(ThreadPoolExecutorWithExceptions... executors)
    • updateMetrics

      protected void updateMetrics(Map<String,BarcodeMetric> metrics, BarcodeMetric noMatch)