Class IlluminaDataProviderFactory

java.lang.Object
picard.illumina.parser.IlluminaDataProviderFactory

public class IlluminaDataProviderFactory extends Object
IlluminaDataProviderFactory accepts options for parsing Illumina data files for a lane and creates an IlluminaDataProvider, an iterator over the ClusterData for that lane, which utilizes these options.

Note: Since we tend to use IlluminaDataProviderFactory in multithreaded environments (e.g. we call makeDataProvider in a different thread per tile in IlluminaBasecallsToSam). I've made it essentially immutable. makeDataProvider/getTiles are now idempotent (well as far as IlluminaDataProviderFactory is concerned, many file handles and other things are opened when makeDataProvider is called). We may in the future want dataTypes to be provided to the makeDataProvider factory methods so configuration is not done multiple times for the same basecallDirectory in client code.

  • Field Details

  • Constructor Details

    • IlluminaDataProviderFactory

      public IlluminaDataProviderFactory(File basecallDirectory, int lane, ReadStructure readStructure, BclQualityEvaluationStrategy bclQualityEvaluationStrategy, Set<IlluminaDataType> dataTypes)
      Create factory with the specified options, one that favors using QSeqs over all other files
      Parameters:
      basecallDirectory - The baseCalls directory of a complete Illumina directory. Files are found by searching relative to this folder (some of them higher up in the directory tree).
      lane - Which lane to iterate over.
      readStructure - The read structure to which output clusters will conform. When not using QSeqs, EAMSS masking(see BclParser) is run on individual reads as found in the readStructure, if the readStructure specified does not match the readStructure implied by the sequencer's output than the quality scores output may differ than what would be found in a run's QSeq files
      dataTypes - Which data types to read
    • IlluminaDataProviderFactory

      public IlluminaDataProviderFactory(File basecallDirectory, File barcodesDirectory, int lane, ReadStructure readStructure, BclQualityEvaluationStrategy bclQualityEvaluationStrategy, Set<IlluminaDataType> dataTypes)
      Create factory with the specified options, one that favors using QSeqs over all other files
      Parameters:
      basecallDirectory - The baseCalls directory of a complete Illumina directory. Files are found by searching relative to this folder (some of them higher up in the directory tree).
      barcodesDirectory - The barcodesDirectory with barcode files extracted by 'ExtractIlluminaBarcodes'. This will be set to `basecallsDirectory` if null.
      lane - Which lane to iterate over.
      readStructure - The read structure to which output clusters will conform. When not using QSeqs, EAMSS masking(see BclParser) is run on individual reads as found in the readStructure, if the readStructure specified does not match the readStructure implied by the sequencer's output than the quality scores output may differ than what would be found in a run's QSeq files
      bclQualityEvaluationStrategy - The basecall quality evaluation strategy that is applyed to decoded base calls.
      dataTypes - Which data types to read
  • Method Details

    • getOutputReadStructure

      public ReadStructure getOutputReadStructure()
      Sometimes (in the case of skipped reads) the logical read structure of the output cluster data is different from the input readStructure
      Returns:
      The ReadStructure describing the output cluster data
    • getAvailableTiles

      public List<Integer> getAvailableTiles()
      Return the list of tiles available for this flowcell and lane. These are in ascending numerical order.
      Returns:
      List of all tiles available for this flowcell and lane.
    • setApplyEamssFiltering

      public void setApplyEamssFiltering(boolean applyEamssFiltering)
      Sets whether or not EAMSS filtering will be applied if parsing BCL files for bases and quality scores.
    • makeDataProvider

      public BaseIlluminaDataProvider makeDataProvider()
    • makeDataProvider

      public BaseIlluminaDataProvider makeDataProvider(Integer requestedTile)
    • makeDataProvider

      public BaseIlluminaDataProvider makeDataProvider(List<Integer> requestedTiles)
      Call this method to create a ClusterData iterator over the specified tiles.
      Returns:
      An iterator for reading the Illumina basecall output for the lane specified in the constructor.
    • findUnmatchedTypes

      public static Set<IlluminaDataType> findUnmatchedTypes(Set<IlluminaDataType> requestedDataTypes, Map<IlluminaFileUtil.SupportedIlluminaFormat,Set<IlluminaDataType>> formatToMatchedTypes)
      Given a set of formats to data types they provide, find any requested data types that do not have a format associated with them and return them
      Parameters:
      requestedDataTypes - Data types that need to be provided
      formatToMatchedTypes - A map of file formats to data types that will support them
      Returns:
      The data types that go unsupported by the formats found in formatToMatchedTypes
    • determineFormats

      public static Map<IlluminaFileUtil.SupportedIlluminaFormat,Set<IlluminaDataType>> determineFormats(Set<IlluminaDataType> requestedDataTypes, IlluminaFileUtil fileUtil)
      For all requestedDataTypes return a map of file format to set of provided data types that covers as many requestedDataTypes as possible and chooses the most preferred available formats possible
      Parameters:
      requestedDataTypes - Data types to be provided
      fileUtil - A file util for the lane/directory we wish to provide data for
      Returns:
      A Mapinvalid input: '<'Supported file format, Set of data types file format provides>
    • findPreferredFormat

      public static IlluminaFileUtil.SupportedIlluminaFormat findPreferredFormat(IlluminaDataType dt, IlluminaFileUtil fileUtil)
      Given a data type find the most preferred file format even if files are not available
      Parameters:
      dt - Type of desired data
      fileUtil - Util for the lane/directory in which we will find data
      Returns:
      The file format that is "most preferred" (i.e. fastest to parse/smallest in memory)