Package picard.illumina.parser
Class IlluminaDataProviderFactory
java.lang.Object
picard.illumina.parser.IlluminaDataProviderFactory
IlluminaDataProviderFactory accepts options for parsing Illumina data files for a lane and creates an
IlluminaDataProvider, an iterator over the ClusterData for that lane, which utilizes these options.
Note: Since we tend to use IlluminaDataProviderFactory in multithreaded environments (e.g. we call makeDataProvider
in a different thread per tile in IlluminaBasecallsToSam). I've made it essentially immutable. makeDataProvider/getTiles
are now idempotent (well as far as IlluminaDataProviderFactory is concerned, many file handles and other things are
opened when makeDataProvider is called). We may in the future want dataTypes to be provided to the
makeDataProvider factory methods so configuration is not done multiple times for the same basecallDirectory in
client code.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final Map
<IlluminaFileUtil.SupportedIlluminaFormat, Set<IlluminaDataType>> A Map of file formats to the dataTypes they will provide for this run. -
Constructor Summary
ConstructorsConstructorDescriptionIlluminaDataProviderFactory
(File basecallDirectory, int lane, ReadStructure readStructure, BclQualityEvaluationStrategy bclQualityEvaluationStrategy, Set<IlluminaDataType> dataTypes) Create factory with the specified options, one that favors using QSeqs over all other filesIlluminaDataProviderFactory
(File basecallDirectory, File barcodesDirectory, int lane, ReadStructure readStructure, BclQualityEvaluationStrategy bclQualityEvaluationStrategy, Set<IlluminaDataType> dataTypes) Create factory with the specified options, one that favors using QSeqs over all other files -
Method Summary
Modifier and TypeMethodDescriptiondetermineFormats
(Set<IlluminaDataType> requestedDataTypes, IlluminaFileUtil fileUtil) For all requestedDataTypes return a map of file format to set of provided data types that covers as many requestedDataTypes as possible and chooses the most preferred available formats possiblefindPreferredFormat
(IlluminaDataType dt, IlluminaFileUtil fileUtil) Given a data type find the most preferred file format even if files are not availablestatic Set
<IlluminaDataType> findUnmatchedTypes
(Set<IlluminaDataType> requestedDataTypes, Map<IlluminaFileUtil.SupportedIlluminaFormat, Set<IlluminaDataType>> formatToMatchedTypes) Given a set of formats to data types they provide, find any requested data types that do not have a format associated with them and return themReturn the list of tiles available for this flowcell and lane.Sometimes (in the case of skipped reads) the logical read structure of the output cluster data is different from the input readStructuremakeDataProvider
(Integer requestedTile) makeDataProvider
(List<Integer> requestedTiles) Call this method to create a ClusterData iterator over the specified tiles.void
setApplyEamssFiltering
(boolean applyEamssFiltering) Sets whether or not EAMSS filtering will be applied if parsing BCL files for bases and quality scores.
-
Field Details
-
formatToDataTypes
protected final Map<IlluminaFileUtil.SupportedIlluminaFormat,Set<IlluminaDataType>> formatToDataTypesA Map of file formats to the dataTypes they will provide for this run.
-
-
Constructor Details
-
IlluminaDataProviderFactory
public IlluminaDataProviderFactory(File basecallDirectory, int lane, ReadStructure readStructure, BclQualityEvaluationStrategy bclQualityEvaluationStrategy, Set<IlluminaDataType> dataTypes) Create factory with the specified options, one that favors using QSeqs over all other files- Parameters:
basecallDirectory
- The baseCalls directory of a complete Illumina directory. Files are found by searching relative to this folder (some of them higher up in the directory tree).lane
- Which lane to iterate over.readStructure
- The read structure to which output clusters will conform. When not using QSeqs, EAMSS masking(see BclParser) is run on individual reads as found in the readStructure, if the readStructure specified does not match the readStructure implied by the sequencer's output than the quality scores output may differ than what would be found in a run's QSeq filesdataTypes
- Which data types to read
-
IlluminaDataProviderFactory
public IlluminaDataProviderFactory(File basecallDirectory, File barcodesDirectory, int lane, ReadStructure readStructure, BclQualityEvaluationStrategy bclQualityEvaluationStrategy, Set<IlluminaDataType> dataTypes) Create factory with the specified options, one that favors using QSeqs over all other files- Parameters:
basecallDirectory
- The baseCalls directory of a complete Illumina directory. Files are found by searching relative to this folder (some of them higher up in the directory tree).barcodesDirectory
- The barcodesDirectory with barcode files extracted by 'ExtractIlluminaBarcodes'. This will be set to `basecallsDirectory` if null.lane
- Which lane to iterate over.readStructure
- The read structure to which output clusters will conform. When not using QSeqs, EAMSS masking(see BclParser) is run on individual reads as found in the readStructure, if the readStructure specified does not match the readStructure implied by the sequencer's output than the quality scores output may differ than what would be found in a run's QSeq filesbclQualityEvaluationStrategy
- The basecall quality evaluation strategy that is applyed to decoded base calls.dataTypes
- Which data types to read
-
-
Method Details
-
getOutputReadStructure
Sometimes (in the case of skipped reads) the logical read structure of the output cluster data is different from the input readStructure- Returns:
- The ReadStructure describing the output cluster data
-
getAvailableTiles
Return the list of tiles available for this flowcell and lane. These are in ascending numerical order.- Returns:
- List of all tiles available for this flowcell and lane.
-
setApplyEamssFiltering
public void setApplyEamssFiltering(boolean applyEamssFiltering) Sets whether or not EAMSS filtering will be applied if parsing BCL files for bases and quality scores. -
makeDataProvider
-
makeDataProvider
-
makeDataProvider
Call this method to create a ClusterData iterator over the specified tiles.- Returns:
- An iterator for reading the Illumina basecall output for the lane specified in the constructor.
-
findUnmatchedTypes
public static Set<IlluminaDataType> findUnmatchedTypes(Set<IlluminaDataType> requestedDataTypes, Map<IlluminaFileUtil.SupportedIlluminaFormat, Set<IlluminaDataType>> formatToMatchedTypes) Given a set of formats to data types they provide, find any requested data types that do not have a format associated with them and return them- Parameters:
requestedDataTypes
- Data types that need to be providedformatToMatchedTypes
- A map of file formats to data types that will support them- Returns:
- The data types that go unsupported by the formats found in formatToMatchedTypes
-
determineFormats
public static Map<IlluminaFileUtil.SupportedIlluminaFormat,Set<IlluminaDataType>> determineFormats(Set<IlluminaDataType> requestedDataTypes, IlluminaFileUtil fileUtil) For all requestedDataTypes return a map of file format to set of provided data types that covers as many requestedDataTypes as possible and chooses the most preferred available formats possible- Parameters:
requestedDataTypes
- Data types to be providedfileUtil
- A file util for the lane/directory we wish to provide data for- Returns:
- A Mapinvalid input: '<'Supported file format, Set of data types file format provides>
-
findPreferredFormat
public static IlluminaFileUtil.SupportedIlluminaFormat findPreferredFormat(IlluminaDataType dt, IlluminaFileUtil fileUtil) Given a data type find the most preferred file format even if files are not available- Parameters:
dt
- Type of desired datafileUtil
- Util for the lane/directory in which we will find data- Returns:
- The file format that is "most preferred" (i.e. fastest to parse/smallest in memory)
-