Class RichSequence.IOTools

java.lang.Object
org.biojavax.bio.seq.RichSequence.IOTools
Enclosing interface:
RichSequence

public static final class RichSequence.IOTools extends Object
A set of convenience methods for handling common file formats.
Since:
1.5
Author:
Mark Schreiber, Richard Holland
  • Method Details

    • registerFormat

      public static void registerFormat(Class formatClass)
      Register a new format with IOTools for auto-guessing.
      Parameters:
      formatClass - the RichSequenceFormat object to register.
    • readStream

      Guess which format a stream is then attempt to read it.
      Parameters:
      stream - the BufferedInputStream to attempt to read.
      seqFactory - a factory used to build a RichSequence
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the file
      Throws:
      IOException - in case the stream is unrecognisable or problems occur in reading it.
    • readStream

      Guess which format a stream is then attempt to read it.
      Parameters:
      stream - the BufferedInputStream to attempt to read.
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the file
      Throws:
      IOException - If the file cannot be read.
    • readFile

      public static RichSequenceIterator readFile(File file, RichSequenceBuilderFactory seqFactory, Namespace ns) throws IOException
      Guess which format a file is then attempt to read it.
      Parameters:
      file - the File to attempt to read.
      seqFactory - a factory used to build a RichSequence
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the file
      Throws:
      IOException - in case the file is unrecognisable or problems occur in reading it.
    • readFile

      public static RichSequenceIterator readFile(File file, Namespace ns) throws IOException
      Guess which format a file is then attempt to read it.
      Parameters:
      file - the File to attempt to read.
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the file
      Throws:
      IOException - If the file cannot be read.
    • readFasta

      Read a fasta file.
      Parameters:
      br - the BufferedReader to read data from
      sTok - a SymbolTokenization that understands the sequences
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readFasta

      Read a fasta file building a custom type of RichSequence . For example, use RichSequenceBuilderFactory.FACTORY to emulate readFasta(BufferedReader, SymbolTokenization) and RichSequenceBuilderFactory.PACKED to force all symbols to be encoded using bit-packing.
      Parameters:
      br - the BufferedReader to read data from
      sTok - a SymbolTokenization that understands the sequences
      seqFactory - a factory used to build a RichSequence
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readFastaDNA

      Iterate over the sequences in an FASTA-format stream of DNA sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
      See Also:
    • readHashedFastaDNA

      Iterate over the sequences in an FASTA-format stream of DNA sequences. In contrast to readFastaDNA, this provides a speeded up implementation where all sequences are accessed from memory.
      Parameters:
      is - the BufferedInputStream to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
      Throws:
      BioException - if somethings goes wrong while reading the file.
      See Also:
    • readFastaRNA

      Iterate over the sequences in an FASTA-format stream of RNA sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readFastaProtein

      Iterate over the sequences in an FASTA-format stream of Protein sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readGenbank

      Read a GenBank file using a custom type of SymbolList. For example, use RichSequenceBuilderFactory.FACTORY to emulate readFasta(BufferedReader, SymbolTokenization) and RichSequenceBuilderFactory.PACKED to force all symbols to be encoded using bit-packing.
      Parameters:
      br - the BufferedReader to read data from
      sTok - a SymbolTokenization that understands the sequences
      seqFactory - a factory used to build a SymbolList
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readGenbankDNA

      Iterate over the sequences in an GenBank-format stream of DNA sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readGenbankRNA

      Iterate over the sequences in an GenBank-format stream of RNA sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readGenbankProtein

      Iterate over the sequences in an GenBank-format stream of Protein sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readINSDseq

      Read a INSDseq file using a custom type of SymbolList. For example, use RichSequenceBuilderFactory.FACTORY to emulate readFasta(BufferedReader, SymbolTokenization) and RichSequenceBuilderFactory.PACKED to force all symbols to be encoded using bit-packing.
      Parameters:
      br - the BufferedReader to read data from
      sTok - a SymbolTokenization that understands the sequences
      seqFactory - a factory used to build a SymbolList
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readINSDseqDNA

      Iterate over the sequences in an INSDseq-format stream of DNA sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readINSDseqRNA

      Iterate over the sequences in an INSDseq-format stream of RNA sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readINSDseqProtein

      Iterate over the sequences in an INSDseq-format stream of Protein sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readEMBLxml

      Read a EMBLxml file using a custom type of SymbolList. For example, use RichSequenceBuilderFactory.FACTORY to emulate readFasta(BufferedReader, SymbolTokenization) and RichSequenceBuilderFactory.PACKED to force all symbols to be encoded using bit-packing.
      Parameters:
      br - the BufferedReader to read data from
      sTok - a SymbolTokenization that understands the sequences
      seqFactory - a factory used to build a SymbolList
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readEMBLxmlDNA

      Iterate over the sequences in an EMBLxml-format stream of DNA sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readEMBLxmlRNA

      Iterate over the sequences in an EMBLxml-format stream of RNA sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readEMBLxmlProtein

      Iterate over the sequences in an EMBLxml-format stream of Protein sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readEMBL

      Read a EMBL file using a custom type of SymbolList. For example, use RichSequenceBuilderFactory.FACTORY to emulate readFasta(BufferedReader, SymbolTokenization) and RichSequenceBuilderFactory.PACKED to force all symbols to be encoded using bit-packing.
      Parameters:
      br - the BufferedReader to read data from
      sTok - a SymbolTokenization that understands the sequences
      seqFactory - a factory used to build a SymbolList
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readEMBLDNA

      Iterate over the sequences in an EMBL-format stream of DNA sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readEMBLRNA

      Iterate over the sequences in an EMBL-format stream of RNA sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readEMBLProtein

      Iterate over the sequences in an EMBL-format stream of Protein sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readUniProt

      Read a UniProt file using a custom type of SymbolList. For example, use RichSequenceBuilderFactory.FACTORY to emulate readFasta(BufferedReader, SymbolTokenization) and RichSequenceBuilderFactory.PACKED to force all symbols to be encoded using bit-packing.
      Parameters:
      br - the BufferedReader to read data from
      sTok - a SymbolTokenization that understands the sequences
      seqFactory - a factory used to build a SymbolList
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readUniProt

      Iterate over the sequences in an UniProt-format stream of RNA sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readUniProtXML

      Read a UniProt XML file using a custom type of SymbolList. For example, use RichSequenceBuilderFactory.FACTORY to emulate readFasta(BufferedReader, SymbolTokenization) and RichSequenceBuilderFactory.PACKED to force all symbols to be encoded using bit-packing.
      Parameters:
      br - the BufferedReader to read data from
      sTok - a SymbolTokenization that understands the sequences
      seqFactory - a factory used to build a SymbolList
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • readUniProtXML

      Iterate over the sequences in an UniProt XML-format stream of RNA sequences.
      Parameters:
      br - the BufferedReader to read data from
      ns - a Namespace to load the sequences into. Null implies that it should use the namespace specified in the file. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.
      Returns:
      a RichSequenceIterator over each sequence in the fasta file
    • writeFasta

      public static void writeFasta(OutputStream os, SequenceIterator in, Namespace ns, FastaHeader header) throws IOException
      Writes Sequences from a SequenceIterator to an OutputStream in Fasta Format. This makes for a useful format filter where a StreamReader can be sent to the RichStreamWriter after formatting.
      Parameters:
      os - The stream to write fasta formatted data to
      in - The source of input RichSequences
      ns - a Namespace to write the RichSequences to. Null implies that it should use the namespace specified in the individual sequence.
      header - the FastaHeader
      Throws:
      IOException - if there is an IO problem
    • writeFasta

      public static void writeFasta(OutputStream os, SequenceIterator in, Namespace ns) throws IOException
      Writes Sequences from a SequenceIterator to an OutputStream in Fasta Format. This makes for a useful format filter where a StreamReader can be sent to the RichStreamWriter after formatting.
      Parameters:
      os - The stream to write fasta formatted data to
      in - The source of input RichSequences
      ns - a Namespace to write the RichSequences to. Null implies that it should use the namespace specified in the individual sequence.
      Throws:
      IOException - if there is an IO problem
    • writeFasta

      public static void writeFasta(OutputStream os, Sequence seq, Namespace ns) throws IOException
      Writes a single Sequence to an OutputStream in Fasta format.
      Parameters:
      os - the OutputStream.
      seq - the Sequence.
      ns - a Namespace to write the sequences to. Null implies that it should use the namespace specified in the individual sequence.
      Throws:
      IOException - if there is an IO problem
    • writeFasta

      public static void writeFasta(OutputStream os, Sequence seq, Namespace ns, FastaHeader header) throws IOException
      Writes a single Sequence to an OutputStream in Fasta format.
      Parameters:
      os - the OutputStream.
      seq - the Sequence.
      ns - a Namespace to write the sequences to. Null implies that it should use the namespace specified in the individual sequence.
      header - a FastaHeader that controls the fields in the header.
      Throws:
      IOException - if there is an IO problem
    • writeGenbank

      public static void writeGenbank(OutputStream os, SequenceIterator in, Namespace ns) throws IOException
      Writes sequences from a SequenceIterator to an OutputStream in GenBank Format. This makes for a useful format filter where a StreamReader can be sent to the RichStreamWriter after formatting.
      Parameters:
      os - The stream to write fasta formatted data to
      in - The source of input Sequences
      ns - a Namespace to write the sequences to. Null implies that it should use the namespace specified in the individual sequence.
      Throws:
      IOException - if there is an IO problem
    • writeGenbank

      public static void writeGenbank(OutputStream os, Sequence seq, Namespace ns) throws IOException
      Writes a single Sequence to an OutputStream in GenBank format.
      Parameters:
      os - the OutputStream.
      seq - the Sequence.
      ns - a Namespace to write the sequences to. Null implies that it should use the namespace specified in the individual sequence.
      Throws:
      IOException - if there is an IO problem
    • writeINSDseq

      public static void writeINSDseq(OutputStream os, SequenceIterator in, Namespace ns) throws IOException
      Writes sequences from a SequenceIterator to an OutputStream in INSDseq Format. This makes for a useful format filter where a StreamReader can be sent to the RichStreamWriter after formatting.
      Parameters:
      os - The stream to write fasta formatted data to
      in - The source of input Sequences
      ns - a Namespace to write the sequences to. Null implies that it should use the namespace specified in the individual sequence.
      Throws:
      IOException - if there is an IO problem
    • writeINSDseq

      public static void writeINSDseq(OutputStream os, Sequence seq, Namespace ns) throws IOException
      Writes a single Sequence to an OutputStream in INSDseq format.
      Parameters:
      os - the OutputStream.
      seq - the Sequence.
      ns - a Namespace to write the sequences to. Null implies that it should use the namespace specified in the individual sequence.
      Throws:
      IOException - if there is an IO problem
    • writeEMBLxml

      public static void writeEMBLxml(OutputStream os, SequenceIterator in, Namespace ns) throws IOException
      Writes sequences from a SequenceIterator to an OutputStream in EMBLxml Format. This makes for a useful format filter where a StreamReader can be sent to the RichStreamWriter after formatting.
      Parameters:
      os - The stream to write fasta formatted data to
      in - The source of input Sequences
      ns - a Namespace to write the sequences to. Null implies that it should use the namespace specified in the individual sequence.
      Throws:
      IOException - if there is an IO problem
    • writeEMBLxml

      public static void writeEMBLxml(OutputStream os, Sequence seq, Namespace ns) throws IOException
      Writes a single Sequence to an OutputStream in EMBLxml format.
      Parameters:
      os - the OutputStream.
      seq - the Sequence.
      ns - a Namespace to write the sequences to. Null implies that it should use the namespace specified in the individual sequence.
      Throws:
      IOException - if there is an IO problem
    • writeEMBL

      public static void writeEMBL(OutputStream os, SequenceIterator in, Namespace ns) throws IOException
      Writes sequences from a SequenceIterator to an OutputStream in EMBL Format. This makes for a useful format filter where a StreamReader can be sent to the RichStreamWriter after formatting.
      Parameters:
      os - The stream to write fasta formatted data to
      in - The source of input Sequences
      ns - a Namespace to write the sequences to. Null implies that it should use the namespace specified in the individual sequence.
      Throws:
      IOException - if there is an IO problem
    • writeEMBL

      public static void writeEMBL(OutputStream os, Sequence seq, Namespace ns) throws IOException
      Writes a single Sequence to an OutputStream in EMBL format.
      Parameters:
      os - the OutputStream.
      seq - the Sequence.
      ns - a Namespace to write the sequences to. Null implies that it should use the namespace specified in the individual sequence.
      Throws:
      IOException - if there is an IO problem
    • writeUniProt

      public static void writeUniProt(OutputStream os, SequenceIterator in, Namespace ns) throws IOException
      Writes sequences from a SequenceIterator to an OutputStream in UniProt Format. This makes for a useful format filter where a StreamReader can be sent to the RichStreamWriter after formatting.
      Parameters:
      os - The stream to write fasta formatted data to
      in - The source of input Sequences
      ns - a Namespace to write the sequences to. Null implies that it should use the namespace specified in the individual sequence.
      Throws:
      IOException - if there is an IO problem
    • writeUniProt

      public static void writeUniProt(OutputStream os, Sequence seq, Namespace ns) throws IOException
      Writes a single Sequence to an OutputStream in UniProt format.
      Parameters:
      os - the OutputStream.
      seq - the Sequence.
      ns - a Namespace to write the sequences to. Null implies that it should use the namespace specified in the individual sequence.
      Throws:
      IOException - if there is an IO problem
    • writeUniProtXML

      public static void writeUniProtXML(OutputStream os, SequenceIterator in, Namespace ns) throws IOException
      Writes sequences from a SequenceIterator to an OutputStream in UniProt XML Format. This makes for a useful format filter where a StreamReader can be sent to the RichStreamWriter after formatting.
      Parameters:
      os - The stream to write fasta formatted data to
      in - The source of input Sequences
      ns - a Namespace to write the sequences to. Null implies that it should use the namespace specified in the individual sequence.
      Throws:
      IOException - if there is an IO problem
    • writeUniProtXML

      public static void writeUniProtXML(OutputStream os, Sequence seq, Namespace ns) throws IOException
      Writes a single Sequence to an OutputStream in UniProt XML format.
      Parameters:
      os - the OutputStream.
      seq - the Sequence.
      ns - a Namespace to write the sequences to. Null implies that it should use the namespace specified in the individual sequence.
      Throws:
      IOException - if there is an IO problem
    • getDNAParser

      Creates a DNA symbol tokenizer.
      Returns:
      a SymbolTokenization for parsing DNA.
    • getRNAParser

      Creates a RNA symbol tokenizer.
      Returns:
      a SymbolTokenization for parsing RNA.
    • getNucleotideParser

      Creates a nucleotide symbol tokenizer.
      Returns:
      a SymbolTokenization for parsing nucleotides.
    • getProteinParser

      Creates a protein symbol tokenizer.
      Returns:
      a SymbolTokenization for parsing protein.