Class FastaFormat

All Implemented Interfaces:
SequenceFormat, RichSequenceFormat

Format object representing FASTA files. These files are almost pure sequence data.
Since:
1.5
Author:
Thomas Down, Matthew Pocock, Greg Cox, Lukas Kall, Richard Holland, Mark Schreiber, Carl Masak
  • Field Details

  • Constructor Details

  • Method Details

    • canRead

      public boolean canRead(File file) throws IOException
      Check to see if a given file is in our format. Some formats may be able to determine this by filename, whilst others may have to open the file and read it to see what format it is in. A file is in FASTA format if the name ends with fa or fas, or the file starts with ">".
      Specified by:
      canRead in interface RichSequenceFormat
      Overrides:
      canRead in class RichSequenceFormat.BasicFormat
      Parameters:
      file - the File to check.
      Returns:
      true if the file is readable by this format, false if not.
      Throws:
      IOException - in case the file is inaccessible.
    • guessSymbolTokenization

      On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the file. For formats that accept multiple tokenizations, its up to you how you do it. Returns an protein parser if the first line of sequence contains any of F/L/I/P/Q/E, otherwise returns a DNA tokenizer.
      Specified by:
      guessSymbolTokenization in interface RichSequenceFormat
      Overrides:
      guessSymbolTokenization in class RichSequenceFormat.BasicFormat
      Parameters:
      file - the File object to guess the format of.
      Returns:
      a SymbolTokenization to read the file with.
      Throws:
      IOException - if the file is unrecognisable or inaccessible.
    • canRead

      public boolean canRead(BufferedInputStream stream) throws IOException
      Check to see if a given stream is in our format. A stream is in FASTA format if the stream starts with ">".
      Parameters:
      stream - the BufferedInputStream to check.
      Returns:
      true if the stream is readable by this format, false if not.
      Throws:
      IOException - in case the stream is inaccessible.
    • guessSymbolTokenization

      On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the stream. For formats that accept multiple tokenizations, its up to you how you do it. Returns an protein parser if the first line of sequence contains any of F/L/I/P/Q/E, otherwise returns a DNA tokenizer.
      Parameters:
      stream - the BufferedInputStream object to guess the format of.
      Returns:
      a SymbolTokenization to read the stream with.
      Throws:
      IOException - if the stream is unrecognisable or inaccessible.
    • readSequence

      Read a sequence and pass data on to a SeqIOListener.
      Parameters:
      reader - The stream of data to parse.
      symParser - A SymbolParser defining a mapping from character data to Symbols.
      listener - A listener to notify when data is extracted from the stream.
      Returns:
      a boolean indicating whether or not the stream contains any more sequences.
      Throws:
      IllegalSymbolException - if it is not possible to translate character data from the stream into valid BioJava symbols.
      IOException - if an error occurs while reading from the stream.
      ParseException
    • readRichSequence

      Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols. Events are passed to the listener, and the namespace used for sequences read is the one given. If the namespace is null, then the default namespace for the parser is used, which may depend on individual implementations of this interface. If namespace is null, then the namespace of the sequence in the fasta is used. If the namespace is null and so is the namespace of the sequence in the fasta, then the default namespace is used.
      Parameters:
      reader - the input source
      symParser - the tokenizer which understands the sequence being read
      rsiol - the listener to send sequence events to
      ns - the namespace to read sequences into.
      Returns:
      true if there is more to read after this, false otherwise.
      Throws:
      IllegalSymbolException - if the tokenizer couldn't understand one of the sequence symbols in the file.
      IOException - if there was a read error.
      ParseException
    • processHeader

      public void processHeader(String line, RichSeqIOListener rsiol, Namespace ns) throws IOException, ParseException
      Parse the Header information from the Fasta Description line
      Parameters:
      line -
      rsiol -
      ns -
      Throws:
      IOException
      ParseException
    • writeSequence

      public void writeSequence(Sequence seq, PrintStream os) throws IOException
      writeSequence writes a sequence to the specified PrintStream, using the default format.
      Parameters:
      seq - the sequence to write out.
      os - the printstream to write to.
      Throws:
      IOException
    • writeSequence

      public void writeSequence(Sequence seq, String format, PrintStream os) throws IOException
      writeSequence writes a sequence to the specified PrintStream, using the specified format.
      Parameters:
      seq - a Sequence to write out.
      format - a String indicating which sub-format of those available from a particular SequenceFormat implemention to use when writing.
      os - a PrintStream object.
      Throws:
      IOException - if an error occurs.
    • writeSequence

      public void writeSequence(Sequence seq, Namespace ns) throws IOException
      Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class. If namespace is given, sequences will be written with that namespace, otherwise they will be written with the default namespace of the implementing class (which is usually the namespace of the sequence itself). If you pass this method a sequence which is not a RichSequence, it will attempt to convert it using RichSequence.Tools.enrich(). Obviously this is not going to guarantee a perfect conversion, so it's better if you just use RichSequences to start with! If namespace is null, then the sequence's own namespace is used.
      Parameters:
      seq - the sequence to write
      ns - the namespace to write it with
      Throws:
      IOException - in case it couldn't write something
    • getDefaultFormat

      getDefaultFormat returns the String identifier for the default sub-format written by a SequenceFormat implementation.
      Returns:
      a String.
    • getHeader

    • setHeader

      public void setHeader(FastaHeader header)