Package org.biojavax.bio.seq.io
Class UniProtXMLFormat
java.lang.Object
org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
org.biojavax.bio.seq.io.UniProtXMLFormat
- All Implemented Interfaces:
SequenceFormat
,RichSequenceFormat
Format reader for UniProtXML files. This version of UniProtXML format will generate
and write RichSequence objects. Loosely Based on code from the old, deprecated,
org.biojava.bio.seq.io.GenbankXmlFormat object.
Understands http://www.ebi.uniprot.org/support/docs/uniprot.xsd
- Since:
- 1.5
- Author:
- Alan Li (code based on his work), Richard Holland
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
Implements some UniProtXML-specific terms.Nested classes/interfaces inherited from interface org.biojavax.bio.seq.io.RichSequenceFormat
RichSequenceFormat.BasicFormat, RichSequenceFormat.HeaderlessFormat
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final Pattern
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
static final String
The name of this formatprotected static final String
protected static final Pattern
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoid
Informs the writer that we want to start writing.boolean
canRead
(BufferedInputStream stream) Check to see if a given stream is in our format.boolean
Check to see if a given file is in our format.void
Informs the writer that are done writing.getDefaultFormat
returns the String identifier for the default sub-format written by aSequenceFormat
implementation.On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.guessSymbolTokenization
(File file) On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.boolean
readRichSequence
(BufferedReader reader, SymbolTokenization symParser, RichSeqIOListener rlistener, Namespace ns) Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols.boolean
readSequence
(BufferedReader reader, SymbolTokenization symParser, SeqIOListener listener) Read a sequence and pass data on to a SeqIOListener.void
writeSequence
(Sequence seq, PrintStream os) writeSequence
writes a sequence to the specified PrintStream, using the default format.void
writeSequence
(Sequence seq, String format, PrintStream os) writeSequence
writes a sequence to the specifiedPrintStream
, using the specified format.void
writeSequence
(Sequence seq, Namespace ns) Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class.Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
getElideComments, getElideFeatures, getElideReferences, getElideSymbols, getLineWidth, getPrintStream, setElideComments, setElideFeatures, setElideReferences, setElideSymbols, setLineWidth, setPrintStream
-
Field Details
-
UNIPROTXML_FORMAT
The name of this format- See Also:
-
ENTRY_GROUP_TAG
- See Also:
-
ENTRY_TAG
- See Also:
-
ENTRY_VERSION_ATTR
- See Also:
-
ENTRY_NAMESPACE_ATTR
- See Also:
-
ENTRY_CREATED_ATTR
- See Also:
-
ENTRY_UPDATED_ATTR
- See Also:
-
COPYRIGHT_TAG
- See Also:
-
ACCESSION_TAG
- See Also:
-
NAME_TAG
- See Also:
-
TEXT_TAG
- See Also:
-
REF_ATTR
- See Also:
-
TYPE_ATTR
- See Also:
-
KEY_ATTR
- See Also:
-
ID_ATTR
- See Also:
-
EVIDENCE_ATTR
- See Also:
-
VALUE_ATTR
- See Also:
-
STATUS_ATTR
- See Also:
-
NAME_ATTR
- See Also:
-
PROTEIN_TAG
- See Also:
-
PROTEIN_TYPE_ATTR
- See Also:
-
DOMAIN_TAG
- See Also:
-
COMPONENT_TAG
- See Also:
-
GENE_TAG
- See Also:
-
ORGANISM_TAG
- See Also:
-
DBXREF_TAG
- See Also:
-
PROPERTY_TAG
- See Also:
-
LINEAGE_TAG
- See Also:
-
TAXON_TAG
- See Also:
-
GENELOCATION_TAG
- See Also:
-
GENELOCATION_NAME_TAG
- See Also:
-
REFERENCE_TAG
- See Also:
-
CITATION_TAG
- See Also:
-
TITLE_TAG
- See Also:
-
EDITOR_LIST_TAG
- See Also:
-
AUTHOR_LIST_TAG
- See Also:
-
PERSON_TAG
- See Also:
-
CONSORTIUM_TAG
- See Also:
-
LOCATOR_TAG
- See Also:
-
RP_LINE_TAG
- See Also:
-
RC_LINE_TAG
- See Also:
-
RC_SPECIES_TAG
- See Also:
-
RC_TISSUE_TAG
- See Also:
-
RC_TRANSP_TAG
- See Also:
-
RC_STRAIN_TAG
- See Also:
-
RC_PLASMID_TAG
- See Also:
-
COMMENT_TAG
- See Also:
-
COMMENT_MASS_ATTR
- See Also:
-
COMMENT_ERROR_ATTR
- See Also:
-
COMMENT_METHOD_ATTR
- See Also:
-
COMMENT_LOCTYPE_ATTR
- See Also:
-
COMMENT_ABSORPTION_TAG
- See Also:
-
COMMENT_ABS_MAX_TAG
- See Also:
-
COMMENT_KINETICS_TAG
- See Also:
-
COMMENT_KIN_KM_TAG
- See Also:
-
COMMENT_KIN_VMAX_TAG
- See Also:
-
COMMENT_PH_TAG
- See Also:
-
COMMENT_REDOX_TAG
- See Also:
-
COMMENT_TEMPERATURE_TAG
- See Also:
-
COMMENT_LINK_TAG
- See Also:
-
COMMENT_LINK_URI_ATTR
- See Also:
-
COMMENT_EVENT_TAG
- See Also:
-
COMMENT_ISOFORM_TAG
- See Also:
-
COMMENT_INTERACTANT_TAG
- See Also:
-
COMMENT_INTERACT_INTACT_ATTR
- See Also:
-
COMMENT_INTERACT_LABEL_TAG
- See Also:
-
COMMENT_ORGANISMS_TAG
- See Also:
-
COMMENT_EXPERIMENTS_TAG
- See Also:
-
NOTE_TAG
- See Also:
-
KEYWORD_TAG
- See Also:
-
PROTEIN_EXISTS_TAG
- See Also:
-
ID_TAG
- See Also:
-
FEATURE_TAG
- See Also:
-
FEATURE_DESC_ATTR
- See Also:
-
FEATURE_ORIGINAL_TAG
- See Also:
-
FEATURE_VARIATION_TAG
- See Also:
-
EVIDENCE_TAG
- See Also:
-
EVIDENCE_CATEGORY_ATTR
- See Also:
-
EVIDENCE_ATTRIBUTE_ATTR
- See Also:
-
EVIDENCE_DATE_ATTR
- See Also:
-
LOCATION_TAG
- See Also:
-
LOCATION_SEQ_ATTR
- See Also:
-
LOCATION_BEGIN_TAG
- See Also:
-
LOCATION_END_TAG
- See Also:
-
LOCATION_POSITION_ATTR
- See Also:
-
LOCATION_POSITION_TAG
- See Also:
-
SEQUENCE_TAG
- See Also:
-
SEQUENCE_VERSION_ATTR
- See Also:
-
SEQUENCE_LENGTH_ATTR
- See Also:
-
SEQUENCE_MASS_ATTR
- See Also:
-
SEQUENCE_CHECKSUM_ATTR
- See Also:
-
SEQUENCE_MODIFIED_ATTR
- See Also:
-
rppat
-
xmlSchema
-
-
Constructor Details
-
UniProtXMLFormat
public UniProtXMLFormat()
-
-
Method Details
-
canRead
Check to see if a given file is in our format. Some formats may be able to determine this by filename, whilst others may have to open the file and read it to see what format it is in. A file is in UniProtXML format if the second XML line contains the phrase "http://www.uniprot.org/support/docs/uniprot.xsd".- Specified by:
canRead
in interfaceRichSequenceFormat
- Overrides:
canRead
in classRichSequenceFormat.BasicFormat
- Parameters:
file
- theFile
to check.- Returns:
- true if the file is readable by this format, false if not.
- Throws:
IOException
- in case the file is inaccessible.
-
guessSymbolTokenization
On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the file. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a protein tokenizer.- Specified by:
guessSymbolTokenization
in interfaceRichSequenceFormat
- Overrides:
guessSymbolTokenization
in classRichSequenceFormat.BasicFormat
- Parameters:
file
- theFile
object to guess the format of.- Returns:
- a
SymbolTokenization
to read the file with. - Throws:
IOException
- if the file is unrecognisable or inaccessible.
-
canRead
Check to see if a given stream is in our format. A stream is in UniProtXML format if the second XML line contains the phrase "http://www.uniprot.org/support/docs/uniprot.xsd".- Parameters:
stream
- theBufferedInputStream
to check.- Returns:
- true if the stream is readable by this format, false if not.
- Throws:
IOException
- in case the stream is inaccessible.
-
guessSymbolTokenization
On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the stream. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a protein tokenizer.- Parameters:
stream
- theBufferedInputStream
object to guess the format of.- Returns:
- a
SymbolTokenization
to read the stream with. - Throws:
IOException
- if the stream is unrecognisable or inaccessible.
-
readSequence
public boolean readSequence(BufferedReader reader, SymbolTokenization symParser, SeqIOListener listener) throws IllegalSymbolException, IOException, ParseException Read a sequence and pass data on to a SeqIOListener.- Parameters:
reader
- The stream of data to parse.symParser
- A SymbolParser defining a mapping from character data to Symbols.listener
- A listener to notify when data is extracted from the stream.- Returns:
- a boolean indicating whether or not the stream contains any more sequences.
- Throws:
IllegalSymbolException
- if it is not possible to translate character data from the stream into valid BioJava symbols.IOException
- if an error occurs while reading from the stream.ParseException
-
readRichSequence
public boolean readRichSequence(BufferedReader reader, SymbolTokenization symParser, RichSeqIOListener rlistener, Namespace ns) throws IllegalSymbolException, IOException, ParseException Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols. Events are passed to the listener, and the namespace used for sequences read is the one given. If the namespace is null, then the default namespace for the parser is used, which may depend on individual implementations of this interface. If namespace is null, then the namespace of the sequence in the fasta is used. If the namespace is null and so is the namespace of the sequence in the fasta, then the default namespace is used.- Parameters:
reader
- the input sourcesymParser
- the tokenizer which understands the sequence being readrlistener
- the listener to send sequence events tons
- the namespace to read sequences into.- Returns:
- true if there is more to read after this, false otherwise.
- Throws:
IllegalSymbolException
- if the tokenizer couldn't understand one of the sequence symbols in the file.IOException
- if there was a read error.ParseException
-
beginWriting
Informs the writer that we want to start writing. This will do any initialisation required, such as writing the opening tags of an XML file that groups sequences together.- Throws:
IOException
- if writing fails.
-
finishWriting
Informs the writer that are done writing. This will do any finalisation required, such as writing the closing tags of an XML file that groups sequences together.- Throws:
IOException
- if writing fails.
-
writeSequence
writeSequence
writes a sequence to the specified PrintStream, using the default format.- Parameters:
seq
- the sequence to write out.os
- the printstream to write to.- Throws:
IOException
-
writeSequence
writeSequence
writes a sequence to the specifiedPrintStream
, using the specified format.- Parameters:
seq
- aSequence
to write out.format
- aString
indicating which sub-format of those available from a particularSequenceFormat
implemention to use when writing.os
- aPrintStream
object.- Throws:
IOException
- if an error occurs.
-
writeSequence
Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class. If namespace is given, sequences will be written with that namespace, otherwise they will be written with the default namespace of the implementing class (which is usually the namespace of the sequence itself). If you pass this method a sequence which is not a RichSequence, it will attempt to convert it using RichSequence.Tools.enrich(). Obviously this is not going to guarantee a perfect conversion, so it's better if you just use RichSequences to start with! If namespace is null, then the sequence's own namespace is used.- Parameters:
seq
- the sequence to writens
- the namespace to write it with- Throws:
IOException
- in case it couldn't write something
-
getDefaultFormat
getDefaultFormat
returns the String identifier for the default sub-format written by aSequenceFormat
implementation.- Returns:
- a
String
.
-