Interface Node
-
- All Superinterfaces:
java.lang.Cloneable
- All Known Implementing Classes:
AbstractNode
,AppletTag
,BaseHrefTag
,BodyTag
,Bullet
,BulletList
,CompositeTag
,DefinitionList
,DefinitionListBullet
,Div
,DoctypeTag
,FormTag
,FrameSetTag
,FrameTag
,HeadingTag
,HeadTag
,Html
,ImageTag
,InputTag
,JspTag
,LabelTag
,LinkTag
,MetaTag
,ObjectTag
,OptionTag
,ParagraphTag
,ProcessingInstructionTag
,RemarkNode
,ScriptTag
,SelectTag
,Span
,StyleTag
,TableColumn
,TableHeader
,TableRow
,TableTag
,TagNode
,TextareaTag
,TextNode
,TitleTag
public interface Node extends java.lang.Cloneable
Specifies the minimum requirements for nodes returned by the Lexer or Parser. There are three types of nodes in HTML: text, remarks and tags. You may wish to define your own nodes to be returned by theLexer
orParser
, but each of the types must support this interface. More specific interface requirements for each of the node types are specified by theText
,Remark
andTag
interfaces.
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description void
accept(NodeVisitor visitor)
Apply the visitor to this node.java.lang.Object
clone()
Allow cloning of nodes.void
collectInto(NodeList list, NodeFilter filter)
Collect this node and its child nodes into a list, provided the node satisfies the filtering criteria.void
doSemanticAction()
Perform the meaning of this tag.NodeList
getChildren()
Get the children of this node.int
getEndPosition()
Gets the ending position of the node.Node
getFirstChild()
Get the first child of this node.Node
getLastChild()
Get the last child of this node.Node
getNextSibling()
Get the next sibling to this node.Page
getPage()
Get the page this node came from.Node
getParent()
Get the parent of this node.Node
getPreviousSibling()
Get the previous sibling to this node.int
getStartPosition()
Gets the starting position of the node.java.lang.String
getText()
Returns the text of the node.void
setChildren(NodeList children)
Set the children of this node.void
setEndPosition(int position)
Sets the ending position of the node.void
setPage(Page page)
Set the page this node came from.void
setParent(Node node)
Sets the parent of this node.void
setStartPosition(int position)
Sets the starting position of the node.void
setText(java.lang.String text)
Sets the string contents of the node.java.lang.String
toHtml()
Return the HTML for this node.java.lang.String
toHtml(boolean verbatim)
Return the HTML for this node.java.lang.String
toPlainTextString()
A string representation of the node.java.lang.String
toString()
Return the string representation of the node.
-
-
-
Method Detail
-
toPlainTextString
java.lang.String toPlainTextString()
A string representation of the node. This is an important method, it allows a simple string transformation of a web page, regardless of a node. For a Text node this is obviously the textual contents itself. For a Remark node this is the remark contents (sic). For tags this is the text contents of it's children (if any). Because multiple nodes are combined when presenting a page in a browser, this will not reflect what a user would see. See HTML specification section 9.1 White space http://www.w3.org/TR/html4/struct/text.html#h-9.1.
Typical application code (for extracting only the text from a web page) would be:
for (Enumeration e = parser.elements (); e.hasMoreElements ();) // or do whatever processing you wish with the plain text string System.out.println ((Node)e.nextElement ()).toPlainTextString ());
- Returns:
- The text of this node including it's children.
-
toHtml
java.lang.String toHtml()
Return the HTML for this node. This should be the sequence of characters that were encountered by the parser that caused this node to be created. Where this breaks down is where broken nodes (tags and remarks) have been encountered and fixed. Applications reproducing html can use this method on nodes which are to be used or transferred as they were received or created.- Returns:
- The sequence of characters that would cause this node to be returned by the parser or lexer.
-
toHtml
java.lang.String toHtml(boolean verbatim)
Return the HTML for this node. This should be the exact sequence of characters that were encountered by the parser that caused this node to be created. Where this breaks down is where broken nodes (tags and remarks) have been encountered and fixed. Applications reproducing html can use this method on nodes which are to be used or transferred as they were received or created.- Parameters:
verbatim
- Iftrue
return as close to the original page text as possible.- Returns:
- The (exact) sequence of characters that would cause this node to be returned by the parser or lexer.
-
toString
java.lang.String toString()
Return the string representation of the node. The return value may not be the entire contents of the node, and non- printable characters may be translated in order to make them visible. This is typically to be used in the manner
System.out.println (node);
or within a debugging environment.- Overrides:
toString
in classjava.lang.Object
- Returns:
- A string representation of this node suitable for printing, that isn't too large.
-
collectInto
void collectInto(NodeList list, NodeFilter filter)
Collect this node and its child nodes into a list, provided the node satisfies the filtering criteria.This mechanism allows powerful filtering code to be written very easily, without bothering about collection of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it at the top-level, as many tags (like form tags), can contain links embedded in them. We could get the links out by checking if the current node is a
CompositeTag
, and going through its children. So this method provides a convenient way to do this.Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look like:
NodeList list = new NodeList (); NodeFilter filter = new TagNameFilter ("A"); for (NodeIterator e = parser.elements (); e.hasMoreNodes ();) e.nextNode ().collectInto (list, filter);
Thus,list
will hold all the link nodes, irrespective of how deep the links are embedded.Another way to accomplish the same objective is:
NodeList list = new NodeList (); NodeFilter filter = new TagClassFilter (LinkTag.class); for (NodeIterator e = parser.elements (); e.hasMoreNodes ();) e.nextNode ().collectInto (list, filter);
This is slightly less specific because the LinkTag class may be registered for more than one node name, e.g. <LINK> tags too.- Parameters:
list
- The list to collect nodes into.filter
- The criteria to use when deciding if a node should be added to the list.
-
getStartPosition
int getStartPosition()
Gets the starting position of the node. This is the character (not byte) offset of this node in the page.- Returns:
- The start position.
- See Also:
setStartPosition(int)
-
setStartPosition
void setStartPosition(int position)
Sets the starting position of the node.- Parameters:
position
- The new start position.- See Also:
getStartPosition()
-
getEndPosition
int getEndPosition()
Gets the ending position of the node. This is the character (not byte) offset of the character following this node in the page.- Returns:
- The end position.
- See Also:
setEndPosition(int)
-
setEndPosition
void setEndPosition(int position)
Sets the ending position of the node.- Parameters:
position
- The new end position.- See Also:
getEndPosition()
-
getPage
Page getPage()
Get the page this node came from.- Returns:
- The page that supplied this node.
- See Also:
setPage(org.htmlparser.lexer.Page)
-
setPage
void setPage(Page page)
Set the page this node came from.- Parameters:
page
- The page that supplied this node.- See Also:
getPage()
-
accept
void accept(NodeVisitor visitor)
Apply the visitor to this node.- Parameters:
visitor
- The visitor to this node.
-
getParent
Node getParent()
Get the parent of this node. This will always return null when parsing with theLexer
. Currently, the object returned from this method can be safely cast to aCompositeTag
, but this behaviour should not be expected in the future.- Returns:
- The parent of this node, if it's been set,
null
otherwise. - See Also:
setParent(org.htmlparser.Node)
-
setParent
void setParent(Node node)
Sets the parent of this node.- Parameters:
node
- The node that contains this node.- See Also:
getParent()
-
getChildren
NodeList getChildren()
Get the children of this node.- Returns:
- The list of children contained by this node, if it's been set,
null
otherwise. - See Also:
setChildren(org.htmlparser.util.NodeList)
-
setChildren
void setChildren(NodeList children)
Set the children of this node.- Parameters:
children
- The new list of children this node contains.- See Also:
getChildren()
-
getFirstChild
Node getFirstChild()
Get the first child of this node.- Returns:
- The first child in the list of children contained by this node,
null
otherwise.
-
getLastChild
Node getLastChild()
Get the last child of this node.- Returns:
- The last child in the list of children contained by this node,
null
otherwise.
-
getPreviousSibling
Node getPreviousSibling()
Get the previous sibling to this node.- Returns:
- The previous sibling to this node if one exists,
null
otherwise.
-
getNextSibling
Node getNextSibling()
Get the next sibling to this node.- Returns:
- The next sibling to this node if one exists,
null
otherwise.
-
getText
java.lang.String getText()
Returns the text of the node.- Returns:
- The contents of the string or remark node, and in the case of a tag, the contents of the tag less the enclosing angle brackets.
- See Also:
setText(java.lang.String)
-
setText
void setText(java.lang.String text)
Sets the string contents of the node.- Parameters:
text
- The new text for the node.- See Also:
getText()
-
doSemanticAction
void doSemanticAction() throws ParserException
Perform the meaning of this tag. This is defined by the tag, for example the bold tag <B> may switch bold text on and off. Only a few tags have semantic meaning to the parser. These have to do with the character set to use (<META>) and the base URL to use (<BASE>). Other than that, the semantic meaning is up to the application and it's custom nodes.
The semantic action is performed when the node has been parsed. For composite nodes (those that contain other nodes), the children will have already been parsed and will be available viagetChildren()
.- Throws:
ParserException
- If a problem is encountered performing the semantic action.
-
clone
java.lang.Object clone() throws java.lang.CloneNotSupportedException
Allow cloning of nodes. Creates and returns a copy of this object. The precise meaning of "copy" may depend on the class of the object. The general intent is that, for any object x, the expression:x.clone() != x
x.clone().getClass() == x.getClass()
x.clone().equals(x)
By convention, the returned object should be obtained by calling super.clone. If a class and all of its superclasses (except Object) obey this convention, it will be the case that x.clone().getClass() == x.getClass().
By convention, the object returned by this method should be independent of this object (which is being cloned). To achieve this independence, it may be necessary to modify one or more fields of the object returned by super.clone before returning it. Typically, this means copying any mutable objects that comprise the internal "deep structure" of the object being cloned and replacing the references to these objects with references to the copies. If a class contains only primitive fields or references to immutable objects, then it is usually the case that no fields in the object returned by super.clone need to be modified.
The method clone for class Object performs a specific cloning operation. First, if the class of this object does not implement the interface Cloneable, then a CloneNotSupportedException is thrown. Note that all arrays are considered to implement the interface Cloneable. Otherwise, this method creates a new instance of the class of this object and initializes all its fields with exactly the contents of the corresponding fields of this object, as if by assignment; the contents of the fields are not themselves cloned. Thus, this method performs a "shallow copy" of this object, not a "deep copy" operation.
The class Object does not itself implement the interface Cloneable, so calling the clone method on an object whose class is Object will result in throwing an exception at run time.
- Returns:
- a clone of this instance.
- Throws:
java.lang.CloneNotSupportedException
- if the object's class does not support theCloneable
interface. Subclasses that override theclone
method can also throw this exception to indicate that an instance cannot be cloned.- See Also:
Cloneable
-
-