Package org.htmlparser.beans
Class StringBean
- java.lang.Object
-
- org.htmlparser.visitors.NodeVisitor
-
- org.htmlparser.beans.StringBean
-
- All Implemented Interfaces:
java.io.Serializable
public class StringBean extends NodeVisitor implements java.io.Serializable
Extract strings from a URL.Text within <SCRIPT></SCRIPT> tags is removed.
The text within <PRE></PRE> tags is not altered.
The property
Strings, which is the output property is null until a URL is set. So a typical usage is:StringBean sb = new StringBean (); sb.setLinks (false); sb.setReplaceNonBreakingSpaces (true); sb.setCollapse (true); sb.setURL ("http://www.netbeans.org"); // the HTTP is performed here String s = sb.getStrings ();You can also use the StringBean as a NodeVisitor on your own parser, in which case you have to refetch your page if you change one of the properties because it resets the Strings property:StringBean sb = new StringBean (); Parser parser = new Parser ("http://cbc.ca"); parser.visitAllNodesWith (sb); String s = sb.getStrings (); sb.setLinks (true); parser.reset (); parser.visitAllNodesWith (sb); String sl = sb.getStrings ();According to Nick Burch, who contributed the patch, this is handy if you don't want StringBean to wander off and get the content itself, either because you already have it, it's not on a website etc.- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description protected java.lang.StringBuffermBufferThe buffer text is stored in while traversing the HTML.protected booleanmCollapseIftruesequences of whitespace characters are replaced with a single space character.protected intmCollapseStateThe state of the collapse processiung state machine.protected booleanmIsPreSettruewhen traversing a PRE tag.protected booleanmIsScriptSettruewhen traversing a SCRIPT tag.protected booleanmIsStyleSettruewhen traversing a STYLE tag.protected booleanmLinksIftruethe link URLs are embedded in the text output.protected ParsermParserThe parser used to extract strings.protected java.beans.PropertyChangeSupportmPropertySupportBound property support.protected booleanmReplaceSpaceIftrueregular space characters are substituted for non-breaking spaces in the text output.protected java.lang.StringmStringsThe strings extracted from the URL.static java.lang.StringPROP_COLLAPSE_PROPERTYProperty name in event where the 'collapse whitespace' state changes.static java.lang.StringPROP_CONNECTION_PROPERTYProperty name in event where the connection changes.static java.lang.StringPROP_LINKS_PROPERTYProperty name in event where the 'embed links' state changes.static java.lang.StringPROP_REPLACE_SPACE_PROPERTYProperty name in event where the 'replace non-breaking spaces' state changes.static java.lang.StringPROP_STRINGS_PROPERTYProperty name in event where the URL contents changes.static java.lang.StringPROP_URL_PROPERTYProperty name in event where the URL changes.
-
Constructor Summary
Constructors Constructor Description StringBean()Create a StringBean object.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddPropertyChangeListener(java.beans.PropertyChangeListener listener)Add a PropertyChangeListener to the listener list.protected voidcarriageReturn()Appends a newline to the buffer if there isn't one there already.protected voidcollapse(java.lang.StringBuffer buffer, java.lang.String string)Add the given text collapsing whitespace.protected java.lang.StringextractStrings()Extract the text from a page.booleangetCollapse()Get the current 'collapse whitespace' state.java.net.URLConnectiongetConnection()Get the current connection.booleangetLinks()Get the current 'include links' state.booleangetReplaceNonBreakingSpaces()Get the current 'replace non breaking spaces' state.java.lang.StringgetStrings()Return the textual contents of the URL.java.lang.StringgetURL()Get the current URL.static voidmain(java.lang.String[] args)Unit test.voidremovePropertyChangeListener(java.beans.PropertyChangeListener listener)Remove a PropertyChangeListener from the listener list.voidsetCollapse(boolean collapse)Set the current 'collapse whitespace' state.voidsetConnection(java.net.URLConnection connection)Set the parser's connection.voidsetLinks(boolean links)Set the 'include links' state.voidsetReplaceNonBreakingSpaces(boolean replace)Set the 'replace non breaking spaces' state.protected voidsetStrings()Fetch the URL contents.voidsetURL(java.lang.String url)Set the URL to extract strings from.protected voidupdateStrings(java.lang.String strings)Assign theStringsproperty, firing the property change.voidvisitEndTag(Tag tag)Resets the state of the PRE and SCRIPT flags.voidvisitStringNode(Text string)Appends the text to the output.voidvisitTag(Tag tag)Appends a NEWLINE to the output if the tag breaks flow, and possibly sets the state of the PRE and SCRIPT flags.-
Methods inherited from class org.htmlparser.visitors.NodeVisitor
beginParsing, finishedParsing, shouldRecurseChildren, shouldRecurseSelf, visitRemarkNode
-
-
-
-
Field Detail
-
PROP_STRINGS_PROPERTY
public static final java.lang.String PROP_STRINGS_PROPERTY
Property name in event where the URL contents changes.- See Also:
- Constant Field Values
-
PROP_LINKS_PROPERTY
public static final java.lang.String PROP_LINKS_PROPERTY
Property name in event where the 'embed links' state changes.- See Also:
- Constant Field Values
-
PROP_URL_PROPERTY
public static final java.lang.String PROP_URL_PROPERTY
Property name in event where the URL changes.- See Also:
- Constant Field Values
-
PROP_REPLACE_SPACE_PROPERTY
public static final java.lang.String PROP_REPLACE_SPACE_PROPERTY
Property name in event where the 'replace non-breaking spaces' state changes.- See Also:
- Constant Field Values
-
PROP_COLLAPSE_PROPERTY
public static final java.lang.String PROP_COLLAPSE_PROPERTY
Property name in event where the 'collapse whitespace' state changes.- See Also:
- Constant Field Values
-
PROP_CONNECTION_PROPERTY
public static final java.lang.String PROP_CONNECTION_PROPERTY
Property name in event where the connection changes.- See Also:
- Constant Field Values
-
mPropertySupport
protected java.beans.PropertyChangeSupport mPropertySupport
Bound property support.
-
mParser
protected Parser mParser
The parser used to extract strings.
-
mStrings
protected java.lang.String mStrings
The strings extracted from the URL.
-
mLinks
protected boolean mLinks
Iftruethe link URLs are embedded in the text output.
-
mReplaceSpace
protected boolean mReplaceSpace
Iftrueregular space characters are substituted for non-breaking spaces in the text output.
-
mCollapse
protected boolean mCollapse
Iftruesequences of whitespace characters are replaced with a single space character.
-
mCollapseState
protected int mCollapseState
The state of the collapse processiung state machine.
-
mBuffer
protected java.lang.StringBuffer mBuffer
The buffer text is stored in while traversing the HTML.
-
mIsScript
protected boolean mIsScript
Settruewhen traversing a SCRIPT tag.
-
mIsPre
protected boolean mIsPre
Settruewhen traversing a PRE tag.
-
mIsStyle
protected boolean mIsStyle
Settruewhen traversing a STYLE tag.
-
-
Constructor Detail
-
StringBean
public StringBean()
Create a StringBean object. Default property values are set to 'do the right thing':Linksis setfalseso text appears like a browser would display it, albeit without the colour or underline clues normally associated with a link.ReplaceNonBreakingSpacesis settrue, so that printing the text works, but the extra information regarding these formatting marks is available if you set it false.Collapseis settrue, so text appears compact like a browser would display it.
-
-
Method Detail
-
carriageReturn
protected void carriageReturn()
Appends a newline to the buffer if there isn't one there already. Except if the buffer is empty.
-
collapse
protected void collapse(java.lang.StringBuffer buffer, java.lang.String string)Add the given text collapsing whitespace. Use a little finite state machine:state 0: whitepace was last emitted character state 1: in whitespace state 2: in word A whitespace character moves us to state 1 and any other character moves us to state 2, except that state 0 stays in state 0 until a non-whitespace and going from whitespace to word we emit a space before the character: input: whitespace other-character state\next 0 0 2 1 1 space then 2 2 1 2- Parameters:
buffer- The buffer to append to.string- The string to append.
-
extractStrings
protected java.lang.String extractStrings() throws ParserExceptionExtract the text from a page.- Returns:
- The textual contents of the page.
- Throws:
ParserException- If a parse error occurs.
-
updateStrings
protected void updateStrings(java.lang.String strings)
Assign theStringsproperty, firing the property change.- Parameters:
strings- The new value of theStringsproperty.
-
setStrings
protected void setStrings()
Fetch the URL contents. Only do work if there is a valid parser with it's URL set.
-
addPropertyChangeListener
public void addPropertyChangeListener(java.beans.PropertyChangeListener listener)
Add a PropertyChangeListener to the listener list. The listener is registered for all properties.- Parameters:
listener- The PropertyChangeListener to be added.
-
removePropertyChangeListener
public void removePropertyChangeListener(java.beans.PropertyChangeListener listener)
Remove a PropertyChangeListener from the listener list. This removes a registered PropertyChangeListener.- Parameters:
listener- The PropertyChangeListener to be removed.
-
getStrings
public java.lang.String getStrings()
Return the textual contents of the URL. This is the primary output of the bean.- Returns:
- The user visible (what would be seen in a browser) text.
-
getLinks
public boolean getLinks()
Get the current 'include links' state.- Returns:
trueif link text is included in the text extracted from the URL,falseotherwise.
-
setLinks
public void setLinks(boolean links)
Set the 'include links' state. If the setting is changed after the URL has been set, the text from the URL will be reacquired, which is possibly expensive.- Parameters:
links- Usetrueif link text is to be included in the text extracted from the URL,falseotherwise.
-
getURL
public java.lang.String getURL()
Get the current URL.- Returns:
- The URL from which text has been extracted, or
nullif this property has not been set yet.
-
setURL
public void setURL(java.lang.String url)
Set the URL to extract strings from. The text from the URL will be fetched, which may be expensive, so this property should be set last.- Parameters:
url- The URL that text should be fetched from.
-
getReplaceNonBreakingSpaces
public boolean getReplaceNonBreakingSpaces()
Get the current 'replace non breaking spaces' state.- Returns:
trueif non-breaking spaces (character '\u00a0', numeric character reference   or character entity reference ) are to be replaced with normal spaces (character '\u0020').
-
setReplaceNonBreakingSpaces
public void setReplaceNonBreakingSpaces(boolean replace)
Set the 'replace non breaking spaces' state. If the setting is changed after the URL has been set, the text from the URL will be reacquired, which is possibly expensive.- Parameters:
replace-trueif non-breaking spaces (character '\u00a0', numeric character reference   or character entity reference ) are to be replaced with normal spaces (character '\u0020').
-
getCollapse
public boolean getCollapse()
Get the current 'collapse whitespace' state. If set totruethis emulates the operation of browsers in interpretting text whereuser agents should collapse input white space sequences when producing output inter-word space
. See HTML specification section 9.1 White space http://www.w3.org/TR/html4/struct/text.html#h-9.1.- Returns:
trueif sequences of whitespace (space '\u0020', tab '\u0009', form feed '\u000C', zero-width space '\u200B', carriage-return '\r' and NEWLINE '\n') are to be replaced with a single space.
-
setCollapse
public void setCollapse(boolean collapse)
Set the current 'collapse whitespace' state. If the setting is changed after the URL has been set, the text from the URL will be reacquired, which is possibly expensive. The internal state of the collapse state machine can be reset with code like this:setCollapse (getCollapse ());- Parameters:
collapse- Iftrue, sequences of whitespace will be reduced to a single space.
-
getConnection
public java.net.URLConnection getConnection()
Get the current connection.- Returns:
- The connection that the parser has or
nullif it hasn't been set or the parser hasn't been constructed yet.
-
setConnection
public void setConnection(java.net.URLConnection connection)
Set the parser's connection. The text from the URL will be fetched, which may be expensive, so this property should be set last.- Parameters:
connection- New value of property Connection.
-
visitStringNode
public void visitStringNode(Text string)
Appends the text to the output.- Overrides:
visitStringNodein classNodeVisitor- Parameters:
string- The text node.
-
visitTag
public void visitTag(Tag tag)
Appends a NEWLINE to the output if the tag breaks flow, and possibly sets the state of the PRE and SCRIPT flags.- Overrides:
visitTagin classNodeVisitor- Parameters:
tag- The tag to examine.
-
visitEndTag
public void visitEndTag(Tag tag)
Resets the state of the PRE and SCRIPT flags.- Overrides:
visitEndTagin classNodeVisitor- Parameters:
tag- The end tag to process.
-
main
public static void main(java.lang.String[] args)
Unit test.- Parameters:
args- Pass arg[0] as the URL to process.
-
-