Class StringBean

  • All Implemented Interfaces:
    java.io.Serializable

    public class StringBean
    extends NodeVisitor
    implements java.io.Serializable
    Extract strings from a URL.

    Text within <SCRIPT></SCRIPT> tags is removed.

    The text within <PRE></PRE> tags is not altered.

    The property Strings, which is the output property is null until a URL is set. So a typical usage is:

         StringBean sb = new StringBean ();
         sb.setLinks (false);
         sb.setReplaceNonBreakingSpaces (true);
         sb.setCollapse (true);
         sb.setURL ("http://www.netbeans.org"); // the HTTP is performed here
         String s = sb.getStrings ();
     
    You can also use the StringBean as a NodeVisitor on your own parser, in which case you have to refetch your page if you change one of the properties because it resets the Strings property:

         StringBean sb = new StringBean ();
         Parser parser = new Parser ("http://cbc.ca");
         parser.visitAllNodesWith (sb);
         String s = sb.getStrings ();
         sb.setLinks (true);
         parser.reset ();
         parser.visitAllNodesWith (sb);
         String sl = sb.getStrings ();
     
    According to Nick Burch, who contributed the patch, this is handy if you don't want StringBean to wander off and get the content itself, either because you already have it, it's not on a website etc.
    See Also:
    Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected java.lang.StringBuffer mBuffer
      The buffer text is stored in while traversing the HTML.
      protected boolean mCollapse
      If true sequences of whitespace characters are replaced with a single space character.
      protected int mCollapseState
      The state of the collapse processiung state machine.
      protected boolean mIsPre
      Set true when traversing a PRE tag.
      protected boolean mIsScript
      Set true when traversing a SCRIPT tag.
      protected boolean mIsStyle
      Set true when traversing a STYLE tag.
      protected boolean mLinks
      If true the link URLs are embedded in the text output.
      protected Parser mParser
      The parser used to extract strings.
      protected java.beans.PropertyChangeSupport mPropertySupport
      Bound property support.
      protected boolean mReplaceSpace
      If true regular space characters are substituted for non-breaking spaces in the text output.
      protected java.lang.String mStrings
      The strings extracted from the URL.
      static java.lang.String PROP_COLLAPSE_PROPERTY
      Property name in event where the 'collapse whitespace' state changes.
      static java.lang.String PROP_CONNECTION_PROPERTY
      Property name in event where the connection changes.
      static java.lang.String PROP_LINKS_PROPERTY
      Property name in event where the 'embed links' state changes.
      static java.lang.String PROP_REPLACE_SPACE_PROPERTY
      Property name in event where the 'replace non-breaking spaces' state changes.
      static java.lang.String PROP_STRINGS_PROPERTY
      Property name in event where the URL contents changes.
      static java.lang.String PROP_URL_PROPERTY
      Property name in event where the URL changes.
    • Constructor Summary

      Constructors 
      Constructor Description
      StringBean()
      Create a StringBean object.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void addPropertyChangeListener​(java.beans.PropertyChangeListener listener)
      Add a PropertyChangeListener to the listener list.
      protected void carriageReturn()
      Appends a newline to the buffer if there isn't one there already.
      protected void collapse​(java.lang.StringBuffer buffer, java.lang.String string)
      Add the given text collapsing whitespace.
      protected java.lang.String extractStrings()
      Extract the text from a page.
      boolean getCollapse()
      Get the current 'collapse whitespace' state.
      java.net.URLConnection getConnection()
      Get the current connection.
      boolean getLinks()
      Get the current 'include links' state.
      boolean getReplaceNonBreakingSpaces()
      Get the current 'replace non breaking spaces' state.
      java.lang.String getStrings()
      Return the textual contents of the URL.
      java.lang.String getURL()
      Get the current URL.
      static void main​(java.lang.String[] args)
      Unit test.
      void removePropertyChangeListener​(java.beans.PropertyChangeListener listener)
      Remove a PropertyChangeListener from the listener list.
      void setCollapse​(boolean collapse)
      Set the current 'collapse whitespace' state.
      void setConnection​(java.net.URLConnection connection)
      Set the parser's connection.
      void setLinks​(boolean links)
      Set the 'include links' state.
      void setReplaceNonBreakingSpaces​(boolean replace)
      Set the 'replace non breaking spaces' state.
      protected void setStrings()
      Fetch the URL contents.
      void setURL​(java.lang.String url)
      Set the URL to extract strings from.
      protected void updateStrings​(java.lang.String strings)
      Assign the Strings property, firing the property change.
      void visitEndTag​(Tag tag)
      Resets the state of the PRE and SCRIPT flags.
      void visitStringNode​(Text string)
      Appends the text to the output.
      void visitTag​(Tag tag)
      Appends a NEWLINE to the output if the tag breaks flow, and possibly sets the state of the PRE and SCRIPT flags.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • PROP_STRINGS_PROPERTY

        public static final java.lang.String PROP_STRINGS_PROPERTY
        Property name in event where the URL contents changes.
        See Also:
        Constant Field Values
      • PROP_LINKS_PROPERTY

        public static final java.lang.String PROP_LINKS_PROPERTY
        Property name in event where the 'embed links' state changes.
        See Also:
        Constant Field Values
      • PROP_URL_PROPERTY

        public static final java.lang.String PROP_URL_PROPERTY
        Property name in event where the URL changes.
        See Also:
        Constant Field Values
      • PROP_REPLACE_SPACE_PROPERTY

        public static final java.lang.String PROP_REPLACE_SPACE_PROPERTY
        Property name in event where the 'replace non-breaking spaces' state changes.
        See Also:
        Constant Field Values
      • PROP_COLLAPSE_PROPERTY

        public static final java.lang.String PROP_COLLAPSE_PROPERTY
        Property name in event where the 'collapse whitespace' state changes.
        See Also:
        Constant Field Values
      • PROP_CONNECTION_PROPERTY

        public static final java.lang.String PROP_CONNECTION_PROPERTY
        Property name in event where the connection changes.
        See Also:
        Constant Field Values
      • mPropertySupport

        protected java.beans.PropertyChangeSupport mPropertySupport
        Bound property support.
      • mParser

        protected Parser mParser
        The parser used to extract strings.
      • mStrings

        protected java.lang.String mStrings
        The strings extracted from the URL.
      • mLinks

        protected boolean mLinks
        If true the link URLs are embedded in the text output.
      • mReplaceSpace

        protected boolean mReplaceSpace
        If true regular space characters are substituted for non-breaking spaces in the text output.
      • mCollapse

        protected boolean mCollapse
        If true sequences of whitespace characters are replaced with a single space character.
      • mCollapseState

        protected int mCollapseState
        The state of the collapse processiung state machine.
      • mBuffer

        protected java.lang.StringBuffer mBuffer
        The buffer text is stored in while traversing the HTML.
      • mIsScript

        protected boolean mIsScript
        Set true when traversing a SCRIPT tag.
      • mIsPre

        protected boolean mIsPre
        Set true when traversing a PRE tag.
      • mIsStyle

        protected boolean mIsStyle
        Set true when traversing a STYLE tag.
    • Constructor Detail

      • StringBean

        public StringBean()
        Create a StringBean object. Default property values are set to 'do the right thing':

        Links is set false so text appears like a browser would display it, albeit without the colour or underline clues normally associated with a link.

        ReplaceNonBreakingSpaces is set true, so that printing the text works, but the extra information regarding these formatting marks is available if you set it false.

        Collapse is set true, so text appears compact like a browser would display it.

    • Method Detail

      • carriageReturn

        protected void carriageReturn()
        Appends a newline to the buffer if there isn't one there already. Except if the buffer is empty.
      • collapse

        protected void collapse​(java.lang.StringBuffer buffer,
                                java.lang.String string)
        Add the given text collapsing whitespace. Use a little finite state machine:
         state 0: whitepace was last emitted character
         state 1: in whitespace
         state 2: in word
         A whitespace character moves us to state 1 and any other character
         moves us to state 2, except that state 0 stays in state 0 until
         a non-whitespace and going from whitespace to word we emit a space
         before the character:
            input:     whitespace   other-character
         state\next
            0               0             2
            1               1        space then 2
            2               1             2
         
        Parameters:
        buffer - The buffer to append to.
        string - The string to append.
      • extractStrings

        protected java.lang.String extractStrings()
                                           throws ParserException
        Extract the text from a page.
        Returns:
        The textual contents of the page.
        Throws:
        ParserException - If a parse error occurs.
      • updateStrings

        protected void updateStrings​(java.lang.String strings)
        Assign the Strings property, firing the property change.
        Parameters:
        strings - The new value of the Strings property.
      • setStrings

        protected void setStrings()
        Fetch the URL contents. Only do work if there is a valid parser with it's URL set.
      • addPropertyChangeListener

        public void addPropertyChangeListener​(java.beans.PropertyChangeListener listener)
        Add a PropertyChangeListener to the listener list. The listener is registered for all properties.
        Parameters:
        listener - The PropertyChangeListener to be added.
      • removePropertyChangeListener

        public void removePropertyChangeListener​(java.beans.PropertyChangeListener listener)
        Remove a PropertyChangeListener from the listener list. This removes a registered PropertyChangeListener.
        Parameters:
        listener - The PropertyChangeListener to be removed.
      • getStrings

        public java.lang.String getStrings()
        Return the textual contents of the URL. This is the primary output of the bean.
        Returns:
        The user visible (what would be seen in a browser) text.
      • getLinks

        public boolean getLinks()
        Get the current 'include links' state.
        Returns:
        true if link text is included in the text extracted from the URL, false otherwise.
      • setLinks

        public void setLinks​(boolean links)
        Set the 'include links' state. If the setting is changed after the URL has been set, the text from the URL will be reacquired, which is possibly expensive.
        Parameters:
        links - Use true if link text is to be included in the text extracted from the URL, false otherwise.
      • getURL

        public java.lang.String getURL()
        Get the current URL.
        Returns:
        The URL from which text has been extracted, or null if this property has not been set yet.
      • setURL

        public void setURL​(java.lang.String url)
        Set the URL to extract strings from. The text from the URL will be fetched, which may be expensive, so this property should be set last.
        Parameters:
        url - The URL that text should be fetched from.
      • getReplaceNonBreakingSpaces

        public boolean getReplaceNonBreakingSpaces()
        Get the current 'replace non breaking spaces' state.
        Returns:
        true if non-breaking spaces (character '\u00a0', numeric character reference &#160; or character entity reference &nbsp;) are to be replaced with normal spaces (character '\u0020').
      • setReplaceNonBreakingSpaces

        public void setReplaceNonBreakingSpaces​(boolean replace)
        Set the 'replace non breaking spaces' state. If the setting is changed after the URL has been set, the text from the URL will be reacquired, which is possibly expensive.
        Parameters:
        replace - true if non-breaking spaces (character '\u00a0', numeric character reference &#160; or character entity reference &nbsp;) are to be replaced with normal spaces (character '\u0020').
      • getCollapse

        public boolean getCollapse()
        Get the current 'collapse whitespace' state. If set to true this emulates the operation of browsers in interpretting text where user agents should collapse input white space sequences when producing output inter-word space. See HTML specification section 9.1 White space http://www.w3.org/TR/html4/struct/text.html#h-9.1.
        Returns:
        true if sequences of whitespace (space '\u0020', tab '\u0009', form feed '\u000C', zero-width space '\u200B', carriage-return '\r' and NEWLINE '\n') are to be replaced with a single space.
      • setCollapse

        public void setCollapse​(boolean collapse)
        Set the current 'collapse whitespace' state. If the setting is changed after the URL has been set, the text from the URL will be reacquired, which is possibly expensive. The internal state of the collapse state machine can be reset with code like this: setCollapse (getCollapse ());
        Parameters:
        collapse - If true, sequences of whitespace will be reduced to a single space.
      • getConnection

        public java.net.URLConnection getConnection()
        Get the current connection.
        Returns:
        The connection that the parser has or null if it hasn't been set or the parser hasn't been constructed yet.
      • setConnection

        public void setConnection​(java.net.URLConnection connection)
        Set the parser's connection. The text from the URL will be fetched, which may be expensive, so this property should be set last.
        Parameters:
        connection - New value of property Connection.
      • visitStringNode

        public void visitStringNode​(Text string)
        Appends the text to the output.
        Overrides:
        visitStringNode in class NodeVisitor
        Parameters:
        string - The text node.
      • visitTag

        public void visitTag​(Tag tag)
        Appends a NEWLINE to the output if the tag breaks flow, and possibly sets the state of the PRE and SCRIPT flags.
        Overrides:
        visitTag in class NodeVisitor
        Parameters:
        tag - The tag to examine.
      • visitEndTag

        public void visitEndTag​(Tag tag)
        Resets the state of the PRE and SCRIPT flags.
        Overrides:
        visitEndTag in class NodeVisitor
        Parameters:
        tag - The end tag to process.
      • main

        public static void main​(java.lang.String[] args)
        Unit test.
        Parameters:
        args - Pass arg[0] as the URL to process.