Class SiteCapturer

  • Direct Known Subclasses:
    WikiCapturer

    public class SiteCapturer
    extends java.lang.Object
    Save a web site locally. Illustrative program to save a web site contents locally. It was created to demonstrate URL rewriting in it's simplest form. It uses customized tags in the NodeFactory to alter the URLs. This program has a number of limitations:
    • it doesn't capture forms, this would involve too many assumptions
    • it doesn't capture script references, so funky onMouseOver and other non-static content will not be faithfully reproduced
    • it doesn't handle style sheets
    • it doesn't dig into attributes that might reference resources, so for example, background images won't necessarily be captured
    • worst of all, it gets confused when a URL both has content and is the prefix for other content, i.e. http://whatever.com/top and http://whatever.com/top/sub.html both yield content, since this cannot be faithfully replicated to a static directory structure (this happens a lot with servlet based sites)
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected boolean mCaptureResources
      If true, save resources locally too, otherwise, leave resource links pointing to original page.
      protected java.util.HashSet mCopied
      The set of resources already copied.
      protected NodeFilter mFilter
      The filter to apply to the nodes retrieved.
      protected java.util.HashSet mFinished
      The set of pages already captured.
      protected java.util.ArrayList mImages
      The list of resources to copy.
      protected java.util.ArrayList mPages
      The list of pages to capture.
      protected Parser mParser
      The parser to use for processing.
      protected java.lang.String mSource
      The web site to capture.
      protected java.lang.String mTarget
      The local directory to capture to.
      protected int TRANSFER_SIZE
      Copy buffer size.
    • Constructor Summary

      Constructors 
      Constructor Description
      SiteCapturer()
      Create a web site capturer.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void capture()
      Perform the capture.
      protected void copy()
      Copy a resource (image) locally.
      protected java.lang.String decode​(java.lang.String raw)
      Unescape a URL to form a file name.
      boolean getCaptureResources()
      Getter for property captureResources.
      NodeFilter getFilter()
      Getter for property filter.
      java.lang.String getSource()
      Getter for property source.
      java.lang.String getTarget()
      Getter for property target.
      protected boolean isHtml​(java.lang.String link)
      Returns true if the link contains text/html content.
      protected boolean isToBeCaptured​(java.lang.String link)
      Returns true if the link is one we are interested in.
      static void main​(java.lang.String[] args)
      Mainline to capture a web site locally.
      protected java.lang.String makeLocalLink​(java.lang.String link, java.lang.String current)
      Converts a link to local.
      protected void process​(NodeFilter filter)
      Process a single page.
      void setCaptureResources​(boolean capture)
      Setter for property captureResources.
      void setFilter​(NodeFilter filter)
      Setter for property filter.
      void setSource​(java.lang.String source)
      Setter for property source.
      void setTarget​(java.lang.String target)
      Setter for property target.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • mSource

        protected java.lang.String mSource
        The web site to capture. This is used as the base URL in deciding whether to adjust a link and whether to capture a page or not.
      • mTarget

        protected java.lang.String mTarget
        The local directory to capture to. This is used as a base prefix for files saved locally.
      • mPages

        protected java.util.ArrayList mPages
        The list of pages to capture. Links are added to this list as they are discovered, and removed in sequential order (FIFO queue) leading to a breadth first traversal of the web site space.
      • mFinished

        protected java.util.HashSet mFinished
        The set of pages already captured. Used to avoid repeated acquisition of the same page.
      • mImages

        protected java.util.ArrayList mImages
        The list of resources to copy. Images and other resources are added to this list as they are discovered.
      • mCopied

        protected java.util.HashSet mCopied
        The set of resources already copied. Used to avoid repeated acquisition of the same images and other resources.
      • mParser

        protected Parser mParser
        The parser to use for processing.
      • mCaptureResources

        protected boolean mCaptureResources
        If true, save resources locally too, otherwise, leave resource links pointing to original page.
      • mFilter

        protected NodeFilter mFilter
        The filter to apply to the nodes retrieved.
      • TRANSFER_SIZE

        protected final int TRANSFER_SIZE
        Copy buffer size. Resources are moved to disk in chunks this size or less.
        See Also:
        Constant Field Values
    • Constructor Detail

      • SiteCapturer

        public SiteCapturer()
        Create a web site capturer.
    • Method Detail

      • getSource

        public java.lang.String getSource()
        Getter for property source.
        Returns:
        Value of property source.
      • setSource

        public void setSource​(java.lang.String source)
        Setter for property source. This is the base URL to capture. URL's that don't start with this prefix are ignored (left as is), while the ones with this URL as a base are re-homed to the local target.
        Parameters:
        source - New value of property source.
      • getTarget

        public java.lang.String getTarget()
        Getter for property target.
        Returns:
        Value of property target.
      • setTarget

        public void setTarget​(java.lang.String target)
        Setter for property target. This is the local directory under which to save the site's pages.
        Parameters:
        target - New value of property target.
      • getCaptureResources

        public boolean getCaptureResources()
        Getter for property captureResources. If true, the images and other resources referenced by the site and within the base URL tree are also copied locally to the target directory. If false, the image links are left 'as is', still refering to the original site.
        Returns:
        Value of property captureResources.
      • setCaptureResources

        public void setCaptureResources​(boolean capture)
        Setter for property captureResources.
        Parameters:
        capture - New value of property captureResources.
      • getFilter

        public NodeFilter getFilter()
        Getter for property filter.
        Returns:
        Value of property filter.
      • setFilter

        public void setFilter​(NodeFilter filter)
        Setter for property filter.
        Parameters:
        filter - New value of property filter.
      • isToBeCaptured

        protected boolean isToBeCaptured​(java.lang.String link)
        Returns true if the link is one we are interested in.
        Parameters:
        link - The link to be checked.
        Returns:
        true if the link has the source URL as a prefix and doesn't contain '?' or '#'; the former because we won't be able to handle server side queries in the static target directory structure and the latter because presumably the full page with that reference has already been captured previously. This performs a case insensitive comparison, which is cheating really, but it's cheap.
      • isHtml

        protected boolean isHtml​(java.lang.String link)
                          throws ParserException
        Returns true if the link contains text/html content.
        Parameters:
        link - The URL to check for content type.
        Returns:
        true if the HTTP header indicates the type is "text/html".
        Throws:
        ParserException - If the supplied URL can't be read from.
      • makeLocalLink

        protected java.lang.String makeLocalLink​(java.lang.String link,
                                                 java.lang.String current)
        Converts a link to local. A relative link can be used to construct both a URL and a file name. Basically, the operation is to strip off the base url, if any, and then prepend as many dot-dots as necessary to make it relative to the current page. A bit of a kludge handles the root page specially by calling it index.html, even though that probably isn't it's real file name. This isn't pretty, but it works for me.
        Parameters:
        link - The link to make relative.
        current - The current page URL, or empty if it's an absolute URL that needs to be converted.
        Returns:
        The URL relative to the current page.
      • decode

        protected java.lang.String decode​(java.lang.String raw)
        Unescape a URL to form a file name. Very crude.
        Parameters:
        raw - The escaped URI.
        Returns:
        The native URI.
      • copy

        protected void copy()
        Copy a resource (image) locally. Removes one element from the 'to be copied' list and saves the resource it points to locally as a file.
      • process

        protected void process​(NodeFilter filter)
                        throws ParserException
        Process a single page.
        Parameters:
        filter - The filter to apply to the collected nodes.
        Throws:
        ParserException - If a parse error occurs.
      • capture

        public void capture()
        Perform the capture.
      • main

        public static void main​(java.lang.String[] args)
                         throws java.net.MalformedURLException,
                                java.io.IOException
        Mainline to capture a web site locally.
        Parameters:
        args - The command line arguments. There are three arguments the web site to capture, the local directory to save it to, and a flag (true or false) to indicate whether resources such as images and video are to be captured as well. These are requested via dialog boxes if not supplied.
        Throws:
        java.net.MalformedURLException - If the supplied URL is invalid.
        java.io.IOException - If an error occurs reading the page or resources.