org.opencms.util
Class CmsHtmlParser

java.lang.Object
  extended by org.htmlparser.visitors.NodeVisitor
      extended by org.opencms.util.CmsHtmlParser
All Implemented Interfaces:
I_CmsHtmlNodeVisitor
Direct Known Subclasses:
CmsHtml2TextConverter, CmsHtmlDecorator, CmsLinkProcessor

public class CmsHtmlParser
extends org.htmlparser.visitors.NodeVisitor
implements I_CmsHtmlNodeVisitor

Base utility class for OpenCms NodeVisitor implementations, which provides some often used utility functions.

This base implementation is only a "pass through" class, that is the content is parsed, but the generated result is exactly identical to the input.

Since:
6.2.0
Version:
$Revision: 1.12 $
Author:
Alexander Kandzior

Field Summary
protected  boolean m_echo
          Indicates if "echo" mode is on, that is all content is written to the result by default.
protected  java.util.List<java.lang.String> m_noAutoCloseTags
          List of upper case tag name strings of tags that should not be auto-corrected if closing divs are missing.
protected  java.lang.StringBuffer m_result
          The buffer to write the out to.
protected static java.lang.String[] TAG_ARRAY
          The array of supported tag names.
protected static java.util.List<java.lang.String> TAG_LIST
          The list of supported tag names.
 
Constructor Summary
CmsHtmlParser()
          Creates a new instance of the html converter with echo mode set to false.
CmsHtmlParser(boolean echo)
          Creates a new instance of the html converter.
 
Method Summary
protected  java.lang.String collapse(java.lang.String string)
          Collapse HTML whitespace in the given String.
protected  org.htmlparser.PrototypicalNodeFactory configureNoAutoCorrectionTags()
          Internally degrades Composite tags that do have children in the DOM tree to simple single tags.
 java.lang.String getConfiguration()
          Returns the configuartion String of this visitor or the empty String if was not provided before.
 java.util.List<java.lang.String> getNoAutoCloseTags()
          Returns a list of upper case tag names for which parsing / visiting will not correct missing closing tags.
 java.lang.String getResult()
          Returns the text extraction result.
 java.lang.String getTagHtml(org.htmlparser.Tag tag)
          Returns the HTML for the given tag itself (not the tag content).
 java.lang.String process(java.lang.String html, java.lang.String encoding)
          Extracts the text from the given html content, assuming the given html encoding.
 void setConfiguration(java.lang.String configuration)
          Set a configuartion String for this visitor.
 void setNoAutoCloseTags(java.util.List<java.lang.String> noAutoCloseTagList)
          Sets a list of upper case tag names for which parsing / visiting should not correct missing closing tags.
 void visitEndTag(org.htmlparser.Tag tag)
          Visitor method (callback) invoked when a closing Tag is encountered.
 void visitRemarkNode(org.htmlparser.Remark remark)
          Visitor method (callback) invoked when a remark Tag (HTML comment) is encountered.
 void visitStringNode(org.htmlparser.Text text)
          Visitor method (callback) invoked when a remark Tag (HTML comment) is encountered.
 void visitTag(org.htmlparser.Tag tag)
          Visitor method (callback) invoked when a starting Tag (HTML comment) is encountered.
 
Methods inherited from class org.htmlparser.visitors.NodeVisitor
beginParsing, finishedParsing, shouldRecurseChildren, shouldRecurseSelf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_noAutoCloseTags

protected java.util.List<java.lang.String> m_noAutoCloseTags
List of upper case tag name strings of tags that should not be auto-corrected if closing divs are missing.


TAG_ARRAY

protected static final java.lang.String[] TAG_ARRAY
The array of supported tag names.


TAG_LIST

protected static final java.util.List<java.lang.String> TAG_LIST
The list of supported tag names.


m_echo

protected boolean m_echo
Indicates if "echo" mode is on, that is all content is written to the result by default.


m_result

protected java.lang.StringBuffer m_result
The buffer to write the out to.

Constructor Detail

CmsHtmlParser

public CmsHtmlParser()
Creates a new instance of the html converter with echo mode set to false.


CmsHtmlParser

public CmsHtmlParser(boolean echo)
Creates a new instance of the html converter.

Parameters:
echo - indicates if "echo" mode is on, that is all content is written to the result
Method Detail

configureNoAutoCorrectionTags

protected org.htmlparser.PrototypicalNodeFactory configureNoAutoCorrectionTags()
Internally degrades Composite tags that do have children in the DOM tree to simple single tags. This allows to avoid auto correction of unclosed HTML tags.

Returns:
A node factory that will not autocorrect open tags specified via setNoAutoCloseTags(List)

getConfiguration

public java.lang.String getConfiguration()
Description copied from interface: I_CmsHtmlNodeVisitor
Returns the configuartion String of this visitor or the empty String if was not provided before.

Specified by:
getConfiguration in interface I_CmsHtmlNodeVisitor
Returns:
the configuartion String of this visitor - by this contract never null but an empty String if not provided.
See Also:
I_CmsHtmlNodeVisitor.getConfiguration()

getResult

public java.lang.String getResult()
Description copied from interface: I_CmsHtmlNodeVisitor
Returns the text extraction result.

Specified by:
getResult in interface I_CmsHtmlNodeVisitor
Returns:
the text extraction result
See Also:
I_CmsHtmlNodeVisitor.getResult()

getTagHtml

public java.lang.String getTagHtml(org.htmlparser.Tag tag)
Returns the HTML for the given tag itself (not the tag content).

Parameters:
tag - the tag to create the HTML for
Returns:
the HTML for the given tag

process

public java.lang.String process(java.lang.String html,
                                java.lang.String encoding)
                         throws org.htmlparser.util.ParserException
Description copied from interface: I_CmsHtmlNodeVisitor
Extracts the text from the given html content, assuming the given html encoding.

Specified by:
process in interface I_CmsHtmlNodeVisitor
Parameters:
html - the content to extract the plain text from
encoding - the encoding to use
Returns:
the text extracted from the given html content
Throws:
org.htmlparser.util.ParserException - if something goes wrong
See Also:
I_CmsHtmlNodeVisitor.process(java.lang.String, java.lang.String)

setConfiguration

public void setConfiguration(java.lang.String configuration)
Description copied from interface: I_CmsHtmlNodeVisitor
Set a configuartion String for this visitor.

This will most likely be done with data from an xsd, custom jsp tag, ...

Specified by:
setConfiguration in interface I_CmsHtmlNodeVisitor
Parameters:
configuration - the configuration of this visitor to set.
See Also:
I_CmsHtmlNodeVisitor.setConfiguration(java.lang.String)

visitEndTag

public void visitEndTag(org.htmlparser.Tag tag)
Description copied from interface: I_CmsHtmlNodeVisitor
Visitor method (callback) invoked when a closing Tag is encountered.

Specified by:
visitEndTag in interface I_CmsHtmlNodeVisitor
Overrides:
visitEndTag in class org.htmlparser.visitors.NodeVisitor
Parameters:
tag - the tag that is ended.
See Also:
I_CmsHtmlNodeVisitor.visitEndTag(org.htmlparser.Tag)

visitRemarkNode

public void visitRemarkNode(org.htmlparser.Remark remark)
Description copied from interface: I_CmsHtmlNodeVisitor
Visitor method (callback) invoked when a remark Tag (HTML comment) is encountered.

Specified by:
visitRemarkNode in interface I_CmsHtmlNodeVisitor
Overrides:
visitRemarkNode in class org.htmlparser.visitors.NodeVisitor
Parameters:
remark - the remark Tag to visit.
See Also:
I_CmsHtmlNodeVisitor.visitRemarkNode(org.htmlparser.Remark)

visitStringNode

public void visitStringNode(org.htmlparser.Text text)
Description copied from interface: I_CmsHtmlNodeVisitor
Visitor method (callback) invoked when a remark Tag (HTML comment) is encountered.

Specified by:
visitStringNode in interface I_CmsHtmlNodeVisitor
Overrides:
visitStringNode in class org.htmlparser.visitors.NodeVisitor
Parameters:
text - the text that is visited.
See Also:
I_CmsHtmlNodeVisitor.visitStringNode(org.htmlparser.Text)

visitTag

public void visitTag(org.htmlparser.Tag tag)
Description copied from interface: I_CmsHtmlNodeVisitor
Visitor method (callback) invoked when a starting Tag (HTML comment) is encountered.

Specified by:
visitTag in interface I_CmsHtmlNodeVisitor
Overrides:
visitTag in class org.htmlparser.visitors.NodeVisitor
Parameters:
tag - the tag that is visited.
See Also:
I_CmsHtmlNodeVisitor.visitTag(org.htmlparser.Tag)

collapse

protected java.lang.String collapse(java.lang.String string)
Collapse HTML whitespace in the given String.

Parameters:
string - the string to collapse
Returns:
the input String with all HTML whitespace collapsed

getNoAutoCloseTags

public java.util.List<java.lang.String> getNoAutoCloseTags()
Returns a list of upper case tag names for which parsing / visiting will not correct missing closing tags.

Returns:
a List of upper case tag names for which parsing / visiting will not correct missing closing tags

setNoAutoCloseTags

public void setNoAutoCloseTags(java.util.List<java.lang.String> noAutoCloseTagList)
Sets a list of upper case tag names for which parsing / visiting should not correct missing closing tags.

Specified by:
setNoAutoCloseTags in interface I_CmsHtmlNodeVisitor
Parameters:
noAutoCloseTagList - a list of upper case tag names for which parsing / visiting should not correct missing closing tags to set.