org.w3c.tidy
Class Clean

java.lang.Object
  extended byorg.w3c.tidy.Clean

public class Clean
extends java.lang.Object

Clean up misuse of presentation markup. Filters from other formats such as Microsoft Word often make excessive use of presentation markup such as font tags, B, I, and the align attribute. By applying a set of production rules, it is straight forward to transform this to use CSS. Some rules replace some of the children of an element by style properties on the element, e.g.

...

.

...

Such rules are applied to the element's content and then to the element itself until none of the rules more apply. Having applied all the rules to an element, it will have a style attribute with one or more properties. Other rules strip the element they apply to, replacing it by style properties on the contents, e.g.
  • ...

  • .

    ... These rules are applied to an element before processing its content and replace the current element by the first element in the exposed content. After applying both sets of rules, you can replace the style attribute by a class value and style rule in the document head. To support this, an association of styles and class names is built. A naive approach is to rely on string matching to test when two property lists are the same. A better approach would be to first sort the properties before matching.

    Version:
    $Revision: 1.25 $ ($Author: fgiust $)
    Author:
    Dave Raggett dsr@w3.org , Andy Quick ac.quick@sympatico.ca (translation to Java), Fabrizio Giustina

    Field Summary
    private  int classNum
              sequential number for generated css classes.
    private  TagTable tt
              Tag table.
     
    Constructor Summary
    Clean(TagTable tagTable)
              Instantiates a new Clean.
     
    Method Summary
    private  void addAlign(Node node, java.lang.String align)
              Adds an align style.
    private  void addColorRule(Lexer lexer, java.lang.String selector, java.lang.String color)
              Adds a css rule for color.
    private  void addFontColor(Node node, java.lang.String color)
              Adds a font color style.
    private  void addFontFace(Node node, java.lang.String face)
              Adds a font-family style.
    private  void addFontSize(Node node, java.lang.String size)
              Adds a font size style.
    private  void addFontStyles(Node node, AttVal av)
              Add style properties to node corresponding to the font face, size and color attributes.
    private  java.lang.String addProperty(java.lang.String style, java.lang.String property)
              Creates a string with merged properties.
    private  void addStyleProperty(Node node, java.lang.String property)
              Add style property to element, creating style attribute as needed and adding ; delimiter.
    private  boolean blockStyle(Lexer lexer, Node node)
              Symptom: the only child of a block-level element is a presentation element such as B, I or FONT.
     void bQ2Div(Node node)
              Replace implicit blockquote by div with an indent taking care to reduce nested blockquotes to a single div with the indent set to match the nesting depth.
    (package private) static void bumpObject(Lexer lexer, Node html)
              Where appropriate move object elements from head to body.
    private  boolean center2Div(Lexer lexer, Node node, Node[] pnode)
              Symptom:
    private  void cleanBodyAttrs(Lexer lexer, Node body)
              Move presentation attribs from body to style element.
    private  Node cleanNode(Lexer lexer, Node node)
              Applies all matching rules to a node.
     void cleanTree(Lexer lexer, Node doc)
              Clean an html tree.
     void cleanWord2000(Lexer lexer, Node node)
              This is a major clean up to strip out all the extra stuff you get when you save as web page from Word 2000.
    private  StyleProp createProps(StyleProp prop, java.lang.String style)
              Create sorted linked list of properties from style string.
    private  java.lang.String createPropString(StyleProp props)
              Create a css property.
    private  void createStyleElement(Lexer lexer, Node doc)
              Create style element using rules from dictionary.
    private  Node createStyleProperties(Lexer lexer, Node node, Node[] prepl)
              Special case: if the current node is destroyed by CleanNode() lower in the tree, this node and its parent no longer exist.
    private  void defineStyleRules(Lexer lexer, Node node)
              Find style attribute in node content, and replace it by corresponding class attribute.
    private  boolean dir2Div(Lexer lexer, Node node)
              Symptom: <dir><li> where <li> is only child.
    private  void discardContainer(Node element, Node[] pnode)
              Used to strip font start and end tags.
     void dropSections(Lexer lexer, Node node)
              Drop if/endif sections inserted by word2000.
     void emFromI(Node node)
              Replace i by em and b by strong.
    (package private)  Node findEnclosingCell(Node node)
              Find the enclosing table cell for the given node.
    private  java.lang.String findStyle(Lexer lexer, java.lang.String tag, java.lang.String properties)
              Finds a css style.
    private  void fixNodeLinks(Node node)
              Ensure bidirectional links are consistent.
    private  boolean font2Span(Lexer lexer, Node node, Node[] pnode)
              Replace font elements by span elements, deleting the font element's attributes and replacing them by a single style attribute.
    private  java.lang.String fontSize2Name(java.lang.String size)
              Map a % font size to a named font size.
    private  java.lang.String gensymClass(Lexer lexer, java.lang.String tag)
              Generates a new css class name.
    private  boolean inlineStyle(Lexer lexer, Node node, Node[] pnode)
              If the node has only one b, i, or font child remove the child node and add the appropriate style attributes to parent.
    private  StyleProp insertProperty(StyleProp props, java.lang.String name, java.lang.String value)
              Insert a css style property.
     boolean isWord2000(Node root)
              Check if the current document is a converted Word document.
     void list2BQ(Node node)
              Some people use dir or ul without an li to indent the content.
    private  void mergeClasses(Node node, Node child)
              Merge class attributes from 2 nodes.
    private  boolean mergeDivs(Lexer lexer, Node node)
              Symptom: <div><div>...</div></div> Action: merge the two divs.
    private  java.lang.String mergeProperties(java.lang.String s1, java.lang.String s2)
              Create new string that consists of the combined style properties in s1 and s2.
    private  void mergeStyles(Node node, Node child)
              Merge style from 2 nodes.
     void nestedEmphasis(Node node)
              simplifies ...
    private  boolean nestedList(Lexer lexer, Node node, Node[] pnode)
              Symptom: ...
    private  boolean niceBody(Lexer lexer, Node doc)
              Check deprecated attributes in body tag.
    (package private)  boolean noMargins(Node node)
              Used to hunt for hidden preformatted sections.
    private  void normalizeSpaces(Lexer lexer, Node node)
              Map non-breaking spaces to regular spaces.
     Node pruneSection(Lexer lexer, Node node)
              node is <![if ...]> prune up to <![endif]>.
     void purgeWord2000Attributes(Node node)
              Remove word2000 attributes from node.
    (package private)  boolean singleSpace(Lexer lexer, Node node)
              Does element have a single space as its content?
    private  void stripOnlyChild(Node node)
              Used to strip child of node when the node has one and only one child.
     Node stripSpan(Lexer lexer, Node span)
              Word2000 uses span excessively, so we strip span out.
    private  void style2Rule(Lexer lexer, Node node)
              Find style attribute in node, and replace it by corresponding class attribute.
    private  void textAlign(Lexer lexer, Node node)
              Symptom: <p align=center>.
     
    Methods inherited from class java.lang.Object
    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
     

    Field Detail

    classNum

    private int classNum
    sequential number for generated css classes.


    tt

    private TagTable tt
    Tag table.

    Constructor Detail

    Clean

    public Clean(TagTable tagTable)
    Instantiates a new Clean.

    Parameters:
    tagTable - tag table instance
    Method Detail

    insertProperty

    private StyleProp insertProperty(StyleProp props,
                                     java.lang.String name,
                                     java.lang.String value)
    Insert a css style property.

    Parameters:
    props - StyleProp instance
    name - property name
    value - property value
    Returns:
    StyleProp containin the given property

    createProps

    private StyleProp createProps(StyleProp prop,
                                  java.lang.String style)
    Create sorted linked list of properties from style string.

    Parameters:
    prop - StyleProp
    style - style string
    Returns:
    StyleProp with given style

    createPropString

    private java.lang.String createPropString(StyleProp props)
    Create a css property.

    Parameters:
    props - StyleProp
    Returns:
    css property as String

    addProperty

    private java.lang.String addProperty(java.lang.String style,
                                         java.lang.String property)
    Creates a string with merged properties.

    Parameters:
    style - css style
    property - css properties
    Returns:
    merged string

    gensymClass

    private java.lang.String gensymClass(Lexer lexer,
                                         java.lang.String tag)
    Generates a new css class name.

    Parameters:
    lexer - Lexer
    tag - Tag
    Returns:
    generated css class

    findStyle

    private java.lang.String findStyle(Lexer lexer,
                                       java.lang.String tag,
                                       java.lang.String properties)
    Finds a css style.

    Parameters:
    lexer - Lexer
    tag - tag name
    properties - css properties
    Returns:
    style string

    style2Rule

    private void style2Rule(Lexer lexer,
                            Node node)
    Find style attribute in node, and replace it by corresponding class attribute. Search for class in style dictionary otherwise gensym new class and add to dictionary. Assumes that node doesn't have a class attribute.

    Parameters:
    lexer - Lexer
    node - node with a style attribute

    addColorRule

    private void addColorRule(Lexer lexer,
                              java.lang.String selector,
                              java.lang.String color)
    Adds a css rule for color.

    Parameters:
    lexer - Lexer
    selector - css selector
    color - color value

    cleanBodyAttrs

    private void cleanBodyAttrs(Lexer lexer,
                                Node body)
    Move presentation attribs from body to style element.
     background="foo" . body { background-image: url(foo) }
     bgcolor="foo" . body { background-color: foo }
     text="foo" . body { color: foo }
     link="foo" . :link { color: foo }
     vlink="foo" . :visited { color: foo }
     alink="foo" . :active { color: foo }
     

    Parameters:
    lexer - Lexer
    body - body node

    niceBody

    private boolean niceBody(Lexer lexer,
                             Node doc)
    Check deprecated attributes in body tag.

    Parameters:
    lexer - Lexer
    doc - document root node
    Returns:
    true is the body doesn't contain deprecated attributes, false otherwise.

    createStyleElement

    private void createStyleElement(Lexer lexer,
                                    Node doc)
    Create style element using rules from dictionary.

    Parameters:
    lexer - Lexer
    doc - root node

    fixNodeLinks

    private void fixNodeLinks(Node node)
    Ensure bidirectional links are consistent.

    Parameters:
    node - root node

    stripOnlyChild

    private void stripOnlyChild(Node node)
    Used to strip child of node when the node has one and only one child.

    Parameters:
    node - parent node

    discardContainer

    private void discardContainer(Node element,
                                  Node[] pnode)
    Used to strip font start and end tags.

    Parameters:
    element - original node
    pnode - passed in as array to allow modification. pnode[0] will contain the final node

    addStyleProperty

    private void addStyleProperty(Node node,
                                  java.lang.String property)
    Add style property to element, creating style attribute as needed and adding ; delimiter.

    Parameters:
    node - node
    property - property added to node

    mergeProperties

    private java.lang.String mergeProperties(java.lang.String s1,
                                             java.lang.String s2)
    Create new string that consists of the combined style properties in s1 and s2. To merge property lists, we build a linked list of property/values and insert properties into the list in order, merging values for the same property name.

    Parameters:
    s1 - first property
    s2 - second property
    Returns:
    merged properties

    mergeClasses

    private void mergeClasses(Node node,
                              Node child)
    Merge class attributes from 2 nodes.

    Parameters:
    node - Node
    child - Child node

    mergeStyles

    private void mergeStyles(Node node,
                             Node child)
    Merge style from 2 nodes.

    Parameters:
    node - Node
    child - Child node

    fontSize2Name

    private java.lang.String fontSize2Name(java.lang.String size)
    Map a % font size to a named font size.

    Parameters:
    size - size in %
    Returns:
    font size name

    addFontFace

    private void addFontFace(Node node,
                             java.lang.String face)
    Adds a font-family style.

    Parameters:
    node - Node
    face - font face

    addFontSize

    private void addFontSize(Node node,
                             java.lang.String size)
    Adds a font size style.

    Parameters:
    node - Node
    size - font size

    addFontColor

    private void addFontColor(Node node,
                              java.lang.String color)
    Adds a font color style.

    Parameters:
    node - Node
    color - color value

    addAlign

    private void addAlign(Node node,
                          java.lang.String align)
    Adds an align style.

    Parameters:
    node - Node
    align - align value

    addFontStyles

    private void addFontStyles(Node node,
                               AttVal av)
    Add style properties to node corresponding to the font face, size and color attributes.

    Parameters:
    node - font tag
    av - attribute list for node

    textAlign

    private void textAlign(Lexer lexer,
                           Node node)
    Symptom: <p align=center>. Action: <p style="text-align: center">.

    Parameters:
    lexer - Lexer
    node - node with center attribute. Will be modified to use css style.

    dir2Div

    private boolean dir2Div(Lexer lexer,
                            Node node)
    Symptom: <dir><li> where <li> is only child. Action: coerce <dir> <li> to <div> with indent. The clean up rules use the pnode argument to return the next node when the original node has been deleted.

    Parameters:
    lexer - Lexer
    node - dir tag
    Returns:
    true if a dir tag has been coerced to a div

    center2Div

    private boolean center2Div(Lexer lexer,
                               Node node,
                               Node[] pnode)
    Symptom:
     <center>
     
    .

    Action: replace <center> by <div style="text-align: center">

    Parameters:
    lexer - Lexer
    node - center tag
    pnode - pnode[0] is the same as node, passed in as an array to allow modification
    Returns:
    true if a center tag has been replaced by a div

    mergeDivs

    private boolean mergeDivs(Lexer lexer,
                              Node node)
    Symptom: <div><div>...</div></div> Action: merge the two divs. This is useful after nested <dir>s used by Word for indenting have been converted to <div>s.

    Parameters:
    lexer - Lexer
    node - first div
    Returns:
    true if the divs have been merged

    nestedList

    private boolean nestedList(Lexer lexer,
                               Node node,
                               Node[] pnode)
    Symptom: Action: discard outer list.

    Parameters:
    lexer - Lexer
    node - Node
    pnode - passed in as array to allow modifications.
    Returns:
    true if nested lists have been found and replaced

    blockStyle

    private boolean blockStyle(Lexer lexer,
                               Node node)
    Symptom: the only child of a block-level element is a presentation element such as B, I or FONT. Action: add style "font-weight: bold" to the block and strip the <b>element, leaving its children. example:
     <p>
     <b><font face="Arial" size="6">Draft Recommended Practice</font></b>
     </p>
     
    becomes:
     <p style="font-weight: bold; font-family: Arial; font-size: 6">
     Draft Recommended Practice
     </p>
     

    This code also replaces the align attribute by a style attribute. However, to avoid CSS problems with Navigator 4, this isn't done for the elements: caption, tr and table

    Parameters:
    lexer - Lexer
    node - parent node
    Returns:
    true if the child node has been removed

    inlineStyle

    private boolean inlineStyle(Lexer lexer,
                                Node node,
                                Node[] pnode)
    If the node has only one b, i, or font child remove the child node and add the appropriate style attributes to parent.

    Parameters:
    lexer - Lexer
    node - parent node
    pnode - passed as an array to allow modifications
    Returns:
    true if child node has been stripped, replaced by style attributes.

    font2Span

    private boolean font2Span(Lexer lexer,
                              Node node,
                              Node[] pnode)
    Replace font elements by span elements, deleting the font element's attributes and replacing them by a single style attribute.

    Parameters:
    lexer - Lexer
    node - font tag
    pnode - passed as an array to allow modifications
    Returns:
    true if a font tag has been dropped and replaced by style attributes

    cleanNode

    private Node cleanNode(Lexer lexer,
                           Node node)
    Applies all matching rules to a node.

    Parameters:
    lexer - Lexer
    node - original node
    Returns:
    cleaned up node

    createStyleProperties

    private Node createStyleProperties(Lexer lexer,
                                       Node node,
                                       Node[] prepl)
    Special case: if the current node is destroyed by CleanNode() lower in the tree, this node and its parent no longer exist. So we must jump back up the CreateStyleProperties() call stack until we have a valid node reference.

    Parameters:
    lexer - Lexer
    node - Node
    prepl - passed in as array to allow modifications
    Returns:
    cleaned Node

    defineStyleRules

    private void defineStyleRules(Lexer lexer,
                                  Node node)
    Find style attribute in node content, and replace it by corresponding class attribute.

    Parameters:
    lexer - Lexer
    node - parent node

    cleanTree

    public void cleanTree(Lexer lexer,
                          Node doc)
    Clean an html tree.

    Parameters:
    lexer - Lexer
    doc - root node

    nestedEmphasis

    public void nestedEmphasis(Node node)
    simplifies ... ... etc.

    Parameters:
    node - root Node

    emFromI

    public void emFromI(Node node)
    Replace i by em and b by strong.

    Parameters:
    node - root Node

    list2BQ

    public void list2BQ(Node node)
    Some people use dir or ul without an li to indent the content. The pattern to look for is a list with a single implicit li. This is recursively replaced by an implicit blockquote.

    Parameters:
    node - root Node

    bQ2Div

    public void bQ2Div(Node node)
    Replace implicit blockquote by div with an indent taking care to reduce nested blockquotes to a single div with the indent set to match the nesting depth.

    Parameters:
    node - root Node

    findEnclosingCell

    Node findEnclosingCell(Node node)
    Find the enclosing table cell for the given node.

    Parameters:
    node - Node
    Returns:
    enclosing cell node

    pruneSection

    public Node pruneSection(Lexer lexer,
                             Node node)
    node is <![if ...]> prune up to <![endif]>.

    Parameters:
    lexer - Lexer
    node - Node
    Returns:
    cleaned up Node

    dropSections

    public void dropSections(Lexer lexer,
                             Node node)
    Drop if/endif sections inserted by word2000.

    Parameters:
    lexer - Lexer
    node - Node root node

    purgeWord2000Attributes

    public void purgeWord2000Attributes(Node node)
    Remove word2000 attributes from node.

    Parameters:
    node - node to cleanup

    stripSpan

    public Node stripSpan(Lexer lexer,
                          Node span)
    Word2000 uses span excessively, so we strip span out.

    Parameters:
    lexer - Lexer
    span - Node span
    Returns:
    cleaned node

    normalizeSpaces

    private void normalizeSpaces(Lexer lexer,
                                 Node node)
    Map non-breaking spaces to regular spaces.

    Parameters:
    lexer - Lexer
    node - Node

    noMargins

    boolean noMargins(Node node)
    Used to hunt for hidden preformatted sections.

    Parameters:
    node - checked node
    Returns:
    true if the node has a "margin-top: 0" or "margin-bottom: 0" style

    singleSpace

    boolean singleSpace(Lexer lexer,
                        Node node)
    Does element have a single space as its content?

    Parameters:
    lexer - Lexer
    node - checked node
    Returns:
    true if the element has a single space as its content

    cleanWord2000

    public void cleanWord2000(Lexer lexer,
                              Node node)
    This is a major clean up to strip out all the extra stuff you get when you save as web page from Word 2000. It doesn't yet know what to do with VML tags, but these will appear as errors unless you declare them as new tags, such as o:p which needs to be declared as inline.

    Parameters:
    lexer - Lexer
    node - node to clean up

    isWord2000

    public boolean isWord2000(Node root)
    Check if the current document is a converted Word document.

    Parameters:
    root - root Node
    Returns:
    true if the document has been geenrated by Microsoft Word.

    bumpObject

    static void bumpObject(Lexer lexer,
                           Node html)
    Where appropriate move object elements from head to body.

    Parameters:
    lexer - Lexer
    html - html node