Parent

FeedNormalizer::HtmlCleaner

Various methods for cleaning up HTML and preparing it for safe public consumption.

Documents used for refrence:

Constants

HTML_ELEMENTS

allowed html elements.

HTML_ATTRS

allowed attributes.

HTML_URI_ATTRS

allowed attributes, but they can contain URIs, extra caution required. NOTE: That means this doesnt list all URI attrs, just the ones that are allowed.

DODGY_URI_SCHEMES

Public Class Methods

add_entities(str) click to toggle source

Adds entities where possible. Works like CGI.escapeHTML, but will not escape existing entities; i.e. { will NOT become {

This method could be improved by adding a whitelist of html entities.

     # File lib/html-cleaner.rb, line 152
152:       def add_entities(str)
153:         str.to_s.gsub(/\"/, '&quot;').gsub(/>/, '&gt;').gsub(/</, '&lt;').gsub(/&(?!(\#\d+|\#x([0-9a-f]+)|\w{2,8});)/mi, '&amp;')
154:       end
clean(str) click to toggle source

Does this:

  • Unescape HTML

  • Parse HTML into tree

  • Find ‘body’ if present, and extract tree inside that tag, otherwise parse whole tree

  • Each tag:

    • remove tag if not whitelisted

    • escape HTML tag contents

    • remove all attributes not on whitelist

    • extra-scrub URI attrs; see dodgy_uri?

Extra (i.e. unmatched) ending tags and comments are removed.

    # File lib/html-cleaner.rb, line 60
60:       def clean(str)
61:         str = unescapeHTML(str)
62: 
63:         doc = Hpricot(str, :fixup_tags => true)
64:         doc = subtree(doc, :body)
65: 
66:         # get all the tags in the document
67:         # Somewhere near hpricot 0.4.92 "*" starting to return all elements,
68:         # including text nodes instead of just tagged elements.
69:         tags = (doc/"*").inject([]) { |m,e| m << e.name if(e.respond_to?(:name) && e.name =~ /^\w+$/) ; m }.uniq
70: 
71:         # Remove tags that aren't whitelisted.
72:         remove_tags!(doc, tags - HTML_ELEMENTS)
73:         remaining_tags = tags & HTML_ELEMENTS
74: 
75:         # Remove attributes that aren't on the whitelist, or are suspicious URLs.
76:         (doc/remaining_tags.join(",")).each do |element|
77:           next if element.raw_attributes.nil? || element.raw_attributes.empty?
78:           element.raw_attributes.reject! do |attr,val|
79:             !HTML_ATTRS.include?(attr) || (HTML_URI_ATTRS.include?(attr) && dodgy_uri?(val))
80:           end
81: 
82:           element.raw_attributes = element.raw_attributes.build_hash {|a,v| [a, add_entities(v)]}
83:         end unless remaining_tags.empty?
84:         
85:         doc.traverse_text do |t|
86:           t.swap(add_entities(t.to_html))
87:         end
88: 
89:         # Return the tree, without comments. Ugly way of removing comments,
90:         # but can't see a way to do this in Hpricot yet.
91:         doc.to_s.gsub(/<\!--.*?-->/i, '')
92:       end
dodgy_uri?(uri) click to toggle source

Returns true if the given string contains a suspicious URL, i.e. a javascript link.

This method rejects javascript, vbscript, livescript, mocha and data URLs. It could be refined to only deny dangerous data URLs, however.

     # File lib/html-cleaner.rb, line 117
117:       def dodgy_uri?(uri)
118:         uri = uri.to_s
119: 
120:         # special case for poorly-formed entities (missing ';')
121:         # if these occur *anywhere* within the string, then throw it out.
122:         return true if (uri =~ /&\#(\d+|x[0-9a-f]+)[^;\d]/i)
123: 
124:         # Try escaping as both HTML or URI encodings, and then trying
125:         # each scheme regexp on each
126:         [unescapeHTML(uri), CGI.unescape(uri)].each do |unesc_uri|
127:           DODGY_URI_SCHEMES.each do |scheme|
128: 
129:             regexp = "#{scheme}:".gsub(/./) do |char|
130:               "([\0000-\0037\1177\s]*)#{char}"
131:             end
132: 
133:             # regexp looks something like
134:             # /\A([\000-\037\177\s]*)j([\000-\037\177\s]*)a([\000-\037\177\s]*)v([\000-\037\177\s]*)a([\000-\037\177\s]*)s([\000-\037\177\s]*)c([\000-\037\177\s]*)r([\000-\037\177\s]*)i([\000-\037\177\s]*)p([\000-\037\177\s]*)t([\000-\037\177\s]*):/mi
135:             return true if (unesc_uri =~ %{\A#{regexp}}i)
136:           end
137:         end
138: 
139:         nil
140:       end
flatten(str) click to toggle source

For all other feed elements:

  • Unescape HTML.

  • Parse HTML into tree (taking ‘body’ as root, if present)

  • Takes text out of each tag, and escapes HTML.

  • Returns all text concatenated.

     # File lib/html-cleaner.rb, line 99
 99:       def flatten(str)
100:         str.gsub!("\n", " ")
101:         str = unescapeHTML(str)
102: 
103:         doc = Hpricot(str, :xhtml_strict => true)
104:         doc = subtree(doc, :body)
105: 
106:         out = []
107:         doc.traverse_text {|t| out << add_entities(t.to_html)}
108: 
109:         return out.join
110:       end
unescapeHTML(str, xml = true) click to toggle source

unescapes HTML. If xml is true, also converts XML-only named entities to HTML.

     # File lib/html-cleaner.rb, line 143
143:       def unescapeHTML(str, xml = true)
144:         CGI.unescapeHTML(xml ? str.gsub("&apos;", "&#39;") : str)
145:       end

Private Class Methods

remove_tags!(doc, tags) click to toggle source
     # File lib/html-cleaner.rb, line 163
163:       def remove_tags!(doc, tags)
164:         (doc/tags.join(",")).remove unless tags.empty?
165:       end
subtree(doc, element) click to toggle source

Everything below elment, or the just return the doc if element not present.

     # File lib/html-cleaner.rb, line 159
159:       def subtree(doc, element)
160:         doc.at("//#{element}/*") || doc
161:       end

Disabled; run with --debug to generate this.

[Validate]

Generated with the Darkfish Rdoc Generator 1.1.6.