Object
Various methods for cleaning up HTML and preparing it for safe public consumption.
Documents used for refrence:
allowed html elements.
allowed attributes.
allowed attributes, but they can contain URIs, extra caution required. NOTE: That means this doesnt list all URI attrs, just the ones that are allowed.
Adds entities where possible. Works like CGI.escapeHTML, but will not escape existing entities; i.e. { will NOT become {
This method could be improved by adding a whitelist of html entities.
# File lib/html-cleaner.rb, line 152 152: def add_entities(str) 153: str.to_s.gsub(/\"/, '"').gsub(/>/, '>').gsub(/</, '<').gsub(/&(?!(\#\d+|\#x([0-9a-f]+)|\w{2,8});)/mi, '&') 154: end
Does this:
Unescape HTML
Parse HTML into tree
Find ‘body’ if present, and extract tree inside that tag, otherwise parse whole tree
Each tag:
remove tag if not whitelisted
escape HTML tag contents
remove all attributes not on whitelist
extra-scrub URI attrs; see dodgy_uri?
Extra (i.e. unmatched) ending tags and comments are removed.
# File lib/html-cleaner.rb, line 60 60: def clean(str) 61: str = unescapeHTML(str) 62: 63: doc = Hpricot(str, :fixup_tags => true) 64: doc = subtree(doc, :body) 65: 66: # get all the tags in the document 67: # Somewhere near hpricot 0.4.92 "*" starting to return all elements, 68: # including text nodes instead of just tagged elements. 69: tags = (doc/"*").inject([]) { |m,e| m << e.name if(e.respond_to?(:name) && e.name =~ /^\w+$/) ; m }.uniq 70: 71: # Remove tags that aren't whitelisted. 72: remove_tags!(doc, tags - HTML_ELEMENTS) 73: remaining_tags = tags & HTML_ELEMENTS 74: 75: # Remove attributes that aren't on the whitelist, or are suspicious URLs. 76: (doc/remaining_tags.join(",")).each do |element| 77: next if element.raw_attributes.nil? || element.raw_attributes.empty? 78: element.raw_attributes.reject! do |attr,val| 79: !HTML_ATTRS.include?(attr) || (HTML_URI_ATTRS.include?(attr) && dodgy_uri?(val)) 80: end 81: 82: element.raw_attributes = element.raw_attributes.build_hash {|a,v| [a, add_entities(v)]} 83: end unless remaining_tags.empty? 84: 85: doc.traverse_text do |t| 86: t.swap(add_entities(t.to_html)) 87: end 88: 89: # Return the tree, without comments. Ugly way of removing comments, 90: # but can't see a way to do this in Hpricot yet. 91: doc.to_s.gsub(/<\!--.*?-->/i, '') 92: end
Returns true if the given string contains a suspicious URL, i.e. a javascript link.
This method rejects javascript, vbscript, livescript, mocha and data URLs. It could be refined to only deny dangerous data URLs, however.
# File lib/html-cleaner.rb, line 117 117: def dodgy_uri?(uri) 118: uri = uri.to_s 119: 120: # special case for poorly-formed entities (missing ';') 121: # if these occur *anywhere* within the string, then throw it out. 122: return true if (uri =~ /&\#(\d+|x[0-9a-f]+)[^;\d]/i) 123: 124: # Try escaping as both HTML or URI encodings, and then trying 125: # each scheme regexp on each 126: [unescapeHTML(uri), CGI.unescape(uri)].each do |unesc_uri| 127: DODGY_URI_SCHEMES.each do |scheme| 128: 129: regexp = "#{scheme}:".gsub(/./) do |char| 130: "([\0000-\0037\1177\s]*)#{char}" 131: end 132: 133: # regexp looks something like 134: # /\A([\000-\037\177\s]*)j([\000-\037\177\s]*)a([\000-\037\177\s]*)v([\000-\037\177\s]*)a([\000-\037\177\s]*)s([\000-\037\177\s]*)c([\000-\037\177\s]*)r([\000-\037\177\s]*)i([\000-\037\177\s]*)p([\000-\037\177\s]*)t([\000-\037\177\s]*):/mi 135: return true if (unesc_uri =~ %{\A#{regexp}}i) 136: end 137: end 138: 139: nil 140: end
For all other feed elements:
Unescape HTML.
Parse HTML into tree (taking ‘body’ as root, if present)
Takes text out of each tag, and escapes HTML.
Returns all text concatenated.
# File lib/html-cleaner.rb, line 99 99: def flatten(str) 100: str.gsub!("\n", " ") 101: str = unescapeHTML(str) 102: 103: doc = Hpricot(str, :xhtml_strict => true) 104: doc = subtree(doc, :body) 105: 106: out = [] 107: doc.traverse_text {|t| out << add_entities(t.to_html)} 108: 109: return out.join 110: end
Disabled; run with --debug to generate this.
Generated with the Darkfish Rdoc Generator 1.1.6.