Most feeds embed HTML markup within feed elements. Some feeds even embed other types of markup, such as SVG or MathML. Since many feed aggregators use a web browser (or browser component) to display content, Universal Feed Parser sanitizes embedded markup to remove things that could pose security risks.
These elements are sanitized by default:
Note
If the content is declared to be (or is determined to be) text/plain, it will not be sanitized. This is to avoid data loss. It is recommended that you check the content type in e.g. entries[i].summary_detail.type. If it is text/plain then it has not been sanitized (and you should perform HTML escaping before rendering the content).
The following HTML elements are allowed by default (all others are stripped):
|
|
|
The following HTML attributes are allowed by default (all others are stripped):
|
|
|
The following SVG elements are allowed by default (all others are stripped):
|
|
|
The following SVG attributes are allowed by default (all others are stripped):
|
|
|
The following MathML elements are allowed by default (all others are stripped):
|
|
|
The following MathML attributes are allowed by default (all others are stripped):
|
|
|
The following CSS properties are allowed by default in style attributes (all others are stripped):
|
|
|
Note
Not all possible CSS values are allowed for these properties. The allowable values are restricted by a whitelist and a regular expression that allows color values and lengths. URIs are not allowed, to prevent platypus attacks. See the _HTMLSanitizer class for more details.
I am often asked why Universal Feed Parser is so hard-assed about HTML and CSS sanitizing. To illustrate the problem, here is an incomplete list of potentially dangerous HTML tags and attributes:
style? Yes, style. CSS definitions can contain executable code.
This sample is taken from http://feedparser.org/docs/examples/rss20.xml:
<description>Watch out for
<span style="background: url(javascript:window.location='http://example.org/')">
nasty tricks</span></description>
This sample is more advanced, and does not contain the keyword javascript: that many naive HTML sanitizers scan for:
<description>Watch out for
<span style="any: expression(window.location='http://example.org/')">
nasty tricks</span></description>
Internet Explorer for Windows will execute the Javascript in both of these examples.
Now consider that in HTML, attribute values may be entity-encoded in several different ways.
To a browser, this:
<span style="any: expression(window.location='http://example.org/')">
is the same as this (without the line breaks):
<span style="any: expre
ssion(window
.location='h
ttp://exampl
e.org/')">
which is the same as this (without the line breaks):
<span style="any: expr
ession(win
dow.locati
on='http:/
/example.o
rg/')">
And so on, plus several other variations, plus every combination of every variation.
The more I investigate, the more cases I find where Internet Explorer for Windows will treat seemingly innocuous markup as code and blithely execute it. This is why Universal Feed Parser uses a whitelist and not a blacklist. I am reasonably confident that none of the elements or attributes on the whitelist are security risks. I am not at all confident about elements or attributes that I have not explicitly investigated. And I have no confidence at all in my ability to detect strings within attribute values that Internet Explorer for Windows will treat as executable code.
See also