Skip to content

Extract the contents of escaped CDATA sections (#440)#572

Open
Labib-Bin-Salam wants to merge 1 commit into
kurtmckee:mainfrom
Labib-Bin-Salam:fix-escaped-cdata-440
Open

Extract the contents of escaped CDATA sections (#440)#572
Labib-Bin-Salam wants to merge 1 commit into
kurtmckee:mainfrom
Labib-Bin-Salam:fix-escaped-cdata-440

Conversation

@Labib-Bin-Salam

Copy link
Copy Markdown

Fixes #440.

The problem

When a feed XML-escapes a CDATA section, feedparser parses the field as an empty string. For example:

<description>&lt;![CDATA[some text]]&gt;</description>
>>> feedparser.parse(rss).entries[0].description
''            # expected: 'some text'

After the document is parsed, the text content of <description> is the literal string <![CDATA[some text]]>. feedparser treats that content as HTML, so it runs through the SGML-based HTML processor, which recognizes <![CDATA[ ... ]]> as a marked section and hands its body to unknown_decl(). BaseHTMLProcessor never overrode unknown_decl(), so the base sgmllib.SGMLParser.unknown_decl() ran — and its default implementation discards the data. The character data was silently dropped.

This has been reported several times against real-world feeds (e.g. the feeds linked in #440).

The fix

Override unknown_decl() in BaseHTMLProcessor to emit the contents of a CDATA marked section instead of discarding them. The contents of a CDATA section are character data, so the few characters that are special in the surrounding markup (&, <, >) are escaped before being emitted. This keeps a script smuggled inside an escaped CDATA section inert — it is rendered as literal text rather than executed.

Real (unescaped) CDATA sections are unaffected: the XML parser strips those before the HTML processor ever sees them.

Tests

  • Added tests/wellformed/sgml/escaped_cdata_section.xml covering the reported case (runs under both the strict and loose parsers).
  • The full suite passes (4307 passed).

When a feed XML-escapes a CDATA section -- for example, a <description>
whose text content is the literal string '<![CDATA[...]]>' -- the HTML
processor handled the marked section with SGMLParser.unknown_decl(),
whose default implementation discards it. The character data was
therefore dropped and the field was parsed as an empty string.

Override unknown_decl() in BaseHTMLProcessor to emit the contents of a
CDATA marked section (escaped for the surrounding markup) instead of
discarding them. (kurtmckee#440)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failed to parse description field with escaped CDATA.

1 participant