Extract the contents of escaped CDATA sections (#440) by Labib-Bin-Salam · Pull Request #572 · kurtmckee/feedparser

Labib-Bin-Salam · 2026-06-25T22:08:50Z

Fixes #440.

The problem

When a feed XML-escapes a CDATA section, feedparser parses the field as an empty string. For example:

<description>&lt;![CDATA[some text]]&gt;</description>

>>> feedparser.parse(rss).entries[0].description
''            # expected: 'some text'

After the document is parsed, the text content of <description> is the literal string <![CDATA[some text]]>. feedparser treats that content as HTML, so it runs through the SGML-based HTML processor, which recognizes <![CDATA[ ... ]]> as a marked section and hands its body to unknown_decl(). BaseHTMLProcessor never overrode unknown_decl(), so the base sgmllib.SGMLParser.unknown_decl() ran — and its default implementation discards the data. The character data was silently dropped.

This has been reported several times against real-world feeds (e.g. the feeds linked in #440).

The fix

Override unknown_decl() in BaseHTMLProcessor to emit the contents of a CDATA marked section instead of discarding them. The contents of a CDATA section are character data, so the few characters that are special in the surrounding markup (&, <, >) are escaped before being emitted. This keeps a script smuggled inside an escaped CDATA section inert — it is rendered as literal text rather than executed.

Real (unescaped) CDATA sections are unaffected: the XML parser strips those before the HTML processor ever sees them.

Tests

Added tests/wellformed/sgml/escaped_cdata_section.xml covering the reported case (runs under both the strict and loose parsers).
The full suite passes (4307 passed).

When a feed XML-escapes a CDATA section -- for example, a <description> whose text content is the literal string '<![CDATA[...]]>' -- the HTML processor handled the marked section with SGMLParser.unknown_decl(), whose default implementation discards it. The character data was therefore dropped and the field was parsed as an empty string. Override unknown_decl() in BaseHTMLProcessor to emit the contents of a CDATA marked section (escaped for the surrounding markup) instead of discarding them. (kurtmckee#440)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract the contents of escaped CDATA sections (#440)#572

Extract the contents of escaped CDATA sections (#440)#572
Labib-Bin-Salam wants to merge 1 commit into
kurtmckee:mainfrom
Labib-Bin-Salam:fix-escaped-cdata-440

Labib-Bin-Salam commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Labib-Bin-Salam commented Jun 25, 2026

The problem

The fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant