Skip to content

fix: strip leading whitespace before encoding detection (fixes #508)#570

Open
gaoflow wants to merge 1 commit into
kurtmckee:mainfrom
gaoflow:fix-508-newline-xml
Open

fix: strip leading whitespace before encoding detection (fixes #508)#570
gaoflow wants to merge 1 commit into
kurtmckee:mainfrom
gaoflow:fix-508-newline-xml

Conversation

@gaoflow

@gaoflow gaoflow commented Jun 17, 2026

Copy link
Copy Markdown

When an XML feed starts with a newline before the XML declaration
(e.g. \n<?xml version="1.0"...), the encoding detection in
convert_to_utf8() fails to find the <?xml encoding attribute
because it's not at byte offset 0. This causes a second XML
declaration to be prepended, which trips the SAX parser with
"XML or text declaration not at start of entity".

Fix: strip leading ASCII whitespace from the data after BOM
detection and before encoding sniffing, so that XML declarations
preceded by whitespace are correctly detected.

…kee#508)

When an XML feed starts with a newline before the XML declaration,
the encoding detection in convert_to_utf8() fails to find the
<?xml encoding attribute, causing it to prepend a second XML
declaration which triggers SAX "XML or text declaration not at
start of entity" errors.

Strip leading ASCII whitespace from the data after BOM detection
but before encoding sniffing, so that XML declarations preceded
by newlines are correctly detected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant