Skip to content

bug: 'fast-xml-parser' unexpectedly decodes HTML entities, causing mismatches in diff tools #7107

@indiesewell

Description

@indiesewell

App Version

v3.25.14

API Provider

Anthropic

Model Used

Claude 4 Sonnet

Roo Code Task Links (Optional)

Description
In our project, the apply_diff tool, which is orchestrated by [src/core/tools/multiApplyDiffTool.ts], relies on fast-xml-parser for processing instructions. The utility function responsible for this is located in [src/utils/xml.ts]

We've identified a critical issue where the tool fails when processing XML content that contains text with special characters like &. The root cause is that fast-xml-parser's default configuration decodes HTML entities (e.g., converting & to & internally during parsing). This causes a mismatch when the tool's diffing strategy later compares the parsed content against the original file content, leading to a No sufficiently similar match found error.

This behavior is particularly problematic and non-obvious for tools that require byte-for-byte or character-for-character accuracy between the original source and the parsed representation.

🔁 Steps to Reproduce

Steps to Reproduce
Use a tool that wraps file content in an XML structure for processing, like our apply_diff tool.
Attempt to modify a file where the search block contains a special character that gets converted to an HTML entity (e.g., & becomes &).
The apply_diff tool calls parseXml (from src/utils/xml.ts) which uses fast-xml-parser with its default settings.
The parser returns an object where the string now contains & instead of &.
The diffing engine compares "Team Identity & Project Positioning" (from the parser) with "Team Identity & Project Positioning" (from the file) and fails to find a match.
Minimal Reproducible Example
import { XMLParser } from "fast-xml-parser";

// Simulate the XML content passed to the tool
const xmlInput = <args> <file> <path>./doc.md</path> <diff> <content> Team Identity & Project Positioning </content> </diff> </file> </args>;

// Default parser configuration (problematic)
const defaultParser = new XMLParser();
const parsedWithDefaults = defaultParser.parse(xmlInput);
console.log('Parsed with defaults:', parsedWithDefaults.args.file.diff.content);
// Expected output: "Team Identity & Project Positioning"
// Actual output: "Team Identity & Project Positioning"

// Correct parser configuration (works as expected)
const correctParser = new XMLParser({
processEntities: false, // The key fix
});
const parsedCorrectly = correctParser.parse(xmlInput);
console.log('Parsed correctly:', parsedCorrectly.args.file.diff.content);
// Expected output: "Team Identity & Project Positioning"
// Actual output: "Team Identity & Project Positioning"

typescript

Expected Behavior
The XML parser should provide a configuration option to disable HTML entity processing and, for a tool designed for precise text manipulation, this option should arguably be disabled by default or at least clearly documented as a potential "gotcha". The parsed content should exactly match the original string content from the file.

Actual Behavior
fast-xml-parser decodes entities by default, causing silent data corruption for use cases that depend on literal string matching.

💥 Outcome Summary

Suggested Solution
The fix is to explicitly set processEntities: false in the fast-xml-parser options within our src/utils/xml.ts file. This prevents the library from decoding entities and preserves the original string.

We recommend either:

Changing the default behavior of fast-xml-parser to not process entities unless explicitly enabled.
Or, more prominently documenting this default behavior in the README as a critical consideration for any application using the library for text-based diffing or validation.
Environment:

Library: fast-xml-parser
Version: 5.x.x
Context: Node.js-based developer tool, specifically affecting file operations in [src/core/tools/multiApplyDiffTool.ts]

\src\utils\xml.ts:28-56


/**
 * Parses an XML string for diffing purposes, ensuring no HTML entities are decoded.
 * This is a specialized version of parseXml to be used exclusively by diffing tools
 * to prevent mismatches caused by entity processing.
 * @param xmlString The XML string to parse
 * @returns Parsed JavaScript object representation of the XML
 * @throws Error if the XML is invalid or parsing fails
 */
export function parseXmlForDiff(xmlString: string, stopNodes?: string[]): unknown {
	const _stopNodes = stopNodes ?? []
	try {
		const parser = new XMLParser({
			ignoreAttributes: false,
			attributeNamePrefix: "@_",
			parseAttributeValue: false,
			parseTagValue: false,
			trimValues: true,
			processEntities: false, // Do not process HTML entities, keep them as is
			stopNodes: _stopNodes,
		})

		return parser.parse(xmlString)
	} catch (error) {
		// Enhance error message for better debugging
		const errorMessage = error instanceof Error ? error.message : "Unknown error"
		throw new Error(`Failed to parse XML: ${errorMessage}`)
	}
}

\src\core\tools\multiApplyDiffTool.ts:15-15

import { parseXmlForDiff } from "../../utils/xml"

\src\core\tools\multiApplyDiffTool.ts:111-111

			const parsed = parseXmlForDiff(argsXmlTag, ["file.diff.content"]) as ParsedXmlResult

📄 Relevant Logs or Errors (Optional)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Issue/PR - TriageNew issue. Needs quick review to confirm validity and assign labels.bugSomething isn't working

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions