Skip to content

Custom Data or RcData tags with hyphens don't close correctly #2332

@sparra-inmoba

Description

@sparra-inmoba

In version 1.20.1 of jsoup, the newly introduced TagSet class allows defining custom element behaviors. However, when configuring a custom element as Tag.Data, the parser fails to detect the correct closing of the element, resulting in malformed HTML output.

This issue affects both the HTML parser (Parser.htmlParser()) and the XML parser (Parser.xmlParser()), which behave similarly and incorrectly in this context.

Expected Behavior:

The input HTML or XML should be preserved as-is, treating the content of custom elements defined as Tag.Data as raw data (similar to <script> or <style>), and not parsing it as nested HTML/XML.

Actual Behavior:

The parser misinterprets the contents of the custom-style element, incorrectly closing the tag and generating an invalid document structure.

Reproducible Example:

@Test
void testHtmlCustomElementsAsData() {

  var tags = TagSet.Html();
  tags.valueOf("custom-style", Parser.NamespaceHtml)
      .set(Tag.Data);

  var html = """
      <html>
       <head>
        <style> a > p {color: #0000; } </style>
       </head>
       <body>
        <custom-style> a > p {color: #0000; } </custom-style>
       </body>
      </html>""";

  var doc = Parser.htmlParser().tagSet(tags).parseInput(html, "http://example.com");
  assertThat(doc.html()).isEqualToNormalizingNewlines(html);
}

Actual Output:

<html>
 <head>
  <style> a > p {color: #0000; } </style>
 </head>
 <body>
  <custom-style> a > p {color: #0000; } </custom-style>
 </body>
</html></custom-style>
 </body>
</html>

The same output pattern is observed when using Parser.xmlParser() with a corresponding TagSet configuration.

Reproduction Repository:

You can find a minimal test project reproducing the issue here:
🔗 https://github.com/sparra-inmoba/jsoup-tagset-broken

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugA confirmed bug, that we should fixfixedAn {bug|improvement} that has been {fixed|implemented}

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions