Skip to content

Non-terminated HTML Entities are not recognized properly #2207

@Muthukirthan

Description

@Muthukirthan

Case1

Input: <p>a&nbspc</p>
Brower result: a c
&nbsp is recognized as &nbsp; html entity

Jsoup parsed content: <p>a&amp;nbspc</p>
Brower result: a&nbspc
&nbsp is not recognized which shows different result in browser


Case2

Input: <p>a&nbsp&shyc</p>
Brower result: a ­c
&nbsp and &shy is recognized as &nbsp; and &shy; respective html entity

Jsoup parsed content: <p>a&nbsp;&amp;shyc</p>
Brower result: a &shyc
&nbsp is recognized (might be due to succeeding & character), but &shy is not recognized as &shy;. Shows different result in browser


Case3

Input: <p>a&shyc&nbsp</p>
Brower result: a­c
&nbsp and &shy is recognized as &nbsp; and &shy; respective html entity

Jsoup parsed content: <p>a&amp;shyc&nbsp;</p>
Brower result: a&shyc
&nbsp is recognized (as the string ends with that entity), but &shy is not recognized as &shy;. Shows different result in browser


On checking few more cases, this issue is seen only for named entities (like  , &, ", and others) where the entity is not ended with semi-colon and followed by letters. Hexa-decimal entities and numeric entities are detected even if they are not ended with semi-colon.

Examples:
Proper detection as expected,
&nbsp,ddd (Expected &nbsp;,ddd , and got same results)
&nbsp ddd (Expected &nbsp; ddd , and got same results)
djdjb&nbsp (Expected djdjb&nbsp; , and got same results)

Invalid detections (ISSUES):
&nbspdhdj (Expected &nbsp;dhdj but got &amp;nbspdhdj)
&ampdfgsj (Expected &amp;dfgsj but got &amp;ampdfgsj)

Browsers are able to detect these html entities. Validated in https://mothereff.in/html-entities as well

Parser: Html parser
Escape mode: Same result for both base and extended. nbsp entity is replaced by &#xa0; in xhtml escape mode but the result is same

I also raised this doubt related to entity: #2206

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugA confirmed bug, that we should fixfixedAn {bug|improvement} that has been {fixed|implemented}

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions