-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Case1
Input: <p>a c</p>
Brower result: a c
  is recognized as html entity
Jsoup parsed content: <p>a&nbspc</p>
Brower result: a c
  is not recognized which shows different result in browser
Case2
Input: <p>a ­c</p>
Brower result: a c
  and ­ is recognized as and ­ respective html entity
Jsoup parsed content: <p>a &shyc</p>
Brower result: a ­c
  is recognized (might be due to succeeding & character), but ­ is not recognized as ­. Shows different result in browser
Case3
Input: <p>a­c </p>
Brower result: ac
  and ­ is recognized as and ­ respective html entity
Jsoup parsed content: <p>a&shyc </p>
Brower result: a­c
  is recognized (as the string ends with that entity), but ­ is not recognized as ­. Shows different result in browser
On checking few more cases, this issue is seen only for named entities (like , &, ", and others) where the entity is not ended with semi-colon and followed by letters. Hexa-decimal entities and numeric entities are detected even if they are not ended with semi-colon.
Examples:
Proper detection as expected,
 ,ddd (Expected ,ddd , and got same results)
  ddd (Expected ddd , and got same results)
djdjb  (Expected djdjb , and got same results)
Invalid detections (ISSUES):
 dhdj (Expected dhdj but got &nbspdhdj)
&dfgsj (Expected &dfgsj but got &ampdfgsj)
Browsers are able to detect these html entities. Validated in https://mothereff.in/html-entities as well
Parser: Html parser
Escape mode: Same result for both base and extended. nbsp entity is replaced by   in xhtml escape mode but the result is same
I also raised this doubt related to entity: #2206