[stdlib] Add utf8 safeguards, fix `chr` method, add unicode and utf16 parsing for `String` #3239

martinvuyk · 2024-07-13T17:05:00Z

Add utf8 safeguards, the second of many steps to fix #2842

fn chr(c: Int) -> String function now returns a replacement character (�)
if the Unicode codepoint is invalid.

Added String.from_unicode(values: List[Int]) -> String and
String.from_utf16(values: List[UInt16]) -> String functions that return a String
containing the concatenated characters. If a Unicode codepoint
is invalid, the parsed String has a replacement character (�) in that index.

Signed-off-by: martinvuyk <[email protected]>

stdlib/test/builtin/test_string.mojo

Signed-off-by: martinvuyk <[email protected]>

JoeLoser · 2024-11-22T02:40:15Z

@martinvuyk do you mean for this to be a draft still, or still pursuing this? Happy to review it after it's rebased.

martinvuyk · 2024-11-22T13:15:44Z

@JoeLoser I'm still mulling this one over.

Python's behavior is quite varied around this functionality:

class str(object=b'', encoding='utf-8', errors='strict'):
[...]
If at least one of encoding or errors is given, object should be a bytes-like object (e.g. bytes or bytearray). In this case, if object is a bytes (or bytearray) object, then str(bytes, encoding, errors) is equivalent to bytes.decode(encoding, errors). Otherwise, the bytes object underlying the buffer object is obtained before calling bytes.decode(). See Binary Sequence Types — bytes, bytearray, memoryview and Buffer Protocol for information on buffer objects.

Some things hold me back on implementing these fully:

Every possible datatype that can have those encodings needs to be taken into account
Every possible string combination to express the standard encodings needs to be taken into account
Every kind of code-path to decode according to the errors argument needs to be implemented

errors controls how decoding errors are handled. If 'strict' (the default), a UnicodeError exception is raised. Other possible values are 'ignore', 'replace', and any other name registered via codecs.register_error(). See Error Handlers for details.

I'm leaning towards creating Span.decode(self) -> String and String.encode(self) -> Span[Byte] since it seems like we are slowly consolidating towards Span[Byte] being the equivalent of bytes. But there is a glaring issue which is the memory leakage of newly allocated Spans which have no ownership of their data and to me is a massive foot-gun (String.encode() for a non utf8 encoding would mean allocating a new buffer). Span would need a flexible pointer which can be or not be owned for this not to happen, this was one big motivation for FlexiblePointer in proposal #3728 which we could still implement independent of the trait proposed there (which Owen convinced me is too constraining). And if we do give it a special pointer and make it able to own its data or not, I'd like us to rename Span to Buffer since I think it will be more fitting.

Another possibility is going for String.decode(List[Byte]) -> String and String.encode(self) -> List[Byte] which would behave similarly to Python's bytes.decode() and string.encode(self) -> bytes. But we would be leaving a lot of performance on the table, namely:

Having decode return a new String would be wasteful for utf8 encoded buffers (which is most of text out there) for buffers that are passed as owned and own their data (which happens most of the time when receiving data over the wire and decoding)
String.encode() would need to consume the string if we want to give the pointer to the List without wasting copies
Having both decode and encode require a List[Byte] is not as scalable as using Span[Byte] (if we make them able to own their data)

I can make a proposal for the special pointer (or some other mechanism to signal ownership) and renaming of Span -> Buffer, or we can go with the List[Byte] approach WDYT?

add better safeguards and fix chr method

7e4f0df

Signed-off-by: martinvuyk <[email protected]>

martinvuyk requested a review from a team as a code owner July 13, 2024 17:05

martinvuyk added 20 commits July 13, 2024 13:05

update changelog

7134f8f

Signed-off-by: martinvuyk <[email protected]>

rename to from_unicode

ab84608

Signed-off-by: martinvuyk <[email protected]>

move from_unicode to be static method

53d7038

Signed-off-by: martinvuyk <[email protected]>

fix from_unicode

6d480b7

Signed-off-by: martinvuyk <[email protected]>

fix docstring

5236388

Signed-off-by: martinvuyk <[email protected]>

fix indentation

439aa21

Signed-off-by: martinvuyk <[email protected]>

fix list constructor

c6f2dfb

Signed-off-by: martinvuyk <[email protected]>

fix use less lines

20bf017

Signed-off-by: martinvuyk <[email protected]>

add utf16 decode

9a62b42

Signed-off-by: martinvuyk <[email protected]>

fix changelog

0bbc386

Signed-off-by: martinvuyk <[email protected]>

fix detail

74e698b

Signed-off-by: martinvuyk <[email protected]>

fix detail

bf4093d

Signed-off-by: martinvuyk <[email protected]>

fix detail

5a2af26

Signed-off-by: martinvuyk <[email protected]>

fix detail

30c027f

Signed-off-by: martinvuyk <[email protected]>

fix detail

ddcbf0d

Signed-off-by: martinvuyk <[email protected]>

simplify utf16 internals

9f5ee3b

Signed-off-by: martinvuyk <[email protected]>

fix detail

fcc789c

Signed-off-by: martinvuyk <[email protected]>

fix detail

e08bc57

Signed-off-by: martinvuyk <[email protected]>

fix detail

9ffd5e6

Signed-off-by: martinvuyk <[email protected]>

fix detail

afb537a

Signed-off-by: martinvuyk <[email protected]>

martinvuyk changed the title ~~[stdlib] Add utf8 safeguards and fix chr method~~ [stdlib] Add utf8 safeguards, fix chr method, add unicode and utf16 parsing for String Jul 14, 2024

martinvuyk added 4 commits July 13, 2024 20:39

fix detail

805041e

Signed-off-by: martinvuyk <[email protected]>

fix detail

0fcdf50

Signed-off-by: martinvuyk <[email protected]>

fix detail

be5a203

Signed-off-by: martinvuyk <[email protected]>

fix detail

fccdbcd

Signed-off-by: martinvuyk <[email protected]>

mzaks reviewed Jul 14, 2024

View reviewed changes

stdlib/test/builtin/test_string.mojo Outdated Show resolved Hide resolved

martinvuyk added 2 commits July 14, 2024 10:04

add suggestion from @mzaks

f46ce80

Signed-off-by: martinvuyk <[email protected]>

fix use unsafe_get

6b47694

Signed-off-by: martinvuyk <[email protected]>

Merge remote-tracking branch 'upstream/nightly' into add-utf8-safeguards

ca38ca3

martinvuyk mentioned this pull request Jul 16, 2024

[stdlib] Add FileHandle iterator and UTF-8 safeguards #3257

Closed

martinvuyk added 2 commits July 16, 2024 12:14

use variant for unicode parsing

af3be58

Signed-off-by: martinvuyk <[email protected]>

Merge remote-tracking branch 'upstream/nightly' into add-utf8-safeguards

a4eedb0

martinvuyk marked this pull request as draft September 5, 2024 17:03

martinvuyk mentioned this pull request Sep 18, 2024

[stdlib] Add full unicode support for character casing functions #3496

Closed

JoeLoser added the waiting for response Needs action/response from contributor before a PR can proceed label Nov 22, 2024

martinvuyk mentioned this pull request Nov 22, 2024

[Feature Request] [proposal] A new Buffer type #3797

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stdlib] Add utf8 safeguards, fix `chr` method, add unicode and utf16 parsing for `String` #3239

[stdlib] Add utf8 safeguards, fix `chr` method, add unicode and utf16 parsing for `String` #3239

martinvuyk commented Jul 13, 2024 •

edited

Loading

JoeLoser commented Nov 22, 2024

martinvuyk commented Nov 22, 2024 •

edited

Loading

[stdlib] Add utf8 safeguards, fix chr method, add unicode and utf16 parsing for String #3239

Are you sure you want to change the base?

[stdlib] Add utf8 safeguards, fix chr method, add unicode and utf16 parsing for String #3239

Conversation

martinvuyk commented Jul 13, 2024 • edited Loading

JoeLoser commented Nov 22, 2024

martinvuyk commented Nov 22, 2024 • edited Loading

[stdlib] Add utf8 safeguards, fix `chr` method, add unicode and utf16 parsing for `String` #3239

[stdlib] Add utf8 safeguards, fix `chr` method, add unicode and utf16 parsing for `String` #3239

martinvuyk commented Jul 13, 2024 •

edited

Loading

martinvuyk commented Nov 22, 2024 •

edited

Loading