Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Add utf8 safeguards, fix chr method, add unicode and utf16 parsing for String #3239

Draft
wants to merge 30 commits into
base: nightly
Choose a base branch
from

Conversation

martinvuyk
Copy link
Contributor

@martinvuyk martinvuyk commented Jul 13, 2024

Add utf8 safeguards, the second of many steps to fix #2842

fn chr(c: Int) -> String function now returns a replacement character (�)
if the Unicode codepoint is invalid.

Added String.from_unicode(values: List[Int]) -> String and
String.from_utf16(values: List[UInt16]) -> String functions that return a String
containing the concatenated characters. If a Unicode codepoint
is invalid, the parsed String has a replacement character (�) in that index.

@martinvuyk martinvuyk requested a review from a team as a code owner July 13, 2024 17:05
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk martinvuyk changed the title [stdlib] Add utf8 safeguards and fix chr method [stdlib] Add utf8 safeguards, fix chr method, add unicode and utf16 parsing for String Jul 14, 2024
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@JoeLoser
Copy link
Collaborator

@martinvuyk do you mean for this to be a draft still, or still pursuing this? Happy to review it after it's rebased.

@JoeLoser JoeLoser added the waiting for response Needs action/response from contributor before a PR can proceed label Nov 22, 2024
@martinvuyk
Copy link
Contributor Author

martinvuyk commented Nov 22, 2024

@JoeLoser I'm still mulling this one over.

Python's behavior is quite varied around this functionality:

class str(object=b'', encoding='utf-8', errors='strict'):
[...]
If at least one of encoding or errors is given, object should be a bytes-like object (e.g. bytes or bytearray). In this case, if object is a bytes (or bytearray) object, then str(bytes, encoding, errors) is equivalent to bytes.decode(encoding, errors). Otherwise, the bytes object underlying the buffer object is obtained before calling bytes.decode(). See Binary Sequence Types — bytes, bytearray, memoryview and Buffer Protocol for information on buffer objects.

Some things hold me back on implementing these fully:

  • Every possible datatype that can have those encodings needs to be taken into account
  • Every possible string combination to express the standard encodings needs to be taken into account
  • Every kind of code-path to decode according to the errors argument needs to be implemented

errors controls how decoding errors are handled. If 'strict' (the default), a UnicodeError exception is raised. Other possible values are 'ignore', 'replace', and any other name registered via codecs.register_error(). See Error Handlers for details.

I'm leaning towards creating Span.decode(self) -> String and String.encode(self) -> Span[Byte] since it seems like we are slowly consolidating towards Span[Byte] being the equivalent of bytes. But there is a glaring issue which is the memory leakage of newly allocated Spans which have no ownership of their data and to me is a massive foot-gun (String.encode() for a non utf8 encoding would mean allocating a new buffer). Span would need a flexible pointer which can be or not be owned for this not to happen, this was one big motivation for FlexiblePointer in proposal #3728 which we could still implement independent of the trait proposed there (which Owen convinced me is too constraining). And if we do give it a special pointer and make it able to own its data or not, I'd like us to rename Span to Buffer since I think it will be more fitting.

Another possibility is going for String.decode(List[Byte]) -> String and String.encode(self) -> List[Byte] which would behave similarly to Python's bytes.decode() and string.encode(self) -> bytes. But we would be leaving a lot of performance on the table, namely:

  • Having decode return a new String would be wasteful for utf8 encoded buffers (which is most of text out there) for buffers that are passed as owned and own their data (which happens most of the time when receiving data over the wire and decoding)
  • String.encode() would need to consume the string if we want to give the pointer to the List without wasting copies
  • Having both decode and encode require a List[Byte] is not as scalable as using Span[Byte] (if we make them able to own their data)

I can make a proposal for the special pointer (or some other mechanism to signal ownership) and renaming of Span -> Buffer, or we can go with the List[Byte] approach WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for response Needs action/response from contributor before a PR can proceed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants