ruff does not honor declaration of character coding #6791

PeterSlickers · 2023-08-22T19:04:03Z

According to PEP263, a character encoding can be declared in a Python program file. This is done with a specially formatted comment placed in the first or second line of the program:

#!/usr/bin/python
# -*- coding: latin-1 -*-

It seems that Ruff (0.0.285) does not honor the coding declaration. Ruff seems to assume that input files are always encoded with utf8. The following Python program demonstrates the problem. It first generates three short Python program files with different encodings and than runs ruff and python3 on them.

#!/usr/bin/env python3
# -*- coding: us-ascii -*-

import subprocess


prog = """# -*- coding: {} -*-
print(\"\u00D8resund og Sj\u00E6lland\")
"""

## create artefacts with differing encodings
filenames = []

filenames.append("prog-utf8.py")
print(f"writing file '{filenames[-1]}'")
with open(filenames[-1], "wb") as outstream:
	outstream.write(prog.format("utf8").encode("utf8"))	

filenames.append("prog-usascii.py")
print(f"writing file '{filenames[-1]}'")
with open(filenames[-1], "wb") as outstream:
	# declared encoding differs from the true encoding
	outstream.write(prog.format("us-ascii").encode("utf8"))	

filenames.append("prog-latin1.py")
print(f"writing file '{filenames[-1]}'")
with open(filenames[-1], "wb") as outstream:
	outstream.write(prog.format("latin-1").encode("latin1"))	

## re-check encodings of the artefacts
print("\nTrue encodings")	
for filename in filenames:
	subprocess.call(["file", "-i", filename,])

## run python3 and ruff on the artefacts
for filename in filenames:
	cmd = ["python3", filename,]
	print("---\n" + " ".join(cmd))
	subprocess.call(cmd)
	cmd = ["ruff", filename,]
	print("\n" + " ".join(cmd))
	subprocess.call(cmd)

The first file with utf8 encoding runs flawlessly with ruff and with python3. This is the expected behaviour.

The second file comprises characters in utf8 encoding, but wrongly declares us-ascii encoding. This file throws an error when run with python3, but successfully passes ruff. I would expect that ruff complains on this file.

The third file comprises characters in latin1 encoding and correctly declares its encoding. This program runs successfully with python3, but throws an error when checked with ruff. I would expect that ruff does not complain on this file.

The text was updated successfully, but these errors were encountered:

MichaReiser · 2023-08-23T06:26:00Z

Thanks for reporting this issue.

Yes, your observation is correct. Ruff currently has no support for the coding pragma comments. Mainly because converting between encodings is hard and coding comments don't seem that widespread anymore. Nonetheless, this is a Ruff limitation and we should either document it or fix it.

From an implementation standpoint. I rather don't add support for non-UTF-8 strings in the lexer/parser. We should rather normalize the string to UTF-8 as early as possible, ideally when, or after reading the file. The main challenge that I see comes with fixes. We would need to write the string back using the original encoding. So we would need to keep that information around.

…gnarly to plumb this through everywhere, and given the lack of uproar over ruff's complete lack of handling of non-utf8 files astral-sh/ruff#6791, we can just warn about it instead. the only potential negative effects are that the bytes on disk in the python file for something like a LocalState that takes it's key from the member name, might not match exactly what's in the TEAL file (and thus on chain). - refactor to slightly reduce mypy dependencies in ParseResult - pull out component orderding and checking to the parse level - construct read_source directly and add caching at the top level

encukou · 2024-11-21T10:23:30Z

I believe that linters in general should reject at least some source encodings.
See the (rather scary) example from the last section of PEP 672, which, as-is, fails ruff check with F401 `os` imported but unused.

I agree that support for encoding declarations doesn't need to be a priority, but if Ruff can't handle them, maybe there should be a linter rule for them? (For Ruff users, “you shouldn't do this if you want to use Ruff” is basically equivalent to “you shouldn't do this”, right?)

MichaReiser added bug needs-decision labels Aug 23, 2023

charliermarsh mentioned this issue Oct 1, 2023

Syntax error while parsing variables containing unicode? character #7731

Closed

MichaReiser mentioned this issue Jun 14, 2024

ruff detects E902 when using special character in code #11876

Closed

MichaReiser removed the bug label Nov 25, 2024

MichaReiser mentioned this issue Dec 2, 2024

UP009 fix changes a file from UTF-8 to a different declared encoding #14704

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ruff does not honor declaration of character coding #6791

ruff does not honor declaration of character coding #6791

PeterSlickers commented Aug 22, 2023 •

edited

Loading

MichaReiser commented Aug 23, 2023

encukou commented Nov 21, 2024

ruff does not honor declaration of character coding #6791

ruff does not honor declaration of character coding #6791

Comments

PeterSlickers commented Aug 22, 2023 • edited Loading

MichaReiser commented Aug 23, 2023

encukou commented Nov 21, 2024

PeterSlickers commented Aug 22, 2023 •

edited

Loading