Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenize.generate_tokens() performance regression in 3.12 #119118

Closed
devdanzin opened this issue May 17, 2024 · 12 comments
Closed

tokenize.generate_tokens() performance regression in 3.12 #119118

devdanzin opened this issue May 17, 2024 · 12 comments
Assignees
Labels
performance Performance or resource usage type-bug An unexpected behavior, bug, or error

Comments

@devdanzin
Copy link
Contributor

devdanzin commented May 17, 2024

Bug report

Bug description:

There seems to be a significant performance regression in tokenize.generate_tokens() between 3.11 and 3.12 when tokenizing a (very) large dict on a single line. I searched the existing issues but couldn't find anything about this.

To reproduce, rename the file largedict.py.txt to largedict.py in the same directory as the script below, then run the script. That file comes from nedbat/coveragepy#1785.

import io, time, sys, tokenize
import largedict

text = largedict.d
readline = io.StringIO(text).readline
glob_start = start = time.time()
print(f"{sys.implementation.name} {sys.platform} {sys.version}")

for i, (ttype, ttext, (sline, scol), (_, ecol), _) in enumerate(tokenize.generate_tokens(readline)):
    if i % 500 == 0:
        print(i, ttype, ttext, sline, scol, time.time() - start)
        start = time.time()
    if i % 5000 == 0:
        print(time.time() - glob_start)

print(f"Time taken: {time.time() - glob_start}")

For Python 3.12, this results in:

cpython linux 3.12.3 (main, May 17 2024, 07:19:22) [GCC 11.4.0]
0 1 a_large_dict_literal 1 0 0.04641866683959961
0.046633005142211914
500 3 ':tombol_a_(golongan_darah):' 1 2675 9.689745903015137
1000 3 ':flagge_anguilla:' 1 5261 9.767053604125977
1500 3 ':флаг_Армения:' 1 7879 9.258271932601929
[...]

For Python 3.11, this results in:

cpython linux 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
0 1 a_large_dict_literal 1 0 0.013637304306030273
0.013663768768310547
500 3 ':tombol_a_(golongan_darah):' 1 2675 0.002939462661743164
1000 3 ':flagge_anguilla:' 1 5261 0.0028715133666992188
1500 3 ':флаг_Армения:' 1 7879 0.002806425094604492
[...]
352500 3 'pt' 1 2589077 0.003370046615600586
Time taken: 2.1244866847991943

That is, each 500 tokens in Python 3.12 is taking over 9 seconds to process, while the 352500 tokens in Python 3.11 is taking a bit over 2 seconds to process.

I can reproduce this on Linux (WSL) and Windows. Also seems to affect 3.13.

CPython versions tested on:

3.9, 3.10, 3.11, 3.12

Operating systems tested on:

Linux, Windows

Linked PRs

@devdanzin devdanzin added the type-bug An unexpected behavior, bug, or error label May 17, 2024
@AlexWaygood AlexWaygood added the performance Performance or resource usage label May 17, 2024
@erlend-aasland
Copy link
Contributor

Suspecting it could be related to the work done on PEP-701, which landed in 3.12.

@pablogsal
Copy link
Member

Yeah we are now using a completely different backend (the actual parser) so we need to investigate a bit what's going on here.

@devdanzin
Copy link
Contributor Author

A data point that might help: when calling list(tokenize.generate_tokens(readline)) (or tuple(), memory blows out very fast with a large single line dict, while the same dict broken into lines doesn't cause memory issues.

@pablogsal
Copy link
Member

Oh wow that's very intriguing. I may be able to investigate this next week after PyCon but thanks for the hint!

@pablogsal
Copy link
Member

pablogsal commented May 17, 2024

Ok, I think I have a guess (i need time to test my theory) but I am 99% sure I know the problem. The problem is this line:

line = PyUnicode_DecodeUTF8(line_start, size, "replace");

We are creating unique strings per token with the entire line. If the dict spans the entire line this is a very expensive operation that we are doing over and over and over with the same result. The fix is to keep reusing the same result until there is a new line.

@erlend-aasland
Copy link
Contributor

erlend-aasland commented May 17, 2024

Confirming that, Pablo; if I cache the line1 on the tokenizer iter object instance, I no longer see the performance penalty. Forget that, it's not that simple.

Footnotes

  1. the PyObject *line object

@lysnikolaou
Copy link
Contributor

lysnikolaou commented May 17, 2024

Well, yeah, it's not that simple. Doing some scrappy caching seems to have brought down the performance penalty, but it's much worse than 3.11.

cpython on  main +16 -1 [!] via C v15.0.0-clang via 🐍 v3.11.3 
❯ ./python.exe tmp/t.py
cpython darwin 3.14.0a0 (heads/main-dirty:65de194dd80, May 17 2024, 15:17:08) [Clang 15.0.0 (clang-1500.3.9.4)]
0 1 a_large_dict_literal 1 0 0.07759904861450195
0.07767128944396973
500 3 ':tombol_a_(golongan_darah):' 1 2675 0.08086895942687988
1000 3 ':flagge_anguilla:' 1 5261 0.10818815231323242
1500 3 ':флаг_Армения:' 1 7879 0.1338660717010498
2000 3 ':grupo_sanguíneo_b:' 1 10470 0.1593310832977295
2500 3 ':bayrak_belarus:' 1 13074 0.18597793579101562
3000 3 ':旗_ボスニア・ヘルツェゴビナ:' 1 15514 0.2100069522857666
3500 3 ':Bulgaria:' 1 18735 0.23893284797668457
4000 3 ':bandiera_camerun:' 1 21240 0.2700800895690918
4500 2 2 1 23919 0.2944760322570801
5000 3 ':bendera_pulau_natal:' 1 26767 0.32537102699279785
2.0853629112243652
5500 3 'es' 1 29643 0.3519558906555176
6000 3 'ar' 1 32240 0.38205909729003906
6500 3 'ko' 1 34917 0.411639928817749
7000 2 2 1 37580 0.45705175399780273
7500 3 'fa' 1 40214 0.4636220932006836
8000 55 , 1 43013 0.49083518981933594
8500 3 ':加纳:' 1 45615 0.5203256607055664
9000 3 ':bandera_guatemala:' 1 48073 0.5623579025268555
...

So we get about 0.1-0.5 seconds per 500 tokens, instead of the original ~9 seconds, but it's still very far away.

@lysnikolaou
Copy link
Contributor

lysnikolaou commented May 17, 2024

Oh found it, it's not just that line, but the calls to _PyPegen_byte_offset_to_character_offset that are made a bit further below that eventually call PyUnicode_* APIs as well. After additionally caching those, I get:

cpython darwin 3.14.0a0 (heads/main-dirty:65de194dd80, May 17 2024, 16:37:58) [Clang 15.0.0 (clang-1500.3.9.4)]
0 1 a_large_dict_literal 1 0 0.08269691467285156
0.08281183242797852
500 3 ':tombol_a_(golongan_darah):' 1 0 0.004801034927368164
1000 3 ':flagge_anguilla:' 1 0 0.003465890884399414
1500 3 ':флаг_Армения:' 1 0 0.004238128662109375
2000 3 ':grupo_sanguíneo_b:' 1 0 0.0038290023803710938
2500 3 ':bayrak_belarus:' 1 0 0.005483865737915039
3000 3 ':旗_ボスニア・ヘルツェゴビナ:' 1 0 0.005335807800292969
3500 3 ':Bulgaria:' 1 0 0.0034456253051757812
4000 3 ':bandiera_camerun:' 1 0 0.0051038265228271484
4500 2 2 1 0 0.003175020217895508
5000 3 ':bendera_pulau_natal:' 1 0 0.0032072067260742188
0.1256239414215088
5500 3 'es' 1 0 0.003175973892211914
[...]
351500 3 'es' 1 0 0.0032269954681396484
352000 3 'ko' 1 0 0.003362894058227539
352500 3 'pt' 1 0 0.003036022186279297
Time taken: 2.6852049827575684

I'll push a PR shortly.

@nedbat
Copy link
Member

nedbat commented May 25, 2024

@lysnikolaou any news on this? We have other people reporting a problem like this in 3.12.

@lysnikolaou
Copy link
Contributor

Thanks for the reminder @nedbat! Turns out that this was even trickier than I though and the fix I showed in #119118 (comment) was buggy. My patch is almost there, but I didn't get the time to fix some final details. I'll have a PR up by early next week.

@pablogsal
Copy link
Member

Tell me if you want me to help or pick it up if you done have time @lysnikolaou

lysnikolaou added a commit to lysnikolaou/cpython that referenced this issue May 27, 2024
- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.
pablogsal added a commit that referenced this issue May 28, 2024
* gh-119118: Fix performance regression in tokenize module

- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.

Co-authored-by: Pablo Galindo <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 28, 2024
…nGH-119615)

* pythongh-119118: Fix performance regression in tokenize module

- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.

(cherry picked from commit d87b015)

Co-authored-by: Lysandros Nikolaou <[email protected]>
Co-authored-by: Pablo Galindo <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 28, 2024
…nGH-119615)

* pythongh-119118: Fix performance regression in tokenize module

- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.

(cherry picked from commit d87b015)

Co-authored-by: Lysandros Nikolaou <[email protected]>
Co-authored-by: Pablo Galindo <[email protected]>
lysnikolaou added a commit that referenced this issue May 28, 2024
…19615) (#119682)

- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.

(cherry picked from commit d87b015)

Co-authored-by: Lysandros Nikolaou <[email protected]>
Co-authored-by: Pablo Galindo <[email protected]>
lysnikolaou added a commit that referenced this issue May 28, 2024
…19615) (#119683)

- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.

(cherry picked from commit d87b015)

Co-authored-by: Lysandros Nikolaou <[email protected]>
Co-authored-by: Pablo Galindo <[email protected]>
@lysnikolaou
Copy link
Contributor

Fixed in #119615.

noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
…n#119615)

* pythongh-119118: Fix performance regression in tokenize module

- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.

Co-authored-by: Pablo Galindo <[email protected]>
estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
…n#119615)

* pythongh-119118: Fix performance regression in tokenize module

- Cache line object to avoid creating a Unicode object
  for all of the tokens in the same line.
- Speed up byte offset to column offset conversion by using the
  smallest buffer possible to measure the difference.

Co-authored-by: Pablo Galindo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

6 participants