Major overhaul to support actual regex semantics by robot-rover · Pull Request #491 · maciejhirsz/logos

robot-rover · 2025-07-26T15:13:08Z

Intro

Hi there. I realize this is a scary PR to see as it changes 50+ files. This is something I've been working on for a while now to address the issues Maciej mentioned in the 0.13 Release Notes/Future of Logos. Namely, switching from the current tailcall implementation to a switch within a loop implementation and fixing the confusing #[regex] implementation that doesn't actually follow the semantic of regular expressions.

I'm posting this as a draft PR right now to see what the maintainer's attitude toward possibly merging these changes are. If you aren't interested, that is not a big deal, I'll just spin these off into a library fork (maybe I'll call it pathos :-P).

~~I wouldn't focus too much on close review of the changes, as there are a couple of things that still need polishing. But please take a high level look so you can give me your thoughts!~~

EDIT 8/3/25: Since I believe I've addressed all of my concerns except for performance regressions, I'm now posting this as a real PR. I believe it is ready for review. Any further changes will probably be localized to the logos-codegen/src/generator/mod.rs file.

EDIT 8/17/25: I don't believe there will be any further changes (save minor bugfixes).

Justification

There are two things I set out to fix with these changes.

There are a lot of places where the actual semantics of #[regex] seem to be unspecified or even conflict with the documentation. I've seen a lot of github issues involving this in the past, and I ran into it myself trying to implement a VHDL lexer with logos (VHDL string literals escape quotes by having two double quotes in a row. This was impossible for me to implement with the latest logos because it required a fork with a "miss" if the next character was not also a double quote, and when I inspected the codegen it seems to have generated code that always returns an Error. (Let me know if you are interested in this and I can dig up the example, but I figured this PR was long enough already).
The tail call codegen of logos is currently unsafe because the become keyword isn't stabilized yet, which means there is no way to guarantee tail calls. While I haven't ever seen this happen on x86, it makes it hard to trust the library, especially on more niche architectures.

As a bonus, it seemed like I could get rid of a large amount of code from the logos-codegen/graph folder by using more of the regex-automata API, which seemed like a win since that seemed to be where a lot of the bugs where coming from.

Overview of Internal Changes

The biggest change I made was tossing nearly all of the logos-codegen/graph folder and instead compiling all the regex HIR into a DFA using the regex-automata crate. The main benefit of this is that we now implement real regex semantics (and basically for free, I did barely any work to get this). I'd stress that we didn't lose any theoretical runtime here. DFAs match in linear time w.r.t. the input, and in pathological patterns we gained efficiency, since DFAs don't have any concept of a "miss" where they have to go back and try a different "path" through the pattern. This also fixes things such regex repetitions being "too greedy" where they consume too much of the input and don't match even when a match should have been found (see the loop_in_loop test and preceding comment for an example).

The other change I made was changing the codegen to spit out a loop containing a match on a unit enum. This solves the issue of a possible stack overflow, but I believe it will have inferior performance until RFC #3720 lands. For more discussion on this, see _Future performance optimization.

As a bonus, I've gotten rid of some unsafe code in the repo (and all unsafe calls in generated code). There were two major sources of it

Getting around the overhead of storing the token in the Lexer struct. I've avoided this completely by returning the Token without storing it in the Lexer first
Re-reading bytes that we've already read without bounds checks. Since the DFA never backtracks (there is no concept of a "miss"), there is never a case where we know there are more bytes that we can read. So we always do a bounds check.

EDIT 8/3/25: After further inspection, it seems there is still some unsafe code for slicing the source to return a slice to callbacks. I've left this as it is still useful.

API Changes

There are two significant user-facing changes that I made.

Now that regex has the correct semantics, the behavior of #[regex] changes significantly. I haven't noticed any situations where it no longer matches something that it used to, but I don't think it's impossible. The most noticeable change is that it now matches many cases it didn't previously. It also supports non-greedy repetitions (which are very important for having simpler patterns. Take, for example, Maciej's complaint that parsing a block comment needs to be done with #[regex(r"\/\*([^*]|\*[^/])*\*\/")] (gross). With the new behavior, you can write #[regex(r"\/\*.*?\*\/"), which seems much easier to parse to me.
I added the requirement that you ask for non-utf8 (eg. Implementing logos with source = [u8]) explicitly by using #[logos(utf8 = false)]. The old heuristic where you only do it if there are byte string patterns doesn't really make sense to me because you can easily match non-utf8 with string patterns (i.e. r"\xFF") and you can manually write out your utf8 bytes in a byte string. Additionally, I removed having a custom source implementation for now. If you ask for utf8, logos validates that your patterns can only match valid utf8 and works with strs, otherwise it does no validation and matches [u8]. I realize this is one of the more opinionated changes. I'm open to discuss / walk this back if convinced. It shouldn't be tons of work to re-add supporting user implemented Source in the future.
ignore(ascii_case) was removed. This is more of a technical limitation. regex-syntax doesn't support this feature (they do support by using the case insensitive flag with the disable unicode flag, but logos supported matching utf8 characters while only being case insensitive in ascii. There might be a sneaky way to implement this, but I can't think of a great reason to want case insensitivity only in ascii while also matching valid utf8, so I didn't pursue this any further. Note that if end users really need this, they can be selective with their regex patterns to do so, i.e. #[regex(r"(?i-u:abc)Ë")] which matches any case of "abc" followed by only a capital E with an umlaut (Ë).

There are several smaller changes that I've added as well
1. I've removed the ability for the Logos enum to have type parameters. This was mostly because I didn't see a great mention of this in the docs, I couldn't think of a great use case for it, and I didn't want to add the extra complexity to the new codegen. If there are users of this, I guess it could be worth the effort to reimplement. Fixed as of 8/3/25
2. Since the old Logos MIR differentiated between byte literals and utf8 literals, it assigned their priority based on the number of characters. The regex-syntax HIR only has byte literals (that may happen to be valid utf8). This means that literals containing non-ascii characters will have a slightly higher priority now. I think this is a small issue and not worth the extra complexity to fix.

Remaining Work

There are a couple of remaining todo items such as improving some error messages and some general code cleanup. In addition, I need to flesh out some commenting where they are pretty sparse. I'll be working on these things in the coming weeks. Fixed as of 8/3/25

Additionally, I'd like to do some benchmarking to evaluate how bad the performance regressions are. Also, I haven't tested it with older rust versions, so I need to do that as well (I'm guessing that the 1.82 uses<'a> change will break the codegen).

EDIT of 8/3/25: There are still performance regressions. I'll continue to work on these in the coming weeks. Additionally, I'd need to update the logos book. I'm holding off on that work until I get some clarity from a contributor on whether or not this PR is likely to be accepted.

EDIT of 8/17/25: The performance is almost comparable. The one thing I'm still missing is probably reading more than one character when trying to match literals (see below), but I'm planning to fix that later by removing literals that also match a regex and picking them out via a perfect hash function. But that is too big a change for this PR.

Future performance optimization

There are a couple places where there are some easy performance wins that could be added in the future. I focused on making the implementation correct first, with the intention to come back to these at some point in the future. In no particular order

Re-add the tail call codegen. While I haven't done any benchmarking yet, I'm fairly sure this created more efficient assembly in most cases, since it seems that Rust's state machine codegen is currently lacking (see earlier RFC link). While it won't be truly safe until the become keyword lands, it has been used successfully by logos in the past, and I think it would be a worthy opt-in feature. Done as of 8/3/25
The regex-automata has the slightly unintuitive behavior of returning a match 1 character after the match has been found (for example, it will tell you you have found a match for the pattern "abc" in the haystack "abcd" once you give it the character "d". This is used to implement things such as $ and line endings. This means we are reading one character too many in nearly all cases (for a typical lexer). This could be fixed by examining the graph created from the DFA. If all edges leading out of a node lead to a match for the same pattern, we can get rid of those matches and make the original node an "early match" for the pattern. This is the sort of optimization you can't do for a runtime DFA because the code for all states has to be homogeneous, but we can get away with it for a compile time codegenerator. Done as of 8/17/25
In a similar vein, if the edges leading out of a node are all long "ropes" of states (according to some heuristic), we can read multiple bytes from the Source and use a rust match statement. We'd need to benchmark this, but I think it is similar to what logos used to do for multiple ropes. For example, looking for the tokens ["beef", "case", "boom"] gives you the state graph (below) and you could make this into a rust match on the next 4 bytes. As to what exactly the heuristic would be for this transformation, I haven't thought about it enough.

                    ┌─e─►3──e─►6──f──►match beef 
                    │                            
         ┌──b───►1──┤                            
         │          │                            
start  ──┤          └─o─►4──o─►7──m──►match boom 
         │                                       
         └──c───►2────a─►5──s─►8──e──►match case

Similar to what logos currently does for forks, if the match statement is complex enough (there are enough disparate classes that it doesn't make sense to just do 1 or 2 range checks, we should be generating a lookup table. This one seems pretty obvious and easy to implement. Done as of 8/17/25
Similar to what logos currently does for loops, regex-automata has the concept of "accelerated states". This could be implemented with a call to a function in the memchr crate (which is how regex-automata does it internally). Done as of 8/17/25. (Note, I looked into implementing fast loops using the cmpistri instruction on x86 but it wasn't appreciably faster.)

Conclusion

Thanks for reading my massive wall of PR description. I hope you'll consider working together to make this the future of Logos. I really tried to make my changes all in the interest of either code cleanup or what I interpreted as the intended direction of the library ala those 0.13 release notes. If you'd like to take the conversation off of github, I can also be reached over discord. Just let me know <3

jeertmans · 2025-07-26T17:54:51Z

Hi @robot-rover! Many thanks for you PR! I am not sure when I will have time to review this, or if @maciejhirsz is available for this, but can you first make sure that there are no conflict with the main branch, so tests can run? Thanks!

robot-rover · 2025-11-25T15:45:17Z

Hi, guys, may I ask about what blocks the custom source type in 0.16?

Mostly limiting the scope of this PR. There isn't a technical reason it couldn't be added again.

robot-rover · 2025-11-25T15:59:19Z

I think @pinkforest point about the guarantees on when things are Unicode or not is a good catch. It seems like its not sufficiently handled by the book right now. Let's hold off merging this until I get a chance to update the relevant book pages.

robot-rover · 2025-11-25T21:41:40Z

The book is now updated with more information on unicode behavior, I think this should be mergable now!

robot-rover · 2025-11-27T22:11:32Z

@jeertmans you might need to acknowledge the regression in codspeed before this can merge.

pinkforest · 2025-11-28T00:35:58Z

btw did you touch on the capture groups at all, it would be nice to have (foo) (bar) and being able to capture items into enums ? that was earlier regex limitation that makes life a bit harder where the choice is either to switch context between two matches (like between keys and values) over matching key=value pairs through one token capturing the two requiring no split operation through callback either.

robot-rover · 2025-11-28T13:41:31Z

In general there is no way to find capture groups when matching with a dfa. When your nfa has no epsilon moves, your dfa is what's known as one pass (see https://docs.rs/regex-automata/latest/regex_automata/dfa/onepass/struct.DFA.html), and then you can do it, but this situation is rare. That only happens when it is unambiguous what the next step of a pattern is given the next byte (see the link for examples). In my opinion adding this functionality to logos just for one pass patterns isn't worth the effort, extra complexity in the already complicated codegen logic and confusion to users on why only certain patterns can have capture groups. That's just my opinion though.
(@pinkforest )

jeertmans · 2025-11-28T17:14:39Z

@jeertmans you might need to acknowledge the regression in codspeed before this can merge.

Unfortunately, I don't have the permission to do that, but that will not block the PR.

jeertmans · 2025-11-28T17:15:21Z

The book is now updated with more information on unicode behavior, I think this should be mergable now!

Thanks, I'll build the book locally to check how it looks, and merge this PR in the upcoming days :-)

jeertmans · 2025-11-29T21:28:47Z

The book is now updated with more information on unicode behavior, I think this should be mergable now!

Thanks, I'll build the book locally to check how it looks, and merge this PR in the upcoming days :-)

Xref: robot-rover#8

stefnotch · 2025-12-01T19:24:20Z

I tried out my logos-demo.zip again [...] I wonder if it'll become faster with rust-lang/rust-analyzer#21097 ?

I just tried it out with the latest nightly Rust toolchain, it's very satisfactory now.
(The proc macro server ships with the Rust toolchain, unlike most parts of Rust-Analyzer)

robot-rover · 2025-12-03T00:56:58Z

@jeertmans Merged!

jeertmans

Hi @robot-rover, thank you very much for addressing all my comments! I am ready to merge this. Let me know if you planned on committing any other changes, or if it can be merged.

robot-rover · 2025-12-05T22:59:42Z

@jeertmans ship it! 🚢

pinkforest · 2025-12-06T23:44:18Z

The docs are 404 https://logos.maciej.codes/

Looks like someone raised:

Handbook is 404 #516

HTML build artifacts are now inside a nested folder, probably because we updated mdBook or because `linkcheck2` creates other build artifacts, see maciejhirsz#491. Closes maciejhirsz#516

stefnotch · 2025-12-07T19:36:53Z

Now that this is merged, I suppose quite a lot of issues can be closed? Like #187

(Robot rover commented on so many issues that I have to press "Load more..." on GitHub to see them all. That is so impressive)

JohanVonElectrum · 2025-12-07T20:07:23Z

@stefnotch I've tested mine (#505) and it's fixed.

jeertmans · 2025-12-07T20:20:58Z

I would be happy to accept PR adding tests that prove that other issues are solved by #491, and thus close those issue accordingly.

stefnotch · 2025-12-07T20:34:12Z

It turns out that @robot-rover is ahead of the game and already included a test for #187 in this PR. I guess you'll just need to go through the file and close the linked issues?

logos/tests/tests/old_logos_bugs.rs

Lines 172 to 173 in d62ecca

    
           // https://github.com/maciejhirsz/logos/issues/187 
        
           mod issue_187 {

jeertmans · 2025-12-07T21:26:28Z

Oh good catch, maybe linking this issue got lost in the long list of issues :)

stefnotch · 2025-12-07T21:35:53Z

Yeah, it likely never got added to the long list of linked issues. In fact, if you go through the file from top to bottom, almost every issue there still needs to be closed. Like #160

On the topic of housekeeping, the T-backtracking label can be sunset, as every issue related to it is fixed by this PR https://github.com/maciejhirsz/logos/issues?q=state%3Aopen%20label%3A%22T-backtracking%22 🚀

(Here's every issue that robot-rover commented on https://github.com/search?q=repo%3Amaciejhirsz%2Flogos+is%3Aissue+commenter%3Arobot-rover+is%3Aopen&type=issues )

pinkforest · 2026-01-30T11:36:43Z

I'm getting ~15% perf increase parsing http headers in 0.16

0.16

TC	Criterion
h11types parse headers - three headers	53.210 ns 53.320 ns 53.447 ns
h11types parse headers - nine headers	120.08 ns 120.30 ns 120.53 ns

0.14

TC	Criterion
h11types parse headers - three headers	65.509 ns 65.757 ns 66.001 ns
h11types parse headers - nine headers	168.54 ns 169.06 ns 169.75 ns

It's faster / almost on par with httparse but it's use-case is very different to mine and depends on situation httparse might be faster sometimes.

maciejhirsz/logos#491

robot-rover and others added 26 commits July 2, 2025 20:26

Remove some code and add NFA building

7802561

Support non-utf8 patterns

ae1b9c2

First tests working

0299dbf

Cleanup

a0b7af9

Most callback code

ce1ca02

Callback code done

e2eee0b

Fix some todos

ab9ebf1

Cleanup graph

e83e8cb

Prioritize length over priority

2cbd2bc

Include current char in unknown token error

bea2666

Fix callback signature

d8a02ec

Fix callback signature again

bb3345b

Change callbacks to allow returning tokens

edb3063

Implement subpatterns

dfc45f3

Remove unused code

e275d71

DRAFT: diagram export

f4f1185

Fix graph export

b6776f3

Fix compile warnings

374cfaa

Fix dependence on prelude being in scope

16c6e6c

Cargo fmt

770b946

Fix one past end error

752c3ae

Non-utf8 support

a08269c

Draft: Cleanup and utf8

20024c8

Working case ignore

a6874b9

Cleanup some old todos

38b45ed

Cargo fmt

eabaaac

robot-rover added 3 commits July 26, 2025 18:30

Merge custom error handler from main

a287860

Cleanup leaf.rs

cf55030

Accept new insta results

e8c95f1

pinkforest mentioned this pull request Nov 25, 2025

Use cfg(logos_) for target / build overrides #514

Open

Add book page on unicode support

3358bff

chore(ci): update mdBook version and check links

4603f42

Fix all linkcheck2 complaints

c9c312b

jeertmans approved these changes Dec 5, 2025

View reviewed changes

jeertmans merged commit 29d05e6 into maciejhirsz:master Dec 6, 2025
20 of 21 checks passed

jeertmans mentioned this pull request Dec 7, 2025

fix(ci): upload correct build artifacts #517

Merged

teofr mentioned this pull request Dec 9, 2025

upgrade logos crate NomicFoundation/slang#1487

Merged

Chaoses-Ib added a commit to Chaoses-Ib/ib-matcher that referenced this pull request Feb 18, 2026

feat(syntax/glob): allow logos v0.16

617954c

maciejhirsz/logos#491

Uh oh!

Conversation

robot-rover commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Intro

Justification

Overview of Internal Changes

API Changes

Remaining Work

Future performance optimization

Conclusion

Uh oh!

jeertmans commented Jul 26, 2025

Uh oh!

robot-rover commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robot-rover commented Nov 25, 2025

Uh oh!

robot-rover commented Nov 25, 2025

Uh oh!

robot-rover commented Nov 27, 2025

Uh oh!

pinkforest commented Nov 28, 2025

Uh oh!

robot-rover commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeertmans commented Nov 28, 2025

Uh oh!

jeertmans commented Nov 28, 2025

Uh oh!

jeertmans commented Nov 29, 2025

Uh oh!

stefnotch commented Dec 1, 2025

Uh oh!

robot-rover commented Dec 3, 2025

Uh oh!

jeertmans left a comment

Choose a reason for hiding this comment

Uh oh!

robot-rover commented Dec 5, 2025

Uh oh!

Uh oh!

pinkforest commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefnotch commented Dec 7, 2025

Uh oh!

JohanVonElectrum commented Dec 7, 2025

Uh oh!

jeertmans commented Dec 7, 2025

Uh oh!

stefnotch commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeertmans commented Dec 7, 2025

Uh oh!

stefnotch commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pinkforest commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

0.16

0.14

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

robot-rover commented Jul 26, 2025 •

edited

Loading

robot-rover commented Nov 25, 2025 •

edited

Loading

robot-rover commented Nov 28, 2025 •

edited

Loading

pinkforest commented Dec 6, 2025 •

edited

Loading

stefnotch commented Dec 7, 2025 •

edited

Loading

stefnotch commented Dec 7, 2025 •

edited

Loading

pinkforest commented Jan 30, 2026 •

edited

Loading