Skip to content

Major overhaul to support actual regex semantics#491

Merged
jeertmans merged 109 commits intomaciejhirsz:masterfrom
robot-rover:master
Dec 6, 2025
Merged

Major overhaul to support actual regex semantics#491
jeertmans merged 109 commits intomaciejhirsz:masterfrom
robot-rover:master

Conversation

@robot-rover
Copy link
Contributor

@robot-rover robot-rover commented Jul 26, 2025

Intro

Hi there. I realize this is a scary PR to see as it changes 50+ files. This is something I've been working on for a while now to address the issues Maciej mentioned in the 0.13 Release Notes/Future of Logos. Namely, switching from the current tailcall implementation to a switch within a loop implementation and fixing the confusing #[regex] implementation that doesn't actually follow the semantic of regular expressions.

I'm posting this as a draft PR right now to see what the maintainer's attitude toward possibly merging these changes are. If you aren't interested, that is not a big deal, I'll just spin these off into a library fork (maybe I'll call it pathos :-P).

I wouldn't focus too much on close review of the changes, as there are a couple of things that still need polishing. But please take a high level look so you can give me your thoughts!

EDIT 8/3/25: Since I believe I've addressed all of my concerns except for performance regressions, I'm now posting this as a real PR. I believe it is ready for review. Any further changes will probably be localized to the logos-codegen/src/generator/mod.rs file.

EDIT 8/17/25: I don't believe there will be any further changes (save minor bugfixes).

Justification

There are two things I set out to fix with these changes.

  1. There are a lot of places where the actual semantics of #[regex] seem to be unspecified or even conflict with the documentation. I've seen a lot of github issues involving this in the past, and I ran into it myself trying to implement a VHDL lexer with logos (VHDL string literals escape quotes by having two double quotes in a row. This was impossible for me to implement with the latest logos because it required a fork with a "miss" if the next character was not also a double quote, and when I inspected the codegen it seems to have generated code that always returns an Error. (Let me know if you are interested in this and I can dig up the example, but I figured this PR was long enough already).
  2. The tail call codegen of logos is currently unsafe because the become keyword isn't stabilized yet, which means there is no way to guarantee tail calls. While I haven't ever seen this happen on x86, it makes it hard to trust the library, especially on more niche architectures.

As a bonus, it seemed like I could get rid of a large amount of code from the logos-codegen/graph folder by using more of the regex-automata API, which seemed like a win since that seemed to be where a lot of the bugs where coming from.

Overview of Internal Changes

The biggest change I made was tossing nearly all of the logos-codegen/graph folder and instead compiling all the regex HIR into a DFA using the regex-automata crate. The main benefit of this is that we now implement real regex semantics (and basically for free, I did barely any work to get this). I'd stress that we didn't lose any theoretical runtime here. DFAs match in linear time w.r.t. the input, and in pathological patterns we gained efficiency, since DFAs don't have any concept of a "miss" where they have to go back and try a different "path" through the pattern. This also fixes things such regex repetitions being "too greedy" where they consume too much of the input and don't match even when a match should have been found (see the loop_in_loop test and preceding comment for an example).

The other change I made was changing the codegen to spit out a loop containing a match on a unit enum. This solves the issue of a possible stack overflow, but I believe it will have inferior performance until RFC #3720 lands. For more discussion on this, see _Future performance optimization.

As a bonus, I've gotten rid of some unsafe code in the repo (and all unsafe calls in generated code). There were two major sources of it

  1. Getting around the overhead of storing the token in the Lexer struct. I've avoided this completely by returning the Token without storing it in the Lexer first
  2. Re-reading bytes that we've already read without bounds checks. Since the DFA never backtracks (there is no concept of a "miss"), there is never a case where we know there are more bytes that we can read. So we always do a bounds check.

EDIT 8/3/25: After further inspection, it seems there is still some unsafe code for slicing the source to return a slice to callbacks. I've left this as it is still useful.

API Changes

There are two significant user-facing changes that I made.

  1. Now that regex has the correct semantics, the behavior of #[regex] changes significantly. I haven't noticed any situations where it no longer matches something that it used to, but I don't think it's impossible. The most noticeable change is that it now matches many cases it didn't previously. It also supports non-greedy repetitions (which are very important for having simpler patterns. Take, for example, Maciej's complaint that parsing a block comment needs to be done with #[regex(r"\/\*([^*]|\*[^/])*\*\/")] (gross). With the new behavior, you can write #[regex(r"\/\*.*?\*\/"), which seems much easier to parse to me.
  2. I added the requirement that you ask for non-utf8 (eg. Implementing logos with source = [u8]) explicitly by using #[logos(utf8 = false)]. The old heuristic where you only do it if there are byte string patterns doesn't really make sense to me because you can easily match non-utf8 with string patterns (i.e. r"\xFF") and you can manually write out your utf8 bytes in a byte string. Additionally, I removed having a custom source implementation for now. If you ask for utf8, logos validates that your patterns can only match valid utf8 and works with strs, otherwise it does no validation and matches [u8]. I realize this is one of the more opinionated changes. I'm open to discuss / walk this back if convinced. It shouldn't be tons of work to re-add supporting user implemented Source in the future.
  3. ignore(ascii_case) was removed. This is more of a technical limitation. regex-syntax doesn't support this feature (they do support by using the case insensitive flag with the disable unicode flag, but logos supported matching utf8 characters while only being case insensitive in ascii. There might be a sneaky way to implement this, but I can't think of a great reason to want case insensitivity only in ascii while also matching valid utf8, so I didn't pursue this any further. Note that if end users really need this, they can be selective with their regex patterns to do so, i.e. #[regex(r"(?i-u:abc)Ë")] which matches any case of "abc" followed by only a capital E with an umlaut (Ë).

There are several smaller changes that I've added as well
1. I've removed the ability for the Logos enum to have type parameters. This was mostly because I didn't see a great mention of this in the docs, I couldn't think of a great use case for it, and I didn't want to add the extra complexity to the new codegen. If there are users of this, I guess it could be worth the effort to reimplement. Fixed as of 8/3/25
2. Since the old Logos MIR differentiated between byte literals and utf8 literals, it assigned their priority based on the number of characters. The regex-syntax HIR only has byte literals (that may happen to be valid utf8). This means that literals containing non-ascii characters will have a slightly higher priority now. I think this is a small issue and not worth the extra complexity to fix.

Remaining Work

There are a couple of remaining todo items such as improving some error messages and some general code cleanup. In addition, I need to flesh out some commenting where they are pretty sparse. I'll be working on these things in the coming weeks. Fixed as of 8/3/25

Additionally, I'd like to do some benchmarking to evaluate how bad the performance regressions are. Also, I haven't tested it with older rust versions, so I need to do that as well (I'm guessing that the 1.82 uses<'a> change will break the codegen).

EDIT of 8/3/25: There are still performance regressions. I'll continue to work on these in the coming weeks. Additionally, I'd need to update the logos book. I'm holding off on that work until I get some clarity from a contributor on whether or not this PR is likely to be accepted.

EDIT of 8/17/25: The performance is almost comparable. The one thing I'm still missing is probably reading more than one character when trying to match literals (see below), but I'm planning to fix that later by removing literals that also match a regex and picking them out via a perfect hash function. But that is too big a change for this PR.

Future performance optimization

There are a couple places where there are some easy performance wins that could be added in the future. I focused on making the implementation correct first, with the intention to come back to these at some point in the future. In no particular order

  • Re-add the tail call codegen. While I haven't done any benchmarking yet, I'm fairly sure this created more efficient assembly in most cases, since it seems that Rust's state machine codegen is currently lacking (see earlier RFC link). While it won't be truly safe until the become keyword lands, it has been used successfully by logos in the past, and I think it would be a worthy opt-in feature. Done as of 8/3/25
  • The regex-automata has the slightly unintuitive behavior of returning a match 1 character after the match has been found (for example, it will tell you you have found a match for the pattern "abc" in the haystack "abcd" once you give it the character "d". This is used to implement things such as $ and line endings. This means we are reading one character too many in nearly all cases (for a typical lexer). This could be fixed by examining the graph created from the DFA. If all edges leading out of a node lead to a match for the same pattern, we can get rid of those matches and make the original node an "early match" for the pattern. This is the sort of optimization you can't do for a runtime DFA because the code for all states has to be homogeneous, but we can get away with it for a compile time codegenerator. Done as of 8/17/25
  • In a similar vein, if the edges leading out of a node are all long "ropes" of states (according to some heuristic), we can read multiple bytes from the Source and use a rust match statement. We'd need to benchmark this, but I think it is similar to what logos used to do for multiple ropes. For example, looking for the tokens ["beef", "case", "boom"] gives you the state graph (below) and you could make this into a rust match on the next 4 bytes. As to what exactly the heuristic would be for this transformation, I haven't thought about it enough.
                    ┌─e─►3──e─►6──f──►match beef 
                    │                            
         ┌──b───►1──┤                            
         │          │                            
start  ──┤          └─o─►4──o─►7──m──►match boom 
         │                                       
         └──c───►2────a─►5──s─►8──e──►match case 
  • Similar to what logos currently does for forks, if the match statement is complex enough (there are enough disparate classes that it doesn't make sense to just do 1 or 2 range checks, we should be generating a lookup table. This one seems pretty obvious and easy to implement. Done as of 8/17/25
  • Similar to what logos currently does for loops, regex-automata has the concept of "accelerated states". This could be implemented with a call to a function in the memchr crate (which is how regex-automata does it internally). Done as of 8/17/25. (Note, I looked into implementing fast loops using the cmpistri instruction on x86 but it wasn't appreciably faster.)

Conclusion

Thanks for reading my massive wall of PR description. I hope you'll consider working together to make this the future of Logos. I really tried to make my changes all in the interest of either code cleanup or what I interpreted as the intended direction of the library ala those 0.13 release notes. If you'd like to take the conversation off of github, I can also be reached over discord. Just let me know <3

@jeertmans
Copy link
Collaborator

Hi @robot-rover! Many thanks for you PR! I am not sure when I will have time to review this, or if @maciejhirsz is available for this, but can you first make sure that there are no conflict with the main branch, so tests can run? Thanks!

@robot-rover
Copy link
Contributor Author

robot-rover commented Nov 25, 2025

Hi, guys, may I ask about what blocks the custom source type in 0.16?

Mostly limiting the scope of this PR. There isn't a technical reason it couldn't be added again.

@robot-rover
Copy link
Contributor Author

I think @pinkforest point about the guarantees on when things are Unicode or not is a good catch. It seems like its not sufficiently handled by the book right now. Let's hold off merging this until I get a chance to update the relevant book pages.

@robot-rover
Copy link
Contributor Author

The book is now updated with more information on unicode behavior, I think this should be mergable now!

@robot-rover
Copy link
Contributor Author

@jeertmans you might need to acknowledge the regression in codspeed before this can merge.

@pinkforest
Copy link
Contributor

btw did you touch on the capture groups at all, it would be nice to have (foo) (bar) and being able to capture items into enums ? that was earlier regex limitation that makes life a bit harder where the choice is either to switch context between two matches (like between keys and values) over matching key=value pairs through one token capturing the two requiring no split operation through callback either.

@robot-rover
Copy link
Contributor Author

robot-rover commented Nov 28, 2025

In general there is no way to find capture groups when matching with a dfa. When your nfa has no epsilon moves, your dfa is what's known as one pass (see https://docs.rs/regex-automata/latest/regex_automata/dfa/onepass/struct.DFA.html), and then you can do it, but this situation is rare. That only happens when it is unambiguous what the next step of a pattern is given the next byte (see the link for examples). In my opinion adding this functionality to logos just for one pass patterns isn't worth the effort, extra complexity in the already complicated codegen logic and confusion to users on why only certain patterns can have capture groups. That's just my opinion though.
(@pinkforest )

@jeertmans
Copy link
Collaborator

@jeertmans you might need to acknowledge the regression in codspeed before this can merge.

Unfortunately, I don't have the permission to do that, but that will not block the PR.

@jeertmans
Copy link
Collaborator

The book is now updated with more information on unicode behavior, I think this should be mergable now!

Thanks, I'll build the book locally to check how it looks, and merge this PR in the upcoming days :-)

@jeertmans
Copy link
Collaborator

The book is now updated with more information on unicode behavior, I think this should be mergable now!

Thanks, I'll build the book locally to check how it looks, and merge this PR in the upcoming days :-)

Xref: robot-rover#8

@stefnotch
Copy link

I tried out my logos-demo.zip again [...] I wonder if it'll become faster with rust-lang/rust-analyzer#21097 ?

I just tried it out with the latest nightly Rust toolchain, it's very satisfactory now.
(The proc macro server ships with the Rust toolchain, unlike most parts of Rust-Analyzer)

@robot-rover
Copy link
Contributor Author

@jeertmans Merged!

Copy link
Collaborator

@jeertmans jeertmans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @robot-rover, thank you very much for addressing all my comments! I am ready to merge this. Let me know if you planned on committing any other changes, or if it can be merged.

@robot-rover
Copy link
Contributor Author

@jeertmans ship it! 🚢

@jeertmans jeertmans merged commit 29d05e6 into maciejhirsz:master Dec 6, 2025
20 of 21 checks passed
@pinkforest
Copy link
Contributor

pinkforest commented Dec 6, 2025

The docs are 404 https://logos.maciej.codes/

Looks like someone raised:

jeertmans added a commit to jeertmans/logos that referenced this pull request Dec 7, 2025
HTML build artifacts are now inside a nested folder, probably because we updated mdBook or because `linkcheck2` creates other build artifacts, see maciejhirsz#491.

Closes maciejhirsz#516
@stefnotch
Copy link

Now that this is merged, I suppose quite a lot of issues can be closed? Like #187

(Robot rover commented on so many issues that I have to press "Load more..." on GitHub to see them all. That is so impressive)

@JohanVonElectrum
Copy link

@stefnotch I've tested mine (#505) and it's fixed.

@jeertmans
Copy link
Collaborator

I would be happy to accept PR adding tests that prove that other issues are solved by #491, and thus close those issue accordingly.

@stefnotch
Copy link

stefnotch commented Dec 7, 2025

It turns out that @robot-rover is ahead of the game and already included a test for #187 in this PR. I guess you'll just need to go through the file and close the linked issues?

// https://github.com/maciejhirsz/logos/issues/187
mod issue_187 {

@jeertmans
Copy link
Collaborator

Oh good catch, maybe linking this issue got lost in the long list of issues :)

@stefnotch
Copy link

stefnotch commented Dec 7, 2025

Yeah, it likely never got added to the long list of linked issues. In fact, if you go through the file from top to bottom, almost every issue there still needs to be closed. Like #160

On the topic of housekeeping, the T-backtracking label can be sunset, as every issue related to it is fixed by this PR https://github.com/maciejhirsz/logos/issues?q=state%3Aopen%20label%3A%22T-backtracking%22 🚀

(Here's every issue that robot-rover commented on https://github.com/search?q=repo%3Amaciejhirsz%2Flogos+is%3Aissue+commenter%3Arobot-rover+is%3Aopen&type=issues )

@pinkforest
Copy link
Contributor

pinkforest commented Jan 30, 2026

I'm getting ~15% perf increase parsing http headers in 0.16

0.16

TC Criterion
h11types parse headers - three headers 53.210 ns 53.320 ns 53.447 ns
h11types parse headers - nine headers 120.08 ns 120.30 ns 120.53 ns

0.14

TC Criterion
h11types parse headers - three headers 65.509 ns 65.757 ns 66.001 ns
h11types parse headers - nine headers 168.54 ns 169.06 ns 169.75 ns

It's faster / almost on par with httparse but it's use-case is very different to mine and depends on situation httparse might be faster sometimes.

Chaoses-Ib added a commit to Chaoses-Ib/ib-matcher that referenced this pull request Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

book Related to the book enhancement New feature or request

Projects

None yet

9 participants