Major overhaul to support actual regex semantics#491
Major overhaul to support actual regex semantics#491jeertmans merged 109 commits intomaciejhirsz:masterfrom
Conversation
|
Hi @robot-rover! Many thanks for you PR! I am not sure when I will have time to review this, or if @maciejhirsz is available for this, but can you first make sure that there are no conflict with the main branch, so tests can run? Thanks! |
Mostly limiting the scope of this PR. There isn't a technical reason it couldn't be added again. |
|
I think @pinkforest point about the guarantees on when things are Unicode or not is a good catch. It seems like its not sufficiently handled by the book right now. Let's hold off merging this until I get a chance to update the relevant book pages. |
|
The book is now updated with more information on unicode behavior, I think this should be mergable now! |
|
@jeertmans you might need to acknowledge the regression in codspeed before this can merge. |
|
btw did you touch on the capture groups at all, it would be nice to have (foo) (bar) and being able to capture items into enums ? that was earlier regex limitation that makes life a bit harder where the choice is either to switch context between two matches (like between keys and values) over matching key=value pairs through one token capturing the two requiring no split operation through callback either. |
|
In general there is no way to find capture groups when matching with a dfa. When your nfa has no epsilon moves, your dfa is what's known as one pass (see https://docs.rs/regex-automata/latest/regex_automata/dfa/onepass/struct.DFA.html), and then you can do it, but this situation is rare. That only happens when it is unambiguous what the next step of a pattern is given the next byte (see the link for examples). In my opinion adding this functionality to logos just for one pass patterns isn't worth the effort, extra complexity in the already complicated codegen logic and confusion to users on why only certain patterns can have capture groups. That's just my opinion though. |
Unfortunately, I don't have the permission to do that, but that will not block the PR. |
Thanks, I'll build the book locally to check how it looks, and merge this PR in the upcoming days :-) |
Xref: robot-rover#8 |
I just tried it out with the latest nightly Rust toolchain, it's very satisfactory now. |
|
@jeertmans Merged! |
jeertmans
left a comment
There was a problem hiding this comment.
Hi @robot-rover, thank you very much for addressing all my comments! I am ready to merge this. Let me know if you planned on committing any other changes, or if it can be merged.
|
@jeertmans ship it! 🚢 |
|
The docs are 404 https://logos.maciej.codes/ Looks like someone raised: |
HTML build artifacts are now inside a nested folder, probably because we updated mdBook or because `linkcheck2` creates other build artifacts, see maciejhirsz#491. Closes maciejhirsz#516
|
Now that this is merged, I suppose quite a lot of issues can be closed? Like #187 (Robot rover commented on so many issues that I have to press "Load more..." on GitHub to see them all. That is so impressive) |
|
@stefnotch I've tested mine (#505) and it's fixed. |
|
I would be happy to accept PR adding tests that prove that other issues are solved by #491, and thus close those issue accordingly. |
|
It turns out that @robot-rover is ahead of the game and already included a test for #187 in this PR. I guess you'll just need to go through the file and close the linked issues? logos/tests/tests/old_logos_bugs.rs Lines 172 to 173 in d62ecca |
|
Oh good catch, maybe linking this issue got lost in the long list of issues :) |
|
Yeah, it likely never got added to the long list of linked issues. In fact, if you go through the file from top to bottom, almost every issue there still needs to be closed. Like #160 On the topic of housekeeping, the (Here's every issue that robot-rover commented on https://github.com/search?q=repo%3Amaciejhirsz%2Flogos+is%3Aissue+commenter%3Arobot-rover+is%3Aopen&type=issues ) |
|
I'm getting ~15% perf increase parsing http headers in 0.16 0.16
0.14
It's faster / almost on par with httparse but it's use-case is very different to mine and depends on situation httparse might be faster sometimes. |
Intro
Hi there. I realize this is a scary PR to see as it changes 50+ files. This is something I've been working on for a while now to address the issues Maciej mentioned in the 0.13 Release Notes/Future of Logos. Namely, switching from the current tailcall implementation to a switch within a loop implementation and fixing the confusing
#[regex]implementation that doesn't actually follow the semantic of regular expressions.I'm posting this as a draft PR right now to see what the maintainer's attitude toward possibly merging these changes are. If you aren't interested, that is not a big deal, I'll just spin these off into a library fork (maybe I'll call it pathos :-P).
I wouldn't focus too much on close review of the changes, as there are a couple of things that still need polishing. But please take a high level look so you can give me your thoughts!EDIT 8/3/25: Since I believe I've addressed all of my concerns except for performance regressions, I'm now posting this as a real PR. I believe it is ready for review. Any further changes will probably be localized to the
logos-codegen/src/generator/mod.rsfile.EDIT 8/17/25: I don't believe there will be any further changes (save minor bugfixes).
Justification
There are two things I set out to fix with these changes.
#[regex]seem to be unspecified or even conflict with the documentation. I've seen a lot of github issues involving this in the past, and I ran into it myself trying to implement a VHDL lexer with logos (VHDL string literals escape quotes by having two double quotes in a row. This was impossible for me to implement with the latest logos because it required a fork with a "miss" if the next character was not also a double quote, and when I inspected the codegen it seems to have generated code that always returns an Error. (Let me know if you are interested in this and I can dig up the example, but I figured this PR was long enough already).becomekeyword isn't stabilized yet, which means there is no way to guarantee tail calls. While I haven't ever seen this happen on x86, it makes it hard to trust the library, especially on more niche architectures.As a bonus, it seemed like I could get rid of a large amount of code from the
logos-codegen/graphfolder by using more of theregex-automataAPI, which seemed like a win since that seemed to be where a lot of the bugs where coming from.Overview of Internal Changes
The biggest change I made was tossing nearly all of the
logos-codegen/graphfolder and instead compiling all the regex HIR into a DFA using theregex-automatacrate. The main benefit of this is that we now implement real regex semantics (and basically for free, I did barely any work to get this). I'd stress that we didn't lose any theoretical runtime here. DFAs match in linear time w.r.t. the input, and in pathological patterns we gained efficiency, since DFAs don't have any concept of a "miss" where they have to go back and try a different "path" through the pattern. This also fixes things such regex repetitions being "too greedy" where they consume too much of the input and don't match even when a match should have been found (see theloop_in_looptest and preceding comment for an example).The other change I made was changing the codegen to spit out a loop containing a match on a unit enum. This solves the issue of a possible stack overflow, but I believe it will have inferior performance until RFC #3720 lands. For more discussion on this, see _Future performance optimization.
As a bonus, I've gotten rid of some unsafe code in the repo (and all unsafe calls in generated code). There were two major sources of it
EDIT 8/3/25: After further inspection, it seems there is still some unsafe code for slicing the source to return a slice to callbacks. I've left this as it is still useful.
API Changes
There are two significant user-facing changes that I made.
#[regex]changes significantly. I haven't noticed any situations where it no longer matches something that it used to, but I don't think it's impossible. The most noticeable change is that it now matches many cases it didn't previously. It also supports non-greedy repetitions (which are very important for having simpler patterns. Take, for example, Maciej's complaint that parsing a block comment needs to be done with#[regex(r"\/\*([^*]|\*[^/])*\*\/")](gross). With the new behavior, you can write#[regex(r"\/\*.*?\*\/"), which seems much easier to parse to me.#[logos(utf8 = false)]. The old heuristic where you only do it if there are byte string patterns doesn't really make sense to me because you can easily match non-utf8 with string patterns (i.e. r"\xFF") and you can manually write out your utf8 bytes in a byte string. Additionally, I removed having a custom source implementation for now. If you ask for utf8, logos validates that your patterns can only match valid utf8 and works with strs, otherwise it does no validation and matches [u8]. I realize this is one of the more opinionated changes. I'm open to discuss / walk this back if convinced. It shouldn't be tons of work to re-add supporting user implemented Source in the future.ignore(ascii_case)was removed. This is more of a technical limitation.regex-syntaxdoesn't support this feature (they do support by using the case insensitive flag with the disable unicode flag, but logos supported matching utf8 characters while only being case insensitive in ascii. There might be a sneaky way to implement this, but I can't think of a great reason to want case insensitivity only in ascii while also matching valid utf8, so I didn't pursue this any further. Note that if end users really need this, they can be selective with their regex patterns to do so, i.e.#[regex(r"(?i-u:abc)Ë")]which matches any case of "abc" followed by only a capital E with an umlaut (Ë).There are several smaller changes that I've added as well
1. I've removed the ability for the Logos enum to have type parameters. This was mostly because I didn't see a great mention of this in the docs, I couldn't think of a great use case for it, and I didn't want to add the extra complexity to the new codegen. If there are users of this, I guess it could be worth the effort to reimplement.Fixed as of 8/3/252. Since the old Logos MIR differentiated between byte literals and utf8 literals, it assigned their priority based on the number of characters. The regex-syntax HIR only has byte literals (that may happen to be valid utf8). This means that literals containing non-ascii characters will have a slightly higher priority now. I think this is a small issue and not worth the extra complexity to fix.
Remaining Work
There are a couple of remaining todo items such as improving some error messages and some general code cleanup. In addition, I need to flesh out some commenting where they are pretty sparse. I'll be working on these things in the coming weeks.Fixed as of 8/3/25Additionally, I'd like to do some benchmarking to evaluate how bad the performance regressions are. Also, I haven't tested it with older rust versions, so I need to do that as well (I'm guessing that the 1.82
uses<'a>change will break the codegen).EDIT of 8/3/25: There are still performance regressions. I'll continue to work on these in the coming weeks. Additionally, I'd need to update the logos book. I'm holding off on that work until I get some clarity from a contributor on whether or not this PR is likely to be accepted.
EDIT of 8/17/25: The performance is almost comparable. The one thing I'm still missing is probably reading more than one character when trying to match literals (see below), but I'm planning to fix that later by removing literals that also match a regex and picking them out via a perfect hash function. But that is too big a change for this PR.
Future performance optimization
There are a couple places where there are some easy performance wins that could be added in the future. I focused on making the implementation correct first, with the intention to come back to these at some point in the future. In no particular order
Re-add the tail call codegen. While I haven't done any benchmarking yet, I'm fairly sure this created more efficient assembly in most cases, since it seems that Rust's state machine codegen is currently lacking (see earlier RFC link). While it won't be truly safe until theDone as of 8/3/25becomekeyword lands, it has been used successfully by logos in the past, and I think it would be a worthy opt-in feature.The regex-automata has the slightly unintuitive behavior of returning a match 1 character after the match has been found (for example, it will tell you you have found a match for the pattern "abc" in the haystack "abcd" once you give it the character "d". This is used to implement things such asDone as of 8/17/25$and line endings. This means we are reading one character too many in nearly all cases (for a typical lexer). This could be fixed by examining the graph created from the DFA. If all edges leading out of a node lead to a match for the same pattern, we can get rid of those matches and make the original node an "early match" for the pattern. This is the sort of optimization you can't do for a runtime DFA because the code for all states has to be homogeneous, but we can get away with it for a compile time codegenerator.Similar to what logos currently does for forks, if the match statement is complex enough (there are enough disparate classes that it doesn't make sense to just do 1 or 2 range checks, we should be generating a lookup table. This one seems pretty obvious and easy to implement.Done as of 8/17/25Similar to what logos currently does for loops, regex-automata has the concept of "accelerated states". This could be implemented with a call to a function in theDone as of 8/17/25. (Note, I looked into implementing fast loops using the cmpistri instruction on x86 but it wasn't appreciably faster.)memchrcrate (which is how regex-automata does it internally).Conclusion
Thanks for reading my massive wall of PR description. I hope you'll consider working together to make this the future of Logos. I really tried to make my changes all in the interest of either code cleanup or what I interpreted as the intended direction of the library ala those 0.13 release notes. If you'd like to take the conversation off of github, I can also be reached over discord. Just let me know <3