Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate better backtracking optimizations. #516

Open
bd82 opened this issue Nov 7, 2021 · 1 comment
Open

Investigate better backtracking optimizations. #516

bd82 opened this issue Nov 7, 2021 · 1 comment

Comments

@bd82
Copy link
Contributor

bd82 commented Nov 7, 2021

Background

The Java-Parser is implemented in a style as closely resembling the official specifications.
This is done to increase confidence in it's correctness and ease the maintainability/upgradability when new versions of the specs
are released.

So the specification for CompilationUnit looks like:
image

The Matching implementation in the parser would be very similar:

  • $.RULE("compilationUnit", () => {
    // custom optimized backtracking lookahead logic
    const isModule = $.BACKTRACK_LOOKAHEAD($.isModuleCompilationUnit);
    $.OR([
    {
    GATE: () => isModule === false,
    ALT: () => $.SUBRULE($.ordinaryCompilationUnit)
    },
    {
    ALT: () => $.SUBRULE($.modularCompilationUnit)
    }
    ]);
    // https://github.com/jhipster/prettier-java/pull/217
    $.CONSUME(EOF);
    });
    // https://docs.oracle.com/javase/specs/jls/se16/html/jls-7.html#jls-OrdinaryCompilationUnit
    $.RULE("ordinaryCompilationUnit", () => {
    $.OPTION({
    GATE: $.BACKTRACK($.packageDeclaration),
    DEF: () => {
    $.SUBRULE($.packageDeclaration);
    }
    });
    $.MANY(() => {
    $.SUBRULE3($.importDeclaration);
    });
    $.MANY2(() => {
    $.SUBRULE($.typeDeclaration);
    });
    });
    // https://docs.oracle.com/javase/specs/jls/se16/html/jls-7.html#jls-ModularCompilationUnit
    $.RULE("modularCompilationUnit", () => {
    $.MANY(() => {
    $.SUBRULE($.importDeclaration);
    });
    $.SUBRULE($.moduleDeclaration);
    });

The Problem and current mitigation

The parser toolkit (Chevrotain) cannot be used to implement the grammar directly due to different parsing algorithims (LL vs LR if I recall correctly). So instead the parser is using backtracking to figure out which alternative to pick in many "places" in the grammar.

Simple naïve Backtracking is very inefficient, so the parser has slightly more optimized backtracking in many rules
where we try to limit how far forward in the grammar we look instead of the most naïve backtracking approach in
which a backtracked rule is fully parsed twice

Recent Discoveries

It seems like a great percentage of the parsing time is spent going "back and forward" due to backtracking.

Open Questions / More info needed

How bad is the performance impact due to backtracking? in other words how many times do
we reparse the same input?

  • Given an input of N tokens, how many times does CONSUME gets called (successfully/unsuccessfully) ? (1.5N / 3N / 10N) ?
  • Do the same parsing rules gets called multiple times on the same idx in the token stream?
    • Could be counted by logging tracing / logging the data for all SUBRULE invocations and grouping the results, e.g for each SUBRULE called, collect and later group up:
      • occurrenceIdx
      • tokenVector idx
      • sublrule name
      • rule parameters (if any).

More possible optimizations

Once we have some answer to the questions above we can consider options for more optimizations.
Basically if in the extreme case if we end attempting to parse the same subsets of input in (mostly) identical manners many times
It may be more optimal to save and cache the intermediate results and re-use those instead of re-parsing again.
Basically a kind of memoization.

It is also possible the performance problem is in a specific sub-set of the grammar (expressions grammar is often a performance hog). so optimizations may not need be targeted at the whole grammar.

@bd82 bd82 changed the title Investigate better backtracking performance. Investigate better backtracking optimizations. Nov 7, 2021
@bd82
Copy link
Contributor Author

bd82 commented Nov 11, 2021

I've added the missing sections above.

The first actionable item
would be to try and get some actual data on how bad the backtracking is (x1.5 or x10 or somewhere in between...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant