Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix query rewriting unproperly with multibyte characters #3

Merged
merged 1 commit into from
Jan 26, 2023

Conversation

markrui3
Copy link
Owner

@markrui3 markrui3 commented Jan 26, 2023

Description

Babelfish doesn't rewrite query with multibyte characters properly.

Analysis

Babelfish preprocess the query string and remove unsupported syntax before sending the query to PG backend. The implementation didn’t consider multibyte unicode characters, so when unicode characters are used ahead of the unsupported syntax, Babelfish will emit a broken query. In specific, character offset is used instead of byte offset during the character replacement. For example:
Input T-SQL : select "你好世界" from tbl with(nolock);
Executed SQL: select "你好世界" f (nolock);

Solution

Consolidate all rewriting behaviors to PLtsql_expr_query_mutator. It’s more maintainable because there will only be one interface for query rewriting. We support Chinese unicode charset as identifier in this patch.

Description

[Describe what this change achieves - Guidelines below (please delete the guidelines after writing the PR description)]

  1. What is the change? This is best described in terms of “Currently, Babelfish does X. With this change it now does Y.” Think of “What did it used to do?” and “What does it do now?”
  1. Why was the change made? What drove our desire to put effort into the change?
  2. How was the code changed should only appear for large commits. This can serve as a rough roadmap to what’s contained in the commit. It should be very high level; if it’s directly referencing code it’s probably too detailed. It’s also critical that this section of a commit message does not try to replace proper code documentation (ie, block comments or README files). Generally, this section should only appear if the commit itself is large enough that it’s helpful to provide a roadmap to someone looking at the commit.
  3. The last descriptive piece is the “title” for the commit: the very first line of the commit message, which should typically be less than 80 characters. A good title is critical, because it’s the only thing that shows up in places like the Github commit listing. No one’s got time to read through full commit messages when trying to find a single commit out of dozens.

Issues Resolved

[List any issues this PR will resolve]

Test Scenarios Covered

  • Use case based -

  • Boundary conditions -

  • Arbitrary inputs -

  • Negative test cases -

  • Minor version upgrade tests -

  • Major version upgrade tests -

  • Performance tests -

  • Tooling impact -

  • Client tests -

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is under the terms of the Apache 2.0 and PostgreSQL licenses, and grant any person obtaining a copy of the contribution permission to relicense all or a portion of my contribution to the PostgreSQL License solely to contribute all or a portion of my contribution to the PostgreSQL open source project.

For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Description
===========
Babelfish doesn't rewrite query with multibyte characters properly.

Analysis
=========
Babelfish preprocess the query string and remove unsupported syntax before
sending the query to PG backend. The implementation didn’t consider multibyte
unicode characters, so when unicode characters are used ahead of the unsupported
syntax, Babelfish will emit a broken query. In specific, character offset is
used instead of byte offset during the character replacement.
For example:
Input T-SQL : select "你好世界" from tbl with(nolock);
Executed SQL: select "你好世界" f            (nolock);

Solution
========
Consolidate all rewriting behaviors to PLtsql_expr_query_mutator. It’s more
maintainable because there will only be one interface for query rewriting.
We support Chinese unicode charset as identifier in this patch.
@markrui3 markrui3 merged commit 7c4cbad into BABEL_2_X_DEV Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant