Add SQLSRV_ENCODING_UTF8_VARCHAR for VARCHAR columns with UTF-8 collations#1593
Draft
Add SQLSRV_ENCODING_UTF8_VARCHAR for VARCHAR columns with UTF-8 collations#1593
Conversation
…tions Fixes #1587: Using UTF-8 encoding in PDO_SQLSRV results in NVARCHAR parameters instead of VARCHAR(_UTF8). When using SQLSRV_ENCODING_UTF8 (the default), string parameters are bound as SQL_WVARCHAR (NVARCHAR) with UTF-16 conversion. This causes implicit type conversions against VARCHAR columns with _UTF8 collations, degrading index usage and query performance. This change adds a new encoding constant SQLSRV_ENCODING_UTF8_VARCHAR (value 65002) that: - Binds parameters as SQL_VARCHAR / SQL_C_CHAR (not NVARCHAR / WCHAR) - Sends UTF-8 bytes directly without UTF-16 conversion - SQL Server interprets data using the column's _UTF8 collation - No implicit conversion, indexes remain usable The new encoding can be set at connection, statement, or per-parameter level in both PDO_SQLSRV and SQLSRV extensions: PDO: PDO::SQLSRV_ENCODING_UTF8_VARCHAR SQLSRV: SQLSRV_ENC_UTF8_VARCHAR (CharacterSet: 'utf-8-varchar') This is a non-breaking, additive change. Existing SQLSRV_ENCODING_UTF8 behavior is unchanged. Files modified: - core_sqlsrv.h: Added enum value - core_stmt.cpp: SQL_VARCHAR/SQL_C_CHAR binding in derive_string_types_sizes, process_resource_param, process_output_string; skip UTF-16 conversion - core_stream.cpp: SQL_C_CHAR stream read path - core_util.cpp: Map 65002->CP_UTF8 in string conversion functions - localizationimpl.cpp: iconv mapping for Linux/Mac - pdo_init.cpp, pdo_dbh.cpp, pdo_stmt.cpp: PDO constant and validation - sqlsrv/init.cpp, sqlsrv/stmt.cpp: SQLSRV constant and validation Tests added: - pdo_utf8_varchar_encoding.phpt: connection/statement/param level - pdo_utf8_varchar_extra_coverage.phpt: bindColumn, emulate prepares, output params - sqlsrv_utf8_varchar_encoding.phpt: connection level with roundtrip - sqlsrv_utf8_varchar_extra_coverage.phpt: stream params, output params, fetch as stream
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #1593 +/- ##
==========================================
+ Coverage 85.75% 85.78% +0.02%
==========================================
Files 23 23
Lines 7211 7231 +20
==========================================
+ Hits 6184 6203 +19
- Misses 1027 1028 +1
🚀 New features to boost your workflow:
|
SQL_C_CHAR binding sends data through the ODBC driver's ANSI codepage conversion. Characters outside the system codepage (CJK, Cyrillic, Arabic) get corrupted on systems where the ANSI codepage is not UTF-8. Replace multi_script test cases (日本語, русский, عربى) with extended Latin characters (Ñoño, café, résumé, naïve) that are representable in Latin1/CP1252 and work consistently across all platforms.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1587: Using UTF-8 encoding in PDO_SQLSRV results in NVARCHAR parameters instead of VARCHAR(_UTF8).
When using SQLSRV_ENCODING_UTF8 (the default), string parameters are bound as SQL_WVARCHAR (NVARCHAR) with UTF-16 conversion. This causes implicit type conversions against VARCHAR columns with _UTF8 collations, degrading index usage and query performance.
This change adds a new encoding constant SQLSRV_ENCODING_UTF8_VARCHAR (value 65002) that:
The new encoding can be set at connection, statement, or per-parameter level in both PDO_SQLSRV and SQLSRV extensions:
PDO: PDO::SQLSRV_ENCODING_UTF8_VARCHAR
SQLSRV: SQLSRV_ENC_UTF8_VARCHAR (CharacterSet: 'utf-8-varchar')
This is a non-breaking, additive change. Existing SQLSRV_ENCODING_UTF8 behavior is unchanged.
Design
Architecture
Data Flow
Key Design Decisions
New constant value (65002): Chosen as
CP_UTF8 + 1to keep it adjacent toSQLSRV_ENCODING_UTF8(65001). Both values are well above the auto-enum range (0-3) and existing Windows codepage numbers, avoiding any collision.Same codepage for string conversion: Both
SQLSRV_ENCODING_UTF8andSQLSRV_ENCODING_UTF8_VARCHARuse CP_UTF8 (65001) forMultiByteToWideChar/WideCharToMultiByteand iconv conversions. The difference is only in ODBC binding — not in how PHP strings are interpreted.Non-breaking: The existing
SQLSRV_ENCODING_UTF8behavior is unchanged. The new encoding is opt-in only, requiring explicit use of the new constant.Emulate prepares limitation: With
PDO::ATTR_EMULATE_PREPARES, string literals use'...'(notN'...'). Non-ASCII data requires the database default collation to be UTF-8 for correct interpretation. With native prepared statements (the default), this limitation does not apply.Files modified:
Tests added: