Add SQLSRV_ENCODING_UTF8_VARCHAR for VARCHAR columns with UTF-8 collations by jahnvi480 · Pull Request #1593 · microsoft/msphpsql

jahnvi480 · 2026-03-30T08:06:06Z

Fixes #1587: Using UTF-8 encoding in PDO_SQLSRV results in NVARCHAR parameters instead of VARCHAR(_UTF8).

When using SQLSRV_ENCODING_UTF8 (the default), string parameters are bound as SQL_WVARCHAR (NVARCHAR) with UTF-16 conversion. This causes implicit type conversions against VARCHAR columns with _UTF8 collations, degrading index usage and query performance.

This change adds a new encoding constant SQLSRV_ENCODING_UTF8_VARCHAR (value 65002) that:

Binds parameters as SQL_VARCHAR / SQL_C_CHAR (not NVARCHAR / WCHAR)
Sends UTF-8 bytes directly without UTF-16 conversion
SQL Server interprets data using the column's _UTF8 collation
No implicit conversion, indexes remain usable

The new encoding can be set at connection, statement, or per-parameter level in both PDO_SQLSRV and SQLSRV extensions:

PDO: PDO::SQLSRV_ENCODING_UTF8_VARCHAR
SQLSRV: SQLSRV_ENC_UTF8_VARCHAR (CharacterSet: 'utf-8-varchar')

This is a non-breaking, additive change. Existing SQLSRV_ENCODING_UTF8 behavior is unchanged.

Design

Architecture

                           SQLSRV_ENCODING_UTF8 (65001)          SQLSRV_ENCODING_UTF8_VARCHAR (65002)
                           ─────────────────────────              ──────────────────────────────────────
PHP string (UTF-8 bytes)   → convert to UTF-16 (WCHAR)           → keep as UTF-8 bytes (no conversion)
SQL parameter type         → SQL_WVARCHAR (NVARCHAR)              → SQL_VARCHAR
C data type                → SQL_C_WCHAR                          → SQL_C_CHAR
ODBC wire format           → Unicode (UCS-2/UTF-16)               → Raw bytes (UTF-8)
SQL Server interpretation  → NVARCHAR semantics                   → Column collation (_UTF8)
PDO quote() output         → N'...' (national literal)            → '...' (char literal)
String conversion codepage → CP_UTF8 (65001)                      → CP_UTF8 (65001) — same

Data Flow

┌──────────────────────┐
│   PHP Application    │
│  $val = "Grüße 日本語" │  ← UTF-8 encoded string
└──────────┬───────────┘
           │
    ┌──────┴──────────────────────────────────┐
    │                                          │
    ▼ SQLSRV_ENCODING_UTF8                     ▼ SQLSRV_ENCODING_UTF8_VARCHAR
    │                                          │
    │ process_string_param()                   │ process_string_param()
    │ ├─ derive_string_types_sizes()           │ ├─ derive_string_types_sizes()
    │ │  sql_data_type = SQL_WVARCHAR          │ │  sql_data_type = SQL_VARCHAR ◄── KEY DIFF
    │ │  c_data_type   = SQL_C_WCHAR           │ │  c_data_type   = SQL_C_CHAR  ◄── KEY DIFF
    │ ├─ convert_input_str_to_utf16() ◄── YES  │ ├─ (encoding != CP_UTF8) ◄── SKIPPED
    │ │  "Grüße" → UTF-16LE bytes              │ │  buffer stays as UTF-8 bytes
    │ └─ SQLBindParameter(SQL_C_WCHAR,         │ └─ SQLBindParameter(SQL_C_CHAR,
    │       SQL_WVARCHAR, utf16_buf)            │       SQL_VARCHAR, utf8_buf)
    │                                          │
    ▼                                          ▼
┌──────────────────────┐              ┌──────────────────────┐
│   ODBC Driver        │              │   ODBC Driver        │
│ Sends as NVARCHAR    │              │ Sends as VARCHAR     │
│ (Unicode path)       │              │ (raw byte path)      │
└──────────┬───────────┘              └──────────┬───────────┘
           │                                     │
           ▼                                     ▼
┌──────────────────────────────────────────────────────────┐
│                      SQL Server                          │
│                                                          │
│  VARCHAR(255) COLLATE Latin1_General_100_CI_AS_SC_UTF8   │
│                                                          │
│  NVARCHAR param → implicit conversion → index scan ✗    │
│  VARCHAR param  → no conversion       → index seek ✓    │
└──────────────────────────────────────────────────────────┘

Key Design Decisions

New constant value (65002): Chosen as CP_UTF8 + 1 to keep it adjacent to SQLSRV_ENCODING_UTF8 (65001). Both values are well above the auto-enum range (0-3) and existing Windows codepage numbers, avoiding any collision.
Same codepage for string conversion: Both SQLSRV_ENCODING_UTF8 and SQLSRV_ENCODING_UTF8_VARCHAR use CP_UTF8 (65001) for MultiByteToWideChar/WideCharToMultiByte and iconv conversions. The difference is only in ODBC binding — not in how PHP strings are interpreted.
Non-breaking: The existing SQLSRV_ENCODING_UTF8 behavior is unchanged. The new encoding is opt-in only, requiring explicit use of the new constant.
Emulate prepares limitation: With PDO::ATTR_EMULATE_PREPARES, string literals use '...' (not N'...'). Non-ASCII data requires the database default collation to be UTF-8 for correct interpretation. With native prepared statements (the default), this limitation does not apply.

Files modified:

core_sqlsrv.h: Added enum value
core_stmt.cpp: SQL_VARCHAR/SQL_C_CHAR binding in derive_string_types_sizes, process_resource_param, process_output_string; skip UTF-16 conversion
core_stream.cpp: SQL_C_CHAR stream read path
core_util.cpp: Map 65002->CP_UTF8 in string conversion functions
localizationimpl.cpp: iconv mapping for Linux/Mac
pdo_init.cpp, pdo_dbh.cpp, pdo_stmt.cpp: PDO constant and validation
sqlsrv/init.cpp, sqlsrv/stmt.cpp: SQLSRV constant and validation

Tests added:

pdo_utf8_varchar_encoding.phpt: connection/statement/param level
pdo_utf8_varchar_extra_coverage.phpt: bindColumn, emulate prepares, output params
sqlsrv_utf8_varchar_encoding.phpt: connection level with roundtrip
sqlsrv_utf8_varchar_extra_coverage.phpt: stream params, output params, fetch as stream

…tions Fixes #1587: Using UTF-8 encoding in PDO_SQLSRV results in NVARCHAR parameters instead of VARCHAR(_UTF8). When using SQLSRV_ENCODING_UTF8 (the default), string parameters are bound as SQL_WVARCHAR (NVARCHAR) with UTF-16 conversion. This causes implicit type conversions against VARCHAR columns with _UTF8 collations, degrading index usage and query performance. This change adds a new encoding constant SQLSRV_ENCODING_UTF8_VARCHAR (value 65002) that: - Binds parameters as SQL_VARCHAR / SQL_C_CHAR (not NVARCHAR / WCHAR) - Sends UTF-8 bytes directly without UTF-16 conversion - SQL Server interprets data using the column's _UTF8 collation - No implicit conversion, indexes remain usable The new encoding can be set at connection, statement, or per-parameter level in both PDO_SQLSRV and SQLSRV extensions: PDO: PDO::SQLSRV_ENCODING_UTF8_VARCHAR SQLSRV: SQLSRV_ENC_UTF8_VARCHAR (CharacterSet: 'utf-8-varchar') This is a non-breaking, additive change. Existing SQLSRV_ENCODING_UTF8 behavior is unchanged. Files modified: - core_sqlsrv.h: Added enum value - core_stmt.cpp: SQL_VARCHAR/SQL_C_CHAR binding in derive_string_types_sizes, process_resource_param, process_output_string; skip UTF-16 conversion - core_stream.cpp: SQL_C_CHAR stream read path - core_util.cpp: Map 65002->CP_UTF8 in string conversion functions - localizationimpl.cpp: iconv mapping for Linux/Mac - pdo_init.cpp, pdo_dbh.cpp, pdo_stmt.cpp: PDO constant and validation - sqlsrv/init.cpp, sqlsrv/stmt.cpp: SQLSRV constant and validation Tests added: - pdo_utf8_varchar_encoding.phpt: connection/statement/param level - pdo_utf8_varchar_extra_coverage.phpt: bindColumn, emulate prepares, output params - sqlsrv_utf8_varchar_encoding.phpt: connection level with roundtrip - sqlsrv_utf8_varchar_extra_coverage.phpt: stream params, output params, fetch as stream

codecov · 2026-03-30T08:36:31Z

Codecov Report

❌ Patch coverage is 96.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 85.78%. Comparing base (d1027a4) to head (4f1a18b).

Files with missing lines	Patch %	Lines
source/sqlsrv/init.cpp	80.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              dev    #1593      +/-   ##
==========================================
+ Coverage   85.75%   85.78%   +0.02%     
==========================================
  Files          23       23              
  Lines        7211     7231      +20     
==========================================
+ Hits         6184     6203      +19     
- Misses       1027     1028       +1

Files with missing lines	Coverage Δ
source/pdo_sqlsrv/pdo_dbh.cpp	`91.75% <ø> (ø)`
source/pdo_sqlsrv/pdo_init.cpp	`89.70% <ø> (ø)`
source/pdo_sqlsrv/pdo_stmt.cpp	`81.68% <ø> (ø)`
source/shared/core_sqlsrv.h	`89.68% <ø> (ø)`
source/shared/core_stmt.cpp	`93.55% <100.00%> (+0.05%)`	⬆️
source/shared/core_stream.cpp	`86.45% <ø> (ø)`
source/shared/core_util.cpp	`89.65% <100.00%> (+0.10%)`	⬆️
source/sqlsrv/stmt.cpp	`88.31% <100.00%> (ø)`
source/sqlsrv/init.cpp	`91.24% <80.00%> (-0.17%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

SQL_C_CHAR binding sends data through the ODBC driver's ANSI codepage conversion. Characters outside the system codepage (CJK, Cyrillic, Arabic) get corrupted on systems where the ANSI codepage is not UTF-8. Replace multi_script test cases (日本語, русский, عربى) with extended Latin characters (Ñoño, café, résumé, naïve) that are representable in Latin1/CP1252 and work consistently across all platforms.

…g for reads

jahnvi480 requested a review from David-Engel March 30, 2026 08:06

jahnvi480 and others added 4 commits March 30, 2026 15:08

Fix Test 3: use local vars for bindParam refs, set consistent encodin…

1447036

…g for reads

Merge branch 'dev' into jahnvi/ghi_1587

348d749

Merge branch 'dev' into jahnvi/ghi_1587

4f1a18b

jahnvi480 marked this pull request as draft April 3, 2026 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SQLSRV_ENCODING_UTF8_VARCHAR for VARCHAR columns with UTF-8 collations#1593

Add SQLSRV_ENCODING_UTF8_VARCHAR for VARCHAR columns with UTF-8 collations#1593
jahnvi480 wants to merge 5 commits intodevfrom
jahnvi/ghi_1587

jahnvi480 commented Mar 30, 2026

Uh oh!

codecov bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jahnvi480 commented Mar 30, 2026

Design

Uh oh!

codecov bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Mar 30, 2026 •

edited

Loading