Skip to content

Add SQLSRV_ENCODING_UTF8_VARCHAR for VARCHAR columns with UTF-8 collations#1593

Draft
jahnvi480 wants to merge 5 commits intodevfrom
jahnvi/ghi_1587
Draft

Add SQLSRV_ENCODING_UTF8_VARCHAR for VARCHAR columns with UTF-8 collations#1593
jahnvi480 wants to merge 5 commits intodevfrom
jahnvi/ghi_1587

Conversation

@jahnvi480
Copy link
Copy Markdown
Contributor

Fixes #1587: Using UTF-8 encoding in PDO_SQLSRV results in NVARCHAR parameters instead of VARCHAR(_UTF8).

When using SQLSRV_ENCODING_UTF8 (the default), string parameters are bound as SQL_WVARCHAR (NVARCHAR) with UTF-16 conversion. This causes implicit type conversions against VARCHAR columns with _UTF8 collations, degrading index usage and query performance.

This change adds a new encoding constant SQLSRV_ENCODING_UTF8_VARCHAR (value 65002) that:

  • Binds parameters as SQL_VARCHAR / SQL_C_CHAR (not NVARCHAR / WCHAR)
  • Sends UTF-8 bytes directly without UTF-16 conversion
  • SQL Server interprets data using the column's _UTF8 collation
  • No implicit conversion, indexes remain usable

The new encoding can be set at connection, statement, or per-parameter level in both PDO_SQLSRV and SQLSRV extensions:

PDO: PDO::SQLSRV_ENCODING_UTF8_VARCHAR
SQLSRV: SQLSRV_ENC_UTF8_VARCHAR (CharacterSet: 'utf-8-varchar')

This is a non-breaking, additive change. Existing SQLSRV_ENCODING_UTF8 behavior is unchanged.

Design

Architecture

                           SQLSRV_ENCODING_UTF8 (65001)          SQLSRV_ENCODING_UTF8_VARCHAR (65002)
                           ─────────────────────────              ──────────────────────────────────────
PHP string (UTF-8 bytes)   → convert to UTF-16 (WCHAR)           → keep as UTF-8 bytes (no conversion)
SQL parameter type         → SQL_WVARCHAR (NVARCHAR)              → SQL_VARCHAR
C data type                → SQL_C_WCHAR                          → SQL_C_CHAR
ODBC wire format           → Unicode (UCS-2/UTF-16)               → Raw bytes (UTF-8)
SQL Server interpretation  → NVARCHAR semantics                   → Column collation (_UTF8)
PDO quote() output         → N'...' (national literal)            → '...' (char literal)
String conversion codepage → CP_UTF8 (65001)                      → CP_UTF8 (65001) — same

Data Flow

┌──────────────────────┐
│   PHP Application    │
│  $val = "Grüße 日本語" │  ← UTF-8 encoded string
└──────────┬───────────┘
           │
    ┌──────┴──────────────────────────────────┐
    │                                          │
    ▼ SQLSRV_ENCODING_UTF8                     ▼ SQLSRV_ENCODING_UTF8_VARCHAR
    │                                          │
    │ process_string_param()                   │ process_string_param()
    │ ├─ derive_string_types_sizes()           │ ├─ derive_string_types_sizes()
    │ │  sql_data_type = SQL_WVARCHAR          │ │  sql_data_type = SQL_VARCHAR ◄── KEY DIFF
    │ │  c_data_type   = SQL_C_WCHAR           │ │  c_data_type   = SQL_C_CHAR  ◄── KEY DIFF
    │ ├─ convert_input_str_to_utf16() ◄── YES  │ ├─ (encoding != CP_UTF8) ◄── SKIPPED
    │ │  "Grüße" → UTF-16LE bytes              │ │  buffer stays as UTF-8 bytes
    │ └─ SQLBindParameter(SQL_C_WCHAR,         │ └─ SQLBindParameter(SQL_C_CHAR,
    │       SQL_WVARCHAR, utf16_buf)            │       SQL_VARCHAR, utf8_buf)
    │                                          │
    ▼                                          ▼
┌──────────────────────┐              ┌──────────────────────┐
│   ODBC Driver        │              │   ODBC Driver        │
│ Sends as NVARCHAR    │              │ Sends as VARCHAR     │
│ (Unicode path)       │              │ (raw byte path)      │
└──────────┬───────────┘              └──────────┬───────────┘
           │                                     │
           ▼                                     ▼
┌──────────────────────────────────────────────────────────┐
│                      SQL Server                          │
│                                                          │
│  VARCHAR(255) COLLATE Latin1_General_100_CI_AS_SC_UTF8   │
│                                                          │
│  NVARCHAR param → implicit conversion → index scan ✗    │
│  VARCHAR param  → no conversion       → index seek ✓    │
└──────────────────────────────────────────────────────────┘

Key Design Decisions

  1. New constant value (65002): Chosen as CP_UTF8 + 1 to keep it adjacent to SQLSRV_ENCODING_UTF8 (65001). Both values are well above the auto-enum range (0-3) and existing Windows codepage numbers, avoiding any collision.

  2. Same codepage for string conversion: Both SQLSRV_ENCODING_UTF8 and SQLSRV_ENCODING_UTF8_VARCHAR use CP_UTF8 (65001) for MultiByteToWideChar/WideCharToMultiByte and iconv conversions. The difference is only in ODBC binding — not in how PHP strings are interpreted.

  3. Non-breaking: The existing SQLSRV_ENCODING_UTF8 behavior is unchanged. The new encoding is opt-in only, requiring explicit use of the new constant.

  4. Emulate prepares limitation: With PDO::ATTR_EMULATE_PREPARES, string literals use '...' (not N'...'). Non-ASCII data requires the database default collation to be UTF-8 for correct interpretation. With native prepared statements (the default), this limitation does not apply.

Files modified:

  • core_sqlsrv.h: Added enum value
  • core_stmt.cpp: SQL_VARCHAR/SQL_C_CHAR binding in derive_string_types_sizes, process_resource_param, process_output_string; skip UTF-16 conversion
  • core_stream.cpp: SQL_C_CHAR stream read path
  • core_util.cpp: Map 65002->CP_UTF8 in string conversion functions
  • localizationimpl.cpp: iconv mapping for Linux/Mac
  • pdo_init.cpp, pdo_dbh.cpp, pdo_stmt.cpp: PDO constant and validation
  • sqlsrv/init.cpp, sqlsrv/stmt.cpp: SQLSRV constant and validation

Tests added:

  • pdo_utf8_varchar_encoding.phpt: connection/statement/param level
  • pdo_utf8_varchar_extra_coverage.phpt: bindColumn, emulate prepares, output params
  • sqlsrv_utf8_varchar_encoding.phpt: connection level with roundtrip
  • sqlsrv_utf8_varchar_extra_coverage.phpt: stream params, output params, fetch as stream

…tions

Fixes #1587: Using UTF-8 encoding in PDO_SQLSRV results in NVARCHAR
parameters instead of VARCHAR(_UTF8).

When using SQLSRV_ENCODING_UTF8 (the default), string parameters are
bound as SQL_WVARCHAR (NVARCHAR) with UTF-16 conversion. This causes
implicit type conversions against VARCHAR columns with _UTF8 collations,
degrading index usage and query performance.

This change adds a new encoding constant SQLSRV_ENCODING_UTF8_VARCHAR
(value 65002) that:
- Binds parameters as SQL_VARCHAR / SQL_C_CHAR (not NVARCHAR / WCHAR)
- Sends UTF-8 bytes directly without UTF-16 conversion
- SQL Server interprets data using the column's _UTF8 collation
- No implicit conversion, indexes remain usable

The new encoding can be set at connection, statement, or per-parameter
level in both PDO_SQLSRV and SQLSRV extensions:

  PDO:    PDO::SQLSRV_ENCODING_UTF8_VARCHAR
  SQLSRV: SQLSRV_ENC_UTF8_VARCHAR (CharacterSet: 'utf-8-varchar')

This is a non-breaking, additive change. Existing SQLSRV_ENCODING_UTF8
behavior is unchanged.

Files modified:
- core_sqlsrv.h: Added enum value
- core_stmt.cpp: SQL_VARCHAR/SQL_C_CHAR binding in derive_string_types_sizes,
  process_resource_param, process_output_string; skip UTF-16 conversion
- core_stream.cpp: SQL_C_CHAR stream read path
- core_util.cpp: Map 65002->CP_UTF8 in string conversion functions
- localizationimpl.cpp: iconv mapping for Linux/Mac
- pdo_init.cpp, pdo_dbh.cpp, pdo_stmt.cpp: PDO constant and validation
- sqlsrv/init.cpp, sqlsrv/stmt.cpp: SQLSRV constant and validation

Tests added:
- pdo_utf8_varchar_encoding.phpt: connection/statement/param level
- pdo_utf8_varchar_extra_coverage.phpt: bindColumn, emulate prepares, output params
- sqlsrv_utf8_varchar_encoding.phpt: connection level with roundtrip
- sqlsrv_utf8_varchar_extra_coverage.phpt: stream params, output params, fetch as stream
@jahnvi480 jahnvi480 requested a review from David-Engel March 30, 2026 08:06
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 96.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 85.78%. Comparing base (d1027a4) to head (4f1a18b).

Files with missing lines Patch % Lines
source/sqlsrv/init.cpp 80.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              dev    #1593      +/-   ##
==========================================
+ Coverage   85.75%   85.78%   +0.02%     
==========================================
  Files          23       23              
  Lines        7211     7231      +20     
==========================================
+ Hits         6184     6203      +19     
- Misses       1027     1028       +1     
Files with missing lines Coverage Δ
source/pdo_sqlsrv/pdo_dbh.cpp 91.75% <ø> (ø)
source/pdo_sqlsrv/pdo_init.cpp 89.70% <ø> (ø)
source/pdo_sqlsrv/pdo_stmt.cpp 81.68% <ø> (ø)
source/shared/core_sqlsrv.h 89.68% <ø> (ø)
source/shared/core_stmt.cpp 93.55% <100.00%> (+0.05%) ⬆️
source/shared/core_stream.cpp 86.45% <ø> (ø)
source/shared/core_util.cpp 89.65% <100.00%> (+0.10%) ⬆️
source/sqlsrv/stmt.cpp 88.31% <100.00%> (ø)
source/sqlsrv/init.cpp 91.24% <80.00%> (-0.17%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jahnvi480 and others added 4 commits March 30, 2026 15:08
SQL_C_CHAR binding sends data through the ODBC driver's ANSI codepage
conversion. Characters outside the system codepage (CJK, Cyrillic,
Arabic) get corrupted on systems where the ANSI codepage is not UTF-8.

Replace multi_script test cases (日本語, русский, عربى) with extended
Latin characters (Ñoño, café, résumé, naïve) that are representable
in Latin1/CP1252 and work consistently across all platforms.
@jahnvi480 jahnvi480 marked this pull request as draft April 3, 2026 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Using UTF-8 encoding in PDO_SQLSRV results in NVARCHAR parameters instead of VARCHAR(_UTF8)

1 participant