Lexical analyzer for Microsoft SQL Server built with C and Flex. Reads SQL statements and breaks them down into classified tokens, reporting errors with line and column positions. Covers a comprehensive subset of T-SQL including keywords, operators, data types, built-in functions, joins, transaction control, and more.
The analyzer reads SQL input and passes it through a Flex-generated scanner (scanner.l) that matches patterns against a rule set. Each matched pattern produces a token with a type classification, the matched value, and its position in the source code (line and column). Tokens are stored in an array and printed grouped by category at the end of the analysis.
The program supports two input modes:
- Manual mode (
test()) — SQL query is hardcoded in the source for quick testing. - Interactive mode (
test_input()) — Paste SQL directly into the console. PressShift+Enterto execute the analysis,Escto exit.
| Category | Examples |
|---|---|
| Queries | SELECT, FROM, WHERE, ORDER BY, UNION, DISTINCT |
| Data Manipulation | INSERT, UPDATE, DELETE, MERGE, VALUES |
| Data Definition | CREATE, ALTER, DROP, TABLE, INDEX, CONSTRAINT |
| Flow Control | IF, ELSE, WHILE, CASE, WHEN, THEN, BEGIN, END |
| Transaction Control | COMMIT, ROLLBACK, SAVEPOINT |
| Data Control | GRANT, REVOKE, DENY |
| Logical Operators | AND, OR, NOT, IS, EXISTS |
| Operators | =, +, -, *, /, >=, <=, <>, !=, :: |
| Data Types | INT, VARCHAR, DECIMAL, DATETIME, FLOAT, XML |
| Functions | COUNT, SUM, AVG, GETDATE, CAST, CONVERT, SQRT |
| Joins | JOIN, INNER, LEFT, RIGHT, FULL, CROSS, ON |
| Identifiers | Table names, column names, @variables, #temp_tables, db.schema.table |
| Literals | String literals ('text'), integers, floats, scientific notation |
| Comments | Single-line (--) and multi-line (/* ... */) |
| Special Symbols | (, ), [, ], ,, ; |
- Unbalanced parentheses/brackets — Uses a stack-based algorithm to detect mismatches and reports the exact position of the problematic bracket.
- Invalid identifiers — Detects identifiers starting with digits (e.g.,
123abc). - Unclosed multi-line comments — Detects
/*without a matching*/. - Invalid float literals — Catches malformed numbers like
3..14. - Invalid scientific notation — Detects incomplete exponents like
1.5E+. - Unrecognized symbols — Any character not matching a rule is flagged with its position.
For valid input:
Keywords Data Definition: CREATE TABLE
Keywords Data Types: INT VARCHAR DECIMAL
Identifiers: movies imdbid title year budget revenue
Keywords Symbols: ( ) , ;
Integer Numbers: 255 5 1
No errors detected!
For input with errors:
Warning: Unbalanced Opening '(' Parentheses!
Error: at line 3, column 12
├── main.c # Entry point, input handling, code formatting
├── scanner.l # Flex rules and token classification logic
├── tokens.h # Token type enum and struct definitions
├── makefile # Build automation
└── script.py # Patches lex.yy.h/c for Windows (comments out unistd.h)
Flex generates #include <unistd.h> in its output files, which doesn't exist on Windows. This script comments out that include so the project compiles with MinGW/MSYS2 without errors. It only runs on Windows and is not needed on macOS/Linux.
- GCC (MinGW/MSYS2 on Windows, or native gcc on macOS/Linux)
- Flex
- Python 3 (only on Windows, for the
unistd.hpatch) - Make
make clean
makeOn Windows only (patch unistd.h issue):
make runscriptRun:
./program.exeManual build without Make:
flex --header-file=lex.yy.h scanner.l
python script.py # Windows only
gcc -o program lex.yy.c main.c
./program.exeNote: The interactive input mode (
test_input()) uses Windows Console API (ReadConsoleInput,SetConsoleMode) for raw keyboard handling. This mode will not work on macOS/Linux without modifications. The manual mode (test()) works on all platforms.
Originally developed on 5 October 2024.
