Skip to content

jaosber/sql-lexical-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SQL Lexical Analyzer

Screenshot

Lexical analyzer for Microsoft SQL Server built with C and Flex. Reads SQL statements and breaks them down into classified tokens, reporting errors with line and column positions. Covers a comprehensive subset of T-SQL including keywords, operators, data types, built-in functions, joins, transaction control, and more.

How It Works

The analyzer reads SQL input and passes it through a Flex-generated scanner (scanner.l) that matches patterns against a rule set. Each matched pattern produces a token with a type classification, the matched value, and its position in the source code (line and column). Tokens are stored in an array and printed grouped by category at the end of the analysis.

The program supports two input modes:

  • Manual mode (test()) — SQL query is hardcoded in the source for quick testing.
  • Interactive mode (test_input()) — Paste SQL directly into the console. Press Shift+Enter to execute the analysis, Esc to exit.

Token Categories

Category Examples
Queries SELECT, FROM, WHERE, ORDER BY, UNION, DISTINCT
Data Manipulation INSERT, UPDATE, DELETE, MERGE, VALUES
Data Definition CREATE, ALTER, DROP, TABLE, INDEX, CONSTRAINT
Flow Control IF, ELSE, WHILE, CASE, WHEN, THEN, BEGIN, END
Transaction Control COMMIT, ROLLBACK, SAVEPOINT
Data Control GRANT, REVOKE, DENY
Logical Operators AND, OR, NOT, IS, EXISTS
Operators =, +, -, *, /, >=, <=, <>, !=, ::
Data Types INT, VARCHAR, DECIMAL, DATETIME, FLOAT, XML
Functions COUNT, SUM, AVG, GETDATE, CAST, CONVERT, SQRT
Joins JOIN, INNER, LEFT, RIGHT, FULL, CROSS, ON
Identifiers Table names, column names, @variables, #temp_tables, db.schema.table
Literals String literals ('text'), integers, floats, scientific notation
Comments Single-line (--) and multi-line (/* ... */)
Special Symbols (, ), [, ], ,, ;

Error Detection

  • Unbalanced parentheses/brackets — Uses a stack-based algorithm to detect mismatches and reports the exact position of the problematic bracket.
  • Invalid identifiers — Detects identifiers starting with digits (e.g., 123abc).
  • Unclosed multi-line comments — Detects /* without a matching */.
  • Invalid float literals — Catches malformed numbers like 3..14.
  • Invalid scientific notation — Detects incomplete exponents like 1.5E+.
  • Unrecognized symbols — Any character not matching a rule is flagged with its position.

Example Output

For valid input:

Keywords Data Definition: CREATE TABLE
Keywords Data Types: INT VARCHAR DECIMAL
Identifiers: movies imdbid title year budget revenue
Keywords Symbols: ( ) , ;
Integer Numbers: 255 5 1

No errors detected!

For input with errors:

Warning: Unbalanced Opening '(' Parentheses!
Error: at line 3, column 12

Project Structure

├── main.c          # Entry point, input handling, code formatting
├── scanner.l       # Flex rules and token classification logic
├── tokens.h        # Token type enum and struct definitions
├── makefile         # Build automation
└── script.py       # Patches lex.yy.h/c for Windows (comments out unistd.h)

About script.py

Flex generates #include <unistd.h> in its output files, which doesn't exist on Windows. This script comments out that include so the project compiles with MinGW/MSYS2 without errors. It only runs on Windows and is not needed on macOS/Linux.

Requirements

  • GCC (MinGW/MSYS2 on Windows, or native gcc on macOS/Linux)
  • Flex
  • Python 3 (only on Windows, for the unistd.h patch)
  • Make

Build and Run

make clean
make

On Windows only (patch unistd.h issue):

make runscript

Run:

./program.exe

Manual build without Make:

flex --header-file=lex.yy.h scanner.l
python script.py          # Windows only
gcc -o program lex.yy.c main.c
./program.exe

Note: The interactive input mode (test_input()) uses Windows Console API (ReadConsoleInput, SetConsoleMode) for raw keyboard handling. This mode will not work on macOS/Linux without modifications. The manual mode (test()) works on all platforms.


Originally developed on 5 October 2024.

About

Lexical analyzer for Microsoft SQL Server built with C and Flex. Tokenizes SQL statements into categorized tokens (keywords, operators, data types, functions, identifiers, literals, comments) with error detection for unbalanced parentheses, invalid identifiers, and unclosed comments.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors