Why SQL Formatting Is Hard
SQL looks simple on the surface but is notoriously difficult to format well. Unlike JSON or YAML — which have rigid, unambiguous grammars — SQL has decades of dialects, vendor extensions, and underspecified behaviors. A naive formatter that works on standard SELECT statements will break on window functions, CTEs, lateral joins, or PostgreSQL-specific syntax.
Regex-based approaches are a dead end. They can handle simple cases but collapse under real-world queries. Building a proper SQL formatter requires a real parse step.
The Three-Stage Pipeline
Our SQL formatter operates in three distinct stages:
{title}
{desc}
{title}
{desc}
{title}
{desc}
Handling Dialects
SQL dialects differ in ways that matter for formatting. PostgreSQL uses :: for type casts. MySQL uses backtick-quoted identifiers. BigQuery supports QUALIFY and STRUCT. T-SQL uses TOP instead of LIMIT.
We handle this by parameterizing the lexer and parser with a dialect configuration. The dialect config specifies which tokens are valid keywords in that dialect, which operators are supported, and how identifier quoting works. This allows a single parse pipeline to handle MySQL, PostgreSQL, SQLite, BigQuery, and T-SQL without branching spaghetti in the core logic.
Edge Cases We Had to Get Right
{title}
{code}{desc}
{title}
{code}{desc}
{title}
{code}{desc}
Error Recovery
Real-world SQL is often syntactically incorrect — queries under development, queries extracted from logs, fragments. A formatter that refuses to format invalid SQL is frustrating.
Our parser implements error recovery: when it encounters an unexpected token, it logs the error, skips tokens until it finds a safe resynchronization point (typically a statement boundary or a known clause keyword), and continues parsing. The resulting AST may be incomplete, but the formatter can still produce useful output for the portions it understood.
Try the SQL Formatter
Paste any SQL query — standard or dialect-specific — and get clean, readable output.