How Our SQL Prettifier Works Under the Hood

Why SQL Formatting Is Hard

SQL looks simple on the surface but is notoriously difficult to format well. Unlike JSON or YAML — which have rigid, unambiguous grammars — SQL has decades of dialects, vendor extensions, and underspecified behaviors. A naive formatter that works on standard SELECT statements will break on window functions, CTEs, lateral joins, or PostgreSQL-specific syntax.

Regex-based approaches are a dead end. They can handle simple cases but collapse under real-world queries. Building a proper SQL formatter requires a real parse step.

The Three-Stage Pipeline

Our SQL formatter operates in three distinct stages:

{step}

{title}

{desc}

{step}

{title}

{desc}

{step}

{title}

{desc}

Handling Dialects

SQL dialects differ in ways that matter for formatting. PostgreSQL uses :: for type casts. MySQL uses backtick-quoted identifiers. BigQuery supports QUALIFY and STRUCT. T-SQL uses TOP instead of LIMIT.

We handle this by parameterizing the lexer and parser with a dialect configuration. The dialect config specifies which tokens are valid keywords in that dialect, which operators are supported, and how identifier quoting works. This allows a single parse pipeline to handle MySQL, PostgreSQL, SQLite, BigQuery, and T-SQL without branching spaghetti in the core logic.

Edge Cases We Had to Get Right

{title}

{code}

{desc}

{title}

{code}

{desc}

{title}

{code}

{desc}

Error Recovery

Real-world SQL is often syntactically incorrect — queries under development, queries extracted from logs, fragments. A formatter that refuses to format invalid SQL is frustrating.

Our parser implements error recovery: when it encounters an unexpected token, it logs the error, skips tokens until it finds a safe resynchronization point (typically a statement boundary or a known clause keyword), and continues parsing. The resulting AST may be incomplete, but the formatter can still produce useful output for the portions it understood.