Blogchevron_rightData
Data

CSV Format Standards — RFC 4180 vs the Real World

CSV is the most-deployed and least-standardized data format on Earth. Master RFC 4180, the dialects you will actually meet in production, and the quoting rules nobody reads — so the next ingest does not break at 2am.

December 13, 2026·8 min read·Convert CSV to JSON →

What RFC 4180 Actually Says

RFC 4180 was published in October 2005 by Y. Shafranovich. It is six pages long and arrived three decades after the format it describes had become ubiquitous. The document explicitly notes its job is to describe "the existing common practices" rather than dictate one — which is why every CSV producer can claim conformance and every CSV consumer can still receive files it cannot read.

The core rules are short. Each record is one line, terminated by CRLF. Records have the same number of fields, separated by commas. The optional first record is a header. Fields containing comma, double-quote, or newline must be enclosed in double-quotes. A double-quote inside a quoted field is escaped by doubling it: "She said ""hi""". The character set defaults to ASCII; UTF-8 is mentioned only as a possibility.

Three things RFC 4180 does not standardize: a delimiter that is not comma, a quote character that is not double-quote, and a way to indicate the encoding. Every dialect that diverges from the spec is filling one of these gaps.

Field Rules in Practice

Three rules cover most field-content questions. First, fields are not trimmed: leading and trailing spaces are part of the value. Second, an empty field and a field containing the empty string are indistinguishable on the wire (this matters when round-tripping to a database that distinguishes NULL from ''). Third, a record with fewer fields than the header is malformed — RFC 4180 forbids ragged rows.

# Header, then three data records
name,email,age
Alice,alice@example.com,30
Bob,bob@example.com,25
Carol,,42
# Carol's email is empty string. There is no way to encode "NULL" in raw CSV.

The empty-vs-null ambiguity is why Postgres COPY adds the \N sentinel and why MySQL LOAD DATA uses the same. If your pipeline needs to preserve nullability, pick a dialect that supports a null marker explicitly — or hand off to JSON Lines and stop pretending CSV is a relational format.

Quoting: When and How

A field needs quoting whenever it contains a comma, a double-quote, a CR, or an LF. Many writers also quote whenever a field contains leading or trailing whitespace, because some readers strip it. The safe default is QUOTE_MINIMAL (quote only when necessary), but for files consumed by unknown tools, QUOTE_ALL is bulletproof at the cost of a few extra bytes per field.

# Original values
name      = Smith, John
quote     = She said "hi"
multiline = Line one
            Line two

# CSV encoding (RFC 4180)
"Smith, John","She said ""hi""","Line one
Line two"

The newline inside the multiline field is literal — the quoted field continues across raw line breaks. This is correct CSV and it is what trips up wc -l when you try to count records.

The escape mechanism is doubling, not backslashing. "a\"b" is two fields (a\ and b) terminated by a stray quote — definitely not what you meant. Postgres in CSV mode accepts both forms, but RFC-strict parsers reject the backslash form.

Newlines Inside Fields

The presence of literal CR or LF inside a quoted field is the single biggest source of pipeline bugs. Naive readers — anything that calls readline() in a loop — split on the embedded newline and produce two malformed half-records. The data looks corrupted but the file is valid.

The fix is to never write code that reads CSV by line. Use a parser. Every modern stdlib (Python csv, .NET TextFieldParser, Ruby CSV, Go encoding/csv) handles embedded newlines correctly. The cost is one import statement.

If you absolutely must process CSV with shell tools, GNU csvkit (csvcut, csvgrep, csvjson) is the answer. Do not reach for awk on real-world CSV.

Dialects in the Wild

These are the six dialects you will meet most often in 2026. Each diverges from RFC 4180 in at least one column. Confusing one for another is how data gets dropped during migration:

DialectDelimQuoteEscapeEOLNullNotes
RFC 4180,""" (double the quote)CRLFEmpty fieldTheoretical baseline; rarely used unmodified
Excel (US),"""CRLFEmpty fieldAdds BOM for UTF-8; numbers may be auto-coerced
Excel (EU);"""CRLFEmpty fieldLocale-dependent; comma is the decimal point
Postgres COPY,""" or \"LF\N (or configurable)Strict; rejects malformed input by default
MySQL LOAD DATA\tnone\\LF\NDefaults to TSV; CSV requires explicit FIELDS clause
Google Sheets,"""LFEmpty fieldUTF-8 default; respects RFC 4180 closely

BOM and UTF-8

The Byte Order Mark — EF BB BF at the start of a UTF-8 file — is the single most contentious three bytes in CSV. Excel on Windows requires it to recognize a file as UTF-8; without it, Excel decodes as CP1252 and every emoji turns into Â¥. Almost every other tool tolerates the BOM but does not need it.

The result is a stalemate. If you generate CSV for human upload to Excel: include the BOM. If you generate CSV for ingest by Postgres, Snowflake, BigQuery, or any non-Microsoft pipeline: do not include the BOM, or strip it on the consumer side with utf-8-sig (Python) or { bom: true } (csv-parse).

The pragmatic compromise is to strip the BOM on every ingest, regardless of source. It is one line of code and it eliminates a class of bug that surfaces only when a BOM-bearing file appears in a pipeline that previously only saw clean UTF-8.

Working CSV in Python

Python's standard library csv module is one of the rare batteries-included implementations that gets every edge case right. Use it for both reading and writing — never reach for str.split(','):

# Python — the standard library handles RFC 4180 correctly
import csv

# Read with explicit dialect
with open('data.csv', encoding='utf-8-sig') as f:  # 'utf-8-sig' strips BOM
    reader = csv.DictReader(f, dialect='excel')
    for row in reader:
        print(row['name'], row['email'])

# Write with strict quoting (any field with special chars gets quoted)
with open('out.csv', 'w', encoding='utf-8', newline='') as f:
    writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL, lineterminator='\r\n')
    writer.writerow(['name', 'note'])
    writer.writerow(['Smith, John', 'Said "hi"'])
    # name,note
    # "Smith, John","Said ""hi"""

The two non-obvious flags: encoding='utf-8-sig' on read silently strips the BOM if present, and newline='' on write disables Python's universal-newlines translation so the lineterminator is honored.

Working CSV in JavaScript / Node

In the Node ecosystem, the two libraries that actually handle real-world CSV are csv-parse for streaming and papaparse for browser-side parsing. Both implement RFC 4180 with explicit options for every common dialect divergence:

// Node.js — csv-parse for streaming, papaparse for browser
import { parse } from 'csv-parse';
import { stringify } from 'csv-stringify/sync';
import fs from 'node:fs';

// Stream parse a 10 GB file without loading into memory
fs.createReadStream('huge.csv')
  .pipe(parse({
    columns: true,        // treat first row as headers
    skip_empty_lines: true,
    relax_quotes: false,  // strict RFC 4180
    bom: true,            // strip UTF-8 BOM if present
  }))
  .on('data', (row) => process(row))
  .on('end', () => console.log('done'));

// Write back, RFC 4180 compliant
const out = stringify([
  { name: 'Smith, John', note: 'Said "hi"' },
], { header: true, quoted_string: true });
// "name","note"\r\n"Smith, John","Said ""hi"""

For files that cannot fit in memory the streaming API is non-negotiable. A 10 GB CSV processed via fs.createReadStream uses a few megabytes of RAM regardless of file size; the same file loaded with fs.readFileSync + papaparse will OOM on most servers.

Pitfalls That Bite Once a Quarter

warning
Excel mangling leading zeros: A column of ZIP codes like "01234" opens in Excel as "1234". Excel decides the value is numeric on parse. Workaround: prefix with =""01234"" or rename the file to .txt and use Data > From Text.
warning
Date locale wars: "03/04/2026" is March 4 in the US and April 3 in the UK. Always serialize dates as ISO 8601 (YYYY-MM-DD). Always.
warning
Embedded newlines in fields: A multi-line address inside a quoted field is RFC 4180 compliant but breaks naive line-by-line readers (sed, awk, half of bash one-liners). Use a real CSV parser.
warning
Mixed UTF-8 and CP1252: Files exported from older Windows tools often use CP1252 (smart quotes) while everything else expects UTF-8. The result: mojibake on every curly apostrophe. Detect with chardet or normalize on ingest.
warning
Trailing comma row count drift: "a,b,c," has four fields, not three — the trailing empty field counts. Some parsers tolerate it, some do not. Strip with a quick grep before ingest if you must support legacy producers.

Defense in depth: Validate row count and column count on every ingest. A file that produced 10,000 rows yesterday and 9,997 today is reporting three silent parse failures — surface them before the downstream report ships.

Convert CSV to JSON in seconds

Drop in any CSV — get clean JSON, no upload, dialect detection built in.

Open CSV to JSON →

Related Tools

CSV Format Standards — RFC 4180 vs Real World