What RFC 4180 Actually Says
RFC 4180 was published in October 2005 by Y. Shafranovich. It is six pages long and arrived three decades after the format it describes had become ubiquitous. The document explicitly notes its job is to describe "the existing common practices" rather than dictate one — which is why every CSV producer can claim conformance and every CSV consumer can still receive files it cannot read.
The core rules are short. Each record is one line, terminated by CRLF. Records have the same number of fields, separated by commas. The optional first record is a header. Fields containing comma, double-quote, or newline must be enclosed in double-quotes. A double-quote inside a quoted field is escaped by doubling it: "She said ""hi""". The character set defaults to ASCII; UTF-8 is mentioned only as a possibility.
Three things RFC 4180 does not standardize: a delimiter that is not comma, a quote character that is not double-quote, and a way to indicate the encoding. Every dialect that diverges from the spec is filling one of these gaps.
Field Rules in Practice
Three rules cover most field-content questions. First, fields are not trimmed: leading and trailing spaces are part of the value. Second, an empty field and a field containing the empty string are indistinguishable on the wire (this matters when round-tripping to a database that distinguishes NULL from ''). Third, a record with fewer fields than the header is malformed — RFC 4180 forbids ragged rows.
# Header, then three data records
name,email,age
Alice,alice@example.com,30
Bob,bob@example.com,25
Carol,,42
# Carol's email is empty string. There is no way to encode "NULL" in raw CSV.The empty-vs-null ambiguity is why Postgres COPY adds the \N sentinel and why MySQL LOAD DATA uses the same. If your pipeline needs to preserve nullability, pick a dialect that supports a null marker explicitly — or hand off to JSON Lines and stop pretending CSV is a relational format.
Quoting: When and How
A field needs quoting whenever it contains a comma, a double-quote, a CR, or an LF. Many writers also quote whenever a field contains leading or trailing whitespace, because some readers strip it. The safe default is QUOTE_MINIMAL (quote only when necessary), but for files consumed by unknown tools, QUOTE_ALL is bulletproof at the cost of a few extra bytes per field.
# Original values
name = Smith, John
quote = She said "hi"
multiline = Line one
Line two
# CSV encoding (RFC 4180)
"Smith, John","She said ""hi""","Line one
Line two"The newline inside the multiline field is literal — the quoted field continues across raw line breaks. This is correct CSV and it is what trips up wc -l when you try to count records.
The escape mechanism is doubling, not backslashing. "a\"b" is two fields (a\ and b) terminated by a stray quote — definitely not what you meant. Postgres in CSV mode accepts both forms, but RFC-strict parsers reject the backslash form.
Newlines Inside Fields
The presence of literal CR or LF inside a quoted field is the single biggest source of pipeline bugs. Naive readers — anything that calls readline() in a loop — split on the embedded newline and produce two malformed half-records. The data looks corrupted but the file is valid.
The fix is to never write code that reads CSV by line. Use a parser. Every modern stdlib (Python csv, .NET TextFieldParser, Ruby CSV, Go encoding/csv) handles embedded newlines correctly. The cost is one import statement.
If you absolutely must process CSV with shell tools, GNU csvkit (csvcut, csvgrep, csvjson) is the answer. Do not reach for awk on real-world CSV.
Dialects in the Wild
These are the six dialects you will meet most often in 2026. Each diverges from RFC 4180 in at least one column. Confusing one for another is how data gets dropped during migration:
| Dialect | Delim | Quote | Escape | EOL | Null | Notes |
|---|---|---|---|---|---|---|
| RFC 4180 | , | " | "" (double the quote) | CRLF | Empty field | Theoretical baseline; rarely used unmodified |
| Excel (US) | , | " | "" | CRLF | Empty field | Adds BOM for UTF-8; numbers may be auto-coerced |
| Excel (EU) | ; | " | "" | CRLF | Empty field | Locale-dependent; comma is the decimal point |
| Postgres COPY | , | " | "" or \" | LF | \N (or configurable) | Strict; rejects malformed input by default |
| MySQL LOAD DATA | \t | none | \\ | LF | \N | Defaults to TSV; CSV requires explicit FIELDS clause |
| Google Sheets | , | " | "" | LF | Empty field | UTF-8 default; respects RFC 4180 closely |
BOM and UTF-8
The Byte Order Mark — EF BB BF at the start of a UTF-8 file — is the single most contentious three bytes in CSV. Excel on Windows requires it to recognize a file as UTF-8; without it, Excel decodes as CP1252 and every emoji turns into Â¥. Almost every other tool tolerates the BOM but does not need it.
The result is a stalemate. If you generate CSV for human upload to Excel: include the BOM. If you generate CSV for ingest by Postgres, Snowflake, BigQuery, or any non-Microsoft pipeline: do not include the BOM, or strip it on the consumer side with utf-8-sig (Python) or { bom: true } (csv-parse).
The pragmatic compromise is to strip the BOM on every ingest, regardless of source. It is one line of code and it eliminates a class of bug that surfaces only when a BOM-bearing file appears in a pipeline that previously only saw clean UTF-8.
Working CSV in Python
Python's standard library csv module is one of the rare batteries-included implementations that gets every edge case right. Use it for both reading and writing — never reach for str.split(','):
# Python — the standard library handles RFC 4180 correctly
import csv
# Read with explicit dialect
with open('data.csv', encoding='utf-8-sig') as f: # 'utf-8-sig' strips BOM
reader = csv.DictReader(f, dialect='excel')
for row in reader:
print(row['name'], row['email'])
# Write with strict quoting (any field with special chars gets quoted)
with open('out.csv', 'w', encoding='utf-8', newline='') as f:
writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL, lineterminator='\r\n')
writer.writerow(['name', 'note'])
writer.writerow(['Smith, John', 'Said "hi"'])
# name,note
# "Smith, John","Said ""hi"""The two non-obvious flags: encoding='utf-8-sig' on read silently strips the BOM if present, and newline='' on write disables Python's universal-newlines translation so the lineterminator is honored.
Working CSV in JavaScript / Node
In the Node ecosystem, the two libraries that actually handle real-world CSV are csv-parse for streaming and papaparse for browser-side parsing. Both implement RFC 4180 with explicit options for every common dialect divergence:
// Node.js — csv-parse for streaming, papaparse for browser
import { parse } from 'csv-parse';
import { stringify } from 'csv-stringify/sync';
import fs from 'node:fs';
// Stream parse a 10 GB file without loading into memory
fs.createReadStream('huge.csv')
.pipe(parse({
columns: true, // treat first row as headers
skip_empty_lines: true,
relax_quotes: false, // strict RFC 4180
bom: true, // strip UTF-8 BOM if present
}))
.on('data', (row) => process(row))
.on('end', () => console.log('done'));
// Write back, RFC 4180 compliant
const out = stringify([
{ name: 'Smith, John', note: 'Said "hi"' },
], { header: true, quoted_string: true });
// "name","note"\r\n"Smith, John","Said ""hi"""For files that cannot fit in memory the streaming API is non-negotiable. A 10 GB CSV processed via fs.createReadStream uses a few megabytes of RAM regardless of file size; the same file loaded with fs.readFileSync + papaparse will OOM on most servers.
Pitfalls That Bite Once a Quarter
Defense in depth: Validate row count and column count on every ingest. A file that produced 10,000 rows yesterday and 9,997 today is reporting three silent parse failures — surface them before the downstream report ships.
Convert CSV to JSON in seconds
Drop in any CSV — get clean JSON, no upload, dialect detection built in.