CSV Quick Fixes
We're about to hit 5 billion rows loaded into CSV Explorer for analysis. I wish I could say all those rows were parsed and loaded automatically from CSV files straight into PostgreSQL, but we've seen our fair share of the quirks and errors of CSV formatting. I remember a late night copying a CSV into PostgreSQL and being stumped by
ERROR invalid input syntax for date '4/31/2017'.
Spoiler: April 31st doesn't exist. So I figured I'd make this cheat sheet of common errors we've seen when importing CSV files into PostgreSQL and how to fix them. All these fixes work without bringing the file into memory, i.e. on large CSVs!
By default, PostgreSQL expects dates to have the day before the month.
The Error: If the month comes first
invalid input syntax for date '31/12/2017'
The Fix: Run
SET datestyle = 'ISO,DMY'; in psql before loading your data. This will stay set for the session.
Nulls in CSV files should be empty commas
Jim,,,,Chocolate Chip. Programs export CSV files with all sorts of values for null.
sed 's/,null/,/g' input.csv Use sed to replace ,null with ,.
The Caveat: This could do bad things if someone's favorite ice cream flavor begins with null. It might be tempting to surround null with commas
's/,null,/,,/g' but then this wouldn't catch the second null in
,null,null,. This also won't catch rows whose first column is null.
sed 's/^null,/,/' input.csv # remove beginning null
sed 's/,null$/,/' input.csv # remove end null
CSV files should have a header row, data, and that's it. Programs like to add extra rows before and after the data.
The Error: MySQL likes to add
(1,300 rows exported), or
This CSV was created by Salesforce
tail can be used to see the first or last lines of a file, but can also be used to remove those rows.
Remove the first three lines from a csv file
tail -n +4 input.csv
Remove the last three lines from a csv file
head -n -3 input.csv
Extra Header Rows & Empty Lines
When CSV files are merged, they often contain the header row several times. Sometimes CSV files have empty lines.
Remove header rows. This will remove every header row including the first.
sed '/^id,name,age/d' input.csv
Delete empty lines
sed '/^$/d' input.csv
Some CSV files have random null bytes in them. I don't know why. Remove them.
invalid byte sequence for encoding "UTF8": 0x00
tr < input.csv -d '\000'
Commas in Numbers
invalid syntax for integer '1,300'
alter table ice_cream_orders alter chocolate_orders type int using replace(chocolate_orders, ',', '')
Postgres doesn't like when numbers have commas in them. If the numbers in your CSV have commas, load the column as text and then use Postgres's
replace to remove the commas and convert it to an int!