CSV Quick Fixes

profile picture of article author Jason by Jason Strauss Data Engineer at CSV Explorer

We're about to hit 5 billion total rows loaded into CSV Explorer for analysis. I wish I could say all those rows were parsed and loaded automatically from CSV files straight into PostgreSQL, but we've seen our fair share of the quirks and errors of CSV formatting. I remember a late night copying a CSV into PostgreSQL and being stumped by ERROR invalid input syntax for date '4/31/2017'.

Spoiler: April 31st doesn't exist. So I figured I'd make this cheat sheet of common errors we've seen when importing CSV files into PostgreSQL and how to fix them. All these fixes work without bringing the file into memory, i.e. on large CSVs!


MM/DD/YYYY

By default, PostgreSQL expects dates to have the day before the month.
The Error: If the month comes first invalid input syntax for date '31/12/2017'
The Fix: Run SET datestyle = 'ISO,DMY'; in psql before loading your data. This will stay set for the session.



Non-null Nulls

Nulls in CSV files should be empty commas Jim,,,,Chocolate Chip. Programs export CSV files with all sorts of values for null.
The Error: Jen,null,null,null,Vanilla or Janet,#N/A,#N/A,#N/A,Chocolate
The Fix: sed 's/,null/,/g' input.csv Use sed to replace ,null with ,.

The Caveat: This could do bad things if someone's favorite ice cream flavor begins with null. It might be tempting to surround null with commas 's/,null,/,,/g' but then this wouldn't catch the second null in ,null,null,. This also won't catch rows whose first column is null.

Remove null from the beginning or end of a row sed 's/^null,/,/' input.csv # remove beginning null
sed 's/,null$/,/' input.csv # remove end null


Extra Rows

CSV files should have a header row, data, and that's it. Programs like to add extra rows before and after the data.
The Error: MySQL likes to add (1,300 rows exported), or This CSV was created by Salesforce
The Fix: head and tail can be used to see the first or last lines of a file, but can also be used to remove those rows.

Remove the first three lines from a csv file tail -n +4 input.csv Remove the last three lines from a csv file head -n -3 input.csv


Extra Header Rows & Empty Lines

When CSV files are merged, they often contain the header row several times. Sometimes CSV files have empty lines.
The Error: id,name,age,ice_cream_flavor,cone_or_cup
...
...
id,name,age,ice_cream_flavor,cone_or_cup
...
The Fix:

Remove header rows. This will remove every header row including the first. sed '/^id,name,age/d' input.csv Delete empty lines sed '/^$/d' input.csv


Null Bytes

Some CSV files have random null bytes in them. I don't know why. Remove them.
The Error: invalid byte sequence for encoding "UTF8": 0x00
The Fix: tr < input.csv -d '\000'

Commas in Numbers

The Error: invalid syntax for integer '1,300'
The Fix: alter table ice_cream_orders alter chocolate_orders type int using replace(chocolate_orders, ',', '')

Postgres doesn't like when numbers have commas in them. If the numbers in your CSV have commas, load the column as text and then use Postgres's replace to remove the commas and convert it to an int!