Linux Follies: byte order

2020-01-09

Dealing with Excel CSV/TSV files in UTF-16LE encoding, and an invisible character

Motivation: I am trying to read a TSV file produced by Blackboard courseware. I am trying to do so using Python's built-in csv library. I am using a simple method:

import csv 

with open('myfile.tsv', 'r', encoding='utf-16le') as cf:
    cr = csv.DictReader(cf, dialect='excel-tab')
    print(cr.fieldnames)   

    for row in cr:       
        print(row)

What I got for the fieldnames was:

['\ufeff"Last Name"', 'First Name']

I.e. the "Last Name" field got munged with \ufeff or U+FEFF in normal Unicode notation.

and the rows looked like:

OrderedDict([('\ufeff"Last Name"', 'Doe'), ('First Name', 'Alice')])

Just using vim to look at the file, there seemed to be nothing weird about the first line which contains the column names. This is a ZERO WIDTH NO-BREAK SPACE character used as a Byte Order Mark (BOM). It allows reading processes, e.g. file(1)/magic(5), to figure out the byte order of the file.

But it messes up the csv.DictReader parsing of the field names.

Turns out, there is an easy fix. You can just modify the field names directly:

cr = csv.DictReader(cf, dialect='excel-tab')
cr.fieldnames[0] = 'Last Name'

And the output is fixed, too:

OrderedDict([('Last Name', 'Doe'), ('First Name', 'Alice')])