Encoding Hell

Haui

2012-09-26 12:00

Some years ago, when I first started using Linux, my system's locale was set to ISO8859-15. Over the years I switched to UTF-8 of course. Though I now tend to use proper filenames for all my files, I come across witnesses of the old days, when I was littering my filenames with crappy [1], or even crappier[2] characters, once in a while. In my defence I have to say, that lots of these files carry names I didn't choose by myself, because they were auto-generated by CD rippers or other software. Some files even date back to the time when I was exclusively using Windows and didn't care about filenames or encodings at all.

Using the command from my posting about rename can usually fix all these filenames, but this might not always be what you want - a folder named glückliche_kühe is renamed to gl_ckliche_k_he - not a perfect solution. What you might really want is to convert the filename from one encoding to another, and good for you, somebody already did all the work and created a nifty little program called convmv, which supports 124 different encodings. The syntax is quiet easy:

convmv -f iso8859-15 -t utf-8 *

This would show which filenames in the current directory would have been converted from ISO8859-15 to UTF-8 if you'd explicitly added the - -notest option to the command-line.

Thats the easy way, but let's assume you want to work with the glückliche_kühe folder without re-encoding the filename. Be aware of the fact, that some graphical file mangers may not handle filenames with wrong encodings correctly. On my system, krusader couldn't open the ISO8859-15 encoded test folder, while gentoo (yes, this is indeed a file manger) only displayed a warning. Additionally, there are situations, where no graphical environment is available at all.

So, the far more interesting question is how to work with these files in a shell environment. The naive approach cd glückliche_kühe fails because the ISO8859-15 ü is different from the UTF-8 ü - our UTF-8 environment will correctly respond that there's no such folder. A simple ls will show a question mark for every crappier character in the filename and that's not exactly useful either, since we can't uniquely identify the names this way. How would you change into glückliche_kühe if there's also a folder called gl_ckliche_k_he? Typing cd gl?ckliche_k?he is ambiguous since the question mark is treated as a special character by the Bash and expands to match any character. Depending on the situation, this might or might not work as the Bash returns a list of all matching filenames for your input sequence gl?ckliche_k?he. One solution is to run ls with the -b option - this way, we instruct ls to print unprintable characters as octal escapes:

user@localhost /tmp/test $ ls -b
gl\374ckliche_k\374he

This gives us something to work with. echo can interpret these escape sequences and Bash's command substitution offers a way to use echo's output as a value.

user@localhost /tmp/test $ cd "$(echo -e "gl\0374ckliche_k\0374he")"
user@localhost /tmp/test/glückliche_kühe $ pwd
/tmp/test/glückliche_kühe

There are three things you should note here. First of all, in order to mark the escaped sequences as octal numbers, you need to add a leading zero in the way I did in this example. Secondly, the -e parameter is required to tell echo to interpret escaped sequences rather than printing the literal characters. The last thing is not exactly related to the encoding problem, but always worth mentioning: the quotes are supposed to be there for a reason!

So, now the encoding hell shouldn't look so scary anymore - at least not with respect to filenames. ;)

Oh, and by the way, if you just want to check if you got any wrongly encoded filenames, this one-liner could help:

find . -print0 | xargs -0 ls -db  | egrep "\\\[0-9\]{3}"

[1] every character c, that is not in [a-zA-Z0-9._-]+

[2] every character c, where utf8(c) != iso8859-15(c)