Remove non-ascii characters in file names

Today someone asked to help with getting some files with non-ASCII characters on his Linux box. The problem was that those file couldn’t be read by some apps (unicode remains a mystery for some). Since I am not a bash-minded person I thought of Python…

Today someone asked to help with getting some files with non-ASCII characters on his Linux box. The problem was that those file couldn’t be read by some apps (unicode still remains a mystery for some). Since I am not a bash-minded person I thought of Python (2.x) first (works on Windows as well). The following script will remove any non-ASCII characters from file names.

WARNING: Beware if you have files that may end up in the same name, as the files may be overridden!

import os
for file in os.listdir(u"."):
    if os.path.isfile(file) and file.endswith(u'.rar'):
    new_file = "".join(i for i in file if ord(i)<128)
    if (file != new_file):
        print u"Renaming", file.encode('utf8'),u" to ", new_file.encode('utf8')
        os.rename(file, new_file)

 

Note that the u"." is essential so that you get the unicode file names back. The "." will give you regular string which is pain-in-the-butt. Sticking to bash (if you don’t like Python for some reason), I’ve came up with the following script:

for f in *.rar; do
    mv "$f" `echo $f | tr -cd "a-zA-Z0-9.-_"`
done

 

Note that most script deal with data IN the files, but not the file names themselves. I hope some other people can use this as well.

Working with unicode in Python (again)

This time I have stumbled (again) a unicode problem using some Python code which was supposed to be perfectly suitable for doing this since it even started with


#!/usr/bin/env python
# -*- coding: UTF-8 -*-

It was quite some time since the last post, but this does not mean I haven’t done anything interesting :). It is just that it was so much interesting that I didn’t have any time to write anything.

Anyway, this time I have stumbled (again) a unicode problem using some Python code which was supposed to be perfectly suitable for doing this since it even started with


#!/usr/bin/env python
# -*- coding: UTF-8 -*-

It went perfectly fine when running in Eclipse, but to my huge surprise I’ve got problems when running the unit tests from the command line in terminal. Whaaat? It just worked!

Well, declaring your source as UTF-8 is not enough of course. There are several things to check when getting the “UnicodeDecodeError: ‘ascii’ codec can’t decode byte … in position …: ordinal not in range(128)”-kind of errors. Googling around didn’t bring me much luck to my surprise, so there is are my findings for the next time :).

First of all make absolutely sure you haven’t forgotten the ‘u’ character before your strings containing the unicode strings. Yep, just like that you screw up the rest of the unicode support. Python (ok, I admit, I use 2.5.4) treats a ‘string’ as a regular string and not as a unicode. So write u’string’ instead!

Second, when doing things file operations don’t forget that you don’t get the unicode by default. Consider the following:


message = u'unicode message'
file_handle.write(message)

Well, guess what. You get a problem when writing the string away. It cannot be recognized. So the solution would be to do something like this


encoded_message = message.encode(u"utf-8")
file_handle.write(encoded_message)

But that’s only a half of the problem. At some point you will be reading this data back. And most probably you would like to get your beloved unicode thingy back. Just doing the following will hardly help:


file_handle = open(full_name, 'r')
line = file_handle.readline()

The following will save your day:


file_handle = open(full_name, 'r')
line = file_handle.readline().decode('utf-8')

Voila. I hope this saves some frustration to somebody, even if it will be me some month later :).

Enjoy!


Comments from Andy (thanks!):

Probably there is nothing new for you in what I am saying below, however, from my experience, it covers 99% unicode-related errors.

Unicode string is s sequence of code points in range 0 to 0x10ffff. Encoding is a way of serializing this sequence, so thay can be represented in memory, written to a file, sent over a socket etc.
Encoding unicode string is needed _at least_ because its ‘as is’ byte representation is not portable due to byte order issues.

It is _recommended_ that you work with unicode string internally provided the language/API supports unicode
It is _must_ that you encode the string to be consumed by another program.