Remove non-ascii characters in file names

Today someone asked to help with getting some files with non-ASCII characters on his Linux box. The problem was that those file couldn’t be read by some apps (unicode remains a mystery for some). Since I am not a bash-minded person I thought of Python…

Today someone asked to help with getting some files with non-ASCII characters on his Linux box. The problem was that those file couldn’t be read by some apps (unicode still remains a mystery for some). Since I am not a bash-minded person I thought of Python (2.x) first (works on Windows as well). The following script will remove any non-ASCII characters from file names.

WARNING: Beware if you have files that may end up in the same name, as the files may be overridden!

import os
for file in os.listdir(u"."):
    if os.path.isfile(file) and file.endswith(u'.rar'):
    new_file = "".join(i for i in file if ord(i)<128)
    if (file != new_file):
        print u"Renaming", file.encode('utf8'),u" to ", new_file.encode('utf8')
        os.rename(file, new_file)

 

Note that the u"." is essential so that you get the unicode file names back. The "." will give you regular string which is pain-in-the-butt. Sticking to bash (if you don’t like Python for some reason), I’ve came up with the following script:

for f in *.rar; do
    mv "$f" `echo $f | tr -cd "a-zA-Z0-9.-_"`
done

 

Note that most script deal with data IN the files, but not the file names themselves. I hope some other people can use this as well.

Python.NET and VS2010/.NET 4

I hope this will help some people like me searching for an answer. Several steps were mentioned in the Python for .NET mailing list by other people as well, but I haven’t seen a step-by-step guide. It is not my intention to duplicate other posts in that sense, but rather have all-in-one post.

Here is how I’ve got VS2010 and .NET 4.0 working with the revision 122 of the Python.NET having Python 2.6 installed.

Compile

  1. Get the sources (tarball from sourceforge or directly from SVN)
  2. Open the pythonnet.sln with VS2010 and convert to 2010 format (will happen automagically)
  3. [updated] Add the constructorbinding.cs to the Python.Runtime.csproj project (see also this post in the PythonNET mailing list)
  4. Change the target framework to 4. Follow the following step for EACH project
    Right-click on the project name and select “Properties”
    Select the “Application” tab on the left (if not selected yet)
    Change the “Target framework” to “.NET Framework 4”
  5. Open the buildclrmodule.bat and change the following lines (2 times!)

    %windir%\Microsoft.NET\Framework\v2.0.50727\ilasm /nologo /quiet /dll %ILASM_EXTRA_ARGS% /include=%INCLUDE_PATH% /output=%OUTPUT_PATH% %INPUT_PATH%

    to

    %windir%\Microsoft.NET\Framework\v4.0.30319\ilasm /nologo /quiet /dll %ILASM_EXTRA_ARGS% /include=%INCLUDE_PATH% /output=%OUTPUT_PATH% %INPUT_PATH%
  6. Open the clrmodule.il and change the lines with the version number in the following piece of code.assembly extern mscorlib
    {
    .publickeytoken = (B7 7A 5C 56 19 34 E0 89 )
    .ver 2:0:0:0
    }

    to

    .ver 4:0:0:0
  7. Recompile the whole solution, ignore the deprecation warnings.

Now you have all necessary files under the pythonnet folder where you have the sources. You need clr.pyd, python.exe and Python.Runtime.dll.

Test

  • Run the newly compiled python.exe
    Type the following in the interactive prompt

    >>> import System
    >>> print System.Environment.Version
    4.0.30319.1

    The last line proves that you’re using the 4.0 runtime. The precompiled binaries available from SourceForge would show


    2.0.50727.3615