Tuesday, September 15, 2009

Python 3 Unicode - print() in a putty cygwin terminal with UTF8 enabled

When trying to rewrite my hkgolden forum stat program by using Python 3, the unicode issue was my first thing to deal with. Using a putty cygwin terminal to launch the program and try to print the web page content to the terminal (UTF8 enabled), I immediately encountered two problems: 1. the chinese is not chinese anymore, and 2. "UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 188: illegal multibyte sequence".

The reason why the chinese cannot be shown correctly because the print() will automatically pick up some default encoding from the terminal/os even you have written "print()" would fail since you could guess from problem #2, the default for my system is "gbk" as I have picked "Simplified chinese" as my non-Unicode encoding in Windows (print(sys.stdout.encoding) returns 'cp936' for me).

After three hours of reading the reference python doc, and googling, I figured out how to bypass the print() with sys.stdout.buffer.write(), this method is for outputting the bytes directly to stdout.

sys.stdout.buffer.write(line.decode("big5").encode())

'line' was in big5 encoded bytes
line.decode("big5") makes the bytes to an unicode string in Python 3
line.decode("big5").encode() will make the unicode string to utf8 encoded bytes

More research on this sys.stdout.buffer.write led me to the better answers at "Setting the correct encoding when piping stdout in python" and here.


import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
print(line.decode("big5")) # automatically using utf8 to output the unicode string


From http://www.python.org/doc/3.1/library/codecs.html#codecs.StreamWriter, stream must be a file-like object open for writing binary data, and that's our "sys.stdout.buffer".

The best way to deal with unicode is to, treat every input as bytes and decode it (say for our example, it is big5) after receiving the input; send every output as bytes by encoding the internal string representation (same as in perl), the best choice is utf8 here.

No comments: