The reason why the chinese cannot be shown correctly because the print() will automatically pick up some default encoding from the terminal/os even you have written "print(
After three hours of reading the reference python doc, and googling, I figured out how to bypass the print() with sys.stdout.buffer.write(), this method is for outputting the bytes directly to stdout.
sys.stdout.buffer.write(line.decode("big5").encode())
'line' was in big5 encoded bytes
line.decode("big5") makes the bytes to an unicode string in Python 3
line.decode("big5").encode() will make the unicode string to utf8 encoded bytes
More research on this sys.stdout.buffer.write led me to the better answers at "Setting the correct encoding when piping stdout in python" and here.
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
print(line.decode("big5")) # automatically using utf8 to output the unicode string
From http://www.python.org/doc/3.1/library/codecs.html#codecs.StreamWriter, stream must be a file-like object open for writing binary data, and that's our "sys.stdout.buffer".
The best way to deal with unicode is to, treat every input as bytes and decode it (say for our example, it is big5) after receiving the input; send every output as bytes by encoding the internal string representation (same as in perl), the best choice is utf8 here.
No comments:
Post a Comment