Thursday, September 24, 2009

Two links for future reference - Spring & inner class and access target from proxy in aop

First link is here, you can create a bean with inner class with a constructor back to the parent class bean or you will get an exception something like no default constructor.

Second link is here, it is basic I know, just something I will most likely forget in the future if I need this again.

Scala 2.8 scala.io.Source throws java.nio.charset.UnmappableCharacterException at an unmappable sequence of bytes by default

Using scala 2.8 to grab a webpage encoded in big5, Source.fromURL throws UnmappableCharacterException at a chinese character (in bytes) that cannot be mapped to the unicode character. The default behavior of the scala Codec is to report this exception.

From reading the Codec source, you could see that Codec is actually composed of java.nio.charset.CharsetDecoder. From reading the javadoc, there is a caller method onUnmappableCharacter, and there should be 3 different CodingErrorAction that you can choose.

In scala.io.Codec source:

def onUnmappableCharacter(newAction: Action): this.type = { _onUnmappableCharacter = newAction ; this }

So that's easy enough,

import java.nio.charset.CodingErrorAction.REPLACE
implicit def codec = Codec("big5").onUnmappableCharacter(REPLACE)
scala.io.Source.fromURL(...) // a big5 encoded page with unmappable

Then everything should go quietly without any error since the unmappable sequence will be replaced by the default value.

Wednesday, September 16, 2009

Python in cygwin with puttycyg or mintty - interactive mode without prompt

I finally find the answer at here the last post under that thread.

$ python -i
Python 3.1.1 (r311:74483, Aug 17 2009, 17:02:12) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>

Yay!

Tuesday, September 15, 2009

Python 3 Unicode - print() in a putty cygwin terminal with UTF8 enabled

When trying to rewrite my hkgolden forum stat program by using Python 3, the unicode issue was my first thing to deal with. Using a putty cygwin terminal to launch the program and try to print the web page content to the terminal (UTF8 enabled), I immediately encountered two problems: 1. the chinese is not chinese anymore, and 2. "UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 188: illegal multibyte sequence".

The reason why the chinese cannot be shown correctly because the print() will automatically pick up some default encoding from the terminal/os even you have written "print()" would fail since you could guess from problem #2, the default for my system is "gbk" as I have picked "Simplified chinese" as my non-Unicode encoding in Windows (print(sys.stdout.encoding) returns 'cp936' for me).

After three hours of reading the reference python doc, and googling, I figured out how to bypass the print() with sys.stdout.buffer.write(), this method is for outputting the bytes directly to stdout.

sys.stdout.buffer.write(line.decode("big5").encode())

'line' was in big5 encoded bytes
line.decode("big5") makes the bytes to an unicode string in Python 3
line.decode("big5").encode() will make the unicode string to utf8 encoded bytes

More research on this sys.stdout.buffer.write led me to the better answers at "Setting the correct encoding when piping stdout in python" and here.


import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
print(line.decode("big5")) # automatically using utf8 to output the unicode string


From http://www.python.org/doc/3.1/library/codecs.html#codecs.StreamWriter, stream must be a file-like object open for writing binary data, and that's our "sys.stdout.buffer".

The best way to deal with unicode is to, treat every input as bytes and decode it (say for our example, it is big5) after receiving the input; send every output as bytes by encoding the internal string representation (same as in perl), the best choice is utf8 here.

Thursday, September 10, 2009

Scala scala.io.Source fromURL blocks / hangs forever without timeout value

Recently I am doing a scala project which is trying to data mine a forum. I have reached to a point that, since I am using multi threads to do the web content fetching, some of my threads block/hang forever at various lines, like Source.getLine, hasNext or even fromURL. Here is one example of the thread dump stack:


"pool-1-thread-194" prio=6 tid=0x0b57d000 nid=0x1710 runnable [0x0f4af000..0x0f4afa94]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x031a5488> (a java.io.BufferedInputStream)
at sun.net.www.MeteredStream.read(MeteredStream.java:116)
- locked <0x031efdc8> (a sun.net.www.http.KeepAliveStream)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:2446)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
- locked <0x031efe48> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.read(BufferedReader.java:157)
- locked <0x031efe48> (a java.io.InputStreamReader)
at scala.io.BufferedSource$$anonfun$1$$anonfun$apply$1.apply(BufferedSource.scala:29)
at scala.io.BufferedSource$$anonfun$1$$anonfun$apply$1.apply(BufferedSource.scala:29)
at scala.io.Codec.wrap(Codec.scala:65)
at scala.io.BufferedSource$$anonfun$1.apply(BufferedSource.scala:29)
at scala.io.BufferedSource$$anonfun$1.apply(BufferedSource.scala:29)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:146)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:712)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:699)
at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:374)
at scala.collection.Iterator$$anon$17.hasNext(Iterator.scala:319)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:706)
at scala.io.Source$LineIterator.getc(Source.scala:182)
at scala.io.Source$LineIterator.next(Source.scala:195)
at scala.io.Source$LineIterator.next(Source.scala:165)
at scala.io.Source.getLine(Source.scala:163)


With debugger on, you should be able to figure out the timeout parameter of the socketRead0 method is actually ZERO. That's why fromURL will block forever.

Open up the scala.io.Source source (2.8), fromURL is actually a convenience method to fromInputStream(url.openStream())(codec).

Now that's easy, just forget about the fromURL method. Use the fromInputStream instead with java.net.URLConnection.


import java.net.URL
import scala.io.Source

val timeout = 60000
val conn = (new URL(url)).openConnection()
conn.setConnectTimeout(timeout)
conn.setReadTimeout(timeout)
val inputStream = conn.getInputStream()

val src = Source.fromInputStream(inputStream,
Source.DefaultBufSize,
null,
() => inputStream.close())


EDIT: forgot to close the stream!