Thursday, September 24, 2009

Scala 2.8 scala.io.Source throws java.nio.charset.UnmappableCharacterException at an unmappable sequence of bytes by default

Using scala 2.8 to grab a webpage encoded in big5, Source.fromURL throws UnmappableCharacterException at a chinese character (in bytes) that cannot be mapped to the unicode character. The default behavior of the scala Codec is to report this exception.

From reading the Codec source, you could see that Codec is actually composed of java.nio.charset.CharsetDecoder. From reading the javadoc, there is a caller method onUnmappableCharacter, and there should be 3 different CodingErrorAction that you can choose.

In scala.io.Codec source:

def onUnmappableCharacter(newAction: Action): this.type = { _onUnmappableCharacter = newAction ; this }

So that's easy enough,

import java.nio.charset.CodingErrorAction.REPLACE
implicit def codec = Codec("big5").onUnmappableCharacter(REPLACE)
scala.io.Source.fromURL(...) // a big5 encoded page with unmappable

Then everything should go quietly without any error since the unmappable sequence will be replaced by the default value.

No comments: