Thursday, September 10, 2009

Scala scala.io.Source fromURL blocks / hangs forever without timeout value

Recently I am doing a scala project which is trying to data mine a forum. I have reached to a point that, since I am using multi threads to do the web content fetching, some of my threads block/hang forever at various lines, like Source.getLine, hasNext or even fromURL. Here is one example of the thread dump stack:


"pool-1-thread-194" prio=6 tid=0x0b57d000 nid=0x1710 runnable [0x0f4af000..0x0f4afa94]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x031a5488> (a java.io.BufferedInputStream)
at sun.net.www.MeteredStream.read(MeteredStream.java:116)
- locked <0x031efdc8> (a sun.net.www.http.KeepAliveStream)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:2446)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
- locked <0x031efe48> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.read(BufferedReader.java:157)
- locked <0x031efe48> (a java.io.InputStreamReader)
at scala.io.BufferedSource$$anonfun$1$$anonfun$apply$1.apply(BufferedSource.scala:29)
at scala.io.BufferedSource$$anonfun$1$$anonfun$apply$1.apply(BufferedSource.scala:29)
at scala.io.Codec.wrap(Codec.scala:65)
at scala.io.BufferedSource$$anonfun$1.apply(BufferedSource.scala:29)
at scala.io.BufferedSource$$anonfun$1.apply(BufferedSource.scala:29)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:146)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:712)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:699)
at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:374)
at scala.collection.Iterator$$anon$17.hasNext(Iterator.scala:319)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:706)
at scala.io.Source$LineIterator.getc(Source.scala:182)
at scala.io.Source$LineIterator.next(Source.scala:195)
at scala.io.Source$LineIterator.next(Source.scala:165)
at scala.io.Source.getLine(Source.scala:163)


With debugger on, you should be able to figure out the timeout parameter of the socketRead0 method is actually ZERO. That's why fromURL will block forever.

Open up the scala.io.Source source (2.8), fromURL is actually a convenience method to fromInputStream(url.openStream())(codec).

Now that's easy, just forget about the fromURL method. Use the fromInputStream instead with java.net.URLConnection.


import java.net.URL
import scala.io.Source

val timeout = 60000
val conn = (new URL(url)).openConnection()
conn.setConnectTimeout(timeout)
conn.setReadTimeout(timeout)
val inputStream = conn.getInputStream()

val src = Source.fromInputStream(inputStream,
Source.DefaultBufSize,
null,
() => inputStream.close())


EDIT: forgot to close the stream!

2 comments:

Anonymous said...

Me too, I got this error with Source.fromURL() in some cases:

java.nio.charset.MalformedInputException: Input length = 1

Normally, I use default connection handler in Java, but I guess the best is Apache HttpComponents.

Matthew Kwong said...

Hi Haiti, it's been half a year I have not touched Scala, but it's good to see people are still using that.