Groovy: Simple file download from URL – Fixed

The Grails app I\’m working on right now has some cookbook code that takes a list of URLs and downloads the file each URL points to into a staging directory for other code to work on them. There are a couple dozen similar examples on groovy/grails blogs on the net:

def downloadFiles = { sourceUrls->
 def stagingDir = \"/tmp/stagingdir\"
 new File(stagingDir).mkdirs()
 sourceUrls.each { sourceUrl ->
   def filename = sourceUrl.tokenize(\'/\')[-1]
   def file = new FileOutputStream(\"$stagingDir/$filename\")
   def out = new BufferedOutputStream(file)
   out << new URL(sourceUrl).openStream()
   out.close()
 }
}

downloadFiles(
 [\"http://lavezzo.com/saic/mvnBuildLifecycle.png\",
 \"http://lavezzo.com/saic/settings.xml\"
 ])

Looks reasonable, right?

What happens if we call it like this?

downloadFiles(
 [\"http://lavezzo.com/saic/mvnBuildLifecycle.png\",
 \"http://lavezzo.com/saic/I have a space.png\"
 ])

Disaster!  java.net.URL can\’t handle spaces? Now normally, if I were writing the URLs I\’d just add in my own %20s and call it a day. But in this case that array of URL strings is the output of an XmlSlurper pointed at an html file. I have no control over the spaces in that file. java.net.URLEncoder seems like a good place to look, but it turns out that class is intended for use when composing links for html files. It substitutes a + for spaces, which don\’t work in java.net.URL. java.net.URI\’s documentation mentions that it encodes non-US-ASCII characters but not with the URI(String str) constructor. Again, this class seems to assume that you are making this URL yourself and can enter the protocol, port, hostname, etc each in its own constructor argument.

Well it was hard for me to believe but the answer was to separate out JUST the http portion of the URL string I collected from the web page and pass those into the URI(String scheme, String ssp, String fragment) constructor and then call URI\’s toURL() method.  Some Groovy array manipulation convienences made it a little easier:

def downloadFiles = { sourceUrls->
 def stagingDir = \"/tmp/stagingdir\"
 new File(stagingDir).mkdirs()
 sourceUrls.each { sourceUrl ->
   def filename = sourceUrl.tokenize(\'/\')[-1]
   def file = new FileOutputStream(\"$stagingDir/$filename\")
   def protocolUrlTokens = sourceUrl.tokenize(\':\')
   def sourceUrlAsURI = new URI(protocolUrlTokens[0],
       protocolUrlTokens[1..(protocolUrlTokens.size-1)].join(\":\"), \"\")
   def out = new BufferedOutputStream(file)
   out << sourceUrlAsURI.toURL().openStream()
   out.close()
 }
}

downloadFiles(
 [\"http://lavezzo.com/saic/mvnBuildLifecycle.png\",
 \"http://lavezzo.com/saic/I have a space.png\"
 ])

It looks silly to be splitting out the http just to put it back together in the constructor.  Seems like a simple point of improvement in the one argument constructor to URI to parse the String for protocol and then use the three argument constructor internally.

In Charlottesville, Virginia
Jeff

[Ed: Now with SyntaxHighlighter goodness]