Posts Tagged: http


6
Oct 09

Downloading a bunch of files in parallel using Clojure Agents

I suddenly needed to download around 3000 files from the Internet. I had the urls in a sequence and I was thinking about a nice way to download the files in parallel.

The idea of using Clojure Agents came naturally to my mind and I was thinking about writing an Agent based HTTP client in Clojure. I asked around on the Clojure IRC channel and the very helpful Stuart Sierra pointed me towards clojure.contrib.http.agent

Indeed, c.c.http.agent seemed to be exactly what I had in my mind :)

The API seemed to be straightforward enough and I got cracking immediately. I came up with something like this –

;;; downloader.clj -- Parallel Downloader -*- Clojure -*-
;;; Time-stamp: "2009-10-06 13:38:57 ghoseb"
;;; Author: Baishampayan Ghose 
 
(ns downloader
  (:require [clojure.contrib.http.agent :as h]
            [clojure.contrib.duck-streams :as d]))
 
A vector of vectors containing the file name and the URL
(def url-data [["file1" "http://some.domain/file1.xml"]
               ["file2" "http://some.domain/file2.xml"]
               ; Many many more :)
               ])
 
(defn download
  "Download the data in the given URL using HTTP Agents
   Args:
     file-name - The file name to save the data in
     url - The URL to fetch
  "
  [file-name url]
  (h/http-agent url
                :handler (fn [agnt]
                           (let [fname file-name]  ; File name in a closure
                             (with-open [w (d/writer fname)]
                               (d/copy (h/stream agnt) w))))))
 
(defn download-all
  "Download all the URLs
   Args:
     url-data - A vector of vectors containing the file name and the url
  "
  [url-data]
  (doseq [[file-name url] url-data]
    (download file-name url)))
 
(download-all url-data)

This looked fine and worked with a small set of urls. But when I ran it on the full-blown set of URLs, the server bailed out because of too many concurrent requests. The reason being the fact that http.agent uses send-off to dispatch action to the agents and send-off can end up using a potentially very large thread-pool.

Surely I needed to somehow make sure that only a limited number of files are downloaded in parallel and start downloading more when those are done.

To achieve that, I did this –

(def partitioned-data (partition 15 url-data)) ;; 15 being the max parallel downloads
 
(defn download-all2
  "Download all the files, step by step
   Args:
     p-url-data - Partitioned url data
  "
  [p-url-data]
  (doseq [url-data p-url-data]
    (let [agnts (map #(download (first %) (second %)) url-data)]
      (apply await agnts)))) ; Wait till the agents finish
 
(download-all2 partitioned-data)

What did I just do? I simply partitioned the data set by the number of parallel downloads I wanted to do, and then modified the download-all function to take the partitioned data, dispatch agents on one partition and wait for them to finish, and then move on to the next partition.

Simple, yet very beautiful.