<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Free Geek &#187; agents</title>
	<atom:link href="http://freegeek.in/blog/tag/agents/feed/" rel="self" type="application/rss+xml" />
	<link>http://freegeek.in/blog</link>
	<description>The Chronicles of Nerd-nia</description>
	<lastBuildDate>Wed, 02 Jun 2010 03:22:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Downloading a bunch of files in parallel using Clojure Agents</title>
		<link>http://freegeek.in/blog/2009/10/downloading-a-bunch-of-files-in-parallel-using-clojure-agents/</link>
		<comments>http://freegeek.in/blog/2009/10/downloading-a-bunch-of-files-in-parallel-using-clojure-agents/#comments</comments>
		<pubDate>Tue, 06 Oct 2009 08:24:49 +0000</pubDate>
		<dc:creator>Baishampayan</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[agents]]></category>
		<category><![CDATA[clojure]]></category>
		<category><![CDATA[functional]]></category>
		<category><![CDATA[functional programming]]></category>
		<category><![CDATA[http]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://freegeek.in/blog/?p=80</guid>
		<description><![CDATA[I suddenly needed to download around 3000 files from the Internet. I had the urls in a sequence and I was thinking about a nice way to download the files in parallel. The idea of using Clojure Agents came naturally to my mind and I was thinking about writing an Agent based HTTP client in [...]]]></description>
			<content:encoded><![CDATA[<p>I suddenly needed to download around 3000 files from the Internet. I had the urls in a sequence and I was thinking about a nice way to download the files in parallel.</p>
<p>The idea of using <a title="Clojure" href="http://clojure.org/">Clojure</a> <a title="Clojure Agents" href="http://clojure.org/agents">Agents</a> came naturally to my mind and I was thinking about writing an Agent based HTTP client in Clojure. I asked around on the Clojure IRC channel and the very helpful Stuart Sierra pointed me towards <a href="http://richhickey.github.com/clojure-contrib/http.agent-api.html">clojure.contrib.http.agent<br />
</a><br />
Indeed, c.c.http.agent seemed to be exactly what I had in my mind <img src='http://freegeek.in/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>The API seemed to be straightforward enough and I got cracking immediately. I came up with something like this &#8211;</p>

<div class="wp_syntax"><div class="code"><pre class="lisp" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">;;; downloader.clj -- Parallel Downloader -*- Clojure -*-</span>
<span style="color: #808080; font-style: italic;">;;; Time-stamp: &quot;2009-10-06 13:38:57 ghoseb&quot;</span>
<span style="color: #808080; font-style: italic;">;;; Author: Baishampayan Ghose </span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>ns downloader
  <span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">:</span><span style="color: #555;">require</span> <span style="color: #66cc66;">&#91;</span>clojure<span style="color: #66cc66;">.</span>contrib<span style="color: #66cc66;">.</span>http<span style="color: #66cc66;">.</span>agent <span style="color: #66cc66;">:</span><span style="color: #555;">as</span> h<span style="color: #66cc66;">&#93;</span>
            <span style="color: #66cc66;">&#91;</span>clojure<span style="color: #66cc66;">.</span>contrib<span style="color: #66cc66;">.</span>duck-streams <span style="color: #66cc66;">:</span><span style="color: #555;">as</span> d<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
A vector of vectors containing the file <span style="color: #b1b100;">name</span> <span style="color: #b1b100;">and</span> the URL
<span style="color: #66cc66;">&#40;</span>def url-data <span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#91;</span><span style="color: #ff0000;">&quot;file1&quot;</span> <span style="color: #ff0000;">&quot;http://some.domain/file1.xml&quot;</span><span style="color: #66cc66;">&#93;</span>
               <span style="color: #66cc66;">&#91;</span><span style="color: #ff0000;">&quot;file2&quot;</span> <span style="color: #ff0000;">&quot;http://some.domain/file2.xml&quot;</span><span style="color: #66cc66;">&#93;</span>
               <span style="color: #808080; font-style: italic;">; Many many more :)</span>
               <span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>defn download
  <span style="color: #ff0000;">&quot;Download the data in the given URL using HTTP Agents
   Args:
     file-name - The file name to save the data in
     url - The URL to fetch
  &quot;</span>
  <span style="color: #66cc66;">&#91;</span>file-<span style="color: #b1b100;">name</span> url<span style="color: #66cc66;">&#93;</span>
  <span style="color: #66cc66;">&#40;</span>h/http-agent url
                <span style="color: #66cc66;">:</span><span style="color: #555;">handler</span> <span style="color: #66cc66;">&#40;</span>fn <span style="color: #66cc66;">&#91;</span>agnt<span style="color: #66cc66;">&#93;</span>
                           <span style="color: #66cc66;">&#40;</span><span style="color: #b1b100;">let</span> <span style="color: #66cc66;">&#91;</span>fname file-<span style="color: #b1b100;">name</span><span style="color: #66cc66;">&#93;</span>  <span style="color: #808080; font-style: italic;">; File name in a closure</span>
                             <span style="color: #66cc66;">&#40;</span>with-open <span style="color: #66cc66;">&#91;</span>w <span style="color: #66cc66;">&#40;</span>d/writer fname<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span>
                               <span style="color: #66cc66;">&#40;</span>d/copy <span style="color: #66cc66;">&#40;</span>h/stream agnt<span style="color: #66cc66;">&#41;</span> w<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>defn download-all
  <span style="color: #ff0000;">&quot;Download all the URLs
   Args:
     url-data - A vector of vectors containing the file name and the url
  &quot;</span>
  <span style="color: #66cc66;">&#91;</span>url-data<span style="color: #66cc66;">&#93;</span>
  <span style="color: #66cc66;">&#40;</span>doseq <span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#91;</span>file-<span style="color: #b1b100;">name</span> url<span style="color: #66cc66;">&#93;</span> url-data<span style="color: #66cc66;">&#93;</span>
    <span style="color: #66cc66;">&#40;</span>download file-<span style="color: #b1b100;">name</span> url<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>download-all url-data<span style="color: #66cc66;">&#41;</span></pre></div></div>

<p>This looked fine and worked with a small set of urls. But when I ran it on the full-blown set of URLs, the server bailed out because of too many concurrent requests. The reason being the fact that http.agent uses send-off to dispatch action to the agents and send-off can end up using a potentially very large thread-pool.</p>
<p>Surely I needed to somehow make sure that only a limited number of files are downloaded in parallel and start downloading more when those are done.</p>
<p>To achieve that, I did this &#8211;</p>

<div class="wp_syntax"><div class="code"><pre class="lisp" style="font-family:monospace;"><span style="color: #66cc66;">&#40;</span>def partitioned-data <span style="color: #66cc66;">&#40;</span>partition <span style="color: #cc66cc;">15</span> url-data<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #808080; font-style: italic;">;; 15 being the max parallel downloads</span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>defn download-all2
  <span style="color: #ff0000;">&quot;Download all the files, step by step
   Args:
     p-url-data - Partitioned url data
  &quot;</span>
  <span style="color: #66cc66;">&#91;</span>p-url-data<span style="color: #66cc66;">&#93;</span>
  <span style="color: #66cc66;">&#40;</span>doseq <span style="color: #66cc66;">&#91;</span>url-data p-url-data<span style="color: #66cc66;">&#93;</span>
    <span style="color: #66cc66;">&#40;</span><span style="color: #b1b100;">let</span> <span style="color: #66cc66;">&#91;</span>agnts <span style="color: #66cc66;">&#40;</span>map #<span style="color: #66cc66;">&#40;</span>download <span style="color: #66cc66;">&#40;</span>first <span style="color: #66cc66;">%</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#40;</span>second <span style="color: #66cc66;">%</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span> url-data<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span>
      <span style="color: #66cc66;">&#40;</span><span style="color: #b1b100;">apply</span> await agnts<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #808080; font-style: italic;">; Wait till the agents finish</span>
&nbsp;
<span style="color: #66cc66;">&#40;</span>download-all2 partitioned-data<span style="color: #66cc66;">&#41;</span></pre></div></div>

<p>What did I just do? I simply partitioned the data set by the number of parallel downloads I wanted to do, and then modified the download-all function to take the partitioned data, dispatch agents on one partition and wait for them to finish, and then move on to the next partition.</p>
<p>Simple, yet very beautiful.</p>
]]></content:encoded>
			<wfw:commentRss>http://freegeek.in/blog/2009/10/downloading-a-bunch-of-files-in-parallel-using-clojure-agents/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
