Wednesday, May 22, 2013

Parallel computing in R

R is a free, general purpose programming language that is particularly popular for data processing and analysis ( I've been using it today for Monte Carlo simulations and, having tried to optimize my code (x3 speed increase) and moved the script from my laptop to my workstation (x2 speed increase), each simulation was still going to take a few hours, and I had a lot of them to do.

That's when I decided I was going to learn how to use the snowfall package for R. This allows one to spread computations across multiple computers or, more relevant in my case, across the 8 threads of the processor in my machine. It's fairly easy to use; it took about an hour to modify my code - and I'm sure it'd take much less time now I know how. Snowfall gave close to a further x8 speed increase for the Monte Carlo simulations.

You can find a good introduction to snowfall  here

I didn't bother with the SSH stuff mentioned in this guide and it worked just fine for me. Basically, just install the snowfall package:

library( snowfall )

, set up a cluster using sfInit,

sfInit( parallel=TRUE, cpus=8, type="SOCK" )

, use sfLapply to do anything you would do with lapply, but across all threads,

store <- sfLapply(1:10000, awoneit, numcat = ncat, nitems = nitms, np = nppts))

and finally simply the list back to a numeric vector to make calculation of means etc easy:

store <- simplify2array(store)
store <- as.numeric(store)

 Close the cluster using: