When I started graduate school I new absolutely nothing about computing on anything more high performance than a laptop. I assumed that clusters and servers were exclusive to large, well-funded labs. Access to these items is a huge limiting factor in almost any field of research. This is certainly the case in microbiology where it is increasingly difficult to close the loop on an interesting story without volumes of sequence data. At the time (and for a long time after) I was indignant at this seeming injustice – only large labs with large budgets had access to the large computers required for the most compelling investigations (the results of which enable additional funding…). Over time I’ve slowly come to appreciate that this is not the case. A high performance computer is cheaper than many items deemed essential to a microbiology lab (-80 C freezer (~ $10 K), microscope (~ $20 K)). I suspect there are other students out there with similar misconceptions, so here is a short compilation of things I’ve learned regarding research computers. Imagine that you are a new faculty member, or a postdoc with a little fellowship money, looking to purchase a computer for basic bioinformatics (e.g. genome assembly and annotation). What will you have to spend?
A necessary consideration is how you choose to use a computer in a research setting. Our lab’s “server” is a MacPro workstation (more on that shortly). After some trial and error I settled into an arrangement where I work almost exclusively from my laptop. I use ssh and scp to access the server, and almost never sit in front of it. The rest of this post assumes that the purchaser works in a similar fashion. It’s a matter of personal style, other lab members like to park in front of the monitor (I’d rather park in the coffee shop, or at least my office). Unlike almost everyone else in the world of bioinformatics my laptop is a PC. Again, a matter of style. It works great for me. The point is that it works almost seamlessly with the MacPro, and all other Unix/Linux servers. It really doesn’t matter what you connect to it with, the server is the important thing. It’s worth bringing this up because, as a graduate student, I’d rather have a cheap laptop and a decent remote workstation than a nice laptop (of course in an ideal world you would have both). If you can’t ask your advisor for both, ask for the workstation…
Our MacPro was purchased back in 2008, for all the right reasons. Next generation sequencing, which generates orders of magnitude more data than older methods, had just been introduced. A Mac with 8G of RAM and 8 cores seemed like more than enough computer. Nobody in either of the labs using the computer knew much about Unix or Linux, so it didn’t seem like there was much of a choice. Unfortunately in this decision a cardinal rule of computers was violated – a computer deemed sufficient to do the job you are purchasing it for is likely to be insufficient for the next job. It’s hard to justify spending money on additional resources without a clear use for them, but it just has to be done in some cases. When I started working with next generation data I upped the RAM to 24 G and added an additional 2T HD. Theses upgrades make the MacPro suitable for 75 % of the tasks I need to perform on a day to day basis, but the other 25 % is kind of essential…
The top of the line MacPro today, with 8T of storage, 12 cores, and 64 G of RAM will run you around $6,600 (tower only). For a long time I assumed this was the only way to go – more capable workstations must be prohibitively more expensive, right? Since there’s a lot that you can’t do with only 64 G RAM and 12 cores I assumed that that work was the sole purview of well funded labs. Sometime back I explained this dilemma to a colleague in our geology department who deals with large geospatial datasets, and who clued me in on the price of computers.
Mac’s are friendly, but some friends can be bossy and expensive to hang out with. In this case it takes a bit of tinkering to make a Mac work well as a Unix server. If that’s how you’re using it than you aren’t using the features that you’ve paid a lot of money for – namely the Mac OS GUI front end. True Unix/Linux workstations are comparatively cheap. For ~ 5 K I found two local companies who could provide a server-grade tower with 128 G RAM (minimal for many applications, but expandable to 512 G), and 2 x 8 Intel Xeon processors (16 cores total). There’s plenty that such a machine can’t do, but it’s not a bad start.
I know that other groups have migrated to Amazon ECS for research computing, but there are clear downsides. Amazon “instances” are expensive, and limited to 64 G RAM. If you aren’t a computer expert you could spend a lot of money just trying to figure out how to make things work. It also seems that data exploration would be limited by cost, and by the slight difficulty inherent in setting up instances. Most research institutions have some kind of high performance computer environment; I’ve used the University of Washington’s Hyak cluster for a lot of my work. This environment is great when many processors are desired, but, as with Amazon ECS, it is expensive and memory limited. I’m pretty new at this and just getting a sense of what’s out there. If you’ve had a good or bad experience with any of this technology leave a comment!