Tinkering With Supercomputers: Experimental Cluster for Researchers Launched

First Posted: Jul 10, 2013 02:26 PM EDT

There is no doubt about it; to support future scientific discovery, computer systems will need to grow larger and many times faster. The next big jump, exascale, will be 1000 times faster than the petabyte computers that debut in 2008. Many US computing centers, national labs, and centers of excellence provide researchers with high-performance computer resources for running software. However, a testbed at scale – where researchers can study how computer systems grow and what happens when they grow – has been hard to come by.

PRObE (Parallel Reconfigurable Observational Environment), an experimental cluster available to systems researchers across the US, fills that void. Sponsored by the US National Science Foundation, it serves as a test bed for large-scale experiments, which would be impossible at smaller scales. The cluster is located at the New Mexico Consortium (NMC) in Los Alamos, US.

“PRObE is a unique resource where systems researchers have full access to the servers, including the hardware, while their experiment is running,” explains Andree Jacobson, PRObE project manager, and computer and information systems manager at NMC. “In order to predict what will happen with software and hardware at scale, you need to have a testbed at scale to try things out on. Here, researchers can really tinker; they can power off computers, unplug cables, crash hardware or software, modify the network, all while their experiment is running – something impossible at the national centers.”

“Our hardware is not more powerful than you’ll find at the centers, but we have many more machines to do these kinds of tests on, than say, at a university or in the high-performance computing community.” The testbed computers are, in fact, decommissioned machines donated by Los Alamos National Lab (LANL), located just across the street from NMC.

“Los Alamos National Laboratories regularly set aside machines to make way for equipment that offers more efficiency per watt and per dollar,” says Garth Gibson, a professor of computer science and engineering at Carnegie Mellon University in Pennsylvania, US. Speaking at the 22nd International ACM Symposium on High-Performance, Parallel, and Distributed Computing in New York City, US, Gibson acknowledged the critical necessity of testing at scale. “Not enough of our students have actually experienced large scale before. Something you can do on 12 processors is not the same problem as it is on a thousand.”

The mastermind behind PRObE is Gary Grider, division leader for high-performance computing at LANL. “The idea actually came about in 2006, while I was on a plane working on paperwork to dispose of a computer. At that time, I was involved with the interagency high-end computing working group at LANL, which was trying to build a community around large-scale storage and file systems. One of the outcries of that community was that they had no place to try out ideas – a system that could be disrupted.”

It took a while to bring the PRObE project to fruition, but Grider has always had his eyes set squarely on the future: “Not only do we need researchers building systems software for high-performance machines; we need people who are able to run and operate those machines.” This conviction ultimately led to summer enrichment opportunities for third-year undergraduates. The Computer System, Cluster, and Networking Summer Institute (CSCNSI) at NMC is aimed at students already engaged in a computer science, computer engineering, or similar major.

Following a merit process, 12 highly-sought positions are awarded to third year undergraduates, and these students learn all of the ins and outs of HPC tools, booting massively parallel systems, loading software stacks, and understanding security concerns. After being supported through this two-week, cluster-building ‘bootcamp’ by dedicated faculty, the three-student teams are partnered with a mentor from LANL, usually a systems researcher or administrator who has a project in waiting for each group.

Upon completion of the program, the students’ names are provided (with their consent) to all of the national labs and high-performance centers, as potential future employees with advanced training as systems administrators. “Sixty to seventy students have been through the program, and the track record of placement is phenomenal – even better than here at LANL, and we have 1,200 students a year,” says Grider.

Supercomputers are never going to get smaller, concedes Jacobson. “There are going to be more nodes and more cores, and it will be hard to predict what will happen with software and hardware in the future, unless you have a testbed at scale like PRObE. If PRObE can help in understanding how computer systems grow and what will happen as they grow, this project will have made a major contribution to the future of computing.”

The PRObE clusters, named Marmot, Denali, and Kodiak, are all currently available. Brand new hardware, Susitna, is slated for release in July 2013. PRObE is currently accepting applications to use Kodiak, its largest cluster. In the meantime, the two smaller clusters, Marmot and Denali, are available for staging. Principal investigators can log on to the portal to request new projects. Success on the staging clusters is required before applications for time on Kodiak will be approved.

PRObE is a collaboration between the US National Science Foundation, New Mexico Consortium, Carnegie Mellon University, and the University of Utah.

For more about PRObE, visit the FAQ, portal, or website, or join the Google Group.