Computer Scientists Break Terabyte Sort Barrier

Computer scientists from the University of California, San Diego, have recently reported setting a world-record, by sorting huge amounts of data in minimal time. This record might have significant impact on various industries that rely on data management, since the technology introduced can be applied to almost all platforms.
 To break the terabyte barrier for the Indy Minute Sort, the computer science researchers built a system made up of 52 computer nodes. Each node is a commodity server with two quad-core processors, 24 gigabytes (GB) memory and sixteen 500 GB disks – all inter-connected by a Cisco Nexus 5020 switch. Cisco donated the switches as a part of their research engagement with the UC San Diego Center for Networked Systems. The compute cluster is hosted at Calit2. (Credit: University of California, San Diego)
To break the terabyte barrier for the Indy Minute Sort, the computer science researchers built a system made up of 52 computer nodes. Each node is a commodity server with two quad-core processors, 24 gigabytes (GB) memory and sixteen 500 GB disks – all inter-connected by a Cisco Nexus 5020 switch. Cisco donated the switches as a part of their research engagement with the UC San Diego Center for Networked Systems. The compute cluster is hosted at Calit2. (Credit: University of California, San Diego)

The new record was accomplished during the 2010 “Sort Benchmark”, when the team of scientists from the UC San Diego broke the terabyte barrier: sorting more than one terabyte of data in just 60 seconds. One terabyte is 1,000 gigabytes, equivalent to approximately 40 single-layer Blu-Ray discs or 213 single-layer DVDs. In total, they sorted one trillion data records in 172 minutes. In comparison to the previous record holder, they did so using 4 times fewer computing resources.

Internet advertisements – such as the ones on Facebook, custom recommendations on Amazon, and up-to-the-second search results on Google – all result from sorting data sets. These sets are enormous, sized at thousands of terabytes. Various companies that look for trends, efficiencies and other competitive advantages could utilize the new techniques to perform heavy-duty data sorting – without exhausting computing resources.

The system is composed of 52 computer nodes; each node is a commodity server with two quad-core processors, 24 gigabytes memory and sixteen 500 GB disks – all inter-connected by a Cisco Nexus 5020 switch. Cisco donated the switches as a part of their research engagement with the UC San Diego Center for Networked Systems. The compute cluster is hosted at Calit2.

The world record was published on Sort Benchmark, the website that runs the competition. Its purpose is to provide benchmarks for data sorting and an interactive forum for researchers working to improve data sorting techniques. The project conducted at UC San Diego was led by computer science professor Amin Vahdat. “If a major corporation wants to run a query across all of their page views or products sold, that can require a sort across a multi-petabyte dataset and one that is growing by many gigabytes every day,” he said. “Companies are pushing the limit on how much data they can sort, and how fast. This is data analytics in real time.” In data centers, sorting is often the most pressing bottleneck in many higher-level activities, he notes.

The team actually won two prizes; the second is for tying the world record for the “Indy Gray Sort” which measures sort rate per minute per 100 terabytes of data. “We’ve set our research agenda around how to make this better…and also on how to make it more general,” said Alex Rasmussen, a PhD student and a team member.

George Porter, a research scientist at the Center for Networked Systems at UC San Diego, takes pride in the project’s result: “We used one fourth the number of computers as the previous record holder to achieve that same sort rate performance – and thus one fourth the energy, and one fourth the cooling and data center real estate.” Both world records are in the “Indy” category – meaning that the systems were designed around the specific parameters of the Sort Benchmark competition.

 
  San Diego computer science PhD student Alex Rasmussen, the lead graduate student on the team. (Credit: University of California, San Diego)
San Diego computer science PhD student Alex Rasmussen, the lead graduate student on the team. (Credit: University of California, San Diego)

Although winning awards and setting records is a distinguished goal, the team’s ultimate objective is to generalize their results for the “Daytona” competition and for use in the real world. “Sorting is also an interesting proxy for a whole bunch of other data processing problems,” said Rasmussen. “Generally, sorting is a great way to measure how fast you can read a lot of data off a set of disks, do some basic processing on it, shuffle it around a network and write it to another set of disks. Sorting puts a lot of stress on the entire input/output subsystem, from the hard drives and the networking hardware to the operating system and application software.”

While current sorting methods apply for most data structures, the major difference when dealing with data larger than 1,000 terabytes is it’s well beyond the memory capacity of the computers doing the sorting. The team’s approach was to design a balanced system, in which computing resources like memory, storage and network bandwidth are fully utilized – and as few resources as possible are wasted.

For example, memory often uses as much or more energy than processors, but the energy consumed by memory gets less attention. “Our system shows what’s possible if you pay attention to efficiency – and there is still plenty of room for improvement,” said Vahdat, holder of the SAIC Chair in Engineering in the Department of Computer Science and Engineering at UC San Diego. “We asked ourselves, ‘What does it mean to build a balanced system where we are not wasting any system resources in carrying out high end computation?’ If you are idling your processors or not using all your RAM, you’re burning energy and losing efficiency.”

TFOT also covered the use of “wet” computing systems to boost processing power, researched at the University of Southampton, and a new compound that may revolutionize chip technology, discovered at the SLAC National Accelerator Laboratory.

For more information about the world record breaking the terabyte sorting barrier, see the official press release.