Skip to main content

Efficient automation and the data deluge

23-05-2011

The ESRF has put in place a future-proofed computing strategy to cover everything from storage infrastructure and data acquisition systems to online data analysis and user interfaces. As Andy Götz explains, that strategy is one of the hidden enablers of the laboratory’s long-term mission to support data-intensive science.

Share

The fundamental question underlying all computing is “what can be (efficiently) automated?”

The above statement, attributed to the Association of Computing Machinery, is a fitting description of the core computing activity at the ESRF, where efficient automation is very much the headline goal. To understand why, it’s necessary to go back to basics. The ESRF is an example of what the late Jim Gray, a pioneering computer scientist, called the “fourth paradigm” in science. What he was referring to is data-intensive science – the other three paradigms being theory, experimentation and computation.

The origin of the fourth paradigm is the so-called “data deluge” resulting from the latest generation of big-science experiments – analytical instruments like the ESRF and the Large Hadron Collider at CERN, among others. The ESRF, for its part, will generate approximately 1 petabyte (1 PB or 1015 bytes) of raw data in 2011. What’s more, that data output is growing exponentially, creating stresses and strains on the four main building blocks of data-intensive computing: data capture, data curation, data analysis and data visualisation.

So how is the ESRF responding to the challenges posed by the data deluge?

Data capture

Let’s start by considering data acquisition. The capture of raw data is still the main source of data at the ESRF; simulation and analysed data are present on a much smaller scale. The challenge for data-acquisition computing is how to handle the very high data rates from the detector to the hard disk. Detectors producing hundreds of megabytes per second are in routine operation today, while the next generation (producing data at gigabytes per second) are already in the works.

Special measures are therefore required to read out the detectors – either using proprietary protocols or multiple Ethernet links – before transferring data to central storage. The key enabler here is high-performance computing solutions for file-storage systems, and a lot of effort is currently being invested in finding the right combination of network, file servers and buffer computers to handle these high data rates.

Online analysis

Right now, one of the hot topics in ESRF computing is online data analysis. What this means is automated data analysis at the beamline to provide scientists with a first look at the results to get feedback on the quality of the data and to make decisions in real-time about the experiment (see SAXS speeds ahead with real-time visualizations).

A key enabling technology in this regard is a framework called EDNA – developed in collaboration with the Diamond Light Source in the UK and other facilities/laboratories – to wrap data analysis programs and automate their execution. The automation can run on a single multicore computer. At the same time, EDNA is being extended to run efficiently on a cluster of multicore computers using a batch scheduler.

For certain applications (such as tomography), it is possible to accelerate online data analysis significantly with the help of graphical processing units (GPUs) – although automating the data analysis depends on capturing the correct metadata during the experiment. Ultimately, the objective is to speed up online data analysis sufficiently to make it attractive for scientists to use. With this in mind, a new user interface based on a workbench concept will enable scientists to set up online data-analysis workflows and browse the results. At the same time, EDNA, which was originally developed for online data analysis for MX, is being extended to other domains like diffraction tomography. In the long term, the goal is to provide online data analysis for all standard techniques and all upgrade beamlines.

Offline analysis

The distinction between online and offline data analysis is very subtle. Online refers to automatically triggered data analysis, whereas offline refers to a manually triggered analysis – though both may run the same data analysis programs. The ESRF has developed – and currently maintains – a number of data analysis programs, including PyMCA, PyHST, XOP, Shadow, Fit2D and Fable to mention a few. The ESRF also contributes to programs (like BigDFT) developed outside the ESRF. In fact, the majority of offline data-analysis programs are developed outside the ESRF.

The main challenge for data analysis programs is to be able to profit from the massive parallelisation offered by GPUs and the new generations of multicore CPUs. This involves analysing the existing codes to find bottlenecks and then rewriting to make use of parallelism. Parallelism can bring much bigger performance improvements (speed-ups of up to a factor of 40 have been achieved) versus Grid-based solutions. In 2009, a study was carried out to evaluate the EGEE Grid for data analysis as part of the European Union Framework 7-financed ESRFUP project.

The study concluded that the Grid is unsuitable for data-intensive problems. As a result, the Grid has been abandoned at the ESRF (a change of strategy compared to plans previously outlined in the lab’s “Purple Book”). The ESRFUP findings reveal that it is far more efficient to speed up data analysis at the source – using GPUs, for example – than ship huge quantities of data externally and then analyse them. That said, Grid or cloud computing could still come into play for user communities external to the ESRF working on non-data-intensive problems.

Data storage

The raw product of the ESRF is data. What’s more, that data production is doubling roughly every 18 months. Ten years ago, 1 PB storage requirements were difficult to imagine. Today it has become the standard unit for data-producing facilities like the LHC (15 PB/year) and the ESRF (a few PB/year). By way of context, Google processes roughly 24 PB/day. The challenge for the ESRF is to manage these petabyte data requirements efficiently and within budget.

That challenge starts as soon as the data are generated. Issues like which data format, which metadata, how big files should be, which database to use – all these and more need to be addressed. To date, however, these questions have not been tackled in a homogeneous manner for the different experiments at the ESRF.

The MX community has traditionally been the most advanced in managing its data in a database. Over time, such systematic management of data will be extended to all beamlines. This will include the adoption of the HDF5 data format and the generation and indexing of metadata in a database (to allow the tracking of data and online searching of metadata). These are essential steps towards data curation (i.e. long-term storage to enable data to be reanalysed by scientists other than the principal investigator of the experiment).

Currently, data are deleted after a delay of one to three months after the experiment. This will change as the cost of tape archives drops. Another notable trend is towards open publication of experimental data online (e.g. the PDB database for proteins and the palaeontology microtomographic database).

Inside the data centre

All aspects of computing rely on a properly dimensioned and efficient IT infrastructure. With an eye on the future, the ESRF has invested in a large central data facility that will soon be ready to receive its first equipment. The upgraded data centre will provide a high-quality environment for central data storage, compute power and network electronics, as well as the software needed to access and manage these resources. In terms of specifics, the facility will soon be equipped with state-of-the-art file servers (capable of storing 1.5 PB), a tape-based archiving facility (of several petabytes), compute clusters with a peak performance of 15 Tflops and an extensive 10 Gbit/s Ethernet infrastructure.

One of the biggest challenges is to provide very-high-performance file storage to be able to write data at high speeds from the beamlines and to be able to read it at high speed during the analysis phase. Such online data analysis places a big strain on file storage because of the simultaneous read-and-write access requirements. The current set-up provides 300 MB/s write access and up to 180 MB/s read access. The goal over the next 18 months is to increase that performance by an order of magnitude without compromising on reliability.

Providing efficient access to compute resources like CPUs and GPUs is essential for online and offline data analysis. For the last 10 years, the ESRF has used the Condor batch scheduler for this task. Recently, a new resource-management scheduler called OAR has been evaluated and the first results are promising. The challenge is to manage resources to better reflect the way that they are used (i.e. as clusters of CPU cores or GPUs).

Against this backdrop of evolving IT requirements, one thing at least is certain: the data deluge shows no sign of slowing. For the ESRF’s computing team, the search for ever-more efficient automation goes on.

 

A focus on international scientific collaboration
Innate complexity and human-resource overheads dictate that many instrumentation and software developments in photon science can no longer be undertaken by a single laboratory.

Equally, another significant shift is occurring among scientific users of the ESRF – many of whom now routinely require access to complementary methods (e.g. neutrons and photons) as well as beam time at several synchrotron facilities to complete their investigations.

Such drivers call for much stronger standardisation of practices, data formats, automation, user interfaces and the like. The ESRF is engaged in several large collaborations with partners in Europe and beyond to make progress on these issues. TANGO, PaN-data, LinkSCEEM-2, EDNA and ISPyB are well established projects along those lines, while the CRISP and PaN-data ODI proposals will soon be decided upon.

 

Andy Götz

 

This article appeared in ESRFnews, March 2011. 

To register for a free subscription and to rapidly receive the current issue, please go to:

http://www.esrf.fr/UsersAndScience/Publications/Newsletter/esrfnewsdigital

 

Top image: The ESRF’s upgraded data centre will soon be ready to receive its first equipment, including state-of-the-art file servers and tape-based archiving.