HepSim: a repository with predictions for high-energy physics experiments

A file repository for calculations of cross sections and kinematic distributions using Monte Carlo generators for high-energy collisions is discussed. The repository is used to facilitate effective preservation and archiving of data from theoretical calculations, as well as for comparisons with experimental data. The HepSim data library is publicly accessible and includes a number of Monte Carlo event samples with Standard Model predictions for current and future experiments. The HepSim project includes a software package to automate the process of downloading and viewing online Monte Carlo event samples. A data streaming over a network for end-user analysis is discussed.


Introduction
Modern theoretical predictions quickly become CPU intensive. A possible solution to facilitate comparisons between theory and data from high-energy physics (HEP) experiments is to develop a public library that stores theoretical predictions in a form that is suited for calculation of arbitrary experimental distribution on commodity computers. The need for such library is driven by the following modern developments: • The Standard Model (SM) predictions should be substantially improved in order to find new physics that can potentially exhibit itself within theoretical uncertainties, which are currently at the level of 5% -10% for quantum chromodynamics (QCD) theory. Currently, such uncertainties are the main limiting factor for precision measurements, as well as for searches new physics beyond the SM. An increase in theoretical precision leads to highly complex, CPU intensive, computations. Such calculations are difficult to achieve on commodity computers. In many cases, it is easier to read events with predictions generated after a proper validation, rather than generating them for every measurement or experiment.
• Searches for new physics often include event scans in different kinematic domains. This means that the outputs from theoretical predictions should be sufficiently flexible to accommodate large variations in event selection requirements and to narrow down search results. A theory "frozen" in the form of histograms is often difficult to deal with since histograms need to be computed for each experimental cut.
• The current method to generate predictions for experimental papers lacks transparency. Usually, such calculations are done by experiments using computational resources that are often unavailable for theorists. Theoretical calculations are typically done through a "private" communication between data analyzers and theorists, without public access to the original code or data that are the result of the computations performed for publications. For example, common samples with SM predictions can be useful for a comparison between different experiments that often use different selection cuts.
Let us give an example illustrating the first point. A single calculation of γ+jet cross section at a next-to-leading order (NLO) QCD typically requires several hours on a commodity computer. Sufficient statistical precision for a falling transverse momentum spectrum (p T (γ)), typical for HEP, requires several independent calculations with different minimum p T (γ) cuts. Next, the calculations of theoretical uncertainties, such as those with renormalisation scale variations or with different sets of the input partondensity functions (PDFs), require several additional runs. Thus, a single high-quality prediction for a publication may require up to 1000 CPU hours. Finding a method to store Monte Carlo (MC) events with full systematic variations in highly compressed archive files that can be processed by experimentalists and theorists becomes essential. We will come back to this example in the next sections.
A creation of the library with common data from theoretical models for HEP experiments can be an important step to simplify data analysis, to ensure proper validation, accessibility and preservation over the long term for new uses. The idea of storing MC predictions (including NLO calculations) in a form of "n-tuples", i.e. an ordered list of records with detailed information on separate (weighted or un-weighted) events is not new; one way or the other, many Monte Carlo (MC) and NLO programs can write data on event-by-event bases into files that can be subsequently read by analysis programs. The missing part of this approach is a common standard layout for such files, a transparent public access, and an easy-to-use software toolkit to process such data for an arbitrary experimental observable. The HepSim project aims to achieve this goal.
A number of community projects exist that simplify theoretical computations and comparisons with experimental data, such as MCDB (a MonteCarlo Database) [1], Professor [2] (a tuning tool for MC event generators), Rivet [3] (a toolkit for validation of MC event generators) and APPLgrid [4] (a method to reproduce the results of full NLO calculations with any input parton distribution set) and JetWeb [5] (a WWW interface and database for Monte Carlo tuning). Among these tools, the closest repository that focuses on storing data with theoretical predictions is the MCDB Monte Carlo database developed within the CMS Collaboration. This publicly available repository mainly includes the CompHEP MC events [6] in the HepML format [7]. This paper discusses a public repository with Monte Carlo simulations (including NLO calculations) designed for fast calculation of cross sections or any kinematic distribution. This repository was created during the Snowmass Community Studies [8] in 2013, that had one of the goals of archiving MC simulation files for future experiments. In comparison with the MCDB repository, the proposed repository stores files in a highly-compressed format that is better suited for archiving, has a simplified data access model with a possibility of data streaming from the web, and includes tools to perform calculations of kinematic distributions.

Technical requirements
A number of software requirements must be met in order to achieve the goal of creating an archive of events from theory predictions for the HEP community: • Data should be stored in compact files suitable for network communication. In particular, the data format should minimize the usage of fixed-length data types and utilize the "varint" approach which use fewer bytes for smaller numbers compared to larger, less common, numbers. For example, such data serialization is implemented for integer values in the Google's Protocol Buffers library [9]. For typical HEP events, large numbers (such as energies, masses, particle identification numbers etc.) are usually less common, and this can be used for very effective compression. It is desirable if MC event samples have file sizes of the order of tens of GBs or less for effective exchange and wide usage.
• An important requirement for the public access is to be able to read the data in a number of programming languages, on any computational platform, with a minimum overhead of installing and configuring the software needed for analysis. Therefore, the data format should be multiplatform from the ground, with the possibility to process such data on Linux, Windows and Mac computers. Likewise, the files should be self-describing and well suited for structured data, similar to XML. The self-describing feature is needed to store data from different MC generators created by different authors, thus data attributes can be vastly different and should be accessed by name. The documentation of data layout should be the part of the file, without external documentation of position field. The programming language used to read the data should be well suited for concurrency (multi-threading).
• Public access via the HTTP protocol is one of the important requirements since this will allow streaming the simulated data to the Web browsers which, in future, can have a functionality of processing and analysing the data. Although the samples can be located on the grid, our previous experience shows that sharing event samples using the grid access model is less suited for wide community due to security restrictions. A more effective data access, such as the GridFTP protocol, can be added in future.
• When possible, theoretical uncertainties should be encapsulated inside the files. For example, events should include central weights plus all associated systematic variations. Such "all in one" approach will significantly simplify the calculations: A single pass over the data files can be sufficient to create final predictions with all uncertainties.
• In addition to the general availability of the data with simulations, the project should provide benchmark cross sections and most representative figures with 3 distributions. All produced plots should be accompanied by analysis programs in order to illustrate the data access.
The above requirements represent a number of software challenges. For example, the usage of the ROOT [10] data-analysis program may be insufficient due to (a) ineffective fixed-length data representation leading to large file sizes. The usage of the variable-byte encoding leads to files that are 30 − 40% smaller compared to ROOT and other existing fixed-length data formats after compression; (b) a complexity in dealing with the C++ system programming language. From the other hand, the usage of ROOT should be well supported since this is the main analysis environment for HEP experiments.
The choice of the programming language may look obvious at first given that C++ is the preferred choice of HEP experiments. However, this can introduce certain limitations since C++ requires professional programming expertise that is typically available only for system programmers. A scripting languages, such as Python, should be an essential part of the project.
The scope of this project and its implementation substantially depend on the usage of high-performance computers.

HepSim database
The HepSim repository with reference Monte Carlo events, including leading-order and NLO MC generators, is currently available for validations and checks. The database is accessable using the link as given in [11].
The HepSim has a front-end that stores metadata using a SQL database engine. The front-end is written in the PHP and JavaScript languages. The MC files are stored on a separate file storage with a URL access. There is no requirements to store data on the same web server; data can be scattered over multiple URL locations and it is up to the user to document data locations. The SQL front-end can be used to search the database using dataset description, MC generator name, production process, cross section and other metadata that are included in the description of MC files.
In order to add an entry to the database, a user should be registered. Upon the registration, a dataset should be added by creating a metadata record with dataset name, physics process, the name of the MC generator, file sizes, a text short description, file format and the URL location of the dataset. Figure 1 shows the HepSim database front-end that lists available samples.
The help menu of HepSim describes how to perform a bulk download of multiple files from the repository and how to read events using minimum requirements for software setup. A more advanced usage is explained on the wiki linked to the HepSim web page.

Supported data formats
As a basis for the HepSim public library, the ProMC [12,13] file format has been chosen. This choice is motivated by the possibility to store data with arbitrary layout using variable-byte encoding, including log files from MC generators. ProMC is implemented as a simple, self-containing library that can easily be deployed on a number of platforms including high-performance computers, such as IBM BlueGene/Q.
The ProMC format is based on a dynamic assignment of the needed number of bytes to store integer values, unlike the traditional approaches that use the fixed-length byte representations. The advantage of this "varint" feature has been discussed in [12,13]. We will illustrate this using another example: To store a single event together with theoretical uncertainties created by a NLO program, one needs to write the information on a few particles together with the event weights representing theoretical uncertainties. For the γ+jet example discussed previously, we need to store a few particles from the hard scattering (where one outgoing particle is photon), together with event weights from different sets of PDFs. Although the central weight can be stored as a floating point number without losing the numerical precision, other weights can be encoded as integer values representing deviations from the central weight. This approach can take the advantage of the compact varint "compression". For example, if a central weight, denoted as P DF (0), is estimated with MSTW2008 PDF [14], 40 associated eigenvector sets for PDF uncertainties can be represented as integer numbers: i.e. in the units of 0.1% with respect to the central wight P DF (0). The factor 1000 is arbitrary and can be changed depending on the required precision. In many cases, integer values w n are close to 0, leading to 1-2 bytes in the varint encoding. Therefore, a single event record with all associated eigenvector PDF sets will use less than 100 bytes. 5 The ProMC files can be read in a number of programming languages supported on the Linux, Windows and Android platforms. The default language to read and process files is chosen to be Java, since it is well suited for web-application programming and is available on all major computational platforms. This choice may not be convenient for the current HEP experiments which base their reconstruction software on C++. Therefore, the C++ language is also fully supported.
During the last ten years, we were witnesses of a rapid transition from the system programming languages, such as C or C++, to scripting dynamic languages (such as Python, Ruby etc.). CPython (the Python language implemented in C) is commonly used by HEP experiments. Taking into account the popularity of this dynamic language, Jython, an implementation of the Python programming language in Java, was used to create analysis example programs. The Jython language has similar semantics to Python, but uses the Java Virtual Machine which ensures platform independence of the analysis environment.
The underlying Java backend of the repository also implies that the deployment of analysis programs in other popular scripting languages, such as Groovy, JRuby (analog of Ruby), BeanShell etc. is possible, since such languages are fully integrated into the Java platform. The program code can be developed using full-featured free IDE for the Java programming language, such as Eclipse, NetBeans or ScaVis [15,16].

Available datasets
Currently, the HepSim repository contains events generated by PYTHIA [19], MadGraph [20], Jetphox [21], MCFM [22], NLOJet++ [23], FPMC [24] and HER-WIG++ [25] generators. The repository includes event samples for pp colliders with the centre-of-mass energies of 8, 13, 14, 100 TeV. In some cases, together with the detailed information on produced particles, full sets of theoretical uncertainties (scale, PDF, etc.) are embedded inside the files as discussed in Sect. 3. A number of processes were generated using the IBM BlueGene/Q (located at the Argonne Leadership Computing Facility), the description of which is beyond the scope of this paper.
A typical dataset has a multiple number of files. The total size of a single dataset uses less than 10 GB. One file typically has 10,000 events with the total size of 80 MB. In many cases, the event records are "slimmed" after removing unstable particles and final-state particles with transverse momentum less than 300-400 MeV. The most essential parton-level information on vector bosons, b− and t− quarks is kept. Each ProMC file, when possible, includes deviations from the central event weight in the form of integer values as discussed in the previous section. This typically leads to a very compact representation of events from NLO generators using the "varint" encoding since large systematic deviations are less common than small ones.

Data analysis
The HepSim repository is useful for fast reconstruction of theoretical cross sections and distributions from four-momenta of particles using experiment-specific selection, reconstruction and histogram bins. The files can be used as inputs for the DELPHES 6 fast detector simulation program [26] which has a built-in reader to process the ProMC files.
For easy accessibility through the Web, Jython scripts have been prepared. They show how to read the ProMC files with simulation data and how to reconstruct cross sections when the event weights are required (i.e. for NLO programs). The example code snippets are based on the ScaVis [15,16] project, but any Java-based IDE should be sufficient to create analysis codes as long as the needed jar library is included in the Java classpath. The data-analysis scripts can read data either through the HTTP protocol (this feature is enabled only for single files), or using files stored on local file system. In both cases, the ScaVis data-analysis framework can assist to read data on any computational platform using different scripting languages, such as Jython, Groovy, Ruby, BeanSheel etc. or using the full-featured Java language. Such scripts can be modified and adjusted to create plots after changing data-selection cuts or histogram bins, or they can be used as templates to develop the ROOT/C++ and CPython analysis codes.
The processing time for each MC sample is less than 30 min on a desktop computer while, in some cases, the CPU time to generate such event samples is more than 1000 CPU hours on IBM BlueGene/Q of the Argonne Leadership Computing Facility. To increase the communication between theorists and experimentalists, the data can be analyzed on any Java-enabled operating system (Windows, Mac, Linux, etc.). The online HepSim manual [11] includes a description of how to download the event samples and how to read the data using Java, C++/ROOT and CPython.
The online HepSim manual [11] includes a description of how to download the event samples and how to read the data using Java, C++/ROOT and CPython.