Modern physics is based on both theoretical analysis and experimental validation. Complex scenarios like subatomic dimensions, high energy, and lower absolute temperature are frontiers for many theoretical models. Simulation with stable numerical methods represents an excellent instrument for high accuracy analysis, experimental validation, and visualization. High performance computing support offers possibility to make simulations at large scale, in parallel, but the volume of data generated by these experiments creates a new challenge for Big Data Science. This paper presents existing computational methods for high energy physics (HEP) analyzed from two perspectives: numerical methods and high performance computing. The computational methods presented are Monte Carlo methods and simulations of HEP processes, Markovian Monte Carlo, unfolding methods in particle physics, kernel estimation in HEP, and Random Matrix Theory used in analysis of particles spectrum. All of these methods produce data-intensive applications, which introduce new challenges and requirements for ICT systems architecture, programming paradigms, and storage capabilities.
1. Introduction
High Energy Physics (HEP) experiments are probably the main consumers of High Performance Computing (HPC) in the area of e-Science, considering numerical methods in real experiments and assisted analysis using complex simulation. Starting with quarks discovery in the last century to Higgs Boson in 2012 [1], all HEP experiments were modeled using numerical algorithms: numerical integration, interpolation, random number generation, eigenvalues computation, and so forth. Data collection from HEP experiments generates a huge volume, with a high velocity, variety, and variability and passes the common upper bounds to be considered Big Data. The numerical experiments using HPC for HEP represent a new challenge for Big Data Science.
Theoretical research in HEP is related to matter (fundamental particles and Standard Model) and Universe formation basic knowledge. Beyond this, the practical research in HEP has led to the development of new analysis tools (synchrotron radiation, medical imaging or hybrid models [2], wavelets-computational aspects [3]), new processes (cancer therapy [4], food preservation, or nuclear waste treatment), or even the birth of a new industry (Internet) [5].
This paper analyzes two aspects: the computational methods used in HEP (Monte Carlo methods and simulations, Markovian Monte Carlo, unfolding methods in particle physics, kernel estimation, and Random Matrix Theory) and the challenges and requirements for ICT systems to deal with processing of Big Data generated by HEP experiments and simulations.
The motivation of using numerical methods in HEP simulations is based on special problems which can be formulated using integral or differential-integral equations (or systems of such equations), like quantum chromodynamics evolution of parton distributions inside a proton which can be described by the Gribov-Lipatov-Altarelli-Parisi (GLAP) equations [6], estimation of cross section for a typical HEP interaction (numerical integration problem), and data representation using histograms (numerical interpolation problem). Numerical methods used for solving differential equations or integrals are based on classical quadratures and Monte Carlo (MC) techniques. These allow generating events in terms of particle flavors and four-momenta, which is particularly useful for experimental applications. For example, MC techniques for solving the GLAP equations are based on simulated Markov chains (random walks), which have the advantage of filtering and smoothing the state vector for estimating parameters.
In practice, several MC event generators and simulation tools are used. For example, HERWIG (http://projects.hepforge.org/herwig/) project considers angular-ordered parton shower, cluster hadronization (the tool is implemented using Fortran), PYTHIA (http://www.thep.lu.se/torbjorn/Pythia.html) project is oriented on dipole-type parton shower and string hadronization (the tool is implemented in Fortran and C++), and SHERPA (http://projects.hepforge.org/sherpa/) considers dipole-type parton shower and cluster hadronization (the tool is implemented in C++). An important tool for MC simulations is GATE (GEANT4 Application for Tomographic Emission), a generic simulation platform based on GEANT4. GATE provides new features for nuclear imaging applications and includes specific modules that have been developed to meet specific requirements encountered in SPECT (Single Photon Emission Tomography) and PET (Positron Emission Tomography).
The main contributions of this paper are as follows:
introduction and analysis of most important modeling methods used in High Energy Physics;
identifying and describing of the computational numerical methods for High Energy Physics;
presentation of the main challenges for Big Data processing.
The paper is structured as follows. Section 2 introduces the computational methods used in HEP and describes the performance evaluation of parallel numerical algorithms. Section 3 discusses the new challenge for Big Data Science generated by HEP and HPC. Section 4 presents the conclusions and general open issues.
2. Computational Methods Used in High Energy Physics
Computational methods are used in HEP in parallel with physical experiments to generate particle interactions that are modeled using vector of events. This section presents general approach of event generation, simulation methods based on Monte Carlo algorithms, Markovian Monte Carlo chains, methods that describe unfolding processes in particle physics, Random Matrix Theory as support for particle spectrum, and kernel estimation that produce continuous estimates of the parent distribution from the empirical probability density function. The section ends with performance analysis of parallel numerical algorithms used in HEP.
2.1. General Approach of Event Generation
The most important aspect in simulation for HEP experiments is event generation. This process can be split into multiple steps, according to physical models. For example, structure of LHC (Large Hadron Collider) events: (1) hard process; (2) parton shower; (3) hadronization; (4) underlying event. According to official LHC website (http://home.web.cern.ch/about/computing): “approximately 600 million times per second, particles collide within the LHC …Experiments at CERN generate colossal amounts of data. The Data Centre stores it, and sends it around the world for analysis.” The analysis must produce valuable data and the simulation results must be correlated with physical experiments.
Figure 1 presents the general approach of event generation, detection, and reconstruction. The physical model is used to create simulation process that produces different type of events, clustered in vector of events (e.g., the fourth type of events in LHC experiments).
General approach of event generation, detection, and reconstruction.
In parallel, the real experiments are performed. The detectors identify the most relevant events and, based on reconstruction techniques, vector of events is created. The detectors can be real or simulated (software tools) and the reconstruction phase combines real events with events detected in simulation. At the end, the final result is compared with the simulation model (especially with generated vectors of events). The model can be corrected for further experiments. The goal is to obtain high accuracy and precision of measured and processed data.
Software tools for event generation are based on random number generators. There are three types of random numbers: truly random numbers (from physical generators), pseudorandom numbers (from mathematical generators), and quasirandom numbers (special correlated sequences of numbers, used only for integration). For example, numerical integration using quasirandom numbers usually gives faster convergence than the standard integration methods based on quadratures. In event generation pseudorandom numbers are used most often.
The most popular HEP application uses Poisson distribution combined with a basic normal distribution. The Poisson distribution can be formulated as
(1)P[X=k]=μkk!exp{-μ},k=0,1,…,
with E(k)=V(k)=μ (V is variance and E is expectation value). Having a uniform random number generator called RND() (Random based on Normal Distribution) we can use the following two algorithms for event generation techniques.
The result of running Algorithms 1 and 2 to generate around 106 random numbers is presented in Figure 2. In general, the second algorithm has better result for Poisson distribution. General recommendation for HEP experiments indicates the use of popular random number generators like TRNG (True Random Number Generators), RANMAR (Fast Uniform Random Number Generator used in CERN experiments), RANLUX (algorithm developed by Luscher used by Unix random number generators), and Mersenne Twister (the “industry standard”). Random number generators provided with compilers, operating system, and programming language libraries can have serious problem because they are based on system clock and suffer from lack of uniformity of distribution for large amounts of generated numbers and correlation of successive values.
<bold>Algorithm 1: </bold>Random number generation for Poisson distribution using many random generated numbers with normal distribution (RND).
(1) procedure Random_Generator_Poisson(μ)
(2) number←-1;
(3) accumulator←1.0;
(4) q←exp{-μ};
(5) while accumulator>q do
(6) rnd_number←RND();
(7) accumulator←accumulator*rnd_number;
(8) number←number+1;
(9) end while
(10) return number;
(11) end procedure
<bold>Algorithm 2: </bold>Random number generation for Poisson distribution using one random generated number with normal distribution.
(1) procedure Random_Generator_Poisson_RND(μ,r)
(2) number←0;
(3) q←exp{-μ};
(4) accumulator←q;
(5) p←q;
(6) while r>accumulator do
(7) number←number+1;
(8) p←p*μ/number;
(9) accumulator←accumulator+p;
(10) end while
(11) return number;
(12) end procedure
General approach of event generation, detection, and reconstruction.
The art of event generation is to use appropriate combinations of various random number generation methods in order to construct an efficient event generation algorithm being solution to a given problem in HEP.
2.2. Monte Carlo Simulation and Markovian Monte Carlo Chains in HEP
In general, a Monte Carlo (MC) method is any simulation technique that uses random numbers to solve a well-defined problem, P. If F is a solution of the problem P (e.g., F∈Rn or F has a Boolean value), we define F^, an estimation of F, as F^=f({r1,r2,…,rn},…), where {ri}1≤i≤n is a random variable that can take more than one value and for which any value that will be taken cannot be predicted in advance. If ρ(r) is the probability density function, ρ(r)dr=P[r<r′<r+dr], the cumulative distributed function is
(2)C(r)=∫-∞rρ(x)dx⟹ρ(r)=dC(r)dr.
C(r) is a monotonically nondecreasing function with all values in [0,1]. The expectation value is
(3)E(f)=∫f(r)dC(r)=∫f(r)ρ(r)dr.
And the variance is
(4)V(f)=E[f-E(f)]2=E(f2)-E2(f).
2.2.1. Monte Carlo Event Generation and Simulation
To define a MC estimator the “Law of Large Numbers (LLN)” is used. LLN can be described as follows: let one choose n numbers ri randomly, with the probability density function uniform on a specific interval (a,b), each ri being used to evaluate f(ri). For large n (consistent estimator),
(5)1n∑i=1nf(ri)⟶E(f)=1b-a∫abf(r)dr.
The properties of a MC estimator are being normally distributed (with Gaussian density); the standard deviation is σ=V(f)/n; MC is unbiased for all n (the expectation value is the real value of the integral); the estimator is consistent if V(f)<∞ (the estimator converges to the true value of the integral for every large n); a sampling phase can be applied to compute the estimator if we do not know anything about the function f; it is just suitable for integration. The sampling phase can be expressed, in a stratified way, as
(6)∫abf(r)dr=∫ar1f(r)dr+∫r1r2f(r)dr+⋯+∫rnbf(r)dr.
MC estimations and MC event generators are necessary tools in most of HEP experiments being used at all their steps: experiments preparation, simulation running, and data analysis.
An example of MC estimation is the Lorentz invariant phase space (LIPS) that describes the cross section for a typical HEP process with n particle in the final state.
Consider
(7)σn~∫|M|2dRn,
where M is the matrix describing the interaction between particles and dRn is the element of LIPS. We have the following estimation:
(8)Rn(P,p1,p2,…,pn)=∫δ(4)(P-∑k=1npk)∏k=1n(δ(pk2-mk2)Θ(pk0)d4pk),
where P is total four-momentum of the n-particle system; pkand mk are four-momenta and mass of the final state particles; δ(4)(P-∑k=1npk) is the total energy momentum conservation; δ(pk2-mk2) is the on-mass-shell condition for the final state system. Based on the integration formula
(9)∫δ(pk2-mk2)Θ(pk0)d4pk=d3pk2pk0,
obtain the iterative form for cross section:
(10)Rn(P,p1,p2,…,pn)=∫Rn-1(P-pn,p1,p2,…,pn-1)d3pn2pn0,
which can be numerical integrated by using the recurrence relation. As result, we can construct a general MC algorithm for particle collision processes.
Example 1.
Let us consider the interaction: e+e-→μ+μ- where Higgs boson contribution is numerically negligible. Figure 3 describes this interaction (Φ is the azimuthal angle, θ the polar angle, and p1,p2,q1,q2 are the four-momenta for particles).
The cross section is
(11)dσ=α24s[W1(s)(1+cos2θ)+W2(s)cosθ]dΩ,
where dΩ=dcosθdΦ, α=e2/4π (fine structure constant), s=(p10+p20)2 is the center of mass energy squared, and W1(s) and W2(s) are constant functions. For pure processes we have W1(s)=1 and W2(s)=0, and the total cross section becomes
(12)σ=∫02πdΦ∫-11dcosθd2σdΦdcosθ.
Example of particle interaction: e(p1)+e-(p2)→μ+(q1)μ-(q2).
We introduce the following notation:
(13)ρ(cosθ,Φ)=d2σdΦdcosθ,
and let us consider ρ~(cosθ,Φ) an approximation of ρ(cosθ,Φ). Then σ~=∬dΦdcosθρ~. Now, we can compute
(14)σ=∫02πdΦ∫-11dcosθρ(cosθ,Φ)=∫02πdΦ∫-11dcosθw(cosθ,Φ)ρ~(cosθ,Φ)≈〈w〉ρ~∫02πdΦ∫-11dcosθρ~(cosθ,Φ)=σ~〈w〉ρ~,
where w(cosθ,Φ)=ρ(cosθ,Φ)/ρ~(cosθ,Φ) and 〈w〉ρ~ is the estimation of w based on ρ~. Here, the MC estimator is
(15)〈w〉MC=1n∑i=1nwi,
and the standard deviation is
(16)sMC=(1n(n-1)∑i=1n(wi-〈w〉MC)2)1/2.
The final numerical result based on MC estimator is
(17)σMC=σ~〈w〉MC±σ~sMC.
As we can show, the principle of a Monte Carlo estimator in physics is to simulate the cross section in interaction and radiation transport knowing the probability distributions (or an approximation) governing each interaction of elementary particles.
Based on this result, the Monte Carlo algorithm used to generate events is as follows. It takes as input ρ~(cosθ,Φ) and in a main loop considers the following steps: (1) generate (cosθ,Φ) peer from ρ~; (2) compute four-momenta p1,p2,q1,q2; (3) compute w=ρ/ρ~. The loop can be stopped in the case of unweighted events, and we will stay in the loop for weighted events. As output, the algorithm returns four-momenta for particle for weighted events and four-momenta and an array of weights for unweighted events. The main issue is how to initialize the input of the algorithm. Based on dσ formula (for W1(s)=1 and W2(s)=0), we can consider as input ρ~(cosθ,Φ)=(α2/4s)(1+cos2θ). Then σ~=4πα2/3s.
In HEP theoretical predictions used for particle collision processes modeling (as shown in presented example) should be provided in terms of Monte Carlo event generators, which directly simulate these processes and can provide unweighted (weight = 1) events. A good Monte Carlo algorithm should be used not only for numerical integration [7] (i.e., provide weighted events) but also for efficient generation of unweighted events, which is very important issue for HEP.
2.2.2. Markovian Monte-Carlo Chains
A classical Monte Carlo method estimates a function F with F^ by using a random variable. The main problem with this approach is that we cannot predict any value in advance for a random variable. In HEP simulation experiments the systems are described in states [8]. Let us consider a system with a finite set of possible states S1,S2,…, and St the state at the moment t. The conditional probability is defined as
(18)P(St=Sj∣St1=Si1,St2=Si2,…,Stn=Sin),
where the mappings (t1,i1),…,(tn,in) can be interpreted as the description of system evolution in time by specifying a specific state for each moment of time.
The system is a Markov chain if the distribution of St depends only on immediate predecessor St-1 and it is independent of all previous states as follows:
(19)P(St=Sj∣St-1=Sit-1,…,St2=Si2,St1=Si1)=P(St=Sj∣St-1=Sit-1).
To generate the time steps (t1,t2,…,tn) we use the probability of a single forward Markovian step given by p(t∣tn) with the property ∫tn∞p(t∣tn)dt=1 and we define p(t)=p(t∣0). The 1-dimensional Monte Carlo Markovian Algorithm used to generate the time steps is presented in Algorithm 3.
<bold>Algorithm 3: </bold>1-Dimensional Monte Carlo Markovian Algorithm.
(1) Generate t1 according with p(t1)=p(t1∣t0=0)
(2) if t1<tmax then▹ Generate the initial state.
(3) PN≥1=∫0tmaxp(t1∣t0)dt1; ▹ Compute the initial probability.
(4) Retain t1;
(5) end if
(6) if t1>tmax then▹ Discard all generated and computed data.
(7) N=0; P0=∫tmax∞p(t1∣t0)dt1=e-tmax;
(8) Delete t1;
(9) EXIT. ▹ The algorithm ends here.
(10) end if
(11) i=2;
(12) while (1) do▹ Infinite loop until a successful EXIT.
(13) Generate ti according with p(ti∣ti-1)
(14) if ti<tmax then▹ Generate a new state and new probability.
(15) PN≥i=∫titmaxp(ti∣ti-1)dti;
(16) Retain ti;
(17) end if
(18) if ti>tmax then▹ Discard all generated and computed data.
(19) N=i-1; Pi=∫tmax∞p(ti∣ti-1)dti;
(20) Retain (t1,t2,…,ti-1); Delete ti;
(21) EXIT. ▹ The algorithm ends here.
(22) end if
(23) i=i+1;
(24) end while
The main result of Algorithm 3 is that P(tmax) follows a Poisson distribution:
(20)PN=∫0tmaxp(t1∣t0)dt1×∫t1tmaxp(t2∣t1)dt2×⋯×∫tN-1tmaxp(tN∣tN-1)dtN×∫tmax∞p(tN+1∣tN)dtN+1=1N!(tmax)Ne-tmax.
We can consider the 1-dimensional Monte Carlo Markovian Algorithm as a method used to iteratively generate the systems’ states (codified as a Markov chain) in simulation experiments. According to the Ergodic Theorem for Markov chains, the chain defined has a unique stationary probability distribution [9, 10].
Figures 4 and 5 present the running of Algorithm 3. According to different values of parameter s used to generate the next step, the results are very different, for 1000 iterations. Figure 4 for s=1 shows a profile of the type of noise. For s=10,100,1000 profile looks like some of the information is filtered and lost. The best results are obtained for s=0.01 and s=0.1 and the generated values can be easily accepted for MC simulation in HEP experiments.
Example of 1-dimensional Monte Carlo Markovian algorithm.
Analysis of acceptance rate for 1-dimensional Monte Carlo Markovian algorithm for different s values.
Figure 5 shows the acceptance rate of values generated with parameter s used in the algorithm. And parameter values are correlated with Figure 4. Results in Figure 5 show that the acceptance rate decreases rapidly with increasing value of parameter s. The conclusion is that values must be kept small to obtain meaningful data. A correlation with the normal distribution is evident, showing that a small value for the mean square deviation provides useful results.
2.2.3. Performance of Numerical Algorithms Used in MC Simulations
Numerical methods used to compute MC estimator use numerical quadratures to approximate the value of the integral for function f on a specific domain by a linear compilation of function values and weights {wi}1≤i≤m as follows:
(21)∫abf(r)dr=∑i=1mwif(ri).
We can consider a consistent MC estimator a a classical numerical quadrature with all wi=1. Efficiency of integration methods for 1 dimension and for d dimensions is presented in Table 1. We can conclude that quadrature methods are difficult to apply in many dimensions for variate integration domains (regions) and the integral is not easy to be estimated.
Efficiency of integration methods for 1 dimension and for d dimensions.
Method
1 dimension
d dimensions
Monte Carlo
n-1/2
n-1/2
Trapezoidal rule
n-2
n-2/d
Simpson’s rule
n-4
n-4/d
m-points Gauss rule (m<n)
n-2m
n-2m/d
As practical example, in a typical high-energy particle collision there can be many final-state particles (even hundreds). If we have n final state particle, we face with d=3n-4 dimensional phase space. As numerical example, for n=4 we have d=8 dimensions, which is very difficult approach for classical numerical quadratures.
Full decomposition integration volume for one double number (10 Bytes) per volume unit is nd×10 Bytes. For the example considered with d=8 and n=10 divisions for interval [0,1] we have, for one numerical integration,
(22)nd×10Bytes=108×1010243GBytes≈0.93GBytes.
Considering 106 events per second, one integration per event, the data produced in one hour will be ≈3197.4 P Bytes.
The previous assumption is only for multidimensional arrays. But due to the factorization assumption, p(r1,r2,…,rn)=p(r1)p(r2)⋯p(rn), we obtain for one integration
(23)n×d×10Bytes=800Bytes,
which means ≈2.62 T Bytes of data produce for one hour of simulations.
2.3. Unfolding Processes in Particle Physics and Kernel Estimation in HEP
In particle physics analysis we have two types of distributions: true distribution (considered in theoretical models) and measured distribution (considered in experimental models, which are affected by finite resolution and limited acceptance of existing detectors). A HEP interaction process starts with a true knows distribution and generate a measured distribution, corresponding to an experiment of a well-confirmed theory. An inverse process starts with a measured distribution and tries to identify the true distribution. These unfolding processes are used to identify new theories based on experiments [11].
2.3.1. Unfolding Processes in Particle Physics
The theory of unfolding processes in particle physics is as follows [12]. For a physics variable t we have a true distribution f(t) mapped in x and an n-vector of unknowns and a measured distribution g(s) (for a measured variable s) mapped in an m-vector of measured data. A response matrix A∈Rm×n encodes a Kernel function K(s,t) describing the physical measurement process [12–15]. The direct and inverse processes are described by the Fredholm integral equation [16] of the first kind, for a specific domain Ω,
(24)∫ΩK(s,t)f(t)dt=g(s).
In particle physics the Kernel function K(s,t) is usually known from a Monte Carlo sample obtained from simulation. A numerical solution is obtained using the following linear equation: Ax=b. Vectors x and y are assumed to be 1-dimensional in theory, but they can be multidimensional in practice (considering multiple independent linear equations). In practice, also the statistical properties of the measurements are well known and often they follow the Poisson statistics [17]. To solve the linear systems we have different numerical methods.
First method is based on linear transformation x=A#y. If m=n then A#=A-1 and we can use direct Gaussian methods, iterative methods (Gauss-Siedel, Jacobi or SOR), or orthogonal methods (based on Householder transformation, Givens methods, or Gram-Schmidt algorithm). If m>n (the most frequent scenario) we will construct the matrix A#=(ATA)-1AT (called pseudoinverse Penrose-Moore). In these cases the orthogonal methods offer very good and stable numerical solutions.
Second method considers the singular value decomposition:
(25)A=UΣVT=∑i=1nσiuiviT,
where U∈Rm×n and V∈Rn×n are matrices with orthonormal columns and the diagonal matrix Σ=diag{σ1,…,σn}=UTAV. The solution is
(26)x=A#y=VΣ-1(UTy)=∑i=1n1σi(uiTy)vi=∑i=1n1σicivi,
where ci=uiTy, i=1,…,n, are called Fourier coefficients.
2.3.2. Random Matrix Theory
Analysis of particle spectrum (e.g., neutrino spectrum) faces with Random Matrix Theory (RMT), especially if we consider anarchic neutrino masses. The RMT means the study of the statistical properties of eigenvalues of very large matrices [18]. For an interaction matrix A (with size N), where Aij is an independent distributed random variable and AH is the complex conjugate and transpose matrix, we define M=A+AH, which describes a Gaussian Unitary Ensemble (GUE). The GUE properties are described by the probability distribution P(M)dM: (1) it is invariant under unitary transformation, P(M)dm=P(M′)dM′, where M′=UHMU, U is a Hermitian matrix (UHU=I); (2) the elements of M matrix are statistically independent, P(M)=∏i≤jPij(Mij); and (3) the matrix M can be diagonalized as M=UDUH, where U=diag{λ1,…,λN}, λi is the eigenvalue of M and λi≤λj if i<j(27)Propability(2):P(M)dM~dMexp{-N2Tr(MHM)};Propability(3):P(M)dM~dU∏idλi∏i<j(λi-λj)2×exp{-N2∑i(λi2)}.
The numerical methods used for eigenvalues computation are the QR method and Power methods (direct and indirect). The QR method is a numerical stable algorithm and Power method is an iterative one. The RMT can be used for many body systems, quantum chaos, disordered systems, quantum chromodynamics, and so forth.
2.3.3. Kernel Estimation in HEP
Kernel estimation is a very powerful solution and relevant method for HEP when it is necessary to combine data from heterogeneous sources like MC datasets obtained by simulation and from Standard Model expectation, obtained from real experiments [19]. For a set of data {xi}1≤i≤n with a constant bandwidth h (the difference between two consecutive data values), called the smoothing parameter, we have the estimation
(28)f^(x)=1nh∑i=1nK(x-xih),
where K is an estimator. For example, a Gauss estimator with mean μ and standard deviation σ is
(29)K(x)=1σ2πexp{-(x-μ)22σ2},
and has the following properties: positive definite and infinitely differentiable (due to the exp function), and it can be defined for an infinite supports (n→∞). The kernel is a nonparametric method, which means that h is independent of dataset and for large amount of normally distributed data we can find a value for h that minimizes the integrated squared error of f^(x). This value for bandwidth is computed as
(30)h*=(43n)1/5σ.
The main problem in Kernel Estimation is that the set of data {xi}1≤i≤n is not normally distributed and in real experiments the optimal bandwidth it is not known. An improvement of presented method considers adaptive Kernel Estimation proposed by Abramson [20], where hi=h/f(xi) and σ are considered global qualities for dataset. The new form is
(31)f^a(x)=1n∑i=1n1hiK(x-xihi),
and the local bandwidth value that minimizes the integrated squared error of f^a(x) is
(32)hi*=(43n)1/5σf^(xi),
where f^ is the normal estimator.
Kernel estimation is used for event selection to confidence level evaluation, for example, in Markovian Monte Carlo chains or in selection of neural network output used in experiments for reconstructed Higgs mass. In general, the main usage of Kernel estimation in HEP is searching for new particle, by finding relevant data in a large dataset.
A method based on Kernel estimation is the graphical representation of datasets using advanced shifted histogram algorithm (ASH). This is a numerical interpolation for large datasets with the main aim of creating a set of nbin histograms H={Hi}, with the same bin-width h. Algorithm 4 presents the steps of histograms generation starting with a specific interval [a,b], a number of points n in this interval, and a number of bins and a number of values used for kernel estimation, m. Figure 6 shows the results of kernel estimation if function f=-(1/2)x2 on [0,1] and graphical representation with a different number of bins. The values on vertical axis are aggregated in step 17 of Algorithm 4 and increase with the number of bins.
Example of advanced shifted histogram algorithm running for different bins: 10, 100, and 1000.
2.3.4. Performance of Numerical Algorithms Used in Particle Physics
All operations used in presented methods for particle physics (Unfolding Processes, Random Matrix Theory, and Kernel Estimation) can be reduced to scalar products, matrix-vector products, and matrix-matrix products. In [21] the design of new standard for the BLAS (Basic Linear Algebra Subroutines) in C language by extension of precision is described. This permits higher internal precision and mixed input/output types. The precision allows implementation of some algorithms that are simpler, more accurate, and sometimes faster than possible without these features. Regarding the precision of numerical computing, Dongarra and Langou established in [22] an upper bound for the residual check for Ax=y system, with A∈Rn×n a dense matrix. The residual check is defined as
(33)r∞=∥Ax-y∥∞nϵ(∥A∥∞∥x∥∞+∥y∥∞)<16,
where ϵ is the relative machine precision for the IEEE representation standard; ∥y∥∞ is the infinite norm of a vector: ∥y∥∞=max1≤i≤n{|yi|}; and ∥A∥∞ is the infinite norm of a matrix ∥A∥∞=max1≤i≤n{∑j=1n|Aij|}.
Figure 7 presents the graphical representation of Dongarras result (using logarithmic scales) for simple and double precision. For simple precision, ϵs=2-24, for all n≥1.05×106 the residual check is always lower than imposed upper bound, similarly for double precision with ϵd=2-53, for all n≥5.63×1014. If matrix size is greater than these values, it will not be possible to detect if the solution is correct or not. These results establish upper bounds for data volume in this model.
Residual check analysis for solving Ax=y system in HPL2.0 using simple and double precision representation.
In a single-processor system, the complexity of algorithms depends only on the problem size, n. We can assume T(n)=Θ(f(n)), where f(n) is a fundamental function (f(n)∈{1,nα,an,logn,n,…}). In parallel systems (multiprocessor systems, with p processors) we have the serial processing time T*(n)=T1(n) and parallel processing time Tp(n). The performance of parallel algorithms can be analyzed using speed-up, efficiency, and isoefficiency metrics.
The speed-up, S(p), represents how a parallel algorithm is faster than a corresponding sequential algorithm. The speed-up is defined as S(p)=T1(n)/Tp(n). There are special bounds for speed-up [23]: S(p)≤pp~/(p+p~-1), where p~=T1/T∞ is the average parallelism (the average number of busy processors given unbounded number of processors). Usually S(p)≤p, but under special circumstances the speed-up can be S(p)>p [24]. Another upper bound is established by the Amdahls law: S(p)=(s+((1-s)/p))1/2≤1/s where s is the fraction of a program that is sequential. The upper bound is considered for a 0 time of parallel fraction.
The efficiency is the average utilization of p processors: E(p)=S(p)/p.
The isoefficiency is the growth rate of workload Wp(n)=pTp(n) in terms of number of processors to keep efficiency fixed. If we consider W1(n)-EWp(n)=0 for any fixed efficiency E we obtain p=p(n). This means that we can establish a relation between needed number of processors and problem size. For example for the parallel sum of n numbers using p processors we have n≈E(n+plogp), so n=Θ(plogp).
Numerical algorithms use for implementation a hypercube architecture. We analyze the performance of different numerical operations using the isoefficiency metric. For the hypercube architecture a simple model for intertask communication considers Tcom=ts+Ltw where ts is the latency (the time needed by a message to cross through the network), tw is the time needed to send a word (1/tw is called bandwidth), and L is the message length (expressed in number of words). The word size depends on processing architecture (usually it is two bytes). We define tc as the processing time per word for a processor. We have the following results.
External productxyT. The isoefficiency is written as
(34)tcn≈E(tcn+(ts+tw)plogp)⟹n=Θ(plogp).
Parallel processing time is Tp=tcn/p+(ts+tw)logp. The optimality is computed using
(35)dTpdp=0⟹-tcnp2+ts+twp=0⟹p≈tcnts+tw.
Scalar product (internal product) xTy=∑i=1nxiyi. The isoefficiency is written as
(36)tcn2≈E(tcn2+ts2plogp+tw2nplogp)⟹n=Θ(p(logp)2).
Matrix-vector product y=Ax,yi=∑j=1nAijxj. The isoefficiency is written as
(37)tcn2≈E(tcn2+tsplogp+twnplogp)⟹n=Θ(p(logp)2).
Table 2 presented the amount of data that can be processed for a specific size. The cases that meet the upper bound n≥1.05×106 are marked with (*). To keep the efficiency high for a specific parallel architecture, HPC algorithms for particle physics introduce upper limits for the amount of data, which means that we have also an upper bound for Big Data volume in this case.
Isoefficiency for a hypercube architecture: n=Θ(plogp) and n=Θ(p(logp)2). We marked with (*) the limitations imposed by Formula (33).
Scenario
Architecture size (p)
n=Θ(plogp)
n=Θ(p(logp)2)
1
101
1.0×101
1.00×101
2
102
2.0×102
8.00×102
3
103
3.0×103
2.70×104
4
104
4.0×104
6.40×105*
5
105
5.0×105*
1.25×107
6
106
6.0×106
2.16×108
7
107
7.0×107
3.43×109
8
108
8.0×108
5.12×1010
9
109
9.0×109
7.29×1011
The factors that determine the efficiency of parallel algorithms are task balancing (work-load distribution between all used processors in a system → to be maximized); concurrency (the number/percentage of processors working simultaneously → to be maximized); and overhead (extra work for introduce by parallel processing that does not appear in serial processing → to be minimized).
3. New Challenges for Big Data Science
There are a lot of applications that generate Big Data, like social networking profiles, social influence, SaaS & Cloud Apps, public web information, MapReduce scientific experiments and simulations (especially HEP simulations), data warehouse, monitoring technologies, and e-government services. Data grow rapidly, since applications produce continuously increasing volumes of both unstructured and structured data. The impact on the approach to data processing, transfer, and storage is the need to reevaluate the way and solutions to better answer the users’ needs [25]. In this context, scheduling models and algorithms for data processing have an important role becoming a new challenge for Big Data Science.
HEP applications consider both experimental data (that are application with TB of valuable data) and simulation data (with data generated using MC based on theoretical models). The processing phase is represented by modeling and reconstruction in order to find properties of observed particles (see Figure 8). Then, the data are analyzed a reduced to a simple statistical distribution. The comparison of results obtained will validate how realistic is a simulation experiment and validate it for use in other new models.
Processing flows for HEP experiments.
Since we face a large variety of solutions for specific applications and platforms, a thorough and systematic analysis of existing solutions for scheduling models, methods, and algorithms used in Big Data processing and storage environments is needed. The challenges for scheduling impose specific requirements in distributed systems: the claims of the resource consumers, the restrictions imposed by resource owners, the need to continuously adapt to changes of resources’ availability, and so forth. We will pay special attention to Cloud Systems and HPC clusters (datacenters) as reliable solutions for Big Data [26]. Based on these requirements, a number of challenging issues are maximization of system throughput, sites’ autonomy, scalability, fault-tolerance, and quality of services.
When discussing Big Data we have in mind the 5 Vs: Volume, Velocity, Variety, Variability, and Value. There is a clear need of many organizations, companies, and researchers to deal with Big Data volumes efficiently. Examples include web analytics applications, scientific applications, and social networks. For these examples, a popular data processing engine for Big Data is Hadoop MapReduce [27]. The main problem is that data arrives too fast for optimal storage and indexing [28]. There are other several processing platforms for Big Data: Mesos [29], YARN (Hortonworks, Hadoop YARN: A next-generation framework for Hadoop data processing, 2013 (http://hortonworks.com/hadoop/yarn/)), Corona (Corona, Under the Hood: Scheduling MapReduce jobs more efficiently with Corona, 2012 (Facebook)), and so forth. A review of various parallel and distributed programming paradigms, analyzing how they fit into the Big Data era is presented in [30]. The challenges that are described for Big Data Science on the modern and future Scientific Data Infrastructure are presented in [31]. The paper introduces the Scientific Data Life-cycle Management (SDLM) model that includes all the major stages and reflects specifics in data management in modern e-Science. The paper proposes the SDI generic architecture model that provides a basis for building interoperable data or project centric SDI using modern technologies and best practices. This analysis highlights in the same time performance and limitations of existing solutions in the context of Big Data. Hadoop can handle many types of data from disparate systems: structured, unstructured, logs, pictures, audio files, communications records, emails, and so forth. Hadoop relies on an internal redundant data structure with cost advantages and is deployed on industry standard servers rather than on expensive specialized data storage systems [32]. The main challenges for scheduling in Hadoop are to improve existing algorithms for Big Data processing: capacity scheduling, fair scheduling, delay scheduling, longest approximate time to end (LATE) speculative execution, deadline constraint scheduler, and resource aware scheduling.
Data transfer scheduling in Grids, Cloud, P2P, and so forth represents a new challenge that is the subject to Big Data. In many cases, depending on applications architecture, data must be transported to the place where tasks will be executed [33]. Consequently, scheduling schemes should consider not only the task execution time, but also the data transfer time for finding a more convenient mapping of tasks [34]. Only a handful of current research efforts consider the simultaneous optimization of computation and data transfer scheduling. The big-data I/O scheduler [35] offers a solution for applications that compete for I/O resources in a shared MapReduce-type Big Data system [36]. The paper [37] reviews Big Data challenges from a data management respective and addresses Big Data diversity, Big Data reduction, Big Data integration and cleaning, Big Data indexing and query, and finally Big Data analysis and mining. On the opposite side, business analytics, occupying the intersection of the worlds of management science, computer science, and statistical science, is a potent force for innovation in both the private and public sectors. The conclusion is that the data is too heterogeneous to fit into a rigid schema [38].
Another challenge is the scheduling policies used to determine the relative ordering of requests. Large distributed systems with different administrative domains will most likely have different resource utilization policies. For example, a policy can take into consideration the deadlines and budgets, and also the dynamic behavior [39]. HEP experiments are usually performed in private Clouds, considering dynamic scheduling with soft deadlines, which is an open issue.
The optimization techniques for the scheduling process represent an important aspect because the scheduling is a main building block for making datacenters more available to user communities, being energy-aware [40] and supporting multicriteria optimization [41]. An example of optimization is multiobjective and multiconstrained scheduling of many tasks in Hadoop [42] or optimizing short jobs [43]. The cost effectiveness, scalability, and streamlined architectures of Hadoop represent solutions for Big Data processing. Considering the use of Hadoop in public/private Clouds; a challenge is to answer the following questions: what type of data/tasks should move to public cloud, in order to achieve a cost-aware cloud scheduler? And is public Cloud a solution for HEP simulation experiments?
The activities for Big Data processing vary widely in a number of issues, for example, support for heterogeneous resources, objective function(s), scalability, coscheduling, and assumptions about system characteristics. The current research directions are focused on accelerating data processing, especially for Big Data analytics (frequently used in HEP experiments), complex task dependencies for data workflows, and new scheduling algorithms for real-time scenarios.
4. Conclusions
This paper presented general aspects about methods used in HEP: Monte Carlo methods and simulations of HEP processes, Markovian Monte Carlo, unfolding methods in particle physics, kernel estimation in HEP, Random Matrix Theory used in analysis of particles spectrum. For each method the proper numerical method had been identified and analyzed. All of identified methods produce data-intensive applications, which introduce new challenges and requirements for Big Data systems architecture, especially for processing paradigms and storage capabilities. This paper puts together several concepts: HEP, HPC, numerical methods, and simulations. HEP experiments are modeled using numerical methods and simulations: numerical integration, eigenvalues computation, solving linear equation systems, multiplying vectors and matrices, interpolation. HPC environments offer powerful tools for data processing and analysis. Big Data was introduced as a concept for a real problem: we live in a data-intensive world, produce huge amount of information, we face with upper bound introduced by theoretical models.
Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The research presented in this paper is supported by the following projects: “SideSTEP—Scheduling Methods for Dynamic Distributed Systems: a self-* approach”, (PN-II-CT-RO-FR-2012-1-0084); “ERRIC—Empowering Romanian Research on Intelligent Information Technologies,” FP7-REGPOT-2010-1, ID: 264207; CyberWater Grant of the Romanian National Authority for Scientific Research, CNDI-UEFISCDI, Project no. 47/2012. The author would like to thank the reviewers for their time and expertise, constructive comments, and valuable insights.
NewmanH.Search for higgs boson diphoton decay with cms at lhcProceedings of the ACM/IEEE Conference on Supercomputing (SC '06)2006New York, NY, USAACM10.1145/1188455.1188517CattaniC.CiancioA.Separable transition density in the hybrid model for tumor-immune system competitionTomaC.Wavelets-computational aspects of sterian realistic approach to uncertainty principle in high energy physics: a transient approachCattaniC.CiancioA.LodsB.On a mathematical model of immune competitionPerret-GallixD.Simulation and event generation in high-energy physicsBaishyaR.SarmaJ. K.Semi numerical solution of non-singlet Dokshitzer-Gribov-Lipatov-Altarelli- Parisi evolution equation up to next-to-next-to-leading order at small xBucurI. I.FagarasanI.PopescuC.CuleaG.SusuA. E.Delay optimum and area optimal mapping of k-lut based fpga circuitsde AustriR. R.TrottaR.RoszkowskiL.A markov chain monte carlo analysis of the cmssmŞerbănescuC.Noncommutative Markov processes as stochastic equations' solutionsŞerbănescuC.Stochastic differential equations and unitary processesBehnkeO.KroningerK.SchottG.Schorner-SadeniusT. H.BlobelV.An unfolding method for high energy physics experimentsProceedings of the Conference on Advanced Statistical Techniques in Particle PhysicsMarch 2002Durham, UKDESY 02-078 (June 2002)HansenP. C.HansenP. C.HöckerA.KartvelishviliV.SVD approach to data unfoldingAzizI.Siraj-ul-IslamNew algorithms for the numerical solution of nonlinear Fredholm and Volterra integral equations using Haar waveletsSawatzkyA.BruneC.MullerJ.BurgerM.Total variation processing of images with poisson statisticsProceedings of the 13th International Conference on Computer Analysis of Images and Patterns (CAIP '09)2009Berlin, GermanySpringer53354010.1007/978-3-642-03767-2_65EdelmanA.RaoN. R.Random matrix theoryCranmerK.Kernel estimation in high-energy physicsAbramsonI. S.On bandwidth variation in kernel estimates—a square root lawLiX.DemmelJ. W.BaileyD. H.HenryG.HidaY.IskandarJ.KahanW.KangS. Y.KapurA.MartinM. C.ThompsonB. J.TungT.YooD. J.Design, implementation and testing of extended and mixed precision BLASDongarraJ. J.LangouJ.The problem with the linpack benchmark 1.0 matrix generatorLundbergL.LennerstadH.An Optimal lower bound on the maximum speedup in multiprocessors with clusters2Proceedings of the IEEE 1st International Conference on Algorithms and Architectures for Parallel Processing (ICAPP '95)April 19956406492-s2.0-0029227960GuntherN. J.A note on parallel algorithmic speedup boundsTechnical Report on Distributed, Parallel, and Cluster Computing. In presshttp://arxiv.org/abs/1104.4078TakB. C.UrgaonkarB.SivasubramaniamA.To move or not to move: the economics of cloud computingProceedings of the 3rd USENIX Conference on Hot Topics in Cloud Computing (HotCloud '11)2011Berkeley, CA, USAUSENIX Association5ZhangL.WuC.LiZ.GuoC.ChenM.LauF.Moving big data to the cloud: an online cost-minimizing approachDittrichJ.Quiane-RuizJ. A.Efficient big data processing in hadoop mapreduceSuciuD.Big data begets big database theoryProceedings of the 29th British National conference on Big Data (BNCOD '13)2013Berlin, GermanySpringer15HindmanB.KonwinskiA.ZahariaM.GhodsiA.JosephA. D.KatzR.ShenkerS.StoicaI.MesosA platform for fine-grained resource sharing in the data centerProceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI '11)2011Berkeley, CA, USAUSENIX Association22DobreC.XhafaF.Parallel programming paradigms and frameworks in big data eraDemchenkoY.de LaatC.WibisonoA.GrossoP.ZhaoZ.Addressing big data challenges for scientific data infrastructureProceedings of the IEEE 4th International Conference on Cloud Computing Technology and Science (CLOUDCOM '12)2012Washington, DC, USAIEEE Computer Society61461710.1109/Cloud-Com.2012.6427494WhiteT.CelayaJ.ArronateguiU.A task routing approach to large-scale schedulingBessisN.SotiriadisS.XhafaF.PopF.CristeaV.Meta-scheduling issues in interoperable hpcs, grids and cloudsXuY.SuarezA.ZhaoM.Ibis: interposed big-data i/o schedulerProceedings of the 22nd International Symposium on High-performance Parallel and Distributed Computing (HPDC '13)2013New York, NY, USAACM10911010.1145/2462902.2462922SotiriadisS.BessisN.AntonopoulosN.Towards inter-cloud schedulers: a survey of meta-scheduling approachesProceedings of the 6th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (PGCIC '11)October 2011Barcelona, Spain59662-s2.0-8485582288310.1109/3PGCIC.2011.19ChenJ.ChenY.DuX.LiC.LuJ.ZhaoS.ZhouX.Big data challenge: a data management perspectiveGopalkrishnanV.SteierD.LewisH.GuszczaJ.Big data, big business: bridging the gapProceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (BigMine '12)2012New York, NY, USAACM71110.1145/2351316.2351318van den BosscheR.VanmechelenK.BroeckhoveJ.Online coste fficient scheduling of deadline-constrained workloads on hybrid cloudsBessisN.SotiriadisS.PopF.CristeaV.Using a novel messageexchanging optimization (meo) model to reduce energy consumption in distributed systemsIordacheG. V.BoboilaM. S.PopF.StratanC.CristeaV.A decentralized strategy for genetic scheduling in heterogeneous environmentsZhangF.CaoJ.LiK.KhanS. U.HwangK.Multi-objective scheduling of many tasks in cloud platformsElmeleegyK.Piranha: optimizing short jobs in hadoop