On the frequency distribution of neutral particles from low-energy strong interactions

The rejection of the contamination, or background, from low-energy strong interactions at hadron collider experiments is a topic that has received significant attention in the field of particle physics. This article builds on a particle-level view of collision events, in line with recently-proposed subtraction methods. While conventional techniques in the field usually concentrate on probability distributions, our study is, to our knowledge, the first attempt at estimating the frequency distribution of background particles across the kinematic space inside individual collision events. In fact, while the probability distribution can generally be estimated given a model of low-energy strong interactions, the corresponding frequency distribution inside a single event typically deviates from the average and cannot be predicted a priori. We present preliminary results in this direction, and establish a connection between our technique and the particle weighting methods that have been the subject of recent investigation at the Large Hadron Collider.


Introduction
The subtraction of the contamination, or background, from soft, that is, low-energy, physics processes described by Quantum Chromodynamics (QCD) that take place in protonproton collisions is a critical task at the Large Hadron Collider (LHC). The impact of the correction is going to become even more significant in the upcoming scenarios whereby the hard, that is, high-energy, parton scattering of interest will be superimposed with a higher number of low-energy interactions from collisions between other protons, the so-called pile-up events. This is an important aspect at the LHC and one that is going to have an even more significant impact at the High-Luminosity LHC (HL-LHC), that is, at the accelerator that is going to be built following the LHC upgrade project. In fact, the contribution of pile-up particles to the events of interest often makes the study of rare processes particularly challenging.
Subtraction techniques are well established and typically combine tracking information for charged particles with estimates of the energy flow associated with neutral particles that originate from low-energy QCD interactions [1]. In particular, pile-up subtraction is a key component in the data processing pipelines responsible for the reconstruction and calibration of jets, that is, of collections of particles interpreted as originating from the same scattered parton.
In this context, an important role has been played by correction procedures based on jet area [2], which provides a measure of the susceptibility of reconstructed jets to the soft QCD energy flow. Such methods work by subtracting from the total momentum of the high-energy jets a quantity proportional to an event-level estimate of the background momentum density as well as to the area of the jet of interest. Therefore, this takes into account event-to-event background variability and since the correction can be calculated in a kinematics-dependent way, also the presence of different levels of pile-up activity in different kinematic regions inside events. However, due to the quantum nature of the underlying physics, the density of pile-up particles can be different even inside jets with very similar kinematics in the same collision event. While techniques based on jet area cannot correct for this, more recent methods exploiting information encoded in the substructure of jets [3][4][5][6][7][8] can effectively take this into account.

Advances in High Energy Physics
In this article, we explore a different perspective in order to estimate the frequency distribution of soft QCD particles inside events. We build on a view of collision events as collections of particles whereby soft QCD particles are rejected upstream of jet reconstruction, in line with the particlelevel pile-up subtraction algorithms that are currently being evaluated at the LHC [9][10][11].
Due to the quantum nature of the underlying physics processes and the limited number of final-state particles inside individual collisions, the particle multiplicity across the kinematic space inside events will generally vary across collisions even when the physics processes involved are exactly the same. More precisely, the soft QCD particle-level frequency distribution will normally deviate from the corresponding probability distribution and will be different in different events. What is discussed in this article is a data-driven method of estimating the soft QCD particle multiplicity across the kinematic space inside each event, using the following: (i) The kinematic probability distributions of soft QCD particles and of particles originating from the signal hard scattering, for example, obtained using simulated data. (ii) The average fraction of soft QCD particles in the events. (iii) The observed particle multiplicity, that is, the observed number of particles in different kinematic regions in the event.
Our approach relies on particles from high-energy scattering processes having pseudorapidity distribution more peaked at = 0, as well as higher values of , on average, than those originating from low-energy strong interactions. This is essentially due to the higher transverse momentum transferred between the colliding protons and results in different kinematic distributions of the final-state particles on ( , ) plane, as illustrated in Figure 1 with reference to the control sample. Although different signal processes will generally be associated with different kinematic signatures, the dissimilarity between background soft QCD and hard scattering signal particles in terms of their and distributions typically outweighs the variability associated with the choice of signal process.
Moreover, as a filtering stage upstream of physics analysis, reconstructed events at the experiments are usually subdivided, or "skimmed," into multiple data streams enriched in different signal processes. For the purpose of this discussion, the signal model can be thought of as describing the particlelevel kinematics corresponding to the high-energy processes that the events analysed are enriched in.
The kinematic variables used in this study are those that the relevant signal and background probability distributions can be written as functions of, namely, particle pseudorapidity, , and transverse momentum, . To our knowledge, this is the first method of estimating how the frequency distribution of soft QCD particles inside individual events deviates from the expectation due to the nondeterministic nature of the underlying processes.
This article reports preliminary results on simulated collision events at the LHC, showing that the algorithm produces reasonable estimates of the number of soft QCD particles in different ( , ) regions inside events regardless of the presence in those regions of particles from the hard scattering. Given that assessing the performance of this method requires knowledge of the true frequency distribution of soft QCD particles in each event, which is not available at the experiments, the validation was performed on simulated data, using an event generator commonly employed in the field [12,13]. Specifically, background and signal particles were generated using soft QCD processes and → , respectively. Our interest in the estimation of the multiplicity of soft QCD particles across the kinematic space inside individual collision events relates to the development of furtherimproved methods of rejecting pile-up in high-luminosity regimes. Since our approach is based on a different principle and works in a different way as compared to established techniques, we expect its combined use with existing methods to result in enhanced pile-up subtraction in the upcoming higher-luminosity regimes at the LHC. We speculate that this can also lead to improved missing transverse energy resolution and to higher-quality estimates of particle isolation as the pile-up rates increase.
It should be noted that, in addition to pile-up, another source of soft QCD particles at the LHC is the so-called Underlying Event (UE), which consists of particles from lowenergy parton interactions taking place in the same protonproton collision that contains the particles produced by the hard parton scattering. Pile-up and UE particles originate from similar processes: for this reason, with regard to estimating the frequency distribution of soft QCD particles inside events, it is expected that UE particles will contribute to the background category. In other words, UE particles are expected to count towards the number of soft QCD particles together with those that originate from pile-up. In any case, although the distinction between pile-up and UE particles is conceptually important, pile-up is by far the primary source of soft QCD contamination at the ATLAS and CMS experiments at the LHC.
The algorithm that we describe in this article is a simplified deterministic variant of the Markov Chain Monte Carlo technique used in [14][15][16], where we discussed the idea of filtering collision events particle by particle upstream of jet reconstruction. Both our previous contributions and the present article relate to the development of new subtraction methods, with a view to improving further on the rejection of contamination from low-energy strong interactions in highluminosity hadron collider environments. In particular, it is our opinion that the simplicity and parallelisation potential of this technique make it a promising candidate for inclusion in particle-by-particle event filtering procedures at the reconstruction level at future high-luminosity hadron collider experiments.

The Algorithm
By construction, the probability density function (PDF) describing the kinematics of particles originating from a Advances in High Energy Physics given process, for example, with reference to soft QCD interactions, can be estimated as the limit of the corresponding frequency distribution, averaged over a large enough number of events. On the other hand, the corresponding frequency distribution inside a single event normally deviates from the PDF due to the limited number of particles. In fact, even when the processes involved are exactly the same, different collisions contain independent, and therefore different, realisations of the underlying quantum processes. For this reason, the shapes of the corresponding particle-level frequency distributions, for example, that of soft QCD particles, generally vary across collisions.
Let 0 and 1 denote the kinematic PDFs of background and signal particles, respectively, normalised such that ∫ ∫ ( , ) = 1, = 0, 1. For the purpose of this study, we describe collision events as statistical populations of particles originating from soft QCD interactions and from the hard scattering, using a mixture model of the form 0 0 ( , )+ 1 1 ( , ), where 0 is the fraction of soft QCD particles in the events and 1 = 1 − 0 .
In this context, the probability for a given particle to originate from soft QCD interactions can be expressed using the following quantity: . (1) Inside each collision event, although the actual numbers of background and signal particles in the different ( , ) bins are not known, it is possible to estimate the corresponding expected numbers, ] ( , ) and ] ( , ), given a background and a signal model, respectively.
For the purpose of this discussion, we express ] in terms of ] ( , ) = 0 0 ( , )Δ Δ , where is the total number of particles in the event and Δ and Δ are the bin widths along and axes, respectively. The corresponding expected number of signal particles in the bin, ] ( , ), can be calculated in a similar way using 1 . If one denotes the unknown true numbers of signal and soft QCD particles as functions of particles and in each event by * ( , ) and * ( , ), respectively, then ( , ) = * ( , ) + * ( , ), where ( , ) is the corresponding number of particles in the data. When one considers LHC events with a number of proton-proton interactions in line with what is expected in the upcoming higher-luminosity regimes, the final-state particle multiplicities associated with soft QCD interactions and with the signal hard scattering are such that the expected number of signal particles in each bin, ] ( , ), is on average much lower than the corresponding number of soft QCD particles; that is, ⟨] ( , )⟩ ≪ ⟨ * ( , )⟩, where the average is taken over the ( , ) space.
One therefore expects that the statistical fluctuations on the observed number of particles in each ( , ) bin will also be dominated by those on the number of soft QCD particles; that is, ⟨ ( , )⟩ ≪ ⟨ ( , )⟩.
It should be noted that if the variability of the number of signal particles can be neglected, the quantitŷ is expected to provide a more reliable estimate of the unknown number of soft QCD particles, * ( , ), than does ] ( , ). In fact, if the true number of soft QCD particles in each ( , ) bin inside a given event deviates from the expected value by an amount Δ , that is, if * = ] +Δ , then a fraction 0 of Δ will contribute tôin that bin. On the other hand, ] reflects an expectation and therefore does not contain any information about statistical fluctuations inside events.
Given the estimated number of soft QCD particles in each ( , ) bin,̂( , ), the corresponding unknown actual number can be treated as a random variable following a 4 Advances in High Energy Physics binomial distribution with mean given by expression (2) and standard deviation Based on expression (1), for the purpose of estimating the background particle multiplicity inside each event, the discrimination between soft QCD interactions and the hard parton scattering exploits the different shapes of the corresponding PDFs as functions of and . Specifically, as anticipated, the discrimination relies on particles from the hard scattering having a pseudorapidity distribution more peaked at = 0, as well as having on average higher values of .
The use of expression (1) for 0 ( , ) in order to estimate the multiplicity of soft QCD particles across the kinematic space inside the events is essentially equivalent to weighting particles against PDFs that summarise our knowledge of the kinematics of the underlying processes, thereby taking into account the average fraction of soft QCD particles in the events. As far as the hard scattering is concerned, in addition to → , which is used to illustrate our method in the present article, the algorithm has also been run on particles originating from decays of the Standard Model Higgs boson produced via vector boson fusion [17]. In fact, such a process does not involve colour exchange between the colliding protons and is therefore expected to lead to a lower degree of particle activity around = 0 when compared to → . The following discussion refers to neutral particles, since the identification of charged pile-up particles is made significantly easier by the availability of information from the tracking detectors at the experiments.
It is worth pointing out that although the signal and background PDFs were derived using simulated collision events in the context of this study, similar information can in principle be obtained using control samples from the data. As for the estimation of the soft QCD particle fraction among the neutral particles in the events, 0 , it was decided to use the corresponding fraction of charged particles averaged over the events generated. In fact, the investigation of a more sophisticated approach including the use of a kinematic correction factor obtained from Monte Carlo showed no significant performance improvement [18]. It should also be emphasised that the present estimate of 0 based on charged particles was exclusively obtained for the purpose of this investigation and that more information will typically be available at the experiments, for example, in the form of data on the neutral energy flow provided by the calorimeters.
Whereas 0 was defined as a global event-level quantity, the possibility of introducing a dependence on and is worth investigating in the future, as this could lead to further-improved results. Nonetheless, it was decided to rely on as simple an approach as possible for the purpose of this feasibility study.
An overview of this method is given in Figure 2, which highlights what information is extracted from the models and what comes from the data. We will show that this approach produces a more accurate estimate of the background particle multiplicity inside the events than would be obtained using exclusively the expected number, ] , as long as the statistical fluctuations in the data are dominated by those on the number of soft QCD particles.

Results
We discuss the results of a proof-of-concept study on Monte Carlo data at the generator level. We used Pythia 8.176 [12,13] to generate 1,000 events, each consisting of a → hard parton scattering at √ = 14 TeV, superimposed with 50 soft QCD interactions to simulate the presence of pile-up. Soft QCD interactions were generated with "SoftQCD:all"; "Par-tonLevel:ISR"; "PartonLevel:FSR"; and "PartonLevel:MI" set to "on." The performance of this method on a reference event will be discussed for the sake of illustration, and distributions on all events generated will then be shown in order to confirm the consistency of the results.
As a prerequisite for the execution of this method, the particle-level ( , ) space in each event was subdivided into bins of widths Δ = 0.5 and Δ = 0.05 GeV/ . Whereas our Δ binning will have to be revised in the context of a full detector simulation study, which is outside the scope of this article, we are using this choice of bins as a starting point to illustrate our method. Our analysis focusses on particles with 0 < < 1 GeV/ , which are the majority of those produced by soft QCD interactions.
Signal and background PDF templates as functions of particles and were obtained using a control sample dataset containing ∼300,000 particles from → and ∼300,000 from soft QCD interactions. The corresponding Monte Carlo truth ( , ) distributions of neutral soft QCD particles and of neutral particles from the hard scattering are shown in Figures 1(a) and 1(b), respectively, each normalised to unit volume.
The above-mentioned collections of ∼300,000 particles, though significantly lower statistics compared to the large datasets normally used in the field, are considered adequate for the purpose of estimating the signal and background probability distributions in the context of this study. In fact, our emphasis is on how the soft QCD frequency distributions inside individual events deviate from the corresponding probability distribution. For this purpose, ∼300,000 particles are high-enough statistics for the local features in the frequency distributions due to the presence of statistical fluctuations to be averaged out.
The distributions shown in Figure 1 were used together with the previously mentioned estimate of the average fraction of soft QCD particles over all neutrals in the events, 0 , in order to calculate 0 according to expression (1). The distribution of 0 ( , ) is shown in Figure 3(a). Figure 4(a) displays the true multiplicity of neutral soft QCD particles as a function of particles and in the reference event. As expected, the frequency distribution deviates from the corresponding higher-statistics distribution, shown in Figure 1(a), due to the presence of local features that are typically washed out when multiple events are lumped together.
The estimate of the multiplicity of neutral soft QCD particles across ( , ) space obtained using this method is Advances in High Energy Physics shown in Figure 4(b) with reference to the same event. A comparison with the corresponding true distribution in Figure 4(a) suggests that the local features of the distribution are reasonably well described, for example, the excess at ≃ 2.5 and ≃ 0.2 GeV/ . The performance is discussed further below with reference to all the events generated.
The absolute and relative statistical uncertainties on̂in the reference event are displayed in Figures 3(b) and 5(a), respectively, where the absolute uncertainty, , was estimated using expression (3). In particular, Figure 3(b) suggests that the precision of this method is better than 1 particle, although, in order for this claim to be made, the precision over all events generated also needs to be assessed, as discussed in the following.
It is worth emphasising that, in general, it is not known which ( , ) bins in the event contain signal particles and which do not. Therefore, once it is observed that̂is a better estimate of * than ] is and that it is sufficiently precise, it is also necessary to verify that̂is more accurate than the estimate that would be obtained if the possible presence of signal particles in the bins was simply neglected. For this purpose, the absolute deviation of the estimated number of neutral  Figure 4: (a) True particle-level ( , ) distribution of neutral soft QCD particles corresponding to one of the Monte Carlo events generated in this study. The plot illustrates how the soft QCD particle multiplicity across the kinematic space inside a typical event deviates from the corresponding probability distribution due to the limited particle statistics. (b) The corresponding particle-level ( , ) distribution of neutral soft QCD particles estimated using this algorithm as described in the text. soft QCD particles from the true value, normalised to the true number of signal particles, |̂− * |/ * , is displayed in Figure 5(b) with reference to those ( , ) bins that contain at least 1 signal particle. As can be seen, |̂− * |/ * ≲ 1; that is, the absolute deviation of the estimated number of background particles from the true number is lower than the number of signal particles across the kinematic space in the event, corresponding to the pile-up rate considered. As anticipated, Figures 3(b) and 5 relate to one single event. In order to verify the precision and the accuracy of the algorithm in more detail, the corresponding heatmaps were obtained over all events generated. Figures 6(a) and 6(b) display ⟨ ⟩ and ⟨|̂− * |/ * ⟩ across ( , ) kinematic space, respectively, where the average is taken over events. The plots confirm that the algorithm produces consistent results, with ⟨ ⟩ below 1 particle and ⟨|̂− * |/ * ⟩ significantly lower than 1 across ( , ) space.
It is worth noticing that the present investigation relies on a particle-level kinematic comparison between soft QCD interactions and a specific hard scattering process; namely, Advances in High Energy Physics The plots confirm the level of precision and accuracy that can be achieved using this method, in that is below 1 particle and |̂− * |/ * is significantly lower than 1.

→
. Choosing a different signal process will normally change the final-state particle kinematics, although the difference between particles originating from a hard scattering and soft QCD particles is generally expected to be more pronounced than differences across signal processes. As discussed in Section 2, this is supported by the study documented in [17], where this method was applied to vector boson fusion Standard Model Higgs production. In any case, the potential dependence of the performance of this technique on the choice of signal process deserves further investigation, in order for the results presented in this article to be generalised.

Conclusion
The contamination, or background, from particles produced by low-energy strong interactions is a major issue at the Large Hadron Collider. Although well-established correction procedures are in use at the experiments, new techniques are also being investigated in the field. The primary objective of this new line of development is to meet the requirements that will be posed by the upcoming higher-luminosity operational regimes of the accelerator.
We have investigated a different perspective to mainstream methods, thereby estimating the kinematics of background particles inside individual collision events in terms of their frequency distribution. Whereas the use of probability distributions is traditional, our emphasis on the frequency distributions is, to the best of our knowledge, a distinctive and unique feature of our approach.
Our hope and expectation are that the ability to describe the kinematic frequency distribution of the contaminating particles collision by collision will help towards the development of improved subtraction methods at higher luminosity. In particular, we expect that our method will become useful as a complement to existing correction algorithms applied to jets, that is, to collections of final-state particles interpreted as originating from the same scattered parton.
The preliminary results discussed in this article suggest that our method can produce more accurate estimates of the number of contaminating particles in different kinematic regions inside collision events than would be possible by relying exclusively on the expected numbers. Although the possible impact of mismodelling remains to be investigated in more detail, this encourages further studies in this direction.
However, it should be stressed that a proper quantitative test of the proposed method will require a detailed assessment on specific observables, which is beyond the scope of the present feasibility study.
It should also be emphasised that the algorithm is inherently parallel. In fact, different kinematic regions inside events can be processed independently, and the calculation of the only global variable, that is, of the average fraction of contaminating particles per event, can be performed in advance on a control sample. For this reason, this method is potentially suitable for inclusion in future particle-level event filtering procedures upstream of jet reconstruction.
The possible complementarity between our method and the particle weighting techniques that have recently been proposed at the Large Hadron Collider will also be worth exploring. In fact, our estimate of the probability for individual particles to originate from low-energy QCD processes as opposed to the high-energy signal scattering is based on information that is currently not employed by any particle 8 Advances in High Energy Physics weighting algorithms. In this context, we anticipate that multivariate combinations of different weighting schemes can prove beneficial with a view to improving further on the rejection of contaminating particles.
Finally, it will be useful to study the impact of this method, when used in conjunction with existing techniques, on the resolution of the missing transverse energy as well as on estimates of particle isolation, with a view to quantifying the associated benefit at the level of physics analysis. The source code used for the purpose of this study is made available upon request.

Nomenclature and General Remarks
Collisions: Proton-proton collisions at the Large Hadron Collider Events: Triggered proton-proton collisions Bunch crossings: Intersections between colliding proton beam bunches Physics processes: Either the high-energy parton scattering of interest or low-energy strong interactions Missing transverse energy: Event-level energy imbalance measured on a plane perpendicular to the direction of the colliding particle beams Particle transverse momentum, : Absolute value of the component of the particle momentum vector on a plane perpendicular to the direction of the colliding beams Particle pseudorapidity, : Kinematic quantity expressed in terms of the particle polar angle in the laboratory frame, , by = − log[tan( /2)] Neutral particles: Whenever neutral particles are referred to in the text, neutrinos are not considered.