A non-parametric peak finder algorithm and its application in searches for new physics

We have developed an algorithm for non-parametric fitting and extraction of statistically significant peaks in the presence of statistical and systematic uncertainties. Applications of this algorithm for analysis of high-energy collision data are discussed. In particular, we illustrate how to use this algorithm in general searches for new physics in invariant-mass spectra using pp Monte Carlo simulations.


Introduction
Searches for peaks in particle spectra is a task which is becoming increasingly popular at the Large-Hadron collider that focuses on new physics beyond TeV-scale.Bump searches can be performed either in single-particle (such as p T distributions) or invariant-mass spectra.For instance, searches for new particles decaying into a two-body final state (jet-jet, gamma-gamma, etc.) and multi-body decays are typically done by examining invariant masses of final-state objects (jets, leptons, missing transverse momenta, etc.).For example, assuming seven identified particles (jets, photons, electrons, muons, taus, Z-bosons, missing p T ), a search can be made for parent particles decaying into 2, 3, or 4 daughter particles.This leads to 322 unique daughter groups.Thus, the task of analyzing such invariant-mass combinations becomes rather tedious and difficult to handle.Considering a "blind" analysis techniques for scanning many channels [1], any cut variation increases the number of channels that need to be investigated.Finally, similar challenges exist for automatic searches for new hadronic resonances combining tracks [2].
The task of finding bumps is ultimately related to the task of determining a correct background shape using theoretical or known cross sections.However, a theory can be rather uncertain in the regions of interest, difficult to use for background simulation or entirely nonexistent.Even for a simple jet-jet invariant mass, finding an analytical background function that fits the QCD-driven background spanning many orders in magnitude and which can be used to extract possible excess of events due to new physics requires a careful examination.Attempts to fit two-jet and threejet invariant masses have been discussed in CMS [3,4] and ATLAS [5] papers; while both experiments have reached the necessary precision for such fits using initial lowstatistics data, the used analytical functions are rather different and have many free parameters.This task becomes even more difficult considering multiple channels (invariant-mass distributions) with various cuts or detector-selection criteria (like b-tagging).Each such channel requires a careful selection of analytical functions for background fit and adjustments of their initial values for convergence of a non-linear regression while determining an expectantly smooth background shape.A fully automated approach to searches for new physics has been discussed elsewhere [6].
One technically attractive approach is to find a non-parametric way to extract statistically significant peaks without a prior assumptions on background shapes.Such approach is popular in many areas, from image processing to studies of financial market, where a typical peak-identification task is reduced to data smoothing in order to create a function that attempts to approximate the background.The smoothing can be achieved using the moving average [7], Lowess [8], SPlines [9] algorithms.Statistically significant deviations from smoothed distributions can be considered as peaks.Such technique is certainly adequate for the peak extraction, but it does not pursue the goal of peak identification with a correct treatment of statistical (or systematic) uncertainties.
The closest peak-search approach to high-energy case has been developed for studies of γ-ray spectra where the usual features of interest are the energies and intensities of photo-peaks.Several techniques have been developed, such as those based on least squares [10], second differences with least-squares fitting [11], Fourier transformation [12], Markov chain [13], convolution [14] (just naming a few).While such approaches are well suited for counting-type observables, they typically focus on narrow peaks on top of small and often flat-shaped background.In high-energy collisions, a typical standard-model background distribution has a falling shape spanning many orders of magnitude in event counts.A typical example is jet-jet invariant masses used for new particle searches [3,5].For such spectra, the most interesting regions are the tails of the exponentially suppressed distributions where a new highp T physics may show up.This means that there should be rather different thresholds to statistical noise, depending on the phase-space region, and as the result, a correct treatment of statistical and systematic uncertainties is obligatory.Unlike the γ-ray spectra where peaks are rather common and subject of various classification techniques, peaks in high-energy collisions are rather rare.As a consequence, relatively little progress has been made to develop a non-parametric fitting technique for highenergy physics applications where an observation of peaks is typically a subject for searches for new physics rather than for peak-classification purposes.
The above discussion leads to the need for a non-parametric way of background estimation together with the peak extraction mechanism which can be suited for high-energy collision distributions, such as invariant masses.The algorithm should be able to take into account the discrete nature of input distributions with their statistical (and possibly systematic) uncertainties.

Non-parametric peak finder algorithm
Due to the reasons discussed above, the program called Non-Parametric Peak Finder (NPFinder) was developed using a numerical, iterative approach to detect statistically significant peaks in event-counting distributions.In short, NPFinder iterates through bins of input histograms and, using only one sensitivity parameter, determines the location and statistical significance of possible peaks.Unlike the known smoothing algorithms, the main focus of this method is not how to smooth data and then extract peaks, but rather how to extract peaks by comparing neighboring points and then calling what is left over the "background".Below we discuss the major elements of this algorithm and then we illustrate and discuss its limitations and possible improvements.
For each point i in a histogram, the first-order derivative α i is found taking into account possible (statistical or/and systematic) uncertainties.This is done by calculating the slope between two points including their experimental uncertainties: If point i + 1 is lower than point i, the upper error is used, while if point i + 1 is higher than point i, the lower error uncertainty is used.This is done in order to be always on a conservative side while reducing statistical noise.Mathematically, this can be written as: where the uncertainty δy i+1 is taken with negative sign for y i+1 > y i , and with positive sign otherwise.The uncertainly may not need to be symmetric; but for simplicity we assume that they are symmetric as this is usually the case for statistical nature of uncertainties.The derivatives are averaged calculating a running average for any given position N: The algorithm triggers the beginning of a peak if the local derivatives satisfy: where ∆ is a free positive parameter.When the above conditions are true, NPFinder registers a possible peak and begins classifying next points as a part of the peak.The running average Eq. 2 is not accumulated for the points which belong to a possible peak.∆ is the only free parameter which specifies the sensitivity to the peak finding.Typically, it should be close to 1 and decreasing it to a smaller value will increase sensitivity to the peaks (and likely will increase sensitivity to statistical fluctuations).
NPFinder continues to walk over data points until δα N +1 and δα N +2 are both negative, which signifies the maximum of the peak has been reached.The double condition in Eq. 3 is used to reinforce the peak-search robustness.When this condition is met, NPFinder exits the peak and adds an equal number of points to the right side from the peak center.The requirement of having the same number of points implies that the peaks are expected to be symmetric (which is typically the case).For steeply falling distributions, as in the most common case for high-energy physics, this assumption usually means that we somewhat underestimate the peak significance.
After detecting all peak candidates, NPFinder iterates through the list of possible peaks in order to form a background for each peak.This is achieved by performing a linear regression of points between the first and last points in the peak, i.e. applying the function y i = mx i + b, where m and b are the slop and intercept of the linear regression, which in this case is rather trivial as it is performed via the two points only.It should be noted that the linear regression is also performed taking into account uncertainties: where y 1 is the first point of the peak, y 2 is the final point of the peak, and δy 1 and δy 2 are their statistical uncertainties, respectively.Here the statistical uncertainties are added in order to always be on the conservative side in estimation of the background level under the peak.The intercept parameter then is b = y 1 + δy 1 − mx 1 .
It should be mentioned that the technique of peak finding considered above is somewhat similar to that discussed for γ-ray applications [11].But there are several important differences: the peaks can have arbitrary shapes (not only a Gaussian form), no fitting or smoothing procedure is used and errors on data points are included at the iteration step.
Finally, NPFinder uses the background points to calculate the statistical significance of each peak in a given histogram.This is done by summing up the differences r i of the original points in a peak with respect to the calculated background points, and then dividing this value by it's own square root.For a given peak, it can be approximated by: where the sum runs over all points in the peaks.The algorithm runs over an input histogram or graph, builds a list of peaks and estimates their statistical significance.
A typical statistically significant peak in this approach has σ > 5 − 7. A first peak is usually ignored as it corresponds to the kinematic peak of background distributions.Below we illustrate the above approach by generating fully inclusive pp collision events using the PYTHIA generator [15].The required integrated luminosity was 200 pb −1 .Jets are reconstructed with the anti-k T algorithm [16] with a distance parameter of 0.6 using the cut p T > 100 GeV.Currently, this jet algorithm is the default at the ATLAS and CMS experiments.Then, the dijet invariant-mass distribution is calculated and the NPFinder finder is applied using the parameter ∆ = 1.As expected, no peaks with σ > 5 were found.
Finally, a few fake peaks were generated using Gaussian distributions with different peak positions and widths.The peaks were added to the original background histogram.Figure 1 shows an example with 3 peaks generated at 1000 GeV (20 GeV width, 200000 events), 1500 (50 GeV widths, 30000 events) and at 2800 GeV (40 GeV widths, 1200 events).The algorithm found all three peaks and gave correct estimates of their positions, widths and approximate statistical significance using the input parameter ∆ = 1.
It should be noted that the peak statistical significance of the proposed nonparametric method might be smaller than that calculated using more conventional approaches, such as those based on a χ 2 minimisation with appropriate background and signal functions.This can be due to the assumption on the symmetric form of the extracted peaks, the linear approximation for the background under the peaks, and the way in which the uncertainty is incorporated in the peak-significance calculation.An influence of experimental resolution can also be an issue [17] which can only be addressed via correctly identified signal and background functions.Such drawbacks are especially well seen for the highest-mass peak shown in Figure 1 where a statistical fluctuation to the right of the peak pulls the background level up compared to the expected falling shape.Given the approximate nature of the statistical significance calculations which only serve to trigger attention of analyzers who need to study the found peaks in more detail, the performance of the algorithm was found to be reasonable.
It should be noted that there is a correlation between the peak width and the input parameter ∆: a detection of broader peaks typically requires a smaller value of ∆ (which can be as low as 0.2).
In conclusion, a peak-detection algorithm has been developed which can be used for extraction of statistically significant peaks in event-counting distributions taking into account statistical (and potentially systematic) uncertainties.The method can be used for new physics searches in high-energy particle experiments where a correct treatment of such uncertainties is one of the most important issues.The non-parametric peak finder has only one free parameter which is fairly independent of input background distributions.The algorithm was tested and found to perform well.The code is implemented in the Python programming language with the graphical output using either ROOT (C++) [18] or jHepWork (Java) [19].The code example is available for download [20].

Figure 1 :
Figure 1: Invariant mass of two jets generated with the PYTHIA Monte Carlo model.Several peaks seen in this figure were added using Gaussian distributions with different widths and peak values (see the text).The peaks are found using the NPFinder algorithms which also estimates their statistical significances as discussed in the text.