Leakage Prediction in Machine Learning Models When Using Data from Sports Wearable Sensors

One of the major problems in machine learning is data leakage, which can be directly related to adversarial type attacks, raising serious concerns about the validity and reliability of artificial intelligence. Data leakage occurs when the independent variables used to teach the machine learning algorithm include either the dependent variable itself or a variable that contains clear information that the model is trying to predict. This data leakage results in unreliable and poor predictive results after the development and use of the model. It prevents the model from generalizing, which is required in a machine learning problem and thus causes false assumptions about its performance. To have a solid and generalized forecasting model, which will be able to produce remarkable forecasting results, we must pay great attention to detecting and preventing data leakage. This study presents an innovative system of leakage prediction in machine learning models, which is based on Bayesian inference to produce a thorough approach to calculating the reverse probability of unseen variables in order to make statistical conclusions about the relevant correlated variables and to calculate accordingly a lower limit on the marginal likelihood of the observed variables being derived from some coupling method. The main notion is that a higher marginal probability for a set of variables suggests a better fit of the data and thus a greater likelihood of a data leak in the model. The methodology is evaluated in a specialized dataset derived from sports wearable sensors.


Introduction
Machine learning models typically receive input data and solve problems such as pattern recognition by applying a sequence of particular transformations. e majority of these transformations turn out to be extremely sensitive to modest changes in input. Under specific scenarios, using this sensitivity can result in a difference in the behavior of the learning algorithm [1,2]. Adversarial attack is the design of an adequate input in a specific way that leads the learning algorithm to erroneous outputs while not easily noticed by human observers. It is a severe concern in the reliability and security of artificial intelligence technologies. e issue arises because learning techniques are intended for use in stable situations where training and test data are generated from the same, possibly unknown distribution [3]. A trained neural network, for example, represents a significant decision limit corresponding to a standard class. Of course, the restriction is not without flaws. A correctly designed and implemented attack, which corresponds to a modified input form a slightly differentiated dataset, can cause the algorithm to make an incorrect judgment (wrong class) [4][5][6].
Developing and selecting machine learning methodologies to solve complex, usually nonlinear, problems is inextricably linked to the area of application and the target problem it seeks to solve. is is one of the essential processes of preprocessing the area of interest and the dataset, as the choice of appropriate algorithms depends on not only the nature and dynamics of the problem but also the characteristics of the available data, such as volume, number, and type of variables in question. e preprocessing of the data concerns the tests and the preparation work that should be carried out in the examined dataset before the use and application of machine learning algorithms. is method is critical because if the quality of usage or training data is not ensured, the algorithms' performance will be subpar or the algorithms may produce false results [6,7].
In general, data preparation/preprocessing entails dealing with scenarios when the original data have issues such as contradicting information, coding discrepancies, field terminology, and units of measurement. However, more critical issues such as the presence of lost values, noise, and extreme values and dealing with special requirements that necessitate data transformation, such as discretization, normalization, dimension reduction, or the selection of the most appropriate features, must be addressed [9][10][11]. It should be noted that several techniques can be used in preprocessing processes, with the choice of the best strategy arising from the nature of the field of knowledge, the problem to be addressed, the available data, and the machine learning algorithm used.
One of the most critical errors that occur during the preprocessing of data for use by machine learning algorithms is data leakage. e leak in question refers to cases where, inadvertently or even intentionally, the value that the model wishes to predict (dependent variable) is contained indirectly or directly in the features that are called to train the algorithm (independent variables). Any variable that provides transparent information about the value that the model is trying to predict is considered a data leak and leads to fictitious results. An obvious solution to this problem is to apply preprocessing only to the training set. Using preprocessing techniques to the whole dataset will make the model learn the training and the test sets, resulting in a data leak, and thus the model fails to generalize [2,12,13]. e major problem of data leakage occurs when there is a severe indirect interaction of features which is not easy to detect. It is, for example, a widespread phenomenon in machine learning experiments; the relationship between the dependent and the independent variable is complex (e.g., polynomial, trigonometric, and so on), so new features may be created that seem to help capture this relationship. Still, in practice, they create serious data leaks [14,15].
Similarly, combinations may exist between independent and dependent variables through, for example, an arithmetic operation, a modification, or a conversion to make them more important in explaining the discrepancies in the data than if they remained separate. Creating a new opportunity through the interaction of existing features creates data leaks and significant bias in the final machine learning model [4,7,11].
For example, Lu et al. [15] developed a weighted context graph model (WCGM) for information leakage, with the critical goals of first increasing the contextual relevance of information, second classifying the tested data based on the commonality characteristics of its context graphs, and third preserving data proprietors' privacy. e weighted context network reduces complexity by using key sensitive phrases as nodes and contextual linkages as edges. e proposed maximum subgraph matching approach and deep learning algorithms are used to evaluate the similarity of the tested information and the pattern, as well as the responsiveness of the tested data to match the converted data better. e proposed model surpassed the competition regarding accuracy, recall, and run time, indicating its ability to detect real-time data leaks.
Using a variety of datasets, Salem et al. [14] provided research on the new and developing danger of membership inference attacks, demonstrating the efficacy of the suggested assaults across sectors. ey offer two defensive strategies to alleviate the problem. e first, known as dropout, involves randomly deleting specific nodes in each fully linked neural system training step. In contrast, the second, known as model stacking, involves organizing numerous ML models in a ranked order [16]. Extensive testing has shown that our defensive strategies may significantly lower the performance of a membership inference attempt while retaining a high degree of usefulness, i.e., good target model prediction accuracy. ey also suggest a defensive mechanism against a larger class of inclusion inference assaults while maintaining the ML model's high usefulness.
In this work, we proposed an innovative system of leakage prediction in machine learning models, which calculates a lower limit for the marginal probability of the observed variables coming from a coupling method, which shows that in an examined machine learning model, there is data leakage. e methodology is implemented based on the Bayesian inference methodology [17][18][19]. e model's goal is to generate an analytical approach to the reverse probability of unobserved variables [20,21], to draw statistical inferences about the important correlated variables, and to compute a lower limit for the marginal likelihood of observable variables generated from a coupling method. e highest probability indicates that there is a data leak [22].
is is done to have a solid and generalized forecasting model, which will produce remarkable forecasting results without data leakages.

Proposed Approach
e proposed implementation is based on Bayesian inference [23][24][25], which is a method of approaching intractable problems that arise in highly fuzzy environments. More specifically, the methodology offers a secure solution for the observed variables and unknown parameters and latent states of variables, characterized by different types of relationships (interconnected, transformed, hidden, random, and so on). A prior distribution, a posterior distribution, and a likelihood function are used to illustrate Bayesian inference [26] in Figure 1.
e prediction error is defined as the difference between the previous expectation and the likelihood function's peak (i.e., reality). e variance of the prior is the source of uncertainty. e variance of the likelihood function is referred to as noise [27].
Parameters and latent variables are grouped as "unobserved variables." So, with the proposed method, the purpose is as follows [28][29][30][31]: (1) In order to generate an analytical approach to the reverse probability of unobserved variables, develop statistical findings for the important correlated variables.

Computational Intelligence and Neuroscience
(2) e marginal likelihood of the data presented in the model can be used to derive a lower limit for the marginal probability of the observed data, with the marginalization conducted on unobserved variables. e main notion is that a higher marginal probability for a set of variables suggests a better fit of the data and thus a greater likelihood of a data leak in the model.
An example of information gain vs prediction error is presented in Figure 2.
Information gain is calculated mathematically as a function of prediction errors for uncertainty levels ranging from 0.2 to 1.0. e external noise level is set to 0.1 [23,27].
e method generally approaches a conditional latent variable density given the observed variables where we assume that a mixture is present. Mixing behavior occurs because the source of each observation is unknown, that is, the classification into a specific, exact domain of a variable [32]. us, each observation xi is predetermined to each of f i (· | θ i ) with probability pi. Depending on the case, the purpose of the inference is to reconstruct the classification of observations into definition fields, construct estimators for the components' parameters, or even estimate the number of components themselves [15]. It is always feasible to map a mixture of k form distributions to a random variable Xi via a delimitation method [25,33]: e random variable Zi with {1, 2, . . ., k}, is as follows [34]: Next, we assume that we have observed the extended data, which consist of independent pairs with distribution [35]: In the particular case of the model:  Computational Intelligence and Neuroscience where we consider the same normal a priori distribution in the media, μ 1 , μ 2 ∼ N(0, 10), we will calculate the ex post weight ω(z) for a classification z, where in the first component are l observations [24,36]: { } � l for n 1 , n 2 � (l, n − l).
So, we have [37] π z, μ 1 , e ex-weight ω(z) is obtained by completing the above function in RxR for μ1 and μ2, which is a double integral which is easily calculated. For the completion in terms of μ1, excluding the parts that do not contain it, it is enough to calculate [24,33,36,38] So, to calculate the integral, we have 4 Computational Intelligence and Neuroscience because the last integral is crucial in the full support of the exponential distribution [39]: For the completion in terms of μ2, excluding the parts that do not contain it, it is enough to calculate [23,36,38,40] Following the same methodology as before, we conclude that [41] So, the ex post probability ω(z) is calculated as follows [21,23,42,43]: If we replace c1, c2, we take the relation: us, from the above analysis, it appears that it is practically possible to arrive at detailed expressions of the maximum probability and Bayes estimators [44] for the ex ante distributions of the variables of interest and thus marginalize the set of variables for models where there is a data leak [28,33].

Experiments and Results
A specialized scenario was implemented to model the proposed system that uses sports wearables data to record the movements of athletes playing beach volleyball. e dataset comprises three-dimensional acceleration data from joint actions of beach volleyball athletes, each of whom was fitted with an accelerometer worn on the wrist and sampled at 39 Hz. e signal was recorded at 14 bits per axis and then compressed to 16 g. e x, y, and z axes relate to the athletes' spatial arrangement, which is recorded in an independent coordinate system based on the sensor configuration, as there was no transfer to real-world coordinates [45,46]. e 30 athletes recorded ranged in expertise from novice to professional Computational Intelligence and Neuroscience volleyball players. e set's goal is to create an identification and classification system that extracts relevant portions from continuous input and classifies them [47]. e categorization includes ten various volleyball activities, such as homemade service, block, nail, and so on. For the evaluation of the system, 10 characteristics were selected, which were randomly combined into pairs to identify the observed variables, whether they come from a coupling method and whether there is a data leak.
We first describe some key features. Let g(·, ·|θ) be the joint density function of (X, Z) given by the parametric vector θ, f(·|θ) be the density function of X given θ, and k(·|x, θ) be the function density of the bounded distribution of Z given by observations x and θ. e algorithm is based on the use of incomplete data, i.e., we can write the distribution of sample x as follows [1,2,40]: So, logarithm it: We arrive at a complete (unobserved) logarithm of probability: where L is the observed logarithm of the probability. e algorithm fills in the missing variables z based on k (z|x, θ) and then maximizes with θ the expected full logarithm probability [21,25,48]. So, the algorithm is configured as follows: (1 )Give some initial values to θ(0).
As an application of the above, we consider the particular case of the model of mixing two regular variables, where all parameters are known except θ � (μ1, μ2). For a simulated sample of 500 observations and actual values p � 0.7 and (μ1, μ2) � (0, 2.5), the logarithm of probability has two peaks. Applying the algorithm to this model, we have that the total probability is [20,49,50] where its logarithm is For the first step, we need to calculate where the mean value is taken for Z ∼ k(z|x, θ), and we have that Zi are independent of [51][52][53][54] In step t, the expected rankings are equal to erefore: which we maximize in the second step in terms of (μ1, μ2) and get is example involved running the algorithm 20 times (each time with 100 repeats) while picking random numbers from a range of possibilities for the initial conditions. However, the proposed approach was only drawn to the highest and principal vertex of the logarithm probability eight times out of every 20 times in the experiments. It was drawn to the pseudo-vertex of the logarithm probability distribution for the remaining 12 times (although the likelihood is much lower). e original values were closer to the lower peak than the final values, indicating that the early values were more accurate.
e algorithm converges to the pseudo-peak of likelihood, at which point we may make 84 percent correct predictions about the coupling between the variables in the dataset. Accordingly, we will have 93 percent of the variables accurately predicted to couple their coefficients if the algorithm converges to the dominant peak in probability.

Discussion and Conclusions
In this work, we proposed an innovative system of leakage prediction in machine learning models, which is based on Bayesian inference, to calculate a lower limit for the marginal probability of the observed variables coming from a coupling method, which shows that in an examined machine learning model, there is data leakage. e methodology is evaluated in a specialized dataset from sports wearable sensors, where the ability of the method to detect variable coupling is demonstrated, even when it is done randomly. e proposed methodology is a Bayesian approach to statistical discoveries in complicated distributions that are difficult to evaluate directly or by sampling, and this is the methodology that has been offered. It is a method of selection that is different from Monte Carlo sampling methods. While Monte Carlo techniques use a sequence of samples to approximate a rear distribution numerically, the proposed algorithm provides a locally optimal, correct analytical solution, allowing even hidden variable coupling to be found. From the maximum ex post estimate of each variable's unique most probable value to the fully Bayesian estimation that calculates (approximately) the entire rear distribution of parameters and latent variables, the algorithm finds a set of optimal parameters of the interrelated variables, which can then be solved in detail using the information obtained from the data. Indeed, this is true even for conceptually comparable variables, such as a basic nonhierarchical model with only two parameters and no latent variables. e extension of the methodology can focus on integrating countervailing machine learning techniques to be a complete defense system in case of attacks that attempt to deceive the models by providing misleading information. Determine strategies and procedures for running the model on specified sets of issues with training and test data generated from the same statistical distribution. Moreover, a future expansion of the proposed system will review the taxonomies of the characteristics of transfer learning, particularly whether and how this system can mitigate them. Finally, learning transfer approaches are investigated from known distribution attack methods seeking to exploit the dynamics of categorization decision-making limits.
Data Availability e data used in this study are available from the corresponding author upon request.

Conflicts of Interest
e author declares that there are no conflicts of interest.