Estimation of the Conditional Probability Using a Stochastic Gradient Process

&e use of conditional probabilities has gained in popularity in various fields such as medicine, finance, and imaging processing. &is has occurred especially with the availability of large datasets that allow us to extract the full potential of the available estimation algorithms. Nevertheless, such a large volume of data is often accompanied by a significant need for computational capacity as well as a consequent compilation time. In this article, we propose a low-cost estimation method: we first demonstrate analytically the convergence of our method to the desired probability and then we perform a simulation to support our point.


Introduction
e likelihood that an event B will occur knowing that event A has already occurred is called the conditional probability, denoted by P(B|A) or P A (B). For example, if a card is randomly drawn from a deck, there is a one in four chance of getting a heart suit, but if a red reflection is seen on the table, there is now a one in two chance of getting it. If events A and B have nonzero probabilities, then Bayes theorem states that P(B|A) � P(A ∩ B)/P(A). at was for the scientific part, but in daily life also conditional probability is useful in various fields and is even gaining more and more interest. For example, banks estimate the probability of default of a borrower or bond issuer using conditional probability estimation methods based on Basel II regulations (see [1] for more information). e estimation of this probability is crucial since it allows the banks to compute the expected losses and therefore to cover the consequences. Another area where the estimation of conditional probabilities is important is marketing, where it is used to estimate the interest of a customer in a given product or service. erefore, they are able to focus on the most attractive population in order to optimize the marketing costs [2]. e estimation of this probability is also frequently used in the field of medicine, as doctors need to estimate the likelihood of a patient being affected by a given disease based on the symptoms the patient presents [3] and many more areas, such as drug discovery, computer vision, speech recognition, handwriting recognition, biometric identification, document classification, Internet search engines, pattern recognition, and recommender system [4][5][6][7][8][9][10][11].
In practice, we can divide conditional probability estimation methods into two categories, linear and nonlinear classifiers. e linear classifiers can be split into two subcategories, the generative and discriminative models [12,13], and the most commonly used are (i) Fisher's linear discriminant (ii) Logistic regression (iii) Naive Bayes classifier Nonlinear classifiers can be grouped into the following groups of methods: (i) Linear classifier with transformed data such as a discretization of continuous variables (ii) Support vector machines (iii) Quadratic classifiers (iv) K-nearest neighbor (v) Decision trees
Let us consider an observable random binary variable U and a random variable V. We define U such that We are willing to estimate the vector θ such that the conditional probability P(U � 1|V) is written in the form: We are looking for a simple method of estimating the parameter θ that will be less demanding in terms of computational capacity. is is useful especially in the Big Data era, where the datasets can be massive and any common iterative estimation can take a lot of time. To do this, we use the stochastic approximation, which has been introduced by Herbert Robbins and Sutton Monro in 1951 [21]. e goal is to find the unique root θ * of a function M(θ) � α, while M(θ) cannot be directly observed. Yet, we assume that we According to [21], there exists a sequence a n that satisfies ∞ n�1 a n � ∞, such as the process θ n defined by converges to the unique root of M(θ) � α. In our case, we start from the work of Bennar et al. [22] who established the conditions for almost sure convergence, as well as the quadratic mean convergence of a stochastic gradient process θ n to the parameter θ that allows us to estimate E [U|V].
Here, we are interested in the case of binary random variables, where E[U|V] is equivalent to P(U � 1|V), as we can see in the following: We also chose these results as the basis of our work since the stochastic gradient process performs a sampling at each iteration in order to achieve the estimates without relying on all the available data.
In this article, we first present the convergence results elaborated by Bennar et al., then we show that these results are also valid in the framework of estimating the conditional probability. We also present a simulation to highlight the obtained results, and finally, we conclude our work by addressing development perspectives.

Preliminaries
Let us consider an observable random variable U and a random variable V, both have values in R k of law μ. We try to estimate the parameter θ in R p such that ϕ(V, θ) approaches E[U|V] in the least squares sense. It should also be noted that the estimation of the parameters of a logistic regression in the sense of least squares is already achieved through the iterative weighted least squares method [23] which, unlike our purpose, is heavy and employs huge computing capacities in the case of large dataset.
Let f be the real positive function defined in R p by we are looking for the value of θ that minimizes the function f. Let us define the real positive function g in R p by We have thus the problem reduces to looking for θ that minimizes the function g. We have To estimate θ in a sequential way, we use a stochastic gradient algorithm. We consider a random θ n in R p defined by with (i) (a n ) is a sequence of positive real numbers In the following, the abbreviation a.s means almost sure convergence and q.m means quadratic mean convergence. (H 1 ) a n > 0, ∞ n�1 a 2 n < ∞, (H 1 ′ ) a n > 0, ∞ n�1 a n � ∞, ∞ n�1 a 2 n < ∞, (H 2 ): there exist a and b such that for all (H 4 ) θ * is a local minimum of g: . (13) (H 5 ) θ * is the unique stationary point of g: Proof. See [22].

Proof of Process
Convergence. Let us assume ρ 1 , ρ 2 , . . . , ρ p be p functions of q measurable real variables. We note In order to estimate the value of θ that minimizes E[(E[U|V] − (1/1 + exp(− ρ(V) ′ θ))) 2 ], we consider the following stochastic approximation process (θ n ) in R p defined by with Journal of Mathematics where (U 1 , V 1 ), (U 2 , V 2 ), . . . , (U n , V n ) is a sample of (U, V) formed of independent random variable and distributed identically.
Let us prove that the assumption 7 is true. To do this, we use the following result.
where, for any point x of U, ‖f ′ (x)‖ is the operator norm of the differential of f at point x.
Proof. See [24], p. 31. en, there exist two real positive functions h and h ′ defined in R p such that
(31) □ 3.2. Simulation. In order to illustrate our work, we perform a simulation in which we estimate the different parameters of a logistic regression. Our simulations are performed using the programming language "R." We simulate 10 000 observations of the random variable V\leads to N(3, 10), and we define U such that with ε \leads to N(0, 3), to avoid having a perfectly fitted model. en, we fitted a classical logistic regression with the Fisher scoring algorithm, which converged in 12 iterations.
We define the accuracy rate as the number of correctly classified observations over the total number of our observations, and the classic model has an accuracy of 90.34%. Table 1 shows all the remaining outputs of the model. Regarding the proposed process, we initiate it with the following randomly chosen values, Intercept � − 3, θ � − 3, and we choose a n � 1 + exp(− ρ(V) ′ θ)/n; as ρ(V) and θ are finite, we can see that assumption H 1 ′ is verified, and we also randomly draw a sample of one observation to perform our calculations at each iteration. Finally, we have set an accuracy of 10 − 12 . Following simulations, we obtain the results as follows.

Journal of Mathematics
We can see through Figures 1 and 2, as well as Figure 3, that the process converged in 10 iterations. erefore, we only needed 10 samples of one observation to obtain a robust estimation of the coefficients. Moreover, we can see in Figure 3 as well as in the summary of the process, in Table 2, that the latter records a prediction accuracy on the set of simulated observations of 89%, hence a loss of 1% in accuracy, but, in return, we gained greatly in terms of computing capacity.

Conclusion
In this work, we have demonstrated the convergence of the process studied towards the values that minimize the function g(θ), and following our simulations, we can see that this theoretical result is also valid on the empirical level. Nevertheless, this simulation required that we arbitrarily set a starting point, which leads to a possible slow convergence of the process in case the initial point is far from the targeted value. Moreover, the speed of convergence is also greatly affected by the choice of the a n . us, a possible improvement would be to find the optimal sequence a n that provides the fastest convergence.

Data Availability
No data were used to support this study.