Novel ANN method for solving ordinary and fractional Black-Scholes equation

The main aim of this study is to introduce a 2-layered Artificial Neural Network (ANN) for solving the Black-Scholes partial differential equation (PDE) of either fractional or ordinary orders. Firstly, a discretization method is employed to change the model into a sequence of Ordinary Differential Equations (ODE). Then each of these ODEs is solved with the aid of an ANN. Adam optimization is employed as the learning paradigm since it can add the foreknowledge of slowing down the process of optimization when getting close to the actual optimum solution. The model also takes advantage of fine tuning for speeding up the process and domain mapping to confront infinite domain issue. Finally, the accuracy, speed, and convergence of the method for solving several types of Black-Scholes model are reported.


Introduction
Both PDEs of ordinary and fractional order play an important role in pricing of financial derivatives. PDEs of ordinary order are the basis of various models proposed for pricing of different types of options. On the other hand, financial markets show fractal behavior (Mandelbrot, 1963;Peters, 1989;Li and Ma, 2005;Huang, Li and Xiong, 2012) and Fractional PDEs (FPDE) which can better reflect the reality of them have gained a lot of interest recently. Hence finding an accurate and efficient approach for solving both types is a critical issue in pricing.
The most famous PDE in finance is the Black-Scholes (B-S) model, which is broadly adopted collocation method providing high accuracy for the generalized B-S mode (Pradip Roul (2020); R.
(2020)). In order to provide to sixth order accuracy in solving generalized Black-Scholes model, Roul and Goura used both of Crank-Nicolson scheme and sextic B-spline collocation method.
Regarding machine learning and ANN in particular, ANN has been traditionally used for predicting option prices (Malliaris & Salchenberger (1993); Yao (2000); Amornwattana (2007); Andreou (2008); Jang (2019); Ozbayoglu (2020)), however, to the best of our knowledge except few researches there is no available literature for solving B-S differential equation by ANN (Cervera (2019); Eskiizmirliler (2020)). Even these few researches are limited to integer order PDEs and there is no study for solution of fractional order B-S. In this paper we are going to construct a 2-layered Artificial Neural Network (ANN) to solve the Black-Scholes model of either fractional or ordinary orders. The Adaptive Moment Estimation (Adam), which has been specifically created to be used by neural networks, acts as the optimizer in the ANN to add the foreknowledge of slowing down when getting close to the optimal solution. To make the process faster and increase the efficiency of the method Fine tuning is applied to the model. Also to overcome the problem of infinite problem domain for approximation domain mapping is used to shift the whole problem to a finite interval. The rest of paper is organized as follows: section. 2 is dedicated to problem formulation section. 3 explains the methodology and section. 4 presents the numerical results and finally the conclusion is provided in section. 5.

Problem Formulation
A put option on an underlying asset S, is said to follow a Geometric Brownian Motion (GBM), where σ, r and W are the volatility, interest rate, and Brownian motion respectively, if it obeys the stochastic differential equation as follows: Using Eq. (1) and risk-neutral valuation formula together with the classic Feynman-Kac formula the Black-Scholes operator is formed as below: U (S, t) is the unknown function which determines the option price Robbins & Monro (1951). It has been specified that this option will have a certain payoff at a certain date in the future, depending on the value(s) taken by the stock up to that date.
It is well-known now, that a time-fractional Black-Scholes equation with the derivative of real order α can be obtained to describe the price of an option under several circumstances such as when the change in the underlying asset is assumed to follow a fractal transmission system. Fractional derivatives, as they are called, were introduced in option pricing in a bid to take advantage of their memory properties to capture both major jumps over small periods of time and long-range dependencies in markets. Therefore, the fractional Black-Scholes model can be formulated as follows: D denotes the fractional derivative. α is a real number. γ i , i = 1, 2, 3 and f (S, t) are functions dependent on the values of σ, r, and α written using these notations for simplicity. Based on the type of the option, the corresponding condition set is as follows: In some cases, e.g. European, M b (t) moves toward infinity, thus the problem domain is semi-infinite.
Some possible strategies are defined in Section. 3.4 to overcome this obstacle when pricing.

Methodology
In this section, four main concepts employed in the present approach are explained. Then, the necessary steps to be taken for finding the solution are combined forming the proposed method.

Time Discretization
Solving multi-variable equations increases the time complexity and the risk of producing inconsistent answers by computational softwares. In summary, time discretization methods are useful tools for converting such models into a series of Ordinary Differential Equations (ODE). This approach is the result of applying finite difference methods on one dimension of an equation to approximately calculate the value of derivatives with respect to that dimension.
Since here fractional equations are also investigated, ordinary and fractional time discretization approaches are discussed. Suppose that the Partial Differential Equation (PDE) to be solved is is the answer, ∆ = T −0 N , and N is the number of time-steps, then:

Ordinary time discretization
Consider the PDE to be solved is as follows: where U (S, t) is the answer and Ω is a linear or non-linear function. Instead of solving this problem on the two dimensions, it can be converted to a series of dependent ODEs.
By using the defined U i (S) the following equation is constructed: The method is implicit, if θ = 1. In this case only posterior time-step is used. The method is called explicit if θ = 0 where only computing time-step is utilized to approximate the solution. If the value of θ is equal to 1 2 the method is the common Crank-Nicolson method which is unconditionally stable and of the second order in time and it uses both posterior and computing time-steps for approximating the solution of model. Due to the non-smoothness of the payoff function and the activation functions in our ANN, the Crank-Nicolson can not reach its second order convergence.
It can also cause extra inconsistencies because of the same problem. So from this point θ = 0 is considered.

Fractional Time Discretization
Suppose the time fractional PDE (FPDE) to be solved is as follows: where α denotes the Caputo derivatives of the function and 0 < α < 1. As the first step for such FPDEs Caputo derivative should be discretized. (Reader are advised to see HadianRasanan et al. (2019) for the preliminaries and through information about this derivatives). Consider the following theorem.
Theorem 3.1. Suppose that [0, T ] is divided to N parts with step-size of ∆t = T N , 0 < α < 1 and q(t) ∈ C 2 [0, t k ] where t k = k∆t, the following holds for this interval: According to the above theorem, the Eq. (3) can be discretized in the following form: Now, the unknown function should be approximated in each time-step such that it satisfies Eq.
(3) and also, its initial and boundary conditions in Eq. (4) . In this regard, the boundary conditions can be satisfied by considering U N (S) = g(S) in computations. On the other hand, to satisfy the boundary condition, the sum of least square error methods is used in Section. 4.
The remaining step is satisfying the Eq. (3), so for this purpose the cost function is chosen as below: where S = (S 1 , S 2 , . . . , S r ) and S i is the i-th training data. To find the optimum weights for the network this cost function should be minimized subject to the W , so the following nonlinear least square problem is obtained: It is noteworthy that when α = 1, the ordinary time-discretization method will be used.

Function Approximation
According to the universal approximation theorem, every continuous function can be approximated by a feed-forward neural network Csáji (2001). This theorem states that any linear function can be approximated by an ANN without any hidden layers. But for functions of higher orders, the approximation can be well-established, if the ANN has at least one hidden layer. To calculate the price of an option based on Eq.
(2) and Eq. (3), the value of option at each time-step using a 2-layered network should be approximated as follows: Where n is the number of neurons in the hidden layer, . . , n are the activation functions in the hidden layer, and Ψ is the activation function for the output layer. The above formula can be seen in Figure. 1. Figure 1: The topology of the network used for solving Black-Scholes model.
The nodes commensurate with the edges b i and β 1 in this network are called the biases whose inputs are unchangeable 1s and their weights are additional parameters as means of adjusting the output of the next layer. In other words, they help the network fit best for the given data.
The most famous activation function in deep learning and neural networks is sigmoid. The two main reasons the sigmoid function is widely used are the derivatives of it and its value. The sigmoid function defined as follows and its values are in [0, 1] : These features make it a perfect candidate for problems that produce probabilities and for the Black-Scholes model. The values of options are nonnegative and the effect of other parameters will not increase the calculations as they are multiplied by smaller values produced by sigmoid. On the other hand, the derivative of this well-known activator, its slope, is easily calculable between any two arbitrary points: Although linear functions such as identity are not favorable for hidden layers, as they take away the chance to generalize and adapt from the network, it is possible to use them as the activator of the output layer hence the hidden layers are present and are directly interacting with inputs.

Fine-Tuning
One of the key factors in the present study is the possibility of applying the fine-tuning methods during the training process. Fine-tuning is employing a previously trained neural network to find the solution of a new similar task. This process is normally applied to datasets related to images and voices. However, following the same approach, it is possible to increase the accuracy and speed of the network in this work.
Building and validating an ANN from scratch can be a huge task in its own right, depending on what data being trained on it, many parameters such as the number of layers and the number of nodes in hidden layers, the proper activation functions, and learning rate should be found through trial and error. If a trained model that already does one task well exist, and that task is similar to ours in at least some remote way, then everything the model has already learned can be taken advantage of and applied to the new specific task. If the task is completely similar, like what we are facing when solving the problem at different time-steps, the exact weights can be used as the initial values. If the models are somewhat similar, still some knowledge exists on the previous network which is notable for speeding up the process of building/modifying and training the network for the new task. Then the only job remains for the network is learning the new features and properties that were not available in the former task. Here, Once the network is trained for the first time-steps, the obtained parameters, weights, and biases can be effectively reemployed for training the data fed to the network in other time-steps. The approximated solution and its convergence rate are compared in section. 5.

Domain Mapping
Considering the vulnerability of neural networks due to the bounded domain of their activation functions in calculations on infinite domains, the domain mapping approach is utilized to shift the problem from its semi-infinite domain to a finite interval. This helps to prevent the error caused by common solutions such as truncation of the domain.
On finite domains, S = x will be considered, but in semi-infinite domains, transformation formulas should be used for shifting the problem to a desired finite one. Here x(S) = 2 π arctan( S L ) is used to shift the problem's domain which is [0, ∞) to [0, 1], in which L is the characteristic length of the mapping Boyd (1989).
Here, L is chosen in a way that 60% of all training points stand before the mapped strike price because the price significantly differs from zero in [0, K] ). It means that by defining l as an indicator for 0.6, in the view of the fact that these points are equidistant, L is computed as follows: L = K tan( π 2 l) First, let us introduce the following notations: Hence the derivatives needed for the calculations according to Section. 2 are: Since the transformation is applied to the whole problem, the payoff function, and boundary conditions of Eq. (4) should be changed as well.

Discussion
Using Domain Mapping helps to approximate the answer on the whole interval of the problems.
On the other hand, the number of training data points needed for solving the model on smaller intervals is significantly less as can be seen in 5. However, it can decrease the accuracy since the whole domain is being compacted into one small domain and loss of information may occur.
Meaning that truncating the domain causes a perfect approximation on a sub-domain of problems but with acceptance of a bit of loss in accuracy the whole domain can be covered. So there lays a trade-off between these two methods.

Adaptive Moment Estimation Learning
Regression modeling is used to determine coefficients of mathematical functions based on empirical data. The method of least squares determines the coefficients such that the sum of the squares of the deviations between the data and the curve-fit is minimized. Finding a satisfactory solution to nonlinear least-square problems is one of the famous topics among scientists who work on nonlinear systems of equations. For minimizing a vector function, Λ(x) , that is defined as Λ : R n → R m , and m ≥ n with respect to a predefined x = (x 1 , x 2 , ..., x n ), that is to say, x * ∈ R n is found in a way that in which Several methods have been introduced for solving this nonlinear least square model so far. As we can see the same vector can be constructed when solving differential equations. Similarly, For the Black-Scholes models, in each time-step the final goal is finding the proper network weights solving the following system of equations: The best possible solution to this system of equations is calculated when the sum of the squares of these equations gets smaller. Hence the problem is converted into an optimization problem. By minimizing the following equation, the appropriate weights and biases for our network are found: In GD for updating only one parameter, all available samples in the dataset should be visited, however in SGD, Schaul et al. (2013), mini-batches which are small subsets of the whole dataset are used to update a single parameter. For relatively large datasets, this causes the algorithm to converge faster. GD is an actual optimizer trying to find the exact gradients while in SGD as explained the algorithm only approximates the gradients and not the precise value. Since SGD fluctuates a lot, due to frequent updates with high variance, it shows a paradoxical behavior. It can explore new and potential directions to find the minimum but at the same time, this behavior puts the network in danger of completely missing the local or global minimum. In other words, RMSprop normalizes the gradient using a moving average of squared gradients. This normalization balances the step size, reducing the step-size for large gradients to avoid exploding and increasing it for small ones to avoid vanishing. Since this approach uses the exponentially decaying average, it is related to the most recent gradients, so the past gradient would not play a great role in updating the parameters. This leads to slow changes in the learning rate; however, it is relatively faster than GD.
So far, it can be seen that RMSProp and SGD are the best options. Adam is an adaptive algorithm which is generally considered as the combination of these two paradigms with momentum. This methodology has been specifically created to be used by neural networks.
Like RMSprop, Adam employs squared gradients to modify the learning rate. Also, the first and second moments are utilized using the moving average of the gradient like SGD. However, a specific learning rate is calculated for each network parameter (weights) using two hyper-parameters. Here a summary of how Adam optimization works for the present model is provided. For the complete explanation, readers are encouraged to study Kingma & Ba (2015). The convergence of the method has been described in several great papers. But finally, all of the studies confirm the convergence proof provided in the first papers. Reddi et al. (2018).
All computations are done using autograd package of Python. In this package Adam optimization is implemented as follows: def adam(grad, w, callback=None, num_iters=100, step_size=0.001, b1=0.9, b2=0.999, eps=10**-8): m = np.zeros(len(w)) v = np.zeros(len(w)) for i in range(num_iters): g = grad(w, i) if callback: callback(w, i, g) m = (1 -b1) * g + b1 * m # First moment estimate. v = (1 -b2) * (g**2) + b2 * v # Second moment estimate. mhat = m / (1 -b1**(i + 1)) # Bias correction. vhat = v / (1 -b2**(i + 1)) x = x -step_size*mhat/(np.sqrt(vhat) + eps) return x As the name of the method describes, it is derived from adaptive moment estimation. n-th moment of a random variable is defined as the expected value of that variable to the power of n: where m shows the n-th moment and w is a random variable. The gradient of the cost function of the neural network can be considered a random variable since it usually evaluated on some small random batch of data. The first moment is mean, and the second moment is uncentered variance.
To estimates the moments, Adam utilizes exponentially moving averages, computed on the gradient evaluated on a current mini-batch: m and v denote the moving averages, and g is the gradient of the current data presented to the network. According to Kingma & Ba (2015) which is also mentioned the above snippet from autograd package, the values of hyper-parameters β 1 and β 2 have two default values of 0.9 and 0.999 respectively. While the authors did not discuss the choosing process of these two variables, all studies reported very promising and in most cases perfect estimations using these two default values (see also Bock & Weiß (2019)). The vectors of moving averages are initialized with zeros at the first iteration.
The remaining problem with these moments was being biased towards zero since m i and v i are initialized as vectors of 0's. In other words, especially during the initial epochs, and when the decay rates are small (i.e. β 1 and β 2 are close to 1) the values of m i and v i will not change significantly or even at all. So the authors proposed the following bias corrections in order to surmount this Now, for each of the parameters (weights) a specific updating rule can be created: is an control parameter preventing the fractional part from producing a division by zero error.
Different scientific studies have shown that Adam outperforms other methods. According to empirical practices, this method has better performance and accuracy. This is also discussed in Section.
5. One problem that is stated by many studies is the convergence of the method. However, Kingma & Ba (2015) provided the analysis for the convex problems, other papers argued the convergence of the method on a few non-convex problems. And with some modification, they finally agreed on its usability.
In Ekström & Tysk (2009), the full analysis of the convexity of the Black Scholes model is proposed. Due to differences such as the failure of put-call parity in real markets instead of theory, this paper proves that for all American options they preserve their convexity in bubbled markets as well as non-bubbled ones. They showed European options are convexity preserving only for bounded payoffs. Thus, in this respect, the prices of American options are more robust than their European counterparts. In the same study, it is shown that models for bubbles are convexity preserving for bounded contracts. More precisely, consider (x, t) ∈ [0, ∞) × [0, T ], and let u1(x, t) and u2(x, t) be the option prices such that their corresponding volatilities are nonnegative α 1 and α 2 which satisfy Theorem 3.2. Assume that g is concave. Then u(x, t) is concave in x for any t ∈ [0, T ]. Moreover, the option price is decreasing in the volatility, that is, Similarly, if g is convex and bounded, then u(x, t) is convex in x for any t ∈ [0, T ]. Moreover, the option price is increasing in the volatility, that is, The full proof is available in Ekström & Tysk (2009). As they mentioned the proof is valid under the assumption of the uniqueness of the result for such an option, which is proved in their thorough study on properties of Black-Scholes models in more realistic markets as well.

Numerical results and discussion
In in our experiment achieved using the stated values. If the learning rate is low, then training is more reliable, but optimization will take a lot of time because steps towards the minimum of the loss function are small and on the other hand, if the learning rate is high, then training may not converge or may even diverge. Weight changes can be so big that the optimizer overshoots the minimum and makes the loss worse. In this research best values for learning rate are found using genetic algorithm.
Example 4.1. Let us consider a European call option, which its interest rate, volatility and strike price are 0.05, 0.2, 10 respectively. The governing equation is similar to Eq. (31) and the boundary conditions set is as follows: With the maturity of 1, in years, the approximate price at t = 0 is shown in Figures. 2.a and   2.b. The exact solution of this call option can be obtained using the following analytical solution denoted by U exact (S, t): To solve this issue, the problems is mapped to [0, 1] using Section.3.4, then U (x, t) is computed using the proposed ANN and sigmoid functions as the activation functions, then U (x, t) is reverted to the original model's domain using the inverse mapping function so that U (S, t) is calculated.
Only 10 equidistant points are used as training points in this case and the logarithmic absolute errors obtained from two approaches are compared in Figure. 3. It should be noted that after the truncation point the error increases rapidly for the first approach, but when truncating the domain the overall error is higher at the beginning of the interval but it remains steady and even falls at the end of the domain. Since the mapping function converges to infinity on x = 1. the numerical calculation on software such as Python will not be able to perform the calculations. So these com- The interest rate and volatility are 0.05 and 0.25 respectively. Other variables are calculated based on these two variable, a = 1 2 sigma 2 , b = r − a and c = r. The fractional order of the equation is α. Also, The exact solution for this equation is formulated as below when α:  Example 4.3. Let us consider a European put option, which its interest rate, volatility and strike price are 0.05, 0.2, 10, respectively.
With the maturity of 1, in years, the achieved result is shown in Figure. 10. The exact solution of this put option can be obtained using the following analytical solution denoted by U exact (S, t): per each epoch on the above-mentioned configuration is 0.032s. Increasing the number of data points will increase the accuracy for this configuration slightly (other parameters might need to be adjusted as well), however, we preferred this size to reduce complexity and memory usage.
In Figure. 10.a,it is shown that when the problem domain is truncated,unlike Example. 4.3, the approximate price is behaving fine throughout the whole unbounded interval. But this does not state that this behavior is the expected behavior of the option considering that the boundary condition makes the option price move towards zero. To make it clearer, the errors for truncated and mapped approximate solutions are compared in 11. In 11.a the absolute error before the truncation point is relatively better than 11.b. However, after the truncation point, it starts to increase and then again flattens out which is predictable according to the boundary condition of this specific function. In other words that this behavior can not be generalized to other options as well, because farther points are outside training dataset and network can not learn their values. So the preferred way is employing a mapped domain with lower accuracy but more stable behavior. Besides, only 10 equidistant points are used as training points in this case. RMSprop are respectively 0.038s and 0.045s per epoch. Figures. 13.a and 13.b illustrate the influence of fine-tuning on the objective convergence.

Conclusion
This study investigates neural networks with the famous Adam optimizer for solving financial Black-Scholes equations. Converting the PDE into a series of time-dependent ODEs using the Backward-Euler finite difference method and then solving each of these equations using the proposed model confirm the satisfactory result and fast calculation of the method. The speed of the method is caused by the parallel computations in the neural network for each independent neuron, the straight-forward calculations of sigmoid activation functions that do not add to the complexity of the model, and also the small number of training points and hidden neurons for achieving very promising accuracy. The neural network outperforms other methodologies regarding the consistency and accuracy of the model in confrontation with machines or calculation mistakes because of its fault tolerance. Fine-Tuning plays a significant role in this method by reducing the building, validation, and calculation time. It also helps the method converge faster by finding the appropriate direction for gradients as depicted in three examples in Section. 4. Domain Mapping, which has not been used in ANNs before to the best of found knowledge, is employed to make calculations possible on bigger or infinite problem domains. As a result of combining these approaches into one single ANN, reliable, fast, and accurate results were calculated. The methodology is applicable to other types of options priced by either ordinary or fractional models as well as other partial differential equations in any other field of study can be solved using this network.