The hardware structure of a processing element used for optimization of an investment strategy for financial markets is presented. It is shown how this processing element can be multiply implemented on the massively parallel FPGA-machine RIVYERA. This leads to a speedup of a factor of about 17,000 in comparison to one single high-performance PC, while saving more than 99% of the consumed energy. Furthermore, it is shown for a special security and different time periods that the optimized investment strategy delivers an outperformance between 2 and 14 percent in relation to a buy and hold strategy.
1. Introduction
The goal of technical financial market analysis is to predict the development of indices, stocks, funds, and other securities by evaluating the charts of the past. A method to find such predictions can lead to an investment strategy. Many well-known chart-analysis methods (e.g., Elliot waves [1], Bollinger Bands [2]) try to extract patterns from the charts, expecting that such patterns will come up in similar ways again in the future. There are more than 100 different chart-analysis methods but their success is doubted [3]. In most cases, the current development of the markets significantly affects the quality of the different investment strategies. Since the business volume per year on the worldwide stock markets is more than USD 35 trillions [4], it is not surprising that successful investment strategies are in the focus of intensive research.
In general, there are lots of indicators influencing the chart of a security. Those are not only economical and political indicators but also psychological ones. It is very difficult to decide which weight should be assigned to which indicator, the more so as there are known and unknown tradeoffs between different indicators. Furthermore, weights change in time. Recent papers [5–8] try to apply data mining methods on historical market rates, in order to find investment strategies that perform significantly above the average. This approach is extreme compute-intensive since every day there are millions of quotations that are fixed worldwide. Even with the use of high-performance computers the reduction of this amount of data is required. But as shown in the literature [5–8], data mining helps to keep the essential information contents in order to come to successful investment strategies.
In this paper, we present an investment strategy using a novel data mining method, which is discussed in Section 2. It results in a performance significantly above average for certain periods. It is based on the idea of an iterative search for an optimal set of indicator weights in the space of all possible weights of the indicators. Since this space grows exponentially with the number of indicators, the method is very compute-intensive but can optimally be parallelized. Therefore, an FPGA implementation seems to be very promising, because the computational core can be kept very simple and small in hardware. The two phases of the method have been implemented on the FPGA-based massively parallel computer RIVYERA. The RIVYERA architecture and its idea of efficiently exploiting 128 modern FPGAs in parallel are explained in Section 3. Section 4 describes the architecture and the implementation of one processing element for the main computation. The speedup achieved by such an FPGA approach is investigated in Section 5. For different time intervals, the advantage in comparison to a single buy-and-hold strategy is determined. We do not want to discuss the investment strategy itself in this paper. Instead, the main focus here lies on the improvement in terms of speed, energy, and cost efficiency of the new method in comparison to an implementation on a sequential computer architecture. Section 6 summarizes the results and concludes the paper.
2. The Process of Optimizing an Investment Strategy for Securities
For a single security P a successful strategy for buying and selling is desired. For simplicity, in this paper let P be an investment fund that can be traded without trading costs (there are several discount brokers offering such funds, e.g., Vanguard in USA, InvestSMART Financial Services Pty Ltd in Australia, European Bank for Fund Services GmbH in Germany). Since the taxation regulations are varying in different countries, these are not considered here either.
We consider n indicators I0,I1,…,In-1 that might have influence on the chart of P. Typical indicators are S&P500, Nikkei225, EuroStoxx50, EUR/USD, and so forth. In other methods of technical analysis P itself is used as an indicator as well. We consider a time interval of the past consisting of m+1 subsequent trading days d0,d1,…,dm. m should be large enough to get significant results (e.g., m≥125).
R is an m×n matrix, where Ri,j is the percentage difference of the indicator Ij from di-1 to di. The vector Ri is the ith row of the matrix R: Ri=(Ri,0,Ri,1,…,Ri,n-1). The required data for such a matrix can either be collected or downloaded from some trading platform in the internet.
At time d0 we assume a cash capital of one million EUR and a depot with D0=0 pieces of the security P. The value of one million has been chosen in order to be able to abstract from rounding errors. The results for other starting values can be computed proportionally. Generally, let Ci be the cash money, Di the number of pieces of P in the depot, and Zi the total property at day di (0≤i≤m). The fund considered here has exactly one market price Pi per day. Therefore, the following condition holds:Zi=Ci+Di⋅Pi.
We are looking for a function f(Ri) that computes the decision of buying or selling a certain amount of P from the values of R known up to di. The output of f is on the one hand the decision either to buy, to do nothing, or to sell, and, on the other hand, the amount for the first and the last case. The optimal function of this kind is the one that maximizes the value of Zm. This approach is motivated by the assumption of technical analysis that a successful strategy of the past will also be successful in the future.
In order to simplify the search for f, we consider only functions of the kind f(Ri)=∑wj·Ri,j. The weight vector is denoted by w=(w0,w1,w2,…,wn-1). We define w* as the vector which yields a maximum value for Zm. A positive value of f is a buy indication, a negative sell indication. The amount of P to be traded is Zi-1·|f(Ri)|. A buy at day di is limited to Ci-1 in order not to overdraw the cash account and a sell is limited to Di-1, accordingly. The decision to cut down on functions of the kind f(Ri)=∑wj·Ri,j is based on the assumption that the influence of the different indicators is almost linear. Although this cannot be proven here, the results with this simplification are already remarkable. However, it is still worth to investigate modifications of this method with nonlinear functions.
There is one problem with investment funds: at that point in time at the day di where the trading decision is made, the exact value of Pi is not known. Therefore, at this point of time it is not possible to compute the exact number of pieces to be traded without overdrawing the cash account. For a buy order, we therefore transmit not the number of pieces but the amount of money for which we want to buy pieces. Vice versa, for a sell order, we should transmit the exact number of pieces to be sold.
Let Bi be the amount of money for which pieces of P should be bought at di and Si the number of pieces to be sold at di. Obviously, the following condition holds: (Bi=0)∨(Si=0).
Furthermore, if f(Ri)≥0, it holds:Bi=min(Zi-1⋅f(Ri),Ci-1),Si=0,Ci=Ci-1-Bi,Di=Di-1+BiPi.
And if f(Ri)<0,Bi=0,Si=min(Zi-1⋅|f(Ri)|Pi-1,Di-1),Ci=Ci-1+Si⋅Pi,Di=Di-1-Si.
It is computationally unfeasible to determine w* with a brute force approach, even on a supercomputer. If one considers only 100 different values for each of 8 indicators, then there are 1008 different combinations of those. Using a calibration period of 26 weeks, with 5 trading days per week the required number of computations of the function f would be26⋅5⋅1016=1.3⋅1018.
As specified in the following examinations, even the presented RIVYERA implementation would require 377 days to evaluate that number of combinations. Instead of such a brute force method, we use an iterative approach: initially a very rough grid of 16 values per indicator is used. For each of the 168 weight vectors, the final property Zm is computed. The areas in the grid, where Zm is relatively large, are the targets of the next iteration: in the environment of the corresponding weights the grid, is refined. If a component of a promising weight vector is at the boundary of the grid then the grid is extended in this direction. In the same way, we keep on refining the grid. Already after 100 iteration steps, the results are satisfying. In each iteration step 168 weight vectors are considered, resulting in26⋅5⋅168≈5⋅1011
calculations of the function f plus the resulting calculation of the development of the depot value under the assumption that the corresponding buying and selling decisions are taken into account. The process of refining the grid is part of the host system and not specified in this paper.
The high computational effort is caused by the exhaustive search concerning the evaluation of w*. This calculation can be remarkably accelerated by an FPGA-based implementation. Hence, the concrete task is the following: How can we accelerate the identification of the optimal weight vector w* which maximizes Zm out of a given set of weight vectors?
3. FPGA-Based Hardware Platform RIVYERA
Introduced in 2008, the massively parallel FPGA-based hardware platform RIVYERA [9] is the direct successor of the COPACOBANA, presented in 2006 for cost optimized breaking 56 bit DES ciphers in less than two weeks [10]. Besides applications in cryptanalysis (e.g., [11]), RIVYERA finds its applications in the fields of bioinformatics [12–14] and now stock market analysis, as described in this paper.
For the application presented here, the specific RIVYERA S3-5000 is used, distributed by SciEngines GmbH [15]. RIVYERA is designed to be a completely scalable system consisting of two basic elements. Firstly, the in-built multiple FPGA-based supercomputer provides the resources for parallel high-performance applications (Figure 1, right side). Secondly, a standard server grade mainboard equipped with an Intel Core i7-930 processor, 12 GB of RAM, and 2 TB of hard disk space, provides the resources for quick pre- and postprocessing purposes (Figure 1, left side). The RIVYERA S3-5000 is powered by two 650 W supplies and packed in a standard rack mountable 3 U housing. It is running a standard Linux operating system and, therefore, presents an independent system. The details are discussed briefly in the following.
RIVYERA S3-5000 hardware structure.
The FPGA-based supercomputer consists of a backplane and up to 16 FPGA cards (fully equipped for the application described in this paper). Each FPGA card is equipped with eight user configurable Xilinx Spartan3-5000 type FPGAs and one additional FPGA as communication controller. In total, these are 128 user configurable FPGAs. Additionally, a DRAM module with a capacity of 32 MB is directly attached to each user FPGA.
All FPGAs are connected by a systolic-like bus system. Each FPGA on an FPGA card is connected with two neighbors forming a ring including the communication controller. The FPGA card slots are connected to each neighboring slot as well on the backplane, providing the connections between the communication controllers on each FPGA card. The communication is physically realized by high-throughput symmetric LVDS point-to-point connections. The communication of the FPGA-based computer to the host-mainboard follows a connection via PCIe controller card directly to a communication controller on a chosen FPGA card. For applications requiring a higher bandwidth from the host system to the FPGA-based computer, more than one PCIe controller may be attached to other FPGA cards as well. For a configuration as used for this application, the measured net bandwidth from the host to the FPGA computer reaches up to 66 MB/s. Of course, the latency will be different dependent on which clients are communicating with each other, according to the length of the communication chain.
For application development, the RIVYERA provides an API for each of the two basic elements, that is, an API controlling the data transfer between the host software and the FPGAs including broadcast facilities, and an API for the user defined hardware configuration of the FPGAs controlling the data transfer to other FPGAs and the host as well.
A picture of the RIVYERA S3-5000 is shown in Figure 2.
RIVYERA S3-5000. The 16 FPGA-cards forming the FPGA-computer are highlighted. The integrated standard PC cannot be seen behind the cover.
4. Processor Architecture
The FPGA-based part of the presented algorithm is based on exhaustive searches. As different weight vectors can be evaluated independently, the algorithm is suitable for massive parallelization. Therefore, the following description of the technical implementation only considers a single FPGA. Assuming uniform programming of all available FPGAs and an equally divided search space, the computational speed rises approximately linear with the number of FPGAs. According to the RIVYERA platform, the implementation presented here is optimized for Xilinx Spartan3-5000 FPGAs [9, 16].
The key aspect concerning the identification of valuable weight vectors is the calculation of the score Zm for every possible element of the search space. Since these evaluations are the fundamental issue of the computational effort, the success of creating an efficient processor architecture is directly linked to the performance of the underlying implementation of the scoring function. Thus, the main objective, and therefore starting point for the design of the processing element, should be the creation of a scoring unit with a high throughput.
4.1. Scoring Pipeline
The evaluation of Zm consists of repetitive computations of the sequences Ci and Di. Therefore, the throughput of the scoring unit is directly connected to the performance of the computation of these two sequences. Thus, despite the high spatial cost, the advantages of a pipeline architecture are persuasive. The implementation presented here is based on pipelines that yield a new pair (Ci,Di) in every clock cycle. As the values Ci and Di are defined recursively, the pipeline has to wait for its own outputs. Thus, to avoid idle time, l scores for different weight vectors are evaluated concurrently, where l is the length of the longest cyclic path. Hence, l is given by the number of clock cylces that are necessary to compute Ci and Di from Ci-1 and Di-1.
Basically, the structure can be subdivided into three segments. The first one is described by the function f(Ri)=∑wj·Ri,j. Assuming n indicators, the calculation of f(Ri) needs n multiplications and n-1 additions. The corresponding structure for n=4 is shown in Figure 3. As all following calculations directly depend on f(Ri), this computation is part of the longest path of the pipeline. Hence, the additions should be arranged in a way that only the minimum number of steps (1 multiplication and ⌈log2n⌉ additions) is required. However, the path is not element of the longest cyclic path because wj and Ri,j do not depend on the outputs of the pipeline. A different arrangement of the additions has no effect to l.
Calculation of f(Ri) with four indicators.
Due to resource reduction the buy order size Bi and the sell order size Si are combined to a general order size Oi. A negative value indicates a sell order, and a positive value denotes a buy order. Instead of the sequence Di, the pipeline calculates the values Di′=Di·Pi as it enables to the evaluate Zm just by one addition. The calculation of Oi is given by the following instruction:
Oi∶={-Di′,ifOi*≤-Di′,Ci,ifCi≤Oi*,Oi*,else.
When the evaluation of f(Ri) is finished, the order size Oi at day i is computed as shown in Figure 4. The intermediate result Oi*=(Di′+Ci)·f(Ri) is restricted to Oi by the usage of two multiplexers and corresponding comparators. The total property Ci+Di′ and the negative depot value -Di′ do not depend on f(Ri) and, thus, can be calculated in parallel to its evaluation. Hence, the longest path of the pipeline is extended by the multiplication and the comparator chain.
Calculation of order size Oi.
After the calculation of the order size, new cash and depot values are computed. The value i that identifies the day of the historical data set rises by 1 since the end of the given time period (i=m) is reached. In this case i is set to 0 which implies the start of the evaluation of a new weight vector. The sequences Ci and Di′ are reset to the default values C0 and D0′. In the same clock cycle the sum of Cm and Dm′ is calculated and transmitted to the multiplexer that refers to Zm:(Ci,Di′)≔{(C0,D0′)ifi=0,(Ci-1-Oi-1,(Di-1′+Oi-1)⋅PiPi-1)else.
In comparison to a multiplication, a division is much more expensive in regard to resource usage [16]. As a consequence, the quotient Pi/Pi-1 is realized as the multiplication Pi·Pi-1-1. On the one hand, this implies the additional calculation and storage of inverse elements. On the other hand, every calculation needs to be done only once and can be outsourced to the host system. Likewise, the additive memory usage can be disregarded as we will see in Section 4.3.
As considered, the algorithm is trivially parallelizable. The computational speed depends linearly on the number of FPGAs. Likewise, this statement can be assigned on the number of pipelines. But how many pipelines can be synthesized on an FPGA and are there further possibilities to increase that number?
4.2. Optimized Fixpoint Representation
All in all, one scoring pipeline is built of n+3 multiplications, n+2 additions, 2 subtractions, 3 comparators, and 5 multiplexers where n is the number of indicators (see Figure 6) that are 15+2n operations. A Spartan 3–5000 FPGA consists of 8,320 Configurable Logic Blocks which can be separated in 33,280 Slices [16]. Additionally, 108 dedicated 18×18 bit multipliers can be assigned for synthesis.
The allocation report of two synthesis results is shown in Table 1. A single precision floating point representation of all variables is assumed in both cases. Using 32 multipliers, 8 indicators yield a consumption of 25% of the available slices. Assuming that 10% of the slices are reserved for further control units, three pipelines can be synthesized on the FPGA. In case of 16 indicators, additional 17% are required. The drastic increase results from the 8 additional adders and multiplicators and the comparatively high spatial cost of floating point units [16]. An important point is the synchronization of the different pipeline stages. For example, the third stage (see Figure 5) receives, amongst others, the input values Oi-1 and Pi. While Pi is given, Oi is only known after several calculations. Hence, to provide synchronicity, the transfer of Pi is delayed using shifting registers. The longest cyclic path consists of 2 additions, 3 multipliers, 2 comparators, and 3 multiplexers. As an extension of the pipeline implies the requirement of more shifting registers, the path should be as short as possible. Optimized in terms of space, the longest cyclic path comprises l=57 clock cycles. All in all, only two pipelines are possible in this case.
Synthesis result with floating point representation.
Indicators
Slices
Multipliers
8
8584 (25%)
32 (30%)
16
14064 (42%)
44 (42%)
Evaluation of Zm.
Scoring pipeline for 8 indicators.
To counter that problem, a fixpoint representation will be introduced in the following. The idea is motivated by the fact that many of the given values are located in limited ranges. For example, the daily price fluctuations in R rarely exceed the interval [-10%,10%]. That is the reason why the values of R will be stored in 18 bits where the decimal place is coded in 12 bits and the new codomain is the interval [-32%,32%) with a precision of 2-12%. Likewise, the elements of the weight vector will be stored in 18 bits. While a decimal place of 12 bits seems to be the best tradeoff between overflow immunity on the one hand and precision on the other hand, the range of the weight vectors may be determined specifically for every use case. Cash, depot value, and stock prices are stored in integer values in cent. The inverse prices are multiplied with 232 and also stored in integer values.
The 18-bit representation of Ri,j and wj promotes the efficient usage of the dedicated 18×18 multipliers. Furthermore, the transfer from floating point to fixpoint units leads to a considerable decrease of the allocated resources. The length of the longest cyclic path can be reduced to 37. As shown in Table 2, the available resources suffice for up to 6 pipelines per FPGA.
Synthesis result with fixpoint representation.
Indicators
Slices
Multipliers
8
4801 (14%)
12 (11%)
16
5395 (16%)
20 (19%)
4.3. FPGA Overview
The pipelines are triggered synchronously. The trading period d and the corresponding historical data are set globally for all scoring units. Since l independent score evaluations are calculated in parallel in every pipeline, the value of d has to change only once every l clock cycles. To trigger the pipeline in the i-1th recursion, the historical information of day i is necessary. This set consists of the vector of price fluctuations Ri and the values Pi and Pi-1-1. To transfer these values within one clock cycle, the historical data of day i Hi is stored in a single Block RAM word. Such a word Hi=(Pi-1-1,Pi,Ri) consists of 32+32+18·n bits, for example, 208 bits for n=8 indicators. Spartan3-5000 provides 104 RAM blocks with 1,872 KB in total [16]. This is obviously enough in our case, as it suffices for over 9000 days relating to 8 indicators.
As the optimization is based on an exhaustive search, it is necessary to determine the search space. The declared objective is to identify the optimal weight combination for 8 indicators. 8 possible candidates are given for every indicator. So, the search space is declared by an 8×8 matrix. Every row describes an indicator and consist of 8 values. Each of these values can be used as a weight to the correspondent indicator. As there are 8 possible candidates for each of 8 indicators, the number of possible weight vectors is |W|=88≈16.7 million. So, one FPGA is able to calculate the optimal weight vector out of 16.7 million combinations. One unique combination for every pipeline has to be calculated in every clock cycle. To accomplish this, every possible combination is declared by an 24-bit identifier in the range of [0,|W|-1]. An equally divided subspace is ([0,|W|/6-1],[|W|/6,|W|/3-1],…] assigned to every pipeline. The weight vector is extracted by masking the identifier. The bits 3·j+2 to 3·j show the position of coefficient wj of the weight vector w. For example, the identifier 38710=110.000.0112 references the matrix items W[0][3](0112=310) for w0 and W[2][6] for w2. This interpretation is very efficient as the effort of bit masking is comparatively small. Thus, 6 different weight vectors can be selected in a single clock cycle and assigned to the pipelines.
As the data flow is synchronous, the scores Zm of all pipelines are calculated at the same time. Assuming 6 pipelines, 6 results are returned per clock cycle. Obviously, it is neither possible nor does it make sense to store 88 values. Likewise, the effort to administrate a list of the best scores is too high as it implies the sorting of 6 results into the list in a single clock cycle. The examination of this problem shows that a good tradeoff is the storage of the best result of every pipeline. Utilizing 6 pipelines and 128 FPGAs, 768 results are evaluated in every iteration. This set seems to be widespread enough to calculate new weight coefficients for the next iteration. An overview of the FPGA structure is shown in Figure 7.
Outline of the processor architecture.
5. Results and Performance Analysis
For further research, we will now consider results and performance for a certain security, the investment fund DWS Convertibles, ISIN DE0008474263, that is operating internationally.
As described in Section 2 (referred as calibration phase in the following), the optimal weight vector w* is determined for the security and furthermore for a randomly chosen time period of 26 weeks (calibration time interval). We chose 8 indicators that widely represent the current economical environment: S&P 500, DAX, EuroStoxx 50, ASX 200, Nikkei 225, Hang Seng, S&P 500 Future, and EUR/USD. The goal is to find w* with the maximal value of Zm.
The computational effort with 8 indicators is already rather high. In this paper, we disclaim to investigate more indicators since, on the one hand, these 8 indicators represent the activities on the international stock markets to some high degree, and on the other hand, the results with this restriction are already remarkable.
We now focus on the investment strategy where at day di for indicators Ij the Ri,j are calculated and then from f(Ri) the volume of buying or selling orders is computed based on the value of w*. To determine the quality of the vector w*, we test it in a different period of time referred to as the evaluation phase. Of course, this makes only sense for a time interval (the so-called evaluation time interval) which does not overlap with the calibration time interval. We have chosen three different evaluation time intervals of 26 weeks as well. The question is whether or not the new investment strategy gives an outperformance in comparison to a buy-and-hold strategy. Buy-and-hold means P is bought at the beginning of the evaluation time interval and sold at the end.
Figure 8 shows an example of the chart of the security in comparison to the performance of our investment strategy with the same security within three different evaluation time intervals Tk of 26 weeks each, k∈{1,2,3}. T1 (2009-09-14–2010-03-15) is a period where tendency for the fund is rising. T2 (2010-09-27–2011-03-28) is a period without a clear tendency and T3 (2011-03-28–2011-09-26) is a period where tendency for the fund is falling.
Performance of the investment strategy in different evaluation intervals.
The values of wk* had been determined for each Tk in the iterative way described previousuly. The resulting investment strategy Sk was then applied for the evaluation time intervals Te, where e∈{1,2,3} and e≠k. The chart Pk,e shows the performance of the monetary assets in the evaluation time interval Te using investment strategy Sk.
In all time intervals, an outperformance of the investment strategies Sk over P between 2% (see P2,1 in Figure 8) and 14% (see P1,3 in Figure 8) can be seen. Although this is no proof in a mathematical sense that such an investment strategy can be applied to arbitrary securities in arbitrary time periods, it seems to be very promising to further improve the method described here.
Considering computing performance as well, the RIVYERA or similar computer architectures are perfectly suited for such research. Table 3 shows a comparison between the RIVYERA-based approach and a PC version of the algorithm implemented in C. The test system uses an Intel Core i7–970 with 6×3200MHz, an ASRock X58 Extreme mainboard and 8 GB GeIL DIMM DDR3-1066 RAM. The implementation uses all cores of the processor. In addition, the improved number representation is used in the PC version as well.
Comparison of PC and RIVYERA.
Runtime (calibration phase: 26 weeks)
Number of weight vectors
PC
RIVYERA
1 billion
44 h 56 m
9.15 s
50 billion
93 d 14 h
7 m 37 s
1 trillion
5.13 years
2 h 32 m
Power consumption
Number of weight vectors
PC (300 W)
RIVYERA (1300 W)
1 billion
13.48 KWh (2,70€)
3.31 Wh (0,0006€)
50 billion
0.674 MWh (137,49€)
165.20 Wh (0,04€)
1 trillion
13.48 MWh (2695,80€)
3.31 KWh (0,66€)
128·p pairs (Ci,Di′) are calculated in every clock cycle on RIVYERA where p is the number of pipelines per FPGA. The clock rate of the implementation is 50 MHz. Assuming 8 indicators, 6 pipelines can be synthesized on an FPGA. This yields 128·6·50,000,000=38.4 billion pairs per second. Examinations of the PC version denote that 2.26 million calculations per second are possible on the specified test system. The conclusion is a speedup of about 17,000. While RIVYERA requires up to 1300 W, 300 W is supposed for a standard PC. Accordingly, the power consumption is reduced by up to 99.975%.
Of course, such a comparison yields a number of questions. Intel declares 76.8 GFLOPs for i7–970 [17]. The presented FPGA design needs 15+2n=31 operations for n=8 indicators. Assuming that the PC version manages to work with the same number of operations, one could deduce that the referred processor reaches up to 78,600,000,000/31=2.54 billion pairs per second. This is more than 1,000 times faster than the actual implementation. So, what is the reason for this gap?
In fact, the computing power of the processor is not the bottleneck. The main problem is located in the intensive memory communication. Even a cache-optimized version needs several RAM accesses (and of course many cache accesses) to calculate a single pair. The pipeline structure cannot directly be translated but only be simulated by further memory instructions. In contrast, the FPGA Block RAM modules are triggered in parallel to the actual calculations. Thus, there is absolutely no latency concerning memory operations. This is a reason why this algorithm is very suitable for massively parallel computing. A further interesting issue would be a comparison in performance using GPGPU.
As well, the time complexity of the presented algorithm differs in regard to the different platforms. On standard processors, the complexity is 𝒪(wmaxn·n·m) where wmax denotes the maximum number of possible coefficients for one indicator. The factor n occurs because the evaluation of f(Ri) needs n multiplications and n-1 additions that has to be executed sequentially. A RIVYERA pipeline calculates one pair (Ci,Di′) per clock cycle in every case. So, there is obviously no such dependency. Therefore, the time complexity is 𝒪(wmaxn·m). However, the dependency on n is not erased by this approach. While the size of a standard processor remains constant for an increasing n, more adders and multipliers are necessary in terms of an FPGA-based implementation. According to this, the spatial complexity is 𝒪(n). As this may lead to less pipelines per FPGA, an indirect influence to the runtime cannot be concealed.
6. Conclusion
The FPGA-machine RIVYERA is very suitable for optimization of the investment strategy as it was presented in this paper. A speedup of 17,000 and an energy saving of more than 99% in comparison to one single high-performance PC has been determined. The investment strategy which is optimized with RIVYERA delivers for the special investment fund and different time periods reviewed a significant outperformance in relation to a buy and hold strategy.
Several other securities for different time periods were tested. Although always the same, simple indicators were used, the optimization of the investment strategy by using RIVYERA delivered almost in every case a significant outperformance.
ElliottR. N.1938BollingerJ.2001McGraw-HillParkC.IrwinS.What do we Know about the profitability of technical analysis?200721786826NaglerM.2008GavrilovM.AnguelovD.IndykP.MotwaniR.Mining the stock market: which measure is best?Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining2000KannanK. S.senkannan2002@gmail.comSekarP. S.pssekar@hotmail.comSathikM. M.mmdsadiq@gmail.comArumugamP.sixfacemsu@gmail.comFinancial stock market forecast using data mining techniques1Proceedings of the International MultiConference of Engineers and Computer Scientists (IMECS '10)2010555559LangdellS.Examples of the use of data mining in financial applications2002RathburnT.2007Data Mining, Down Under ForumsPfeifferG.BaumgartS.SchröderJ.SchimmlerM.A massively parallel architecture for bioinformatics5544Proceedings of the 9th International Conference on Computational Science (ICCS '09)2009Springer9941003Lecture Notes in Computer Science10.1007/978-3-642-01970-8_100KumarS.PaarC.PelzlJ.PfeifferG.RuppA.SchimmlerM.How to Break DES for €8,980Proceedings of the Workshop on Special-Purpose Hardware for Attacking Cryptographic Systems2006Cologne, GermanyFanJ.BaileyD. V.BatinaL.GuneysuT.PaarC.VerbauwhedeI.Breaking elliptic curve cryptosystems using reconfigurable hardware2010133138SchimmlerM.WienbrandtL.GuneysuT.BisselJ.SchmidtB.COPACOBANA: a massively parallel FPGA-based computer architecture2010CRC Press223262WienbrandtL.lwi@informatik.uni-kiel.deBaumgartS.BisselJ.SchatzF.SchimmlerM.Massively parallel FPGA-based implementation of BLASTp with the two-hit method20114International Conference on Computational Science (ICCS '10)1967197610.1016/j.procs.2011.04.215WienbrandtL.lwi@informatik.uni-kiel.deBaumgartS.stefanbaumgart@sciengines.comBisselJ.jostbissel@sciengines.comYeoC. M.Y.cmy@informatik.uni-kiel.deSchimmlerM.masch@informatik.uni-kiel.deUsing the reconfigurable massively parallel architecture COPACOBANA 5000 for applications in bioinformatics201011International Conference on Computational Science (ICCS '10)1027103410.1016/j.procs.2010.04.114SciEngines GmbH: http://www.sciengines.com/Xilinx Inc.: Xilinx UG331 Spartan-3 Generation FPGA User Guide, March 2011, http://www.xilinx.com/Intel Corporation: Intel Core i7-900 Desktop Processor Series, http://www.intel.com/