Optimizing Investment Strategies with the Reconﬁgurable Hardware Platform RIVYERA

The hardware structure of a processing element used for optimization of an investment strategy for ﬁnancial markets is presented. It is shown how this processing element can be multiply implemented on the massively parallel FPGA-machine RIVYERA. This leads to a speedup of a factor of about 17,000 in comparison to one single high-performance PC, while saving more than 99% of the consumed energy. Furthermore, it is shown for a special security and di ﬀ erent time periods that the optimized investment strategy delivers an outperformance between 2 and 14 percent in relation to a buy and hold strategy.


Introduction
The goal of technical financial market analysis is to predict the development of indices, stocks, funds, and other securities by evaluating the charts of the past.A method to find such predictions can lead to an investment strategy.Many well-known chart-analysis methods (e.g., Elliot waves [1], Bollinger Bands [2]) try to extract patterns from the charts, expecting that such patterns will come up in similar ways again in the future.There are more than 100 different chartanalysis methods but their success is doubted [3].In most cases, the current development of the markets significantly affects the quality of the different investment strategies.Since the business volume per year on the worldwide stock markets is more than USD 35 trillions [4], it is not surprising that successful investment strategies are in the focus of intensive research.
In general, there are lots of indicators influencing the chart of a security.Those are not only economical and political indicators but also psychological ones.It is very difficult to decide which weight should be assigned to which indicator, the more so as there are known and unknown tradeoffs between different indicators.Furthermore, weights change in time.Recent papers [5][6][7][8] try to apply data mining methods on historical market rates, in order to find investment strategies that perform significantly above the average.This approach is extreme compute-intensive since every day there are millions of quotations that are fixed worldwide.Even with the use of high-performance computers the reduction of this amount of data is required.But as shown in the literature [5][6][7][8], data mining helps to keep the essential information contents in order to come to successful investment strategies.
In this paper, we present an investment strategy using a novel data mining method, which is discussed in Section 2. It results in a performance significantly above average for certain periods.It is based on the idea of an iterative search for an optimal set of indicator weights in the space of all possible weights of the indicators.Since this space grows exponentially with the number of indicators, the method is very compute-intensive but can optimally be parallelized.Therefore, an FPGA implementation seems to be very promising, because the computational core can be kept very simple and small in hardware.The two phases of the method have been implemented on the FPGA-based massively parallel computer RIVYERA.The RIVYERA architecture and its idea of efficiently exploiting 128 modern FPGAs in parallel are explained in Section 3.

International Journal of Reconfigurable Computing
Section 4 describes the architecture and the implementation of one processing element for the main computation.The speedup achieved by such an FPGA approach is investigated in Section 5.For different time intervals, the advantage in comparison to a single buy-and-hold strategy is determined.We do not want to discuss the investment strategy itself in this paper.Instead, the main focus here lies on the improvement in terms of speed, energy, and cost efficiency of the new method in comparison to an implementation on a sequential computer architecture.Section 6 summarizes the results and concludes the paper.

The Process of Optimizing an Investment Strategy for Securities
For a single security P a successful strategy for buying and selling is desired.For simplicity, in this paper let P be an investment fund that can be traded without trading costs (there are several discount brokers offering such funds, e.g., Vanguard in USA, InvestSMART Financial Services Pty Ltd in Australia, European Bank for Fund Services GmbH in Germany).Since the taxation regulations are varying in different countries, these are not considered here either.We consider n indicators I 0 , I 1 , . . ., I n−1 that might have influence on the chart of P. Typical indicators are S&P500, Nikkei225, EuroStoxx50, EUR/USD, and so forth.In other methods of technical analysis P itself is used as an indicator as well.We consider a time interval of the past consisting of m + 1 subsequent trading days d 0 , d 1 , . . ., d m .m should be large enough to get significant results (e.g., m ≥ 125).
R is an m × n matrix, where R i, j is the percentage difference of the indicator I j from d i−1 to d i .The vector R i is the ith row of the matrix R: R i = (R i,0 , R i,1 , . . ., R i,n−1 ).The required data for such a matrix can either be collected or downloaded from some trading platform in the internet.
At time d 0 we assume a cash capital of one million EUR and a depot with D 0 = 0 pieces of the security P. The value of one million has been chosen in order to be able to abstract from rounding errors.The results for other starting values can be computed proportionally.Generally, let C i be the cash money, D i the number of pieces of P in the depot, and Z i the total property at day d i (0 ≤ i ≤ m).The fund considered here has exactly one market price P i per day.Therefore, the following condition holds: We are looking for a function f (R i ) that computes the decision of buying or selling a certain amount of P from the values of R known up to d i .The output of f is on the one hand the decision either to buy, to do nothing, or to sell, and, on the other hand, the amount for the first and the last case.The optimal function of this kind is the one that maximizes the value of Z m .This approach is motivated by the assumption of technical analysis that a successful strategy of the past will also be successful in the future.
In order to simplify the search for f , we consider only functions of the kind f (R i ) = w j • R i, j .The weight vector is denoted by w = (w 0 , w 1 , w 2 , . . ., w n−1 ).We define w * as the vector which yields a maximum value for Z m .A positive value of f is a buy indication, a negative sell indication.The amount of P to be traded is Z i−1 • | f (R i )|.A buy at day d i is limited to C i−1 in order not to overdraw the cash account and a sell is limited to D i−1 , accordingly.The decision to cut down on functions of the kind f (R i ) = w j • R i, j is based on the assumption that the influence of the different indicators is almost linear.Although this cannot be proven here, the results with this simplification are already remarkable.However, it is still worth to investigate modifications of this method with nonlinear functions.
There is one problem with investment funds: at that point in time at the day d i where the trading decision is made, the exact value of P i is not known.Therefore, at this point of time it is not possible to compute the exact number of pieces to be traded without overdrawing the cash account.For a buy order, we therefore transmit not the number of pieces but the amount of money for which we want to buy pieces.Vice versa, for a sell order, we should transmit the exact number of pieces to be sold.
Let B i be the amount of money for which pieces of P should be bought at d i and S i the number of pieces to be sold at d i .Obviously, the following condition holds: Furthermore, if f (R i ) ≥ 0, it holds: ( And if f (R i ) < 0, ( It is computationally unfeasible to determine w * with a brute force approach, even on a supercomputer.If one considers only 100 different values for each of 8 indicators, then there are 100 8 different combinations of those.Using a calibration period of 26 weeks, with 5 trading days per week the required number of computations of the function f would be As specified in the following examinations, even the presented RIVYERA implementation would require 377 days to evaluate that number of combinations.Instead of such a brute force method, we use an iterative approach: initially a very rough grid of 16 values per indicator is used.For each of the 16 8 weight vectors, the final property Z m is computed.The areas in the grid, where Z m is relatively large, are the targets of the next iteration: in the environment of the corresponding weights the grid, is refined.If a component of a promising weight vector is at the boundary of the grid then the grid is extended in this direction.In the same way, we keep on refining the grid.Already after 100 iteration steps, the results are satisfying.In each iteration step 16 8  weight vectors are considered, resulting in calculations of the function f plus the resulting calculation of the development of the depot value under the assumption that the corresponding buying and selling decisions are taken into account.The process of refining the grid is part of the host system and not specified in this paper.
The high computational effort is caused by the exhaustive search concerning the evaluation of w * .This calculation can be remarkably accelerated by an FPGA-based implementation.Hence, the concrete task is the following: How can we accelerate the identification of the optimal weight vector w * which maximizes Z m out of a given set of weight vectors?

FPGA-Based Hardware Platform RIVYERA
Introduced in 2008, the massively parallel FPGA-based hardware platform RIVYERA [9] is the direct successor of the COPACOBANA, presented in 2006 for cost optimized breaking 56 bit DES ciphers in less than two weeks [10].Besides applications in cryptanalysis (e.g., [11]), RIVYERA finds its applications in the fields of bioinformatics [12][13][14] and now stock market analysis, as described in this paper.
For the application presented here, the specific RIVYERA S3-5000 is used, distributed by SciEngines GmbH [15].RIVYERA is designed to be a completely scalable system consisting of two basic elements.Firstly, the in-built multiple FPGA-based supercomputer provides the resources for parallel high-performance applications (Figure 1, right side).Secondly, a standard server grade mainboard equipped with an Intel Core i7-930 processor, 12 GB of RAM, and 2 TB of hard disk space, provides the resources for quick pre-and postprocessing purposes (Figure 1, left side).The RIVYERA S3-5000 is powered by two 650 W supplies and packed in a standard rack mountable 3 U housing.It is running a standard Linux operating system and, therefore, presents an independent system.The details are discussed briefly in the following.
The FPGA-based supercomputer consists of a backplane and up to 16 FPGA cards (fully equipped for the application described in this paper).Each FPGA card is equipped with eight user configurable Xilinx Spartan3-5000 type FPGAs and one additional FPGA as communication controller.In total, these are 128 user configurable FPGAs.Additionally, a DRAM module with a capacity of 32 MB is directly attached to each user FPGA.
All FPGAs are connected by a systolic-like bus system.Each FPGA on an FPGA card is connected with two neighbors forming a ring including the communication controller.The FPGA card slots are connected to each neighboring slot as well on the backplane, providing the connections between the communication controllers on each FPGA card.The communication is physically realized by high-throughput symmetric LVDS point-to-point connections.The communication of the FPGA-based computer to the host-mainboard follows a connection via PCIe controller card directly to a communication controller on a chosen FPGA card.For applications requiring a higher bandwidth from the host system to the FPGA-based computer, more than one PCIe controller may be attached to other FPGA cards as well.For a configuration as used for this application, the measured net bandwidth from the host to the FPGA computer reaches up to 66 MB/s.Of course, the latency will be different dependent on which clients are communicating with each other, according to the length of the communication chain.
For application development, the RIVYERA provides an API for each of the two basic elements, that is, an API controlling the data transfer between the host software and the FPGAs including broadcast facilities, and an API for the user defined hardware configuration of the FPGAs controlling the data transfer to other FPGAs and the host as well.
A picture of the RIVYERA S3-5000 is shown in Figure 2.

Processor Architecture
The FPGA-based part of the presented algorithm is based on exhaustive searches.As different weight vectors can be evaluated independently, the algorithm is suitable for massive parallelization.Therefore, the following description of the technical implementation only considers a single FPGA.
Assuming uniform programming of all available FPGAs and an equally divided search space, the computational speed rises approximately linear with the number of FPGAs.
According to the RIVYERA platform, the implementation presented here is optimized for Xilinx Spartan3-5000 FPGAs [9,16].
The key aspect concerning the identification of valuable weight vectors is the calculation of the score Z m for every possible element of the search space.Since these evaluations are the fundamental issue of the computational effort, the success of creating an efficient processor architecture is directly linked to the performance of the underlying implementation of the scoring function.Thus, the main objective, and therefore starting point for the design of the processing element, should be the creation of a scoring unit with a high throughput.

Scoring Pipeline.
The evaluation of Z m consists of repetitive computations of the sequences C i and D i .Therefore, the throughput of the scoring unit is directly connected to the performance of the computation of these two sequences.Thus, despite the high spatial cost, the advantages of a pipeline architecture are persuasive.The implementation presented here is based on pipelines that yield a new pair (C i , D i ) in every clock cycle.As the values C i and D i are defined recursively, the pipeline has to wait for its own outputs.Thus, to avoid idle time, l scores for different weight  vectors are evaluated concurrently, where l is the length of the longest cyclic path.Hence, l is given by the number of clock cylces that are necessary to compute C i and D i from C i−1 and Basically, the structure can be subdivided into three segments.The first one is described by the function Assuming n indicators, the calculation of f (R i ) needs n multiplications and n−1 additions.The corresponding structure for n = 4 is shown in Figure 3.As all following calculations directly depend on f (R i ), this computation is part of the longest path of the pipeline.Hence, the additions should be arranged in a way that only the minimum number of steps (1 multiplication and log 2 n additions) is required.However, the path is not element of the longest cyclic path because w j and R i, j do not depend on the outputs of the pipeline.A different arrangement of the additions has no effect to l.
Due to resource reduction the buy order size B i and the sell order size S i are combined to a general order size O i .A negative value indicates a sell order, and a positive value denotes a buy order.Instead of the sequence D i , the pipeline calculates the values D i = D i • P i as it enables to the evaluate Z m just by one addition.The calculation of O i is given by the following instruction: When the evaluation of f (R i ) is finished, the order size O i at day i is computed as shown in Figure 4.The intermediate result by the usage of two multiplexers and corresponding comparators.The total property C i + D i and the negative depot value −D i do not depend on f (R i ) and, thus, can be calculated in parallel to its evaluation.Hence, the longest path of the pipeline is extended by the multiplication and the comparator chain.
After the calculation of the order size, new cash and depot values are computed.The value i that identifies the day of the historical data set rises by 1 since the end of the given time period (i = m) is reached.In this case i is set to 0 which implies the start of the evaluation of a new weight vector.The sequences C i and D i are reset to the default values C 0 and D 0 .
In the same clock cycle the sum of C m and D m is calculated and transmitted to the multiplexer that refers to Z m : In comparison to a multiplication, a division is much more expensive in regard to resource usage [16].As a consequence, the quotient P i /P i−1 is realized as the multiplication P i • P −1 i−1 .On the one hand, this implies the additional calculation and storage of inverse elements.On the other hand, every calculation needs to be done only once and can be outsourced to the host system.Likewise, the additive memory usage can be disregarded as we will see in Section 4.3.
As considered, the algorithm is trivially parallelizable.The computational speed depends linearly on the number of FPGAs.Likewise, this statement can be assigned on the number of pipelines.But how many pipelines can be  1.A single precision floating point representation of all variables is assumed in both cases.Using 32 multipliers, 8 indicators yield a consumption of 25% of the available slices.Assuming that 10% of the slices are reserved for further control units, three pipelines can be synthesized on the FPGA.In case of 16 indicators, additional 17% are required.The drastic increase results from the 8 additional adders and multiplicators and the comparatively high spatial cost of floating point units [16].An important point is the synchronization of the different pipeline stages.For example, the third stage (see Figure 5) receives, amongst others, the input values O i−1 and P i .While P i is given, O i is only known after several calculations.Hence, to provide synchronicity, the transfer of P i is delayed using shifting registers.The longest cyclic path consists of 2 additions, 3 multipliers, 2 comparators, and 3 multiplexers.As an extension of the pipeline implies the requirement of more shifting registers, the path should be as short as possible.Optimized in terms of space, the longest cyclic path comprises l = 57 clock cycles.All in all, only two pipelines are possible in this case.
To counter that problem, a fixpoint representation will be introduced in the following.The idea is motivated by the fact that many of the given values are located in limited ranges.For example, the daily price fluctuations in R rarely exceed the interval [−10%, 10%].That is the reason why the values of R will be stored in 18 bits where the decimal place is coded in 12 bits and the new codomain is the interval [−32%, 32%) with a precision of 2 −12 %.Likewise, the elements of the weight vector will be stored in 18 bits.While a decimal place of 12 bits seems to be the best tradeoff between overflow immunity on the one hand and precision on the other hand, the range of the weight vectors may be determined specifically for every use case.Cash, depot value, and stock prices are stored in integer values in cent.The inverse prices are multiplied with 2 32 and also stored in integer values.
The 18-bit representation of R i, j and w j promotes the efficient usage of the dedicated 18 × 18 multipliers.Furthermore, the transfer from floating point to fixpoint units leads to a considerable decrease of the allocated resources.The length of the longest cyclic path can be reduced to 37. As shown in Table 2, the available resources suffice for up to 6 pipelines per FPGA.

FPGA Overview.
The pipelines are triggered synchronously.The trading period d and the corresponding historical data are set globally for all scoring units.Since l independent score evaluations are calculated in parallel in every pipeline, the value of d has to change only once every l clock cycles.To trigger the pipeline in the i − 1th recursion, the historical information of day i is necessary.This set consists of the vector of price fluctuations R i and the values P i and P −1 i−1 .To transfer these values within one clock cycle, the historical data of day i H i is stored in a single Block RAM word.Such a word H i = (P −1 i−1 , P i , R i ) consists of 32 + 32 + 18 • n bits, for example, 208 bits for n = 8 indicators.Spartan3-5000 provides 104 RAM blocks with 1,872 KB in total [16].This is obviously enough in our case, as it suffices for over 9000 days relating to 8 indicators.
As the optimization is based on an exhaustive search, it is necessary to determine the search space.The declared objective is to identify the optimal weight combination for 8 indicators.8 possible candidates are given for every indicator.So, the search space is declared by an 8 × 8 matrix.Every row describes an indicator and consist of 8 values.Each of these values can be used as a weight to the correspondent indicator.As there are 8 possible candidates for each of 8 indicators, the number of possible weight vectors is |W | = 8 8 ≈ 16.7 million.So, one FPGA is able to calculate the optimal weight vector out of 16.7 million combinations.One unique combination for every pipeline has to be calculated in every clock cycle.To accomplish this, every possible combination is declared by an 24-bit identifier in the range of [0, . .] assigned to every pipeline.The weight vector is extracted by masking the identifier.The bits 3• j+2 to 3• j show the position of coefficient w j of the weight vector w.For example, the identifier 387 10 = 110.000.011 2 references the matrix items W[0] [3] (011 2 = 3 10 ) for w 0 and W [2][6] for w 2 .This interpretation is very efficient as the effort of bit masking is comparatively small.Thus, 6 different weight vectors can be selected in a single clock cycle and assigned to the pipelines.
As the data flow is synchronous, the scores Z m of all pipelines are calculated at the same time.Assuming 6 pipelines, 6 results are returned per clock cycle.Obviously, it is neither possible nor does it make sense to store 8 8 values.Likewise, the effort to administrate a list of the best scores is too high as it implies the sorting of 6 results into the list in a single clock cycle.The examination of this problem shows that a good tradeoff is the storage of the best result of every pipeline.Utilizing 6 pipelines and 128 FPGAs, 768 results are evaluated in every iteration.This set seems to be widespread enough to calculate new weight coefficients for the next iteration.An overview of the FPGA structure is shown in Figure 7.

Results and Performance Analysis
For further research, we will now consider results and performance for a certain security, the investment fund DWS Convertibles, ISIN DE0008474263, that is operating internationally.
As described in Section 2 (referred as calibration phase in the following), the optimal weight vector w * is determined for the security and furthermore for a randomly chosen time period of 26 weeks (calibration time interval).We chose 8 indicators that widely represent the current economical environment: S&P 500, DAX, EuroStoxx 50, ASX 200, Nikkei 225, Hang Seng, S&P 500 Future, and EUR/USD.The goal is to find w * with the maximal value of Z m .
The computational effort with 8 indicators is already rather high.In this paper, we disclaim to investigate more indicators since, on the one hand, these 8 indicators represent the activities on the international stock markets to some high degree, and on the other hand, the results with this restriction are already remarkable.
We now focus on the investment strategy where at day d i for indicators I j the R i, j are calculated and then from f (R i ) the volume of buying or selling orders is computed based on the value of w * .To determine the quality of the vector w * , we test it in a different period of time referred to as the evaluation phase.Of course, this makes only sense for a time interval (the so-called evaluation time interval) which does not overlap with the calibration time interval.We have chosen three different evaluation time intervals of 26 weeks as well.The question is whether or not the new investment strategy gives an outperformance in comparison to a buyand-hold strategy.Buy-and-hold means P is bought at the beginning of the evaluation time interval and sold at the end.Figure 8 shows an example of the chart of the security in comparison to the performance of our investment strategy with the same security within three different evaluation time intervals T k of 26 weeks each, k ∈ {1, 2, 3}.T 1 (2009-09-14-2010-03-15) is a period where tendency for the fund is rising.T 2 (2010-09-27-2011-03-28) is a period without a clear tendency and T 3 (2011-03-28-2011-09-26) is a period where tendency for the fund is falling.
The values of w * k had been determined for each T k in the iterative way described previousuly.The resulting investment strategy S k was then applied for the evaluation time intervals T e , where e ∈ {1, 2, 3} and e / = k.The chart P k,e shows the performance of the monetary assets in the evaluation time interval T e using investment strategy S k .
In all time intervals, an outperformance of the investment strategies S k over P between 2% (see P 2,1 in Figure 8) and 14% (see P 1,3 in Figure 8) can be seen.Although this is no proof in a mathematical sense that such an investment strategy can be applied to arbitrary securities in arbitrary time periods, it seems to be very promising to further improve the method described here.
Considering computing performance as well, the RIVY-ERA or similar computer architectures are perfectly suited for such research.While RIVYERA requires up to 1300 W, 300 W is supposed for a standard PC.Accordingly, the power consumption is reduced by up to 99.975%.
Of course, such a comparison yields a number of questions.Intel declares 76.8 GFLOPs for i7-970 [17].The presented FPGA design needs 15+2n = 31 operations for n = 8 indicators.Assuming that the PC version manages to work with the same number of operations, one could deduce that the referred processor reaches up to 78, 600, 000, 000/31 = 2.54 billion pairs per second.This is more than 1,000 times In fact, the computing power of the processor is not the bottleneck.The main problem is located in the intensive memory communication.Even a cache-optimized version needs several RAM accesses (and of course many cache accesses) to calculate a single pair.The pipeline structure cannot directly be translated but only be simulated by further memory instructions.In contrast, the FPGA Block RAM modules are triggered in parallel to the actual calculations.Thus, there is absolutely no latency concerning memory operations.This is a reason why this algorithm is very suitable for massively parallel computing.A further interesting issue would be a comparison in performance using GPGPU.
As well, the time complexity of the presented algorithm differs in regard to the different platforms.On standard processors, the complexity is O(w n max • n • m) where w max denotes the maximum number of possible coefficients for one indicator.The factor n occurs because the evaluation of f (R i ) needs n multiplications and n − 1 additions that has to be executed sequentially.A RIVYERA pipeline calculates one pair (C i , D i ) per clock cycle in every case.So, there is obviously no such dependency.Therefore, the time complexity is O(w n max • m).However, the dependency on n is not erased by this approach.While the size of a standard processor remains constant for an increasing n, more adders and multipliers are necessary in terms of an FPGA-based implementation.According to this, the spatial complexity is O(n).As this may lead to less pipelines per FPGA, an indirect influence to the runtime cannot be concealed.

Conclusion
The FPGA-machine RIVYERA is very suitable for optimization of the investment strategy as it was presented in this paper.A speedup of 17,000 and an energy saving of more than 99% in comparison to one single high-performance PC has been determined.The investment strategy which is optimized with RIVYERA delivers for the special investment fund and different time periods reviewed a significant outperformance in relation to a buy and hold strategy.
Several other securities for different time periods were tested.Although always the same, simple indicators were used, the optimization of the investment strategy by using RIVYERA delivered almost in every case a significant outperformance.

Figure 2 :
Figure 2: RIVYERA S3-5000.The 16 FPGA-cards forming the FPGA-computer are highlighted.The integrated standard PC cannot be seen behind the cover.

Figure 3 :
Figure 3: Calculation of f (R i ) with four indicators.

Figure 5 :
Figure 5: Evaluation of Z m .

Figure 7 :
Figure 7: Outline of the processor architecture.

Table 1 :
Synthesis result with floating point representation.

Table 2 :
Synthesis result with fixpoint representation.
Table 3 shows a comparison between the RIVYERA-based approach and a PC version of the algorithm implemented in C. The test system uses an Intel Core i7-970 ) are calculated in every clock cycle on RIVYERA where p is the number of pipelines per FPGA.The clock rate of the implementation is 50 MHz.Assuming 8 indicators, 6 pipelines can be synthesized on an FPGA.This yields 128 • 6 • 50, 000, 000 = 38.4 billion pairs per second.Examinations of the PC version denote that 2.26 million calculations per second are possible on the specified test system.The conclusion is a speedup of about 17,000.
• p pairs (C i , D i

Table 3 :
Comparison of PC and RIVYERA.