An Optimization of Tree Topology Based Parallel Cryptography

Public key cryptography has become of vital importance regarding the rapid development of wireless technologies. The RSA is one of themost important algorithms for secure communications in public-key cryptosystems. Since the RSA is expensive in terms of computational task which is modular exponentiation, parallel processing and architecture is a reasonable solution to speedup RSA operations. In this paper, taking into account pipelining and optimization, we improve throughput and efficiency of the TRSA method, a parallel architecture solution for RSA security based on tree topology. The optimization and pipelining of the tree based architecture increases its efficiency and throughput. The experimental results demonstrate that these pipelined and optimized approaches outperform the main TRSA.


Introduction
Using wireless communication has made most of today's systems vulnerable.Employing appropriate measures, which provide confidentiality, integrity, authenticity, and availability for all messages in presence of adversaries, can reduce threats on applications exploiting wireless communications such as WSN.Asymmetric algorithms called public key can provide user authentication and be useful in key distribution and management.Public key encryption allows two parties to communicate secretly, even if all communications between them are monitored.Furthermore, it allows enormous flexibility, which in example is essential for an online merchant where an online merchant processing credit card orders from multiple purchasers.It is more convenient for the receiver to store a single private key rather than to share and manage different secret keys.
Once asymmetric cryptographies are based on number theory, they are expensive in terms of their mathematical calculations 1, 2 , which need large amount of energy resources and sufficient amount of memory for large keys 3 .Benefiting from software solutions for cryptography leads to flexibility and ease of use and upgrade whilst it do not provide enough speed and are less secure than hardware solutions.Software issues can be easily monitored and on the other hand they consume a lot of resources.Besides, it is difficult to transfer them among different operating systems whilst hardware methods need fewer computer resources and have special chips to accelerate the process 4 .As a real-world example, the security problem in sensor networks is more challenging due to the resource limited nodes.In such low resource devices, a solution based on parallel architecture becomes more appropriate due to the smaller occupied area by the architecture 5 .Basically, energy efficiency and battery life time play a major role in the lifetime of such applications 6 .
The main reason of using topological interconnection of processor elements in terms of parallel architecture is to create a powerful computer or processing element for specific objectives.Parallel architecture is combined of nodes and these interconnection networks.Depending on the algorithm being deployed on the architecture, either of pipelining or optimization and in some cases both of them are applicable to the solutions.While the pipelined approach increases the throughput by processing multiple PEs simultaneously, the optimization improves efficiency by eliminating the redundant PEs.
The main contribution of this paper lies on using parallel pipelined of Tree topology for RSA cryptography and issuing an optimization to this work.The pipelining, and moreover, optimization improve the results drastically.
The rest of this paper is organized as follows.Firstly, we summarize the related works in the next section.Then, the pipelined TRSA is explained.Section 4 provides the optimization of TRSA.Afterwards, the simulation results are discussed in Section 5. Finally, the conclusion of the work is drawn.

Related Works
We have summarized existing parallel approaches on cryptographic algorithms 7 .In addition, a parallel cryptography method using RSA is proposed in 8 , which uses tree topology as its base infrastructure.As far as the authors' knowledge, from the topology point of view, there is no significant parallel approach on RSA algorithm other than TRSA.The well-known parallel methods using software and hardware solutions on RSA are in 5, 9-21 .More details on these approaches are presented in 7 .Among these parallel approaches, the best value in terms of seconds is 3.23 ms while applying 1024 bit key length 5 .The greatest key length is 3072 12 , and the best speedup is 10.9 using 2560 key length 13 .The evaluation metric in these methods is speedup or time for most of the cases, and hardware implementations have employed Montgomery's algorithm to perform modular multiplication or exponentiation.
These recent approaches have not discussed the time complexity or the order of the algorithms which are inseparable from parallel processing.To the best of the authors' knowledge, the only existing discussion on time complexity is the known CRT 22 , Montgomery 23 , and the binary 22 which is an accepted method, that are based on the number of multiplications.Herewith, to analyze the proposed approach, we deal with the time complexity of optimization and pipelining of TRSA method.

Pipelined TRSA
TRSA is a new parallel approach on RSA using the tree interconnection network 8 .TRSA can be applied as a coprocessor for embedded systems such as wireless sensor nodes, which require more speed in transferring confidential information.Enhancing TRSA, pipelining mechanism is employed to achieve higher throughput.Applying pipelining to the main solution, the latency between data that are being encrypted drastically decreases 5 .It is assumed that the base operation is one multiplication and one modulus, which is called MulMod in brief.Considering the encryption as C i m e i mod n where i is the block number, in TRSA, computing C i should be finished before the C i 1 starts to be computed.In this enhancement to the main approach, when the results of current operations are sent to the next level of tree, the computation for next ciphertext can be started in this level.In principle, a new operation can be initiated with this frequency.Even though previous ones are still in pipeline, and the ciphertext is not ready for prior plaintext so far.
The following definitions are used in the original algorithm.
The RSA consists of two great prime numbers p and q, which compute modulus n pq.The value ϕ p−1 q−1 is employed to determine e, where e < n and gcd e, ϕ 0. Consequently, d has the following relation: 3.1 Let e o , e l , A, and B be computed as following: e o e div n p , 3.2 e l e o e mod n p ,

3.3
A m e o mod n, 3.4 B m e l mod n.

3.5
Figure 1 a presents the schematic structure of a tree-based architecture using seven PEs. Figure 1 b demonstrates inner structure of each PE.The encryption/decryption solution in TRSA using seven PEs as tree topology is indicated in the following.Processor elements from p 0 to p 2 receive the value of A as their inputs, and p 3 will receive both A and B. These PEs will perform the following operations: PEs p 0 to p 3 will send their results to their parents which are p 4 and p 5 .The values D and E are the inputs of p 4 and p 5 that are computed as follows:

3.7
In the second step, processor elements p 4 and p 5 will carry out the following operations to compute F and G:

3.8
Then p 4 and p 5 will send their results to the root.The root will execute the following operation to generate the output of p 6 which is the encryption of m using RSA: The details of the operations are discussed in 8 .
The pipeline phases are shown in Figure 2. It is shown that while the computation for C 1 is still in process, the computations for other messages have already been started.
The number of levels depends on the number of nodes in the tree.In this example, number of nodes is eight in which seven of them are forming the tree, and the other one is the coordinator.However, the number of nodes can be more or less but not less than four.Number of PEs should follow the tree rule which is 2 No. of levels − 1, in addition a processor element is also added as the coordinator of first operation.Using more nodes, the more parallelization will be gained, nevertheless, overloading of more processor elements for smaller data and key length should be prevented.The coordinator operations will be decreased due to the increase in the number of PEs for the tree which leads to less execution time.In this approach, there is always a tradeoff between the number of PEs, the data size, and the key length.
Steps of pipelining in TRSA.

Analysis of Pipelining
Once levels one, two, and three do just one MulMod operation, their execution times indicated by T 1 , T 2 , and T 3 are equal and we have

3.10
The execution time of the operation in the coordinator depends on the computational power of the coordinator and the e l .Although tree architecture used here is a homogenous architecture itself, this architecture is heterogeneous as a whole, which means that the coordinator processor element is different from other process elements.The PE known as coordinator must be more advanced than others.Assuming this and knowing that the computation in this PE is more complicated than others, the difference between the execution time of this processor element and others should be very small.As an applied example, the coordinator can be the CPU of the embedded system, and the tree part can be used as a coprocessor.In this case, the coprocessor is homogeneous.According to the above discussion, and using T c as the execution time of coordinator, we can have

3.11
Using pipelining, as it is clarified in Figure 2, the throughput of TRSA with pipelining is about four-times of the throughput of original TRSA for a tree with seven nodes.In the pipelined version, when computation of C 1 is completed, calculation of C 5 will be started as a new block; whilst in the original TRSA, when C 1 is computed, calculation of C 2 will be started as the next block.Th is representing throughput in the subsequent equations: Th pipelined 4Th original .

3.12
It should be considered that if the number of processor elements increase, the throughput of pipelining will be increased too.The relation of the throughput for  the pipelined TRSA and the throughput for original TRSA where l represents the levels of the tree is Th pipelined l 1 Th original .

3.13
In this approach, all PEs are busy all the time computing cipher texts while in the original TRSA, just one level of the processor elements was busy computing, and the others were idle.The pipelining mechanism provides an appropriate load balancing.In this example, having just four pipeline stages has led to a higher throughput, and more pipelining by employing a bigger tree architecture will achieve higher performance and more efficient encryption implementation.

Optimization of TRSA
The original TRSA method can be improved and optimized to increase efficiency as well as decrease area, which is one of the most important factors in energy and size limited systems like sensor nodes.In the original TRSA there are some nodes, which perform the same operation during operation and generate the same result.Knowing this, the results of some PEs can be employed instead of the results of some other PEs.These redundant PEs can be eliminated.Figure 3 illustrates the transition from original TRSA to the optimized TRSA.Considering four levels in a tree, the PEs which are doing the same operation are marked with the same shape in Figure 3 a .The nodes that can be eliminated from the tree to optimize the algorithm are omitted in Figure 3 c and new connections are created.
Eliminating redundant PEs leads to a smaller area usage.By the increase in levels of the tree, the redundant PEs will be increased.If the levels of parallelizing are more than four, the number of eliminated PEs increases exponentially.

Analysis of Optimization
Efficiency of the parallel approaches 24-26 is measured using efficiency factor, which is E S/P , where P and S are the number of processors and speedup, respectively.Assuming the number of PEs to be 15 as demonstrated in Figure 3, the efficiency of the algorithm will increase from S/15 to S/7 employing optimization.Taking into the general form, the efficiency of original TRSA is E S/ 2 l − 1 where l is the number of tree levels.Optimizing the TRSA, the efficiency of optimized TRSA becomes The improvement of efficiency is obtained by The above equation shows that the efficiency boosted exponentially, especially, when the number of levels increases.
The number of multiplications for accepted method, which is binary in the best case, is k − 1 where k is the number of bits of the exponent 14, 22, 27 .The number of multiplications in the binary method for the worst and average cases are 2 k − 1 and 1.5 k − 1 .Let l be the number of levels, the power of A and B in 3.4 and 3.5 will be divided to the number of PEs which is 2 l .Hence, 2 k − 1 − log 2 l multiplications will be done to compute A and B consequently.Thus the number of multiplications of coordinator for the worst case is The total number of multiplications for optimized TRSA is 2l − 1, thus the total multiplication count is 2 k − l − 1 2l − 1.While there are some operations which execute simultaneously in the parallel architectures, the actual number of multiplications is much less than the total number of multiplications.In this case, the total number of PEs in optimized TRSA is 2l − 1 where l is the level of tree and the number of multiplications is l due to the concurrency of PEs.Therefore, the total number of multiplications for encryption is decreased to 2 k − l − 1 l.
Using the calculation method from 11 , the number of multiplications is presented with η k, l and in result the number of multiplications for the optimized TRSA in the worst case is Table 1 compares the number of multiplications in the best, average, and worst case for binary, CRT, Montgomery, and optimized TRSA for both general form and a tree with 60 levels.
The number of multiplications of primitive RSA will be reduced using the optimized TRSA method compared to the CRT and Montgomery method.Although, the best case in the TRSA and the binary methods are the same, the average and worst cases of TRSA are improved.The level of this improvement depends on the number of MulMod blocks in the tree topology and the length for RSA cryptographic key in bits.

Simulations, Results, and Discussion
Benefiting from pipelining technique and the optimization method, the variation of TRSA presented in this paper outperforms original TRSA.According to 8 and the results of time complexity that is based on the number of multiplications, original TRSA also outperforms the well-known existing approaches from the literature which are CRT, Montgomery, Binary, and the sequential approaches.
The advantages of pipelining in terms of throughput are presented in Figure 4 a .The throughput is Th pipelined l 1 Th original TRSA .The advantages of optimization in terms of efficiency are presented in Figure 4 b .
Considering the number of levels for TRSA to be three, the throughput of pipelined TRSA is four-times of main TRSA.The more levels employed, the better throughput obtained.
The simulation results of optimization for the average of 100 iterations are utilized to draw Figure 4 b .The results are achieved using C language of Microsoft Visual Studio 2008, OpenMP and OpenSSL.The efficiency is E S/P where S is representing the execution time for TRSA and optimized TRSA, and the efficiency is multiplied by 100 to be in percent unit.When the number of PEs for original TRSA is seven, the number of PEs in Optimized TRSA becomes five.Just like the case in pipelining, the more levels of tree employed, the more efficiency obtained.

Mathematical Problems in Engineering 9
Figure 4 b shows the efficiency of the original TRSA versus optimized TRSA for 1024, 2048, and 4096 bit key lengths using 64, 128, and 256 byte input file and 61, 29, and 15 levels, respectively.The optimized TRSA has preferred efficiency than original TRSA, which means that less area is required.
Selecting number of levels for a tree among others depends on the area and the speed is desired for the security needs of the target system.The speed and number of PEs are always related to each other.
The optimized TRSA has more efficiency than original TRSA, which means that it requires less area than TRSA that leads to less hardware complexity.It is necessary to clarify that the two variations of the original TRSA represented in this paper are not in conflict with each other.Taking pipelining into account along with the optimization, the outcome will be optimum in terms of speed, throughput, and efficiency all at the same time as three significant parameters.

Conclusions
This paper has come up with a pipelining and optimization approach for original TRSA to achieve better results in RSA encryption using TRSA method.The efficiency of the original TRSA approach is improved using the optimization whilst the throughput is increased by conducting pipelining.The increase in the efficiency resulted from optimization of TRSA depends on the bits of key and the tree levels.Just like efficiency, the throughput also increases by the increase in the tree levels.The simulation results confirmed that having a greater key indicates the efficiency of optimized TRSA which becomes better than original TRSA.Applying the optimization, the number of PEs decreases, which results in less area usage that is extremely important in size and energy limited embedded systems.
n p : The number of PEs of tree 1 e l : The exponentiation for the last leaf node's input e o : The exponentiation for other leaf nodes' input

Figure 1 :
Figure 1: a Processor elements in tree architecture, b interior structure of each PE.

Figure 4 :
Figure 4: Optimization and pipelining of TRSA. a Throughput of original TRSA versus pipelined TRSA.b Efficiency of original TRSA versus Optimized TRSA.

Table 1 :
Number of multiplications in binary, CRT, montgomery, and optimized TRSA.