Improved Matrix Multiplication by Changing Loop Order

Matrix multiplication has been implemented in various programming languages, and improved performance has been reported in many articles under various settings. Matrix multiplication is of paramount interest to machine learning, a lightweight matrix-based key management protocol for IoT networks, animation, and so on. There has always been a need and an interest for improved performance in terms of algorithm implementation. In this work, the authors compared the run times of matrix multiplication in popular languages such as C++, Java, and Python. This analysis showed that Python’s implementation was poor while Java was relatively slower compared to the C++ implementation. All the aforementioned languages use a row-major scheme, and hence, there are many cache misses encountered when implemented through simple looping. In contrast, the authors show that by changing the loop order, more performance gains are possible. Moreover, we evaluated the performance of matrix multiplication by comparing the execution time under various loop settings. The authors observed tremendous performance gains due to better spatial locality. In addition, the authors also implemented a parallel version of the same algorithm using OpenMP with eight logical cores and achieved a speed-up of seven times compared to the serial implementation.


Introduction
With the emergence of smart devices in many application areas, such as healthcare, location-based services, and selfdriving vehicles heavily depends on the e cient processing of data produced by these devices. Processing such data are very challenging, especially when data is produced in zettabytes. On the other hand, the performance of an application is no more linearly linked to the processor clock frequency as it was a norm until 2004, where doubling the system speed would roughly result in a 50% performance gain. However, in 2004, the chip manufacturers encountered a "Power Wall" and faced fundamental constraints in power delivery and heat dissipation. "Power Wall" refers to the di culty of scaling the performance of computing chips and systems due to fundamental constraints imposed by physics.
is limitation on increasing processor frequency introduced the era of multicore computing, where multiple cores run an application in parallel for better performance. Multicore systems and even low-end small devices such as tablets and smartphones, which have two or more cores, currently dominate the computing domain. On the other hand, writing algorithms to fully exploit these multicore systems remains a challenging task.
With multicore systems, personal computers today o er the computation power of supercomputers. Such speed provides many prospects for performing computationally extensive operations. To exploit this facility, several serial applications present the opportunity to write parallel versions, especially in applications where it is of principal interest to exploit the potential parallelism in the application.
is trend of parallel processing theoretically shows linear performance gains with an increased number of processors at an abstract level and excellent performance when run as a parallel system. However, not all code can run in parallel and, hence, serial parts seriously limit the system performance. Unlike serial counterparts where performance is measured through complexity analysis, the performance of parallel systems is evaluated through Amdahl's or Gustafson's laws [1,2].
Traditionally, until 2002, the majority of computer systems had one processor, and accordingly, sequential algorithms were developed. Today, the latest computers, which consist of multiple processing elements (either multiple CPU cores or GPU), are more powerful due to multiple processing units. Understandable serial codes are slow as they need to process the instructions one by one, in contrast to a parallel execution where multiple instructions are processed simultaneously. e performance of many serial programs can be improved by exploiting the parallel architectures, which can be in terms of loop reorder, pipelining, and speculation, and so on. It is worth noting that not every program can be converted in parallel, and there are cases where the parallel version exhibits poor performance compared to the serial version. With the latest computing architecture and parallel languages, tremendous improvements have been reported for machine learning, AI, a lightweight matrix-based key management protocol for IoT networks, graphics, computational photography, and computer vision by exploiting parallelization [3][4][5][6][7].
Benefits such as power efficiency, resource pooling, cost reduction, system availability, and improved computational power can be obtained with cloud computing infrastructure. All such benefits attract computer scientists, systems engineers, the research community, and high-performance computing (HPC) customers to the cloud domain. On the other hand, HPC programs often use a large number of processors to lower the run times of tasks. An issue with such processors is synchronization, in addition to communication overheads. It has been reported in [3] that shifting an HPC application to the cloud environment can negatively impact the aforementioned difficulties and even introduce additional issues of virtualization, multitenancy, network latency, and so forth.
In this paper, we limit our analysis to a multicore system by observing the program behavior and changing loop orderings [4][5][6][7]. We apply a parallel construct for matrix multiplication to gain better performance. In this work, we study the following: e impact of loop reordering on performance Parallel implementation under OpenMP framework to illustrate the speed-up obtained through parallelism e remaining part of this paper is divided into four sections as. In the background Section, we discuss the groundwork, followed by an explanation of the row and column major order of data in the memory in the impact of row-major programming constructs Section. We discuss the behavior of matrix multiplication for a square matrix by changing the loop ordering. In the Discussion Section, we provide the parallel version of the code and discuss the experimental results. e paper then offers a conclusion as the last section.

Background
Until the beginning of the 21st century, advances in technology would simply be considered an increase in the clock speed. Naturally, the software would effectively "speed-up" automatically over time because of running faster processors.
e clock speed of microprocessors increased exponentially through the 1990s and beyond, but after 2004, it reached a limit due to physics, and the clock speed is now limited by power consumption/heat dissipation. With little improvement in clock speed, such performance gain convenience is no longer an option for software engineers. As a solution to lower power consumption, the dynamic voltage scaling concept was introduced in CMOS technologies. e literature shows that the relationship between frequency and voltage in modern processors [8] can be written as: E = P * T, where "E" denotes the power consumption, "P" represents the average power, and "T" is the time taken for this average power.
Today, advances in technology mean increased parallelism and not enhanced clock speed. us, exploiting such parallelism is one of the outstanding challenges of modern computer science. e parallel programming and parallelization of the tasks are done for the main purpose of allowing tasks to be executed at the same time by utilizing multiple computer resources and multiple cores on the same CPU. is process is very critical, especially for large-scale projects where speed is needed. Parallel programming is making its way into various domains, ranging from drug discovery to data analytics to the animation industry. All these applications are computation intensive and traditional sequential code becomes inefficient. However, just increasing the number of processors does not always guarantee performance gains and depends on the nature of the problem to be parallelized. Parallel implementation is prone to overheads such as: task start-up, time, synchronization, SATA communications, and software overhead imposed by parallel languages, libraries, operating system, and so on. When a parallel code is developed, these factors need to be considered carefully.
In the parallel programming literature, a massive parallel system means that the hardware of a given parallel system is comprised of many processing elements, and currently, the largest parallel computers include processing components in the range of hundreds of thousands to millions. Similarly, embarrassing parallel applications refer to a set of applications where independent tasks can run simultaneously and there exists very little to no need for coordination between the tasks. Another term that is used in the parallel programming domain is "scalability," which points to a parallel system's ability to demonstrate an adequate increase in parallel speed-up. Such a situation is understandable with the addition of more resources. Factors that contribute to scalability include algorithms, overheads, hardware, the characteristics of a program, and so forth. In parallel programming, efficiency is defined as the amount of work needed to be done, while "performance" points to how fast an algorithm can finish a particular work. A faster implementation is not necessarily an efficient one, where efficiency points to the full exploitation of available hardware resources.
In a program running on a parallel system, it is possible that some instructions need to be accomplished in sequence.
is sequential execution has a limiting factor on program speed-up such that even adding more processors may not make the program run any faster. For instance, if a program takes 20 minutes to finish using a serial code with one thread, and when the 5 minute portion of the code cannot be made parallel, the remaining 15 minutes of processing can be written as parallel code. In such situations, irrespective of how many threads are devoted to the parallelized execution of this program, the minimum execution time cannot be less than 15 minutes. For such evaluation of a program in the parallel computing domain, Amdahl's law [1] is used and can be represented as follows: where S shows the theoretical speed-up of the execution, speed represents the speed-up of the part of the task that benefits from improved system resources, and P is the proportion of the execution time that the part benefiting from the improved resources originally occupied. One of the most important criteria in parallel computing is to actually measure how much faster a parallel algorithm runs with respect to the best sequential one. is measure is known as "speed-up." In other words, speed-up is the gain in speed made by a parallel execution compared to a sequential execution. Any program that results in higher speed is not necessarily efficient. e efficiency of a program is described as using p processors, or how effectively all system hardware elements are being utilized. If the efficiency is 1.0, then it is the maximum theoretical efficiency and shows the optimal usage of computational resources available for execution. If speed-up is not greater than linear, the efficiency will be less than or equal to 1.0, and this is normally the situation in practical cases.
In computer science, matrix multiplication is of great interest to many application areas, and a lot of work has been done in this regard in the literature [3,[9][10][11][12][13][14][15][16][17][18][19][20][21][22][23]. It has been shown that the number of processes does not necessarily result in performance gain. Recently, the authors in [3] measured the speed-up and efficiency of a matrix multiplication benchmark running on Amazon EC2. eir experiment shows why the performance of HPC applications on the cloud is not predictable due to the shared resources and multitenant environment of the cloud [3]. Various improvements in matrix multiplication have been discussed in the literature [16,24,25], and further gains are possible using GPUs [26][27][28][29]. Recently, for more secure communication between these IoT devices, the authors in [30] extended the work by proposing a lightweight matrix-based key management protocol for IoT networks.
OpenMP is an open specification for multiprocessing and offers a standard API for defining multithreaded, shared-memory programs [31]. e OpenMP high-level API consists of 80% preprocessor (compiler) directives, 19% library calls, and around 1% of environment variables. is framework presents the fork-join model of parallel execution. OpenMP is an API that is portable, supports threading, and can work with shared-memory programming specifications with "light" syntax. It is to be noted that the exact behavior depends on the OpenMP implementation and the number of threads. OpenMP is an advanced API and works for both C and C++; and it requires compiler support. Since a program can have serial and parallel sections, OpenMP allows a programmer to separate a program into serial regions and parallel regions, hide stack management, and provide synchronization constructs. As a potential drawback, OpenMP cannot detect dependencies in the code nor guarantee speed-up. In addition, it cannot provide freedom from data races, and it is the responsibility of the programmer to avoid such cases.

Impact of Row-Major Programming Constructs
In the computer science domain, two methods exist for storing multidimensional arrays, such as matrices, in linear storage and in random access memory. ey are called column-major and row-major orders.
ese methods are different in the way in which elements are stored contiguously in the memory. In column-major order, elements are arranged consecutively along the column, while elements are arranged consecutively along the row under row-major sequence. Python, C, C++, Objective-C, and Java implement the column-major order when storing elements in memory, while FORTRAN, MATLAB, Julia, and Pascal use the column-major order.
In row-major order, elements are placed in memory as shown in Table 1, where the entire row is placed at one location in the memory, ignoring the cache line size such that there are multiple rows of the matrix. Table 2 represents the representation of the elements of Matrix C in cache, where a minimal block of data is transferred between the memory and the cache in a better algorithm, and the entire block of memory is placed in the cache line i.e., the word length and block length are of the same size in this paper. We then study the effect of spatial locality on program performance.
We name the first Matrix A, the second B; the product of Matrixes A and B is stored in Matrix C, as shown in Figure 1. It can be seen that under the j, k, i loop ordering, the program takes the longest due to poor placements of elements in memory. In Figure 2, we can see B has excellent spatial locality, but the code is dominated by two other Matrices A and C, where the placement is poor and, hence, the program shows worse case behavior when implanted in C, C++, Python, or Java.
As an alternative, we implemented the program as follows, written in C++ in Figure 3. We can now visualize the memory layout in Figure 4, where Matrix B offers poor, Matrix A offers good, while Matrix C presents the best spatial locality for the elements. In Figure 4, it can be seen that for Matrix C, the elements are placed n elements apart, and that is why there is a miss, which takes more time for the system to load the value. For Matrix B, it is relatively closer, as only a desired number of steps are taken, but for Matrix C, only one location is updated, and hence, it offers an excellent spatial locality. is program improves the performance by a factor of three when the matrix is of dimension 4096 and implemented in C++.
We then rewrote the code and adjusted the loop order (see Figure 5).
is arrangement did not affect the correctness of the program. e aim of this arrangement was to Mobile Information Systems obtain a better spatial locality, as the program takes much less time compared to the counterpart implements. e corresponding spatial locality for Figure 5 is shown in Figure 6. It can be seen that since Matrixes C and B offer good spatial locality, Matrix A has excellent spatial locality, and hence, this arrangement surpasses its other counterparts in performance. eoretically, the code shown in Figure 5 is approximately 14 times faster than the one implemented in Figure 2 due to better spatial locality.
is arrangement favors row-major languages, while it will hurt the performance of column-major languages due to the inappropriate memory layout of array elements.

Discussion
is experiment was conducted on Windows 10 and the system information is shown in Table 3. e complexity of the program is 2n 3 , where n represents the loop iterations. Our analysis shows that Python's implementation is poor in all three languages. While Java is relatively slow compared to the C++ implementation.
is study is limited to matric multiplication and test programs written in C++. Java and Python to perform matrix-matrix multiplication as follows: (iv) Observe the behavior of loop reordering.
(v) Evaluate the performance gains with parallel implementation in OpenMP.
We first ran sequential versions of the program while changing the matrix size n from 500 to 2500 in steps of 500. We recorded the time to perform matrix-matrix multiplication for each execution. It is worth noting that interpreters can easily support high-level programming features [5]. is is due to the flexibility with which the interpreter reads, interprets, and performs each program statement and updates the machine state. However, the features of dynamic code alteration come at the cost of performance loss. is is why Python is the slowest in this class. While Python is interpreted, Java overcomes this drawback with a just-intime compiler feature. Since C++ is compiled, it presents the best performance. In Table 4, we show our results where the runtime for C++, Java, and Python is extracted for a matrix starting from size 500 to 2500. e just-in-time compiler of Java can recover some of the performance lost by interpretation. In Java, when the code is executed for the first time, it is interpreted first. Interestingly, the system keeps information about how often various pieces of code are executed. Whenever a piece of code executes frequently, the code is compiled to machine code in real time, and the next executions of that code use the more efficient compiled version [3].
We further studied the behavior of C++ by changing the loop order. In matric multiplication, there exist combinations where changing the order of the loops will not affect the correctness of the program. In the following experiment, we showed such combinations and provided the performance of the code (see Table 5). In all the aforementioned three programming languages, matrices were placed in row-major order. It is worth mentioning that the matrix size was the same for all experiments under C++, Java, and Python, while the time varied significantly due to the arrangements of the matrix elements in the memory. As discussed earlier, the poor arrangement of elements results in more cache misses and thus the program takes more time, while an excellent arrangement guarantees improved performance. e parallel version of the code is shown in Figure 7, where pragma "omp parallel for" is used. When this code runs on real hardware, the number of threads, which depends on the hardware and operating system, becomes of interest for performance. In Figure 7, we did not specify the number of threads, but the code can run on any system where more threads will definitely enhance the execution of the code significantly.
e results show the improvement of code in C++ over Java and Python. In our last experiment, we extracted results for a parallel code written in C++. We used the pragma "# pragma omp parallel for" in the parallel implementation of the code. e parallel matrix multiplication program with the support of OpenMP, shows a noticeable gain in speed-up for varying matrix sizes. rough our analysis, we conclude that the row-major policy is better for the matrix multiplication using the loop ordering i, k, j. It is worth noting that although i, j, k order is the natural one and easy to   understand, it results in poor memory layout, and therefore, the performance is poor. e order i, k, j places the data in such a way that the cache hit is increased, and thus, more gains are observed. Considering the i, k, j order as the most suitable looping order, we recommend this strategy for improved performance, irrespective of serial and parallel versions. We show our results for both serial and parallel implementation in Table 6. Hence, we implemented the best serial code, which was i, k, j ordering, and hence, the results were superior. After implementing the code using the OpenMP framework, it is observed that the execution time of the parallel version is approximately seven times better than the execution time of the serial one, and thus, an improvement of around seven times has been obtained. It is worth noting      Mobile Information Systems that the efficiency is achieved mainly due to the algorithm, while the system performance depends on the data structure.
It can be noted that the performance is not linear in our experimentation, and this is understandable. In theory, the speed should be eight times faster as we have eight cores on the system, but since there are communication, synchronization, etc., overheads involved, and speed-up is not linear,

Conclusions
We implemented matrix multiplication for row-major order languages and observed that C++ is more efficient in terms of runtime. e performance gap in programming languages, such as C++, Java, and Python, is due to the use of interpreters and compilers. Since C++ becomes compiled to machine language, it is the fastest. As a workout solution, the just-in-time feature of Java makes it faster than Python for matrix multiplication. We observed that the layout of elements in the memory has a large impact on the computational cost of a program. e parallel implementation of C++ is obtained in a speed-up of a factor of 7.0. As a future work, it will be interesting to extend this work to implement applications such as chess, binary decision diagrams, and logistic regression, and so on.

Data Availability
Data will be made available from the author upon request.

Conflicts of Interest
e author declares that there are no conflicts of interest.