Multi-language programming environments for high performance Java computing

Recent developments in processor capabilities, software tools, programming languages and programming paradigms have brought about new approaches to high performance computing. A steadfast component of this dynamic evolution has been the scientific community’s reliance on established scientific packages. As a consequence, programmers of high-performance applications are reluctant to embrace evolving languages such as Java. This paper describes the Java-to-C Interface (JCI) tool which provides application programmers wishing to use Java with immediate accessibility to existing scientific packages. The JCI tool also facilitates rapid development and reuse of existing code. These benefits are provided at minimal cost to the programmer. While beneficial to the programmer, the additional advantages of mixed-language programming in terms of application performance and portability are addressed in detail within the context of this paper. In addition, we discuss how the JCI tool is complementing other ongoing projects such as IBM’s High-Performance Compiler for Java (HPCJ) and IceT’s metacomputing environment.


Introduction
It is generally accepted that large heterogeneous distributed systems based on the emerging hybrid shared/distributed memory architectures will become the largest and most cost-effective supercomputers over the next decade.They will provide a platform for a new generation of high performance applications including portable applications that can be executed on a number of Internet sites; resource-intensive applications that must aggregate distributed resources (memory, data, computation) to produce results for the problem sizes of interest; and coupled applications that combine computers, immersive and visualization environments, and remote instruments.This makes the search for the most appropriate programming model and corresponding programming environments more important than ever before.Arguably the most serious obstacle to the acceptance of parallel supercomputers is the so-called software crisis.Software, in general, is considered the most complex artifact in highperformance computing; since the lifespan of parallel machines has been so brief, their software environments rarely reach maturity and the parallel software crisis is especially acute.Hence, portability and software re-use, in particular, are critical issues in enabling high-performance computing on the new type of heterogeneous platforms.In order to avoid these problems application programmers need flexible yet comprehensive interfaces which offer the oportunity for multi-language software design and implementation.
The Java language has several built-in mechanisms which allow the parallelism inherent in scientific programs to be exploited.Threads and concurrency constructs are well-suited to shared memory computers, but not large-scale distributed memory machines.Although sockets and the Remote Method Invocation (RMI) interface allow network programming, they are rather low-level to be suitable for SPMD-style scientific programming, and thus, codes based on them would potentially underperform platform-specific implementations of standard communication libraries like MPI.Nevertheless, as a programming language, Java has the basic qualities needed for writing highperformance applications.With the maturing of compilation technology, such applications written in Java will doubtlessly appear.Fortunately, rapid progress is being made in this area by developing optimizing Java compilers, such as the IBM High-Performance Compiler for Java (HPCJ), which generates native codes for the RS6000 architecture [12].Since the Java language is fairly new, however, it lacks the extensive scientific libraries of other languages like Fortran-77 and C.This is one of the major obstacles towards efficient and user-friendly computationally intensive programming in Java.
In order to overcome the above problems, we have applied a Java-to-C Interface (JCI) generating tool to create Java bindings for various legacy libraries [8].In this article we show that with the existing performancetuned libraries already available on different platforms and the multi-language interfaces automatically created by the JCI tool, we can build diferent kinds of multi-language programming environments for high performance Java computing in a flexible and elegant way.In principle, the binding of native libraries to Java has certain limitations though.In particular, for security reasons applets downloaded over the network may not load libraries or define native methods.We also show one possible solution of this problem by using the IceT virtual environment [11].In this case both processes and data would be allowed to migrate and to be transferred throughout owned and unowned resources, under flexible security measures imposed by the users.We also present some evaluation results, which demonstrate the efficiency of our approach.

Automatic binding of legacy codes to Java
Accesing native codes from Java looks not that difficult as Java implementations have a native method interface specificaly designed for this purpose.Although the native interface was not part of the original Java language specification [10], and different vendors have offered incompatible interfaces, the Java Native Interface (JNI) [9] is now the definitive standard for interfacing native code to Java.Implementing JNI in software development involves more than just dynamic linking to the Java virtual machine.Complications stem from the fact that Java data formats are in general different from those of other languages like C, C++, Fortran, etc. which obviously requires data conversion of both arguments and results in multi-language applications.This conversion is a natural part of the native code in case both parts of a multi-language piece of software are to be written from scratch.For legacy codes, however, an additional interface layer called binding or wrapper must be written which performs data conversion and other functions if necessary.
Binding a native legacy library 1 to Java may also be accompanied by portability problems as the JNI specification is not yet supported in all Java implementations on different platforms.Thus to maintain the portability of the binding one may have to cater for a variety of native interfaces.A large legacy library like MPI, for example, can have over a hundred exported functions, therefore the JCI tool which generates the additional interface layer automatically plays central role in creating flexible multi-language programming environments.
A block diagram of JCI is shown in Fig. 1.The tool takes as input a header file containing the C function prototypes of the native library.It outputs a number of files comprising the additional interface: a file of C stub-functions, files of Java class and native method declarations, and shell scripts for compiling and linking the binding.The JCI tool generates a C stubfunction and a Java native method declaration for each exported function of the native library.Every C stubfunction takes arguments whose types correspond directly to those of the Java native method, and converts the arguments into the form expected by the C library function.As we mentioned already, different Java native interfaces exist, and thus different code may be required for binding a native library to each Java implementation.We have tried to limit the dependence of JCI output on the native interface version to a set of macro definitions describing the particular native interface.Thus it may be possible to re-bind a library to a new Java machine simply by providing the appropriate macros.
The JCI tool offers flexible Java bindings for native libraries.For example, by using different library header files as input, we can create bindings for multiple versions of a library, e.g., MPI-1.1,MPI-1.2,MPI-2.0.Furthermore, JCI can be used to generate Java bindings for libraries written in languages other than C, provided that the library can be linked to C programs, and prototypes for the library functions are given in C. We have created Java bindings for the ScaLAPACK constituent libraries written in Fortran-77: BLAS Level 1-3, PB-BLAS, LAPACK, and ScaLAPACK itself [8].The C prototypes for the library functions have been inferred by a Fortran-to-C translator f2c [6].
While automatic binding is certainly convenient, sometimes there may be a price to pay: the data conversion may impose a performance penalty.For exam- ple, some scientific native library functions take multidimensional arrays (e.g., matrices) as arguments.The JCI tool supports multidimensional arrays, but a runtime penalty is incurred: such arrays must always be relocated in memory in order to be made contiguous before being supplied to a native function.When large data arrays are involved the inefficiency can be significant.In order to avoid it, we have chosen to represent matrices in Java as one-dimensional arrays in our ScaLAPACK library bindings.On the other hand, in the Java binding for MPI [13] multi-dimensional arrays are left intact without significant inefficiency.
Large arrays used as data buffers can have their layout described by an MPI derived data type, and the Java binding performs no conversion for them.Multidimensional arrays used in MPI as descriptors are relatively small.

Environment based on conventional JVM
The initial structure of our programming environment including all basic components is illustrated in Fig. 2. The JCI tool takes as input the header file containing the C function prototype declarations of the native legacy library and generates automatically all files comprising the wrapper as required.Then, the bound libraries can be dynamically linked to the Java Virtual Machine (JVM) upon demand and used during the execution.So far we have done experiments with two varieties of the JVM -the Java Development Kit (JDK) for Solaris on a cluster of Sun workstations; and IBM's port of JDK for AIX 4.1 on the SP2.
Most JVMs contain a Just-in-Time (JIT) compiler to improve the execution performance.A JIT compiler turns Java bytecode into native code on-the-fly, as it is loaded into the JVM.The JVM then executes the generated code directly, rather then interpreting bytecode, which leads to a significant performance improvement.In this way, the best performance results using the programming environment in Fig. 2 can be achieved, but the execution time is still much longer in comparison with similar computations using conventional languages such as Fortran-77 or C and the corresponding compilers.The reason for this noticeable difference is twofold -firstly, the JIT translation adds an extra overhead to the execution time; and secondly, the compilation speed requirements constrain the quality of optimisation that a JIT compiler can perform.Therefore, the performance of this environment is relatively low Fig. 2. Programming environment using a conventional Java virtual machine.as there is usually a large imbalance between the efficiency of the performance-tuned implementations of legacy libraries and the rest of the code at execution time.If, however, the processing requirements of the Java code are negligible, this programming environment can still deliver acceptable performance figures.
In order to evaluate the performance of the multilanguage environment based on conventional JVM, we have used the Java version of the Matrix Multiplication (MATMUL) benchmark from the PARKBENCH suite [14].The original benchmark is in Fortran-77 and performs dense matrix multiplication in parallel.It accesses the BLAS, BLACS and LAPACK libraries included in the PARKBENCH 2.1.1 distribution.MPI is used indirectly through the BLACS native library.We have run MATMUL on a Sparc workstation cluster, and on an IBM SP2 machine with 66MHz Power2 "thin1" nodes, 128Mbyte RAM, 64bit memory bus, and 64Kbyte data cache.The results are shown in Table 1 and Fig. 3.

Environment based on native Java compilers
An optimising native code compiler for Java can be used instead of the JVM in order to overcome the above problem.Such a compiler translates bytecode directly into native executable code as shown in Fig. 4. It works in the same manner as compilers for C, C++, Fortran, etc. and unlike JIT compilers, the static compilation occurs only once, before execution time.Thus, traditional resource-intensive optimisations can be applied in order to improve the performance of the generated native executable code.In our experiments, we have used a version of HPCJ, which generates native code for the RS/6000 architecture.The input of HPCJ is usually a bytecode file, but the compiler will also accept Java source as input.In the latter case it invokes the JDK source-to-bytecode compiler to produce the bytecode file first.This file is then processed by a translator which passes an intermediate language representation to the common back-end from the family of compilers for the RS/6000 architecture.The back-end outputs standard object code which is then linked with other object modules and the previously bound legacy libraries to produce native executable code.In this way, our programming environment conforms to the basic We have run the IS benchmark on two platforms: a cluster of Sun Sparc workstations, and an IBM SP2 system with 120 MHz POWER2 Super Chip processors, 256 MB of memory, 128 KB data cache, and 256 bit memory bus.The results obtained on the SP2 machine are shown in Table 2 and Fig. 5.The Java implementation we have used is IBM's port of JDK 1.0.2D(with the JIT compiler enabled).The communications library we have used is the LAM implementation (version 6.1) of MPI from the Ohio Supercomputer Center [4].The results for the C version of IS under both LAM and IBM MPI are also given for comparison.
It is evident from Fig. 3 that Java MATMUL execution times are only 5-10% longer than Fortran-77 times.These results may seem surprisingly good, given that Java IS is two times slower than C IS (Fig. 5).The explanation is that in MATMUL most of the performance-sensitive calculations are carried out by the native library routines (which are the same for both Java and Fortran-77 versions of the benchmark).In contrast, IS uses a native library (MPI) only for communication, and all calculations are done by the benchmark program.
It is important to identify the sources of the slowdown of the Java version of IS with respect to the C version.To that end we have instrumented the JavaMPI binding, and gathered additional measurements.It turns out that the cumulative time spent in the C functions of the JavaMPI binding is approximately 20 milliseconds in all cases, and thus has a negligible share in the breakdown of the total execution time for the Java version of IS.Clearly, the JavaMPI binding does not introduce a noticeable overhead in the results from Table 2.

Virtual environment using IceT
Up to this point, the motivation behind mixing languages has been driven by a goal to increase performance of a stand-alone process, in which case the utilization of native codes has significant attraction.Although in most cases the performance improvement when utilizing native codes is substantial, a relatively small price has to be paid.This has already been mentioned, namely the additional overhead incurred in the data translation which is an unavoidable part of the wrapping code when passing between languages.Other costs which are incurred, however, are in portability and platform independence.These aspects are not as pronounced in the situation of a stand-alone process.However, high-performance computing often entails multi-process computations which span multiple architectures, operating systems, networks, and filesystems.Such an inhomogeneous environment is ideally suited for pure Java processes, where the bytecode representation of the process gives a high-level of assuredness that any component of the multi-process computation can be moved to and executed on any of the resources.This is not the case when one begins mixing languages.
Fortunately, this is not to say that the aspects of portability and platform independence are entirely lost.Such has been the focus of the computational environment of IceT, where Java's bytecode representation is used to give large degrees of portability to mixedlanguage-based distributed computations.In this sense, the benefits to distributed computations are two-fold: Processes enjoy executional speeds of native-language codes and maintain levels of portability similar to pure-Java processes.
For example, Java portability is often recognized in the form of Java-based applets being downloaded over the network and run locally within one's web browser.However, a Java-based process which has been written to take advantage of the JCI binding of MPI for a specific architecture such as Solaris would have little chance in being downloaded to be executed on a Win-dows95 PC.This lacking aspect of portability is addressed in IceT.
A process' use of native codes is detectable by a judicious investigation of the process' bytecode representation.In IceT, process portability is realized through detecting a process' use of native codes and then supplying the necessary shared library format for the appropriate architecture and operating system.The utility of JCI for automating the wrapping of native libraries is attractive in the sense that a Java-binding of native libraries can be easily generated for multiple platforms.The result of the JCI Java-binding for MPI is a shared library, "libJavaMPI_MPI.so"which holds the MPI compilation for the specific architecture and operating system (e.g., Sparc/Solaris, SP2/AIX, etc.).The particular shared library is required in order for the process to successfully execute.In this example, we've used the JCI-generated MPI library to support a distributed message-passing process which is softinstalled onto a remote, pre-configured MPI cluster using IceT.
To accomplish this task, an application programmer writes the programs based upon the JavaMPI binding.Using IceT, JavaMPI calls, and various shell scripts which are located on the remote MPI cluster, the programmer is able to soft-install both the Java-based programming units and the supporting JCI-generated MPI library as shown in Fig. 6.This example illustrates the additional and often overlooked benefits to Java-wrapping native codes.The MPI program collective in this example enjoys effective communication between processes through the native MPI calls which enables high performance computing.Moreover, the Java-based aspects of the program collective permit a programmer to write and debug MPI applications locally as well as allowing softinstallation of the application onto a larger MPI cluster, or perhaps onto a different MPI configuration such as on an IBM SP2 or Intel Paragon.

Discussion and related work
Many research groups and vendors are pursuing research to improve Java's performance which would enable more scientific and engineering applications to be solved in Java.The need for access to legacy libraries is one of the burning problems in this area.Several approaches can be taken in order to make the libraries available from Java: • Hand-writing existing libraries in Java.Considering the size of the available libraries and the number of years that were invested in their development, rewriting the libraries would require an enormous amount of manual work [2].• Automatically translating Fortran-77/C library code into Java.We are aware of two research groups that have been working in this area -University of Tennessee [5] and Syracuse University [7].This approach offers a very important long-term perspective as it preserves Java portability, while achieving high performance in this case would obviously be more difficult.
• Manually or automatically creating a Java wrapper for an existing native Fortran-77/C library.Obviously, by binding legacy libraries, Java programs can gain in performance on all those hardware platforms where the libraries are efficiently implemented.
The automatic binding, which we are primarily interested in, has the obvious advantage of involving the least amount of work, thus reducing dramatically the time for development.Moreover, it guarantees the best performance results, at least in the short term, because the well-established scientific libraries usually have multiple implementations carefully tuned for maximum performance on different hardware platforms.Last but not least, by applying the software re-use tenet, each native legacy library can be linked to Java without any need for re-coding or translating its implementation.
After the initial period when the first Java versions were built for portability, the Java compiler technology has now entered a second phase where the new versions are also targeting higher performance.For example, JIT compilers have dramatically improved their efficiency, and are now challenging mature C++ compilers.The developers of HPCJ have adopted the 'native compiler' approach in order to gain faster execution times.A different strategy has been chosen by the authors of Toba [15].Toba translates Java bytecode into C source code, which is then compiled with the appropriate compiler optimisation flags for high performance.Another advantage of this approach is that it is as portable as any other C software.

Conclusions
This paper presents a general approach to combine Java and legacy code written in Fortran and/or C into multi-language programming environments where Java serves as a front-end wrapper for existing native libraries.The JCI tool for automatic creation of interfaces to such libraries (whether for scientific computation or message-passing) substantially improves the flexibility and applicability of such interfaces.In addition to the JCI-generated bindings, the basic components of our high-performance Java programming environments include performance-tuned implementations of scientific and communications libraries available on different machines, and a native Java compiler such as IBM's HPCJ.We also believe that our appoach is practical in a sense that legacy code is ubiquitous and it would be much too tedious to port all of it to Java.If Java is to gain acceptance as a high performance language it has to interface with such existing native libraries.
One of the primary goals of our approach has been to gain portability by using Java without sacrificing performance from highly optimized native code.The use of the JCI tool clearly extends Java's usefulness and provides rapid solution to the multi-language interfacing problem, but the JNI-wrapping techniques introduce certain limitations on application portability and mobility.Our solution to this problem is based on extensions to the functionality of the IceT virtual collaborative environment by providing the ability to spawn remote Java applications on machines running an IceT daemon and then soft-loading any missing software components such as native libraries and multi-language bindings to remote Java processes.In this case application programmers are given the best of both worlds -their codes enjoy enhanced portability and wider accessibility to resources similar to pure Java applications and are able at the same time to derive high performance levels exclusive to conventional languages such as C and Fortran.

Fig. 4 .
Fig. 4. Programming environment using the high-performance compiler for Java which generate native executable code.

Fig. 5 .
Fig. 5. Execution time for the NPB IS kernel (class A) on the IBM SP2.

Fig. 6 .
Fig. 6.An IceT user (left) merges with a six-node MPI cluster (right) to soft-install and execute an MPI program on the remote cluster.

Table 1
Performance results for the MATMUL benchmark on the IBM SP2

Table 2
Performance results for the NPB IS kernel (class A) on the IBM SP2