The selection of the computer language to adopt is usually driven by intuition and expertise, since it is very diffcult to compare languages taking into account all their characteristics. In this paper, we analyze the effciency of programming languages through Data Envelopment Analysis. We collected the input data from
Which programming language is the best one? In general, the answer could be “no guess,’’ of course. But if we add a precise criterion, asking, for example, for a program with low memory consumption, one would probably not choose Java. In this paper, we propose a way for understanding how to widen this answer, suggesting which programming languages should
Several languages are available to researchers and practitioners, with different syntax and semantics. As common practice, an implementer chooses the programming language to use according to his feeling and knowledge. In fact, comparing the performance of the various languages is not straightforward. Only few studies deal with such an evaluation [
The first problem can be in some part overcome by considering the data provided by
The benchmark problems do not require the investigation of a solution space for finding an optimum. As an example, one of these problems requires reading integers from a file and printing the sum of those integers. Hence, all programs reach the same result, and they are tested for computational time, memory usage, and source code size.
Only the best programs are kept for each pair of language and benchmark. If two implementations using the same language exist for a problem, they are evaluated in terms of the three just-mentioned criteria on multiple workloads. If one of them performs worse in terms of at least one criterion and not better in the others, then it is dominated and thus deleted from the program list. This way, the survived codes are the ones that behave the best, or that, at least, are not clearly worse than any other.
The selection operated in the CLBG site, nonetheless, appears quite conservative. It is not of great help for discriminating among languages. In fact, we would like to compare languages considering run time, memory usage, and source code size as a whole, but these characteristics can be neither directly compared, nor
A possible way to obtain a single ranking, undistorted by a subjective bias, is through the application of the Data Envelopment Analysis (DEA) [
DEA allows to take into account different inputs and outputs (attributes) of the economic units, creating a unique ranking. Its characteristic of coping with attributes of different nature turns out to be very useful also in the present research. In our case, we consider computational time, memory usage, and source code size as input to the DMU/language while we consider the quantities of programs contributed for solving each workload as outputs.
The philosophy underpinning DEA is to assess each unit using individually optimized attribute weights. The weights are those that lead to the best possible efficiency value for that specific unit. In our context, this means, for example, that the less memory consuming language will have maximal weight associated to memory consumption. This way, no other language will be able to dominate it. Languages which are not on the top of the ranking, even if individually optimized weights are assigned, are definitely classified as inefficient [
The rest of the paper is organized as follows. Section
The concept of efficiency is used to measure the quality of a transformation of resources into goods and services. Several definitions of efficiency are commonly adopted in practice. The so-called Pareto-efficiency, for example, is reached when there are no possible alternative allocations of resources whose realization would bring a gain. In terms of inputs and outputs, this means that no more output can be obtained without increasing some input.
Data Envelopment Analysis is a relatively young but well-established optimization-based technique. As mentioned before, it allows to measure and compare the performance of economic entities, each of them constituting a so-called Decision Making Unit.
DMUs are considered as black boxes converting multiple inputs into multiple outputs. Typical examples of DMUs appearing in DEA applications are universities, hospitals, and business firms. They are compared to each other for ranking their performances.
As already mentioned, we consider each language appearing in the CLBG site as an autonomous unit, that is, we assume it to be a DMU.
With DEA, the performance evaluation of a DMU does not require any
DEA’s efficiency scores are normalized and bounded between 0 and 1. Languages that obtain an efficiency score lower than one, that is, that are not on the top of the ranking, are classified as inefficient [
Several DEA models have been developed in the last three decades (see, e.g., [
The original DEA models (see, e.g., the CCR model [
We opt for a nonoriented version, which allows the consideration of both input reductions and output expansions, instead of limiting the analysis to one of these aspects.
Furthermore, since computer language inputs vary greatly in size, we consider the variable return to scale (VRS) version of this model.
Formally, for each DMU, the efficiency score is obtained as the solution of a mathematical programming problem. Consider a set of
The SBM model, in its nonoriented version with variable returns to scale, defines the efficiency of a chosen DMU (DMU0) among
subject to
where
If the optimal value of the SBM model is not equal to 1, then DMU0 is inefficient: even with the most favorable weights. DMU0 turns out to be dominated by at least another DMU. DMUs with efficiency 1 are called SBM-efficient (SBM-efficiency coincides with the so-called Pareto-Koopmans efficiency concept, see, e.g., [
The above model has to be applied for each DMU belonging to the technology set. Remark that computer languages are not completely homogeneous DMUs: as we will discuss in Section
The code reported is provided by volunteer programmers. They aim at achieving the best result in the competition for each benchmark. All the experiments are run on a single-processor 2 Ghz Intel Pentium 4 machine with 512 MB of RAM and 80 GB IDE disk drive. The operating system used is Ubuntu 9.04 Linux Kernel 2.6.28-11-generic. The specific data considered in the current analysis have been collected on March 09, 2009.
Several programs may be uploaded on the website for a single benchmark. All the programs written in the same language for the same problem are pairwise compared with respect to several criteria. In each comparison, if one of them is not better than the other one according to the whole set of available measurement, and it is worse under at least one measurement, then it is eliminated from the competition. In this way, the survived programs can be considered as the ones achieving the best possible results allowed by a language on a benchmark problem.
Each program is run and measured for every benchmark and each workload. If the program gives the expected output within a cutoff time of 120 seconds, five independent runs are performed. If the program does not return the expected output within a timeout of one hour, it is forced to quit. The time measurements include the program startup time. The memory use is measured every 0.2 seconds. The measurements recorded are the lowest time and highest memory usage from repeated measurements, no matter whether forcefully terminated or not.
The data reported have been object of some intuitive assessment (see, e.g., [
Based on the data available, the aim of our analysis is comparing different languages when performing at their best. CPU time, memory usage, and source code size are generally considered the crucial characteristics of the efficiency of computer programs. The latter can be considered as proportional to the implementation effort [
For each benchmark, three workloads are tackled. Hence, three measurements are available for each program, both for CPU time and memory usage, which together to source code size represent the 7 inputs used. If multiple programs written in the same language tackle a workload, the input is the sum of the measurements of all of them. For example, if three programs tackle a workload in
In the analysis, we ignore benchmark problems on which less than 30 languages are tested, as well as languages coded for less than 14 benchmark problems.
A set of thirty-five programming languages are compared. They are all well known in the scientific community; their names and the main characteristics of the implementations considered here are reported in Table
Programming languages considered.
Language | Characteristics |
---|---|
Ada | Multiparadigm, static, compiled |
C | Imperative, static, compiled |
CTiny | Imperative, static, interpreted |
C | Multiparadigm, static, compiled |
C | Multiparadigm, static, compiled |
Clean | Declarative, static, compiled |
D | Multiparadigm, dynamic, compiled |
Eiffel | Multiparadigm, static, compiled |
Erlang | Declarative, dynamic, compiled |
Forth | Imperative, dynamic, compiled |
Fortran | Imperative, static, compiled |
Haskell | Declarative, static, compiled |
Java | Imperative, static, compiled |
JavaScript | Multiparadigm, dynamic, interpreted |
Lisp | Multiparadigm, dynamic, compiled |
Lua | Multiparadigm, dynamic, interpreted |
Nice | Imperative, static, compiled |
Oberon | Imperative, static, compiled |
ObjectiveC | Imperative, dynamic, compiled |
Ocaml | Multiparadigm, static, compiled |
Oz | Multiparadigm, dynamic, compiled |
Pascal | Imperative, static, compiled |
Perl | Multiparadigm, dynamic, interpreted |
PHP | Imperative, dynamic, interpreted |
Pike | Multiparadigm, dynamic, interpreted |
PIR | Imperative, dynamic, interpreted |
Prolog | Declarative, dynamic, compiled |
Python | Multiparadigm, dynamic, compiled |
Ruby | Multiparadigm, dynamic, interpreted |
Scala | Multiparadigm, static, compiled |
Scheme | Declarative, dynamic, compiled |
SLang | Imperative, dynamic, interpreted |
Smalltalk | Imperative, dynamic, compiled |
SML | Multiparadigm, static, compiled |
Tcl | Multiparadigm, dynamic, interpreted |
We consider 14 benchmark problems. The CLBG site reports a detailed description of each of them. Here, we report only their main characteristics.
Each program defines, allocates, and deallocates binary trees. A long-lived binary tree is to be allocated; it lives on while other trees are allocated and deallocated. Then, many further trees are allocated, their nodes are walked through, and they are finally deallocated.
A permutation of
The expected cumulative probabilities for 2 alphabets must be encoded. Then, DNA sequences are generated by weighted random selection from the alphabets. A given linear generator is to be used. Finally, three DNA sequences are written line-by-line in Fasta format, which is a text-based format for representing either nucleotide sequences or peptide sequences: base pairs or amino acids are represented using single-letter codes.
The Mandelbrot set (from
The orbits of Jovian planets have to be modeled by using a simple symplectic integrator.
The prime numbers up to a fixed value are to be counted by using the naïve Sieve of Eratosthenes algorithm [
The Nsieve-bits problem is similar to the previous case, but it is based on arrays of bit flags.
An iterative double-precision algorithm must be used for calculating partial sums of eight given series of real numbers.
A step-by-step spigot algorithm [
Three simple numeric functions must be computed, namely, Ackermann [
The following procedure must be executed: a Fasta format file is to be read line-by-line. Then, for each sequence, the id, the description, and the reverse-complement sequence must be written in Fasta format.
The spectral norm of an infinite matrix
Each program should print “hello world’’ and exit. This benchmark measures startup costs. Each program is run several times in a loop by a shell script wrapper.
This problem consists in reading integers, one line at a time, and printing the sum of those integers.
Figures
Distribution of the values of CPU time recorded for each language.
Distribution of the values of memory usage recorded for each language.
Distribution of source code size of each language.
We can observe that some languages are rather well performing in terms of CPU time, as C and C++, but are quite poor in terms of source code size. The opposite holds for languages as, for example, Tcl and Python (see, Figures
The efficiency of each language is assessed through DEA. As mentioned, inputs are computational time, memory usage, and source code size tested on different workloads, and outputs are the number of times each workload is tackled. The first element to observe is the performance of the various programming languages individually.
Figure
Efficiency of each language on the benchmark problems.
Table
Summary of the results of the DEA analysis: mean and lowest efficiency scores over the available set of benchmark problems.
Binary trees | Fannkuch | Fasta | Mandelbrot | ||||
---|---|---|---|---|---|---|---|
Mean eff. | 0.68 | 0.74 | 0.72 | 0.74 | 0.58 | 0.65 | 0.67 |
Lower eff. | 0.18 | 0.18 | 0.29 | 0.34 | 0.14 | 0.11 | 0.15 |
No. of eff. | 15 | 15 | 11 | 14 | 11 | 10 | 14 |
Partial-sum | Pidigits | Recursive | Reverse-complement | Spectral-norm | Startup | Sum-file | |
Mean eff. | 0.74 | 0.69 | 0.62 | 0.65 | 0.63 | 0.591 | 0.68 |
Lower eff. | 0.22 | 0.18 | 0.17 | 0.19 | 0.12 | 0.08 | 0.18 |
No. of eff. | 17 | 13 | 12 | 12 | 11 | 13 | 12 |
Detailed results of the DEA analysis.
Language | Efficient | Binary trees | Fannkuch | Fasta | Mandelbrot | Partial-sum | Pidigits | Recursive | Reverse-complement | Spectral-norm | Startup | Sum-file | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Ada | 4 | 0.652 | 0.563 | 0.727 | 0.624 | 0.656 | 0.557 | 0.365 | 0.342 | ||||||
C | 13 | 0.8 | |||||||||||||
C Tiny | 5 | 0.752 | 0.951 | 0.959 | 0.869 | 0.920 | |||||||||
C | 1 | 0.408 | 0.472 | 0.636 | 0.572 | 0.409 | 0.709 | 0.519 | 0.568 | 0.312 | 0.423 | 0.505 | 0.124 | 0.834 | |
C | 12 | 0.393 | 0.811 | ||||||||||||
Clean | 8 | 0.746 | 0.761 | 0.719 | 0.899 | 0.563 | |||||||||
D | 14 | ||||||||||||||
Eiffel | 5 | 0.606 | 0.953 | 0.526 | 0.497 | 0.813 | 0.556 | 0.580 | 0.584 | ||||||
Erlang | 2 | 0.315 | 0.525 | 0.453 | 0.251 | 0.497 | 0.403 | 0.501 | 0.188 | 0.084 | 0.184 | ||||
Forth | 4 | 0.579 | 0.349 | 0.615 | 0.500 | 0.421 | 0.300 | 0.196 | 0.413 | 0.305 | |||||
Fortran | 4 | 0.321 | 0.586 | 0.633 | 0.954 | 0.368 | 0.484 | 0.426 | 0.584 | ||||||
Haskell | 5 | 0.564 | 0.534 | 0.782 | 0.407 | 0.190 | 0.343 | ||||||||
Java | 13 | 0.506 | |||||||||||||
JavaScript | 5 | 0.206 | 0.576 | 0.139 | 0.271 | 0.317 | 0.261 | ||||||||
Lisp | 1 | 0.736 | 0.395 | 0.438 | 0.537 | 0.544 | 0.408 | 0.563 | 0.236 | 0.272 | 0.403 | 0.462 | 0.270 | 0.307 | |
Lua | 11 | 0.182 | 0.470 | ||||||||||||
Nice | 2 | 0.390 | 0.382 | 0.724 | 0.321 | 0.229 | 0.287 | 0.621 | 0.088 | 0.500 | |||||
Oberon | 6 | 0.398 | 0.673 | 0.953 | 0.901 | 0.585 | |||||||||
Objective-C | 1 | 0.502 | 0.607 | 0.907 | 0.572 | 0.339 | 0.771 | 0.309 | 0.524 | 0.475 | |||||
Ocaml | 9 | 0.755 | 0.724 | 0.566 | 0.656 | 0.726 | |||||||||
Oz | 0 | 0.520 | 0.427 | 0.349 | 0.386 | 0.165 | 0.222 | 0.335 | 0.328 | 0.541 | 0.197 | 0.189 | 0.272 | 0.174 | 0.293 |
Pascal | 9 | 0.823 | 0.722 | 0.450 | 0.742 | 0.495 | |||||||||
Perl | 8 | 0.390 | 0.661 | 0.285 | 0.220 | 0.349 | 0.598 | ||||||||
PHP | 4 | 0.245 | 0.387 | 0.329 | 0.379 | 0.163 | 0.333 | 0.269 | 0.442 | 0.210 | 0.438 | ||||
Pike | 1 | 0.392 | 0.658 | 0.539 | 0.637 | 0.312 | 0.348 | 0.633 | 0.133 | 0.508 | |||||
PIR | 1 | 0.176 | 0.176 | 0.294 | 0.406 | 0.252 | 0.393 | 0.408 | 0.238 | 0.449 | 0.151 | 0.202 | 0.433 | ||
Prolog | 0 | 0.260 | 0.390 | 0.459 | 0.175 | 0.105 | 0.235 | 0.223 | 0.432 | 0.166 | 0.122 | 0.266 | 0.241 | ||
Python | 12 | 0.562 | 0.370 | ||||||||||||
Ruby | 8 | 0.253 | 0.201 | 0.515 | 0.317 | 0.824 | |||||||||
Scala | 3 | 0.428 | 0.379 | 0.227 | 0.469 | 0.521 | 0.287 | 0.258 | 0.291 | 0.399 | 0.077 | 0.375 | |||
Scheme | 3 | 0.395 | 0.583 | 0.432 | 0.403 | 0.307 | 0.228 | 0.393 | 0.192 | 0.348 | |||||
S-Lang | 0 | 0.281 | 0.701 | 0.437 | 0.545 | 0.356 | 0.248 | 0.149 | 0.656 | 0.447 | 0.649 | 0.763 | |||
Smalltalk | 1 | 0.347 | 0.336 | 0.205 | 0.227 | 0.246 | 0.182 | 0.213 | 0.153 | 0.154 | 0.180 | ||||
SML | 3 | 0.441 | 0.746 | 0.738 | 0.464 | 0.548 | 0.803 | 0.375 | 0.469 | 0.455 | 0.304 | 0.485 | |||
Tcl | 3 | 0.288 | 0.414 | 0.472 | 0.544 | 0.234 | 0.304 | 0.284 | 0.173 | 0.412 | 0.248 |
As it can be observed, the best performing language results to be C: the programs are efficient on almost all the benchmarks considered. A similar evaluation holds for D, Clean, C++, Pascal, and Python. The efficiency of these languages is very robust: the standard deviation of the relative distributions is very low, and the median value is one in all these cases.
Figure
Efficiency of groups of languages: imperative versus declarative versus multiparadigm, static versus dynamic, compiled versus interpreted.
The discussion on the convenience of static versus dynamic languages, in particular, has been widespread on various blogs dedicated to computer languages on the web (e.g., [
In this paper, we used Data Envelopment Analysis for assessing the efficiency of the main programming languages. We exploit its ability of comparing elements on the basis of several criteria of different nature. In particular, we consider the computational time, the memory usage, and the source code size required for solving various benchmark problems under multiple workloads. The programs analyzed are publicly available in the repository of the
According to the results obtained, the most efficient languages are C, D, Clean, C++, Pascal, and Python while Smalltalk, Prolog, PIR, and OZ are definitely inefficient. Such a conclusion supports the expectation driven by common experience. By grouping languages, it is possible to observe the impact of their main characteristics, such as being imperative, declarative or multiparadigm, static or dynamic, compiled or interpreted. The strength of the influence on efficiency of these features appears different.
Our research aims at supporting with scientific evidence, as far as possible, the choice of the programming language to be used for implementing efficient algorithms. The relevance of such a choice is often neglected, while the proposed analysis shows that it is far from being unimportant. In particular, it appears crucial in the current practice of optimization, in which great effort is devoted to boost the performance of solution algorithms.
The proposed methodology can be easily applied to different datasets. Undoubtedly, other properties of languages might be considered, such as reliability or syntax toughness, for achieving a finer classification.
All the available benchmark problems are considered in this phase as equally important. A further refinement of the results may be obtained by weighting them differently according to the similarity to the specific problem one needs to tackle. A possible tool for assigning such weights may be represented by the application of the Analytic Hierarchy Process (AHP) [