The number of computer programs for the analysis of genetic data is increasing significantly, but it still needs to be improved greatly because of the importance of result analysis with appropriate methods and the exponential growth in the volume of genetic data.
Genetic data are typically represented by a set of strings [
The large number of possible candidate solutions during the analysis of genetic data means that the employed algorithms must be selected carefully [
A characteristic feature of the computer programs applied to genetic data is the necessity to analyze large amounts of data using complex algorithms, which means that high performance is crucial. Different user and system requirements mean that the flexibility of software is also important. Finally, users prefer a graphical interface that is accessible from a web browser and applications that update automatically.
Scientists are becoming increasingly involved in software development [
In this study, I describe the
A three-layer software architecture was selected where the presentation layer, data processing layer, and data storage layer were kept separate. The use of a multilayered model makes computer programs flexible and reusable, because applications have different responsibilities. Thus, it is beneficial to segregate models into layers that communicate via well-defined interfaces. Layers help to separate different subsystems, and the code is easier to maintain, clean, and well structured.
Four possible deployment models were considered for the three-layer architecture: the desktop, the database server, the thin client, and the web application, as shown in Figure
Three-layer application deployment models: desktop application (a), database server (b), thin client (c), and web application (d). This solution supports the creation of applications using a web application architecture.
An application architecture with a shared database and data processing modules deployed on client machine (Figure
Deploying the calculation modules on a server machine allows the execution of these modules by clients on different platforms, which reduces the development costs. The computational power of the server is important because it determines the computational time, which means that poorly equipped client machines can be used. The optimum solutions are a thin client architecture, as shown in Figure
Deploying the calculation modules on a server machine, as shown in Figures
Modules produced for a typical application based on the proposed framework using various programming languages.
The algorithms are implemented in C++. The source code is translated (compiled) into machine language, which makes algorithm execution more efficient because the code is executed directly by the processor. The language has higher-level abstractions missing in other languages translated into binary code (C and Fortran). C++ supports object-oriented programming by providing virtual functions and multibase inheritance and exceptions and facilitates functional and genetic programming, including templates and lambda functions. The standard C++ library is compact but it is well tested and efficient. It includes support for inputs and outputs, strings and string operations such as regular expressions, and sets of collections, such as vectors, lists, sets, and associative arrays using trees and/or hash tables. It should be mentioned that concurrency support mechanisms are included in the C++11 standard (ISO/IEC 14882:2011), so the full capabilities of modern computers with multiple processors and/or multiple cores can be exploited. If an older C++ compiler that does not support C++11 is used, it may be necessary to employ the Boost [
The server application uses the Python language in presented solution, mainly because this type of development is faster compared with C++. Modules that do not constitute a bottleneck during calculations should be implemented in Python. Python is a scripting language, so it is small and has a simple, regular syntax. This language is dynamically type-checked, uses a uniform data model, and provides reference counting memory management, so there is no problem with memory leaks. The Python repository of software (PIP)
The use of a compiler and an interpreter makes the developed software more flexible. The application customization requires the use of an interpreter in any case, because changing the settings should not demand the software rebuilding. The use of Python to store the user settings simplifies the customization of applications greatly, because the settings do not need to be lists of names and values, and the Python control instructions can be used.
A client application request is sent to the standard port using the HTTP protocol and it is retransmitted by the web server using interprocess communication mechanisms (e.g., sockets and named pipes) to the server application. Three web servers were investigated: Apache
Flup is a simple WSGI server but its library is small (256 kB), so the facilities are limited to the python function call when an http request is received from a client and the function results are sent back to the client application using a web server. More advanced libraries are Web2Py (9 MB) and Django (22 MB), where the facilities include parameter conversion, authentication, authorization, and database support using object-relational mapping. All Flup, Web2py, and Django were tested in the present study, because the characteristics of Web2py and Django are similar. However, Django is recommended because all of the available facilities are written explicitly and this library has the best documentation. Django uses Flup internally to cooperate with Lighttpd in current version of software; this configuration works correctly under all popular modern operating systems (Linux, Windows, iOS, etc.).
The framework was designed to create the software that serves multiple users at the same time. The users communicate independently with the server via the Internet and the framework includes a component with the active object pattern [
Active object implementation delivered by the framework. The client requests are transformed into commands automatically, which are executed by separate threads.
The execution of calculation tasks is decoupled from task invocation to enhance concurrency and to simplify multithread usage, as shown in Figure
Cooperation among active object participants. The client request is converted into a command managed by the task manager on the Python side and by the scheduler in C++. The command is stored in the queue, and it is executed when an unoccupied thread is available. The client can request the current command status and the command progress.
Software testing is an integral part of the development process. Thus, testing techniques and libraries that support this process are specified in presented framework. Three types of tests are considered: unit tests, integration tests, and system tests. Unit testing checks individual functions, procedures, and classes in isolation. Integration tests examine the communication between modules, based on a consideration that they are created in different programming languages. System tests examine the functions of a computer program as a whole, without the knowledge of the internal structure of the software.
Unit testing uses Boost.Test [
System testing uses the Python language and splinter
The test quality measure is the source code coverage during unit, integration, and system testing. This measure provides numerical data related to the performance of test procedures, which helps to identify inadequately tested parts of the software. The analytic tools used to evaluate coverage in
This section describes the programming tools used to create applications in
The C++ modules require a C++ compiler and it is recommended to use at least two different compilers, particularly the g++ compiler from the GNU Compiler Collection
To speed up the creation of new software, the developer can use a specialized framework. The most popular, freely available frameworks are Bioconductor [
This framework was used to create several applications to analyze genetic data:
This genetic data analysis software development project was performed in academia and it supports students who have a limited amount of time available and who also lack experience in design and programming. I found that agile methodologies [
Presented framework is still being developed; the Guncorn [
operating systems(s): OS Portable;
license: GNU Library or Lesser General Public License version 3.0 (LGPLv3);
getting started: to build a “Hello World” application please download the latest version, extract the files from the archive, install additional software as described in
The author declares that there is no conflict interests.
This work was supported by the statutory research of Institute of Electronic Systems of Warsaw University of Technology. The author would like to thank the editor and anonymous reviewer for their constructive comments. The author is grateful to the students of the Faculty of Electronics and Information Technology of Warsaw University of Technology, who acted as the early users of this software, and to Hanna Markiewicz for the proofreading.