A New Implementation of Genome Rearrangement Problem

Unsigned reverse genome rearrangement is an important part of bioinformatics research, which is widely used in biological similarity and homology analysis, revealing biological inheritance, variation, and evolution. Branch and bound, simulated annealing, and other algorithms in unsigned reverse genome rearrangement algorithm are rare in practical application because of their huge time and space consumption, and greedy algorithms are mostly used at present. By deeply analyzing the domain of unsigned reverse genome rearrangement algorithm based on greedy strategy (unsigned reverse genome rearrangement algorithm (URGRA) based on greedy strategy), the domain features are modeled, and the URGRA algorithm components are interactively designed according to the production programming method. With the support of the PAR platform, the algorithm component library of the URGRA is formally realized, and the concrete algorithm is generated by assembly, which improves the reliability of the assembly algorithm.


Introduction
With the development of biotechnology, biological information data is growing explosively. At the same time, the improvement of computer computing ability and the development of Internet make it possible to store and process large-scale data. How to use computer technology to extract useful information from these data is imminent. erefore, bioinformatics emerges as the times require. Bioinformatics covers the comprehensive application of biology, computer science, and mathematics.
rough the collection, processing, storage, dissemination, analysis, and interpretation of biological information, the biological significance of a large amount of data is clarified and understood.
In organisms, a chromosome is composed of a gene sequence, and the genome is a collection of chromosomes [1]. In order to determine the similarity or homology between the two organisms and reveal the problems of biological heredity, variation, and evolution, we often arrange and compare the two DNA sequences of two organisms according to certain rules. en, through a series of basic operations such as character editing (insert, delete, and replace), one sequence is transformed into another. e minimum number of editing operations required to complete this conversion is the edit distance of two sequences. At the level of a single gene, genetic sequences evolve by editing these characters, so edit distance is a useful measure of evolutionary distance. However, at the chromosomal level, genetic sequences are mainly evolved by global genome rearrangements. ere are five basic rearrangements, namely, translocation, transposition, duplication, deletion, and reversal. Translocation refers to the partial exchange of two nonhomologous chromosomes in a genome. Translocation refers to the exchange of two contiguous gene subsequences in a chromosome. Replication refers to the replication of a continuous gene subsequence on a chromosome. Deletion refers to the deletion of a continuous gene on a chromosome, and inversion refers to the sequence reversal of a continuous gene on a chromosome. Inversion is the most common of these five forms, especially in organisms with only one chromosome. For example, the only difference between the gene sequences of the two most famous bacteria, Escherichia coli and Salmonella typhimurium, is the inversion of a subsequence of the chromosome sequence [2]. In the fruit fly, genus Drosophila, inversion can reflect the differences between species and within species more frequently than translocation or other processes [3]. In these examples, the importance of inversion shows that the research on the algorithm of genome rearrangement only by inversion (we call it reverse genome rearrangement) is a valuable step to study the evolutionary distance at the chromosome level.
In this paper, we design an abstract generic algorithm component library of unsigned reverse genome rearrangement algorithm (URGRA) based on greedy strategy, which improves the reliability and reusability of algorithms in this field. e second section introduces the genome rearrangement problem and related research and briefly describes the domain modeling technology and formal methods. e third section analyzes the domain of reverse genome rearrangement algorithm domain, establishes the domain feature model of URGRA, identifies the common features and variable features, establishes the relationship between features, and designs algorithm components and component interaction model. In Section 4, we show the process of developing a reverse sorting algorithm based on the First Descending Strip Reversal (FDSR) based on the component library and give the experimental results of the algorithm. Finally, we summarize and prospect the full text.

Reverse Genome Rearrangement
Problem. Given two chromosomes, they are represented by τ and σ, respectively, τ � τ 1 , τ 2 , τ 3 , . . . , τ n , σ � σ 1 , σ 2 , σ 3 , . . . , σ n , where σ i and τ i represent a gene on the chromosome, and let ρ � [i, j] (1 ≤ i ≤ j ≤ n) denote the inversion interval acting on the chromosome. σ·ρ denotes that the subsequence What we want to seek is the minimum value of inversion operation; that is, to find a series of inversion interval ρ 1 , ρ 2 , ρ 3 , . . ., ρ r makes σ·ρ 1 ·ρ 2 ·ρ 3 . . .ρ r � τ and r is the smallest. We call r the reverse distance or inversion distance of σ and τ and record it as d (σ,π) (although it cannot guarantee that the inversion process represents an actual evolutionary sequence, it can give us a lower bound of the number of rearrangements that have occurred and indicate the similarity between two species. erefore, scientists are interested in the minimum number of reversals [4]).
In the extension arrangement, if π i and π i+1 are adjacent numbers (0 ≤ i ≤ n), then π i and π i+1 are adjacency; otherwise, π i and π i+1 are breakpoints. In this paper, the interval between two adjacent breakpoints in π is defined as a strip, that is, the largest fragment without a breakpoint. e further differentiation of strip can be divided into ascending strip and descending strip. e strip with only one element can be defined as either an ascending strip or descending strip, but it is conventionally defined as a descending strip (0 element and n + 1 element are always defined as ascending strip).
It has been proved that if the gene has direction, the reverse genome rearrangement problem is polynomial time solvable, and a very effective algorithm has been found. However, if the gene is unsigned, the reverse genome rearrangement problem is NP hard [5,6]. In this paper, we mainly study the problem of unsigned reverse genome rearrangement. If there is no special explanation below, genome rearrangement will mean the unsigned reverse genome rearrangement. In 1995, David Sankoff and Kececioglu began to study the inversion distance problem and discussed the greedy approximation algorithm based on breakpoint elimination and an accurate algorithm of the branch and bound algorithm for the first time [7]. en, Afna and Pevzner designed a 1.75-fold approximation algorithm for genome inversion sequencing without symbol, with a time complexity of O (n 2 ). In 1998, Christie gave a polynomial approximation algorithm with an approximation degree of 1.5 and a time complexity of O (n 4 ) [8]. Berman et al. designed a polynomial approximation algorithm with an approximation degree of 1.375 by using the signed reverse sorting algorithm and the cycle decomposition approximation algorithm [9]. In the twentieth century, Professor Mo Zhongxi's team of Wuhan University designed a greedy algorithm based on breakpoint graph, and Professor Zhu Daming of Shandong University designed a greedy algorithm based on the First Descending Strip Reversal (FDSR) [10]. Many researchers have developed various algorithms since then.
At present, the diversity and complexity of the reverse genome rearrangement algorithm make it impossible for many users to choose the algorithm suitable for different DNA sequence characteristics, which leads to unnecessary errors in the research process. On the other hand, it is difficult to understand the structure of the reverse genome rearrangement algorithm, which will affect the correct use of the algorithm in the actual situation. Because of the low abstraction of the reverse genome rearrangement algorithm, the reusability and reliability of the reverse genome rearrangement software are affected. erefore, it is necessary to study the reverse genome algorithm at the domain level. e research on the algorithm family is helpful to extract the commonness and variability of the algorithm and provide support for the formal development of the reverse genome rearrangement algorithm.

Feature Modeling Technology.
Domain modeling needs to determine key concepts and feature modeling of key concepts [11]. Feature engineering [12] holds that feature is a first-order entity that runs through the software life cycle, spans the problem space and solution space, and reduces the difference in demand awareness between users and software developers through features. In FODA [13], features are regarded as aspects, qualities, characteristics, and so forth that are visible, obvious, or characteristic to users in software systems. Features are domain knowledge accumulated by users' experts after long-term use or research in a domain.
Feature modeling is the activity of modeling the commonality and variability of features and the relationship between them. Literature [14] puts forward a feature-oriented domain modeling method (FODM), which considers the characteristics of service, function, and behavior of the domain and obtains the feature model through service analysis activities, function analysis activities, behavior analysis activities, domain terminology analysis activities, common variability analysis activities, interactive process analysis activities, and quality demand analysis activities.

Formal Method PAR.
e Par [15][16][17][18][19] (partition and recur) method is a unified algorithm programming method based on partition and recur. It makes full use of mature programming technologies such as data abstraction, function abstraction, software reuse, and class genus, to realize the formal development of complex algorithm problems.
at is, through a series of formal transformations to the problem specification, a fast and correct algorithm is obtained, and then an executable language program is obtained through a series of formal equivalent transformations or software conversion tools. It consists of the following elements: SNL (structured natural language), Radl (recurrence based algorithm design language), APLA (Abstract Programming Language), a set of model transformation rules, and a set of automatic conversion tools and executable programs among requirements model, algorithm model, and abstract program model. e APLA language fully embodies modern programming ideas such as function abstraction and data abstraction, which make it very suitable for describing abstract algorithm programs. In APLA, all the combined data types and their related operations adopt the generic mechanism. e generics are mainly divided into two categories: ① type parameterization, the introduction of the keyword sometype, which can be used to define type variables, and the basic type of the combined data type can be directly described in the form of parameters in the type declaration and ② subroutine parameterization: the func and proc keywords are provided in APLA to declare the process parameters and function parameters. When declaring these parameters, you only need to define the operation contains several variables and the type of each variable, and it can be instantiated by taking a subroutine implementation as argument. Apla is not only the target language of Radl-Apla program converter but also the source language of Apla to Ada, Java, C++, Python, and other executable language program converters, which is beneficial to the formal development of reusable components.

Domain Analysis.
Here, we deeply analyze the core ideas of three typical greedy algorithms. ① Input a gene permutation π to judge the validity of gene arrangement. If it is not legal, the output is wrong. ② e extended π is an extension arrangement, π � (0, π 1 , π 2 , π 3 , . . . , π n,n+1 ), and the inversion distance d is set to 0. ③ Judge whether π is the identity permutation; if not, enter into ④. If it is, enter ⑨. ④ If the first strip other than strip 0 is ascending, it is reversed to a descending strip. ⑤ Find the position i of the smallest term (i.e., the last term) in the first descending strip and the position j of its inverse adjacent element. ⑥ If the inverse adjacent element is the first element of a descending strip, the strip where the inverse adjacent element is located is first flipped. ⑦ If the position of j is on the right side of i, the inversion interval is ρ � [i + 1, j] and if the position of j is on the left side of i, the inversion interval is ρ � [i, j + 1]. ⑧ Let π � πρ, increase the inversion distance d by 1 and return to ③. ⑨ Output reverse distance d.

Algorithm Based on Breakpoint Graph.
e algorithm based on breakpoint graph is a greedy algorithm with time complexity O (max{b3 (π), nb (π)}) and space complexity O (n), which is developed by Mo Zhongxi's team of Wuhan University. Its main process is as follows: ① Input the extension permutation π and judge whether π is legal; if not, end. en, judge whether it is the identity permutation. If it is, end, otherwise, determine the inverse permutation π −1 of π and the breakpoint set B π of π, π′s breakpoint location table T. Let i � 1.
② Based on the breakpoint location table, find out all the reverse intervals in π that can eliminate two breakpoints and one breakpoint, and store them in S2 and S1, respectively.

Algorithm Based on Eliminating Breakpoints.
e algorithm based on eliminating breakpoints is a greedy algorithm with an approximate degree of 2 designed by David Sankoff and Kececioglu. e main process can be summarized as follows: ① Input the extension permutation π, judge the correctness of permutation, and calculate π-1, array down, and array up (array down and up can judge whether there are ascending and descending strip in a certain interval within O(1)). ② Find an inversion interval that can eliminate two breakpoints. ③ If there is no reverse interval ρ that can eliminate two breakpoints, find a reverse interval ρ that can eliminate one breakpoint and make the new arrangement after πρ have a descending strip. ④ If the above does not exist, then find an inversion interval that can eliminate a breakpoint. ⑤ If the above does not exist, find the inversion interval [i, πi−1], if πi≠i (i is the minimum position of subscript of element πi≠i). e above five steps until π are the identity permutation. According to the above analysis, we can use a unified flow chart to express the idea of each algorithm, as shown in Figure 1.

Domain Modeling.
In the following, we will use the feature modeling method proposed by academician Mei Hong's team to conduct feature modeling on the URGRA domain and construct the feature model based on the characteristics of service, function, and behavior in the URGRA domain. e reverse manipulation service (rever-se_mani) is the core service in this domain, and sequence validity check (seq_check), gene permutation storage table manipulation (perm_store_mani), greedy algorithm mode option (greedy_op), auxiliary permutation storage table manipulation (auxiliary_permu_mani), judging whether it is identity arrangement (is_sorted), and output are the main functions in this field. Where seq_check, perm_store_mani, greedy_op, output are required functions, auxiliar-y_permu_mani, is_sorted are optional functions, for greedy_op, Breakpoint_diagram_op, FDSR_op, and break-point_op are its behavior characteristics. For output, out-put_mode is its significant behavior characteristic, and there are three main behavior characteristics: inversion process output (procedure_op), inversion distance output (dis-tance_op), and inversion interval output (interval_op). For auxiliary_permu_mani, break_store, π −1 and break_-pos_store, and array of π −1 , up and down, are behavior characteristics. Based on the above analysis, a feature model for this domain is constructed, as shown in Figure 2.
Different features in that feature model realize a complete domain feature model through interaction, and the interaction between the features in the feature model needs to be reflected by the constraints and dependency between the included features. erefore, aiming at the feature model established above, we design the feature interaction model in the URGRA domain.
rough the establishment of the URGRA feature model, it is analyzed that the algorithm mainly includes three characteristics of the change process: permuta-tion_mani, greedy_op, and output. In addition, the input of the algorithm in this field is gene sequence, and the legitimacy of sequence information needs to be checked before algorithm execution. So, the major artifacts in this domain are the seq_check artifact, the perm_store_mani artifact, the greedy_op artifact, and the output artifact. Other features in the feature model are used as auxiliary components, and the interaction model of components is established according to the dependency between components, as shown in Figure 3.
Wherein, the nodes connected by solid lines represent the basic features that must be included in the URGRA domain, and the direction represented by the arrow represents the execution priority of the four features from high to low. e dotted arrows represent the associated operations required during algorithm assembly, such as the use of auxiliary storage table operations for greedy mode selection. e dotted line indicates the interaction between two features during the execution of the algorithm; for example, when using the inversion output feature, when selecting the permutation process, the distance, or the inversion interval output, the gene permutation storage table operation is required.

Type and Algorithm Component Design.
Here, we further analyze the abovementioned interaction design model of the URGRA domain feature model and algorithm component and encapsulate them into two abstract data type (ADT) components and a reverse rearrangement algorithm component. By virtue of the high abstraction of the Apla program, good support for ADT, and easy formal derivation and correctness verification, we carry out the formal design and implementation of the URGRA model based on Apla code. is generic ADT name is perm_store, which contains a type parameter elem, which can accept either integer or character types. type perm_store � private is the storage space specification, which specifies that the storage space used by this self-defined ADT is private. init(var p: perm_store; permutation: array[0 ... n, elem]) is used to dynamically allocate storage space and initialize it. check (p: perm_store; permutation: array[0...n, elem]) is to verify whether the gene sequence is correct. isSorted(p:perm_store; permutation:array[0...n,elem]) is to determine whether the permutation is the identity arrangement. setValue(p: perm_store; i:integer; permutation:array[0...n,elem]) and Journal of Healthcare Engineering getValue(p:perm_store; permutation:array[0...n,elem]; i:integer) function is to set the element value and get the element value. output(p:perm_store; permutation:array[0...n,elem] � NULL; distance:integer) indicates the inversion distance, inversion process, and inversion interval of the output permutation, and only the inversion distance is output by default. reversal(p:perm_store; permutation:array [0...n,elem]; i:integer; j:integer; aux:auxiliary_permu) indicates the inversion of the permutation. distance_mani(p: perm_store; distance: integer) indicates the operation to reverse the distance. e operation of this self-defined ADT type specified in Apla should pass this self-defined ADT type as an argument to a function or procedure as an operation object, so the above operations have a variable p of type perm_store.  function get_value(a:auxiliary_permu; i:integer): elem; enddef e ADT contains a procedure generic parameter someproc initialization_ auxiliary(sometype:elem) and an integer parameter n, so that the generic program can support instantiating different greedy algorithm modes. type per-mission_mani � private is a storage space description, which is used to describe that the storage space used by the selfdefined ADT is private. procedure set_value(a:auxiliar-y_permu; i:integer) and function get_value(a:auxiliar-y_permu; i:integer) are to set and obtain element values.

Results and Discussion
e computer is configured with AMD A10-7300 Radeon R6,10 Compute Cores 4C+6 G 1.90 GHz, 12 GB memory, and Window 7 operating system.
We used real data to carry out the inversion test. Both human and mouse chromosomes have the same gene fragments, totaling 193 genes. ese genes are described as follows [20]:  Read the mouse gene arrangement from sourcedata.txt, then output the inversion result to targedata.txt, as shown in Figure 4. e running time is shown in Figure 5. e FDSR algorithm developed by the formal PAR method runs for 3 ms.
With the formal method PAR, we first accurately describe the functional specification of reverse genome rearrangement in Radl language and then develop loop invariants based on the new strategy of developing loop invariants. en, we develop an Apla algorithm program based on the obtained algorithm specification and loop invariants, thus formally implementing perm_store, auxiliary_permu type components, and reverse rearrangement algorithm component. Finally, we use these three components and assemble the FDSR algorithm with the support of the PAR platform. Compared with some existing algorithms, our formal developed algorithm ensures the reliability and robustness of the algorithm program and improves the assembly flexibility of the assembly algorithm by means of component assembly, which is convenient for researchers to maintain and optimize.

Conclusions
Reverse genome rearrangement is a hot topic in bioinformatics research, and its implementation algorithm has been widely studied. Because of the flexibility of its algorithm design strategy, this kind of algorithm presents more diversity and complexity. In this paper, the generative programming technology is used to deeply analyze the field of reverse genome rearrangement algorithm based on greedy strategy, find out the common features and variable features, design highly abstract program components based on Apla language by using formal method PAR, and generate FDSR algorithm by automatically assembling components supported by PAR platform, thus improving the reliability and reusability of algorithm components and reducing the development cost. Our team has formally implemented the pairwise sequence alignment algorithm component library and the multiple sequence alignment algorithm component library [21,22]. In the next step, the insertion, deletion, and replacement of single characters in sequence alignment will be taken into account, so as to develop corresponding components and expand the scope of assemblable algorithms, to better analyze biological similarity and homology.

Data Availability
e data used to support the findings of this study are available in [20].

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.   Journal of Healthcare Engineering