Formulation and Analysis of Patterns in a Score Matrix for Global Sequence Alignment

Global sequence alignment is one of the most basic pairwise sequence alignment procedures used in molecular biology to understand the similarity that arises among the structure, function, or evolutionary relationship between two nucleotide sequences. The general algorithm associated with global sequence alignment is the dynamic programming algorithm of Needleman and Wunsch. In this paper, patterns are exploited in the score matrix of the Needleman–Wunsch algorithm. With the help of some examples, the general patterns realized are formulated as new a priori propositions and corollaries that are established for both equal and unequal length comparisons of any two arbitrary sequences.


Introduction
Sequence alignment is the matching of strings or sequences of characters to identify patterns that may lead to informed structural or functional relationships between the strings or sequences matched. Varying situational problems seem to require the use of sequence alignment and examples abound from computer science to molecular biology, where sequences are usually aligned to make more meaning out of them. Bare [1] has discussed how researchers have over the years considered genetic sequences as strings of characters instead of focusing on their intrinsic properties to be able to compute the similarity among two or more related sequences. Global and local sequence alignment are used for matching two sequences, and they are categorized as pairwise sequence alignment or multiple sequence alignment when any alignment matches more than two sequences [2][3][4].
e Needleman-Wunsch algorithm [5][6][7] and the Smith-Waterman algorithm [8] are the general algorithms associated with pairwise global and local alignment, respectively. ese two algorithms utilize a procedure called the dynamic programming approach. Dynamic programming as coined by Bellman in the 1940s is simply the process of solving a bigger problem by finding optimal solutions to its smaller nested problems [9][10][11]. us, to tackle a problem in the context of dynamic programming, it must possess the notion of recurrence. In [12,13], an algorithm is said to be a well-defined computational task that accepts input and produces output values after following through a systematic method. Mathematical theory is thus a prerequisite behind the designing of functional programs [14,15], and the algorithm design specializes in solving such problems. Global sequence alignment is mentioned as one of the vast dynamic programming applications in practical problems [16][17][18].
More recently, Ouzounis and Baichoo [19] have stated that even though pairwise sequence alignment has been dealt with over time, concerns still remain in resolving exact evolutionary distances that demand very specific estimates. ey have suggested the existence of theoretical relationships within alignments, algorithms, and data that are yet to be found. Motivated by the display of position-dependent arrays for affine gaps using the Needleman-Wunsch algorithm in [20,21], we consider the more basic linear gap penalties using arbitrary sequences and proceed to find patterns in the score matrix which were absent in the presentation. is approach was taken because it is found missing in the available stream of literature although it can be likened to the edit graph illustration seen in [1]. To the best of our knowledge, there has never been any theoretical exposition of this kind for the most basic and very fundamental concept of constant linear gap penalties which constitute an implicit part of affine gap penalties. us, we focus on finding patterns in global sequence alignment using constant linear gap penalties and attempt the display of completed score matrices of the Needleman-Wunsch algorithm with distinct positional arrays that can be inspected. We give confirmation of this by basic proofs and suggest how predictions can be made. To offer the reader the much needed convenience, concepts that are deemed useful are recalled in brief with corresponding references of comprehensive works that give more in-depth information.

Needleman-Wunsch Algorithm
Let the recursive formulation for the Needleman-Wunsch algorithm be where P(0, 0) is the initialization pivot for the score matrix, P m (i, 0) is the initial row pivot for the score matrix, and P m (0, j) is the initial column pivot for the score matrix.
where P m (i, j) is the pivot for each cell box calculation, P D (i, j) � P m (i − 1, j − 1) + σ(α(i), β(j)) is the diagonal value of a cell box, P R (i, j) � P m (i − 1, j) + gap penalty is the right value of a cell box, P L (i, j) � P m (i, j − 1) + gap penalty is the left value of a cell box, and σ(α(i), β(j)) is the score for aligning the sequence characters of α(i) and β(j).
Specifically, the procedure for the Needleman-Wunsch algorithm follows the dynamic programming approach of the score matrix, traceback, and alignment as outlined in the subsequent sections.

Score Matrix.
e score matrix is a tabular box constructed to keep count of score results. e score matrix for the Needleman-Wunsch algorithm begins with an initialization process and ends with the calculation of cell boxes.

Initialization.
A sequence matrix is created with μ 1 + 1 columns and μ 2 + 1 rows in order for the initial matrix gap to be aligned where μ 1 and μ 2 are the lengths of arbitrary sequences, 〈α〉 and 〈β〉, respectively. e letters of the sequence α fill in the horizontal axis, and similarly, the characters of 〈β〉 fill in the vertical axis of the sequence matrix created. Before the scoring begins from the upper left corner of the initialized matrix to the lower right corner of the matrix, the value P(0, 0) � 0 is assigned to the intersection of the first row and the first column of the matrix (i.e., the initial gap). e reason for the gap penalty for an alignment is because of the possibility of mutation which may insert or delete a string character from one of the sequences. Arrows that point in the direction of positional movement (diagonal and left or right) are placed in each cell box of the matrix and only terminate when all the cell boxes are completely filled.

Calculation of Cell Boxes in the Score Matrix
(1) Fill the initialized gap values first on both the horizontal axis and the vertical axis with a defined constant gap value score (2) Follow with the calculation of each cell box having the three position-dependent arrays (left/beside, right/bottom, or diagonal) (3) Allow only a match/mismatch value for a "diagonal" position, and allow the "bottom" or "beside" positions to take linear gap values only (4) For each computed cell box values, find the maximum score and let that be the pivot (5) e pivot of a computed cell box directly affects the next cell boxes in the row or the column 2.2. Traceback. A simpler score matrix table that contains only the pivot of each cell box calculation is constructed from the original position-dependent arrays of the score matrix table. Arrow pointers are used to direct a path from the highest score or an optimal value in the matrix (which actually occurs at the lower right end corner of the matrix) and traced back to the next biggest value of the predecessors until we reach the intersection of the first row and the column with the initial gap.

Alignment Generation from a Traceback.
To write the sequence characters that appear from the optimal alignment path of the traceback stage, the following steps are followed: (1) When the arrow is diagonal, write both characters in the alignment (2) When the arrow is vertical, write the corresponding horizontal character, and in place of the vertical character, leave a gap (3) When the arrow is horizontal, write the corresponding vertical character, and in place of the horizontal character, leave a gap us, for both the vertical and horizontal arrow positions, one character and a gap are written for the alignment, where the gap explicitly replaces a character position in the alignment. An alignment can only be inferred as the best if the optimal value from the score matrix table corresponds to the alignment score calculation (based on the scoring scheme defined).

2.3.2.
e Problem of Aligning Any Two Sequences. e problem of aligning any two sequences can be simplified as discovering the optimal means of aligning any two arbitrary sequences say 〈α〉 and 〈β〉 such that the character "−" noted as a gap is filled into both 〈α〉 and 〈β〉 or either of them where (1) Any single character in 〈α〉 matches a single character in 〈β〉 or a gap (2) e final sum of scores from the scoring function over the aligned pairs and the gap penalties as given by the function of gap penalty is maximized

Letter Choice for Arbitrary Sequences.
e letter choice of this study shall be that of DNA considered as a string of four characters of adenine-A, guanine-G, cytosine-C, and thymine-T.

Scheme for Scoring
where g is an assigned constant and k is the sequence length count of i � 1, . . . , μ 1 for the score matrix row and j � 1, . . . , μ 2 for the score matrix column. is is the penalty awarded to gaps and is also known as the linear gap function.

Definition 2.
e affine gap penalty is the penalty awarded to gaps where a greatest consecutive sequence of k gaps is given as v(k) � q + gk, where q is the penalty charged for opening the gap and g is the penalty charged for extending it.
us, more formally we define (2) An affine gap as Altschul's theory: assign positive scores to an identity and conserved replacements, and assign negative scores to less likely replacements. Remark 1. Needleman and Wunsch use the "identity matrix" in scoring with 1 for a match and 0 for a mismatch. Needleman-Wunsch's score was criticised for not reflecting observations from the nature because purine-purine or pyrimidine-pyrimidine is less prone to be mutated in comparison with mutations of purine-pyrimidine. Because there is no definite defined score for DNA alignment, the choice of scoring for this study is +5 for a match, −1 for a mismatch, and −2 for a gap as proposed in [18] following the above theory.

Pattern Investigations in the Score Matrix of Needleman-Wunsch Algorithm.
e sequences α and β that will be used are arbitrary sample sequences that were chosen based on consideration of the following: Example 1. Let 〈α〉 � CTTGA and 〈β〉 � CTAGA. Because the length of the sequence characters of 〈α〉 and 〈β〉 is μ 1 � μ 2 � 5, respectively, we align α and β in a score matrix following the above formulation.
Remark 3. σ(C, T) in P D (1, 2) was "−1" because that is the assigned score for mismatched characters. Consequently, the recursion follows similarly until all the cell boxes are filled. In the event of a pivot tie in a cell box, one tie value is picked. For the tabular display, these notations are used interchangeably; P D (i, j) is the same as DV, P L (i, j) is the same as LV, P R (i, j) is the same as RV, and P m (i, j) is the same as the pivot.  Figure 2 shows the pivot of each cell box calculation of the score matrix table of CTTGA and CTAGA in Figure 1.
us, Figure 2 is a simpler score matrix table constructed from the original score matrix table of CTTGA and CTAGA in Figure 1. e arrow pointers direct the path from the optimal value and traceback to the initialization value of zero. Based on the traceback and the diagonal direction of the arrow pointers, the alignment is written as CTTGA, CTAGA.
To confirm the correctness of the alignment done, we check using calculations. Recall the scoring scheme: match � +5, mismatch � −1, and gap � −2. We have four alignment matches: C − C, T − T, G − G, and A − A and one mismatched alignment: T − A . Hence, which is the same as the optimal value from the score matrix table. e alignment is thus optimal.

Unequal Character Length
Example 2. Suppose 〈α〉 � AGCTG and 〈β〉 TCAG, then to fill in the score matrix values of the cell boxes, the priorstated recursive formulation is used. is results in Figures 3  and 4.    International Journal of Mathematics and Mathematical Sciences (4) For each cell box, the value of the preceding bottom value is seen to be always less than the value of the immediate next adjacent diagonal value

Traceback and Alignment for Unequal Character
Length. e traceback and alignment stages for the example on unequal character length of sequences are, respectively, shown below. Figure 5 shows the pivot of each cell box calculation of the score matrix table of AGCTG and TCAG of Table 3. Based on the traceback and the direction of the arrow pointers, the alignment is written as AGCTG, We check the correctness of the alignment done using calculations. Recall the scoring scheme: match � +5, mismatch � −1, and gap � −2. We have two alignment matches: G − G and C − C, two alignment mismatches: A T and T A, and one gapped alignment: G − . Hence, which is the same as the optimal value from the score matrix table. e alignment is thus optimal.

Propositions and Proofs for Equal Sequence
Length. e following propositions are a priori results deduced from the pattern results of equal character length of sequences.
Definition 3. Let the linear gap penalty be v(k) � gk, where g is a constant of −2 and k is the sequence length count of k � 1, . . . , μ 1 for the score matrix row and k � 1, . . . , μ 2 for the score matrix column. Definition 4. Define P(0, 0) � 0 to be the initialization pivot for the score matrix.
Proof. From Proposition 2 and Definitions 5-7, it is clear that where i � j.
and by Proposition 2, we can write that P m (1, j) � P m (i, 1). Hence, it is proved. Proof. Let P m (i, j), P m (i + 1, j), P m (i, j + 1), and P m (i + 1, j + 1) be any four cell boxes, then we are to show that P L (i, j + 1) � P R (i + 1, j). By Definition 7, International Journal of Mathematics and Mathematical Sciences 7 , , Since the linear gap penalty is equal in both cases and it is obvious that P m (i, j) is the same in both cases, we write P L (i, j + 1) � P R (i + 1, j). □ Corollary 6. For any two adjacent row cell boxes, the right value of a preceding cell box is less than the diagonal value of the next cell box.

27)
We are left to show that gap penalty < σ(α(i), β(j+ 1)). □ Remark 5. By Altschul's theory, a match and mismatch are chosen to be greater than a gap penalty. We recall the scoring scheme of the constant g being "−2" in the linear gap penalty gk and the diagonal score being "−1" when there is a mismatch and "+5" when there is a match. For any k count, ∀k � 1, . . . , μ, the linear gap penalty gk decreases, and thus the gap penalty can never be greater than or equal to any of the diagonal scores allocated for match and mismatch. us, gap penalty σ(α(i), β(j + 1)) √√√√√√√ √√√√√√√ mismatch diagonal score match diagonal score , ∀i, j � 1, . . . , μ.

Proposition and Proof for Unequal Sequence
Length. e following proposition holds a priori from the pattern results of unequal character length of sequences. We state and prove the following general results by adhering to the same definitions stated earlier under the equal character length of sequences.

Proposition 7.
For each three pointed arrows intersecting any four cell boxes, the left value of a 2nd cell box corresponds to the right value of a 3rd cell box.
Proof. Let P m (i, j), P m (i + 1, j), P m (i, j + 1), and P m (i + 1, j + 1) be any four cell boxes, then we are to show that P m (i + 1, j) � P m (i, j + 1). Refer to the proof of Proposition 5. Despite the unequal sequence length of μ 1 ≠ μ 2 , the supposed disparity in length has no bearing on the proof. □ Corollary 8. For any two adjacent row cell boxes, the right value of a preceding cell box is less than the diagonal value of the next cell box.
Proof. Refer to the proof of Corollary 6. Again, despite the unequal sequence length of μ 1 ≠ μ 2 , the supposed disparity in length has no bearing on the proof.

Conclusion
In this paper, the score matrix of the Needleman-Wunsch algorithm was exploited for possible patterns. Given any two arbitrary sequences of equal or unequal length, a general pattern was formulated as new a priori propositions and corollaries. ese new formulated propositions and corollaries are justified with their corresponding proofs.

Data Availability
No data were used to support this study.