An Improved Hashing Approach for Biological Sequence to Solve Exact Pattern Matching Problems

,


Introduction
Pattern matching is one of the most signifcant tasks in computer science.Finding a specifc pattern within a large pattern or text is known as pattern matching [1].Problems of this type arise in many areas of the fourth industrial revolution, including networking, signal processing, data recovery, language processing, artifcial intelligence, and many more [2].Pattern matching is also known as pattern searching or string matching.
String-matching algorithms make up a signifcant subclass of string algorithms.Tese algorithms look for instances in a lengthy string or text where a single string or multiple strings, collectively called patterns, appear.
String-matching techniques look for strings in text strings (where strings are collections of characters) that ft a predefned pattern (fnite set) [3].Let a text t, which has a length of n, and the pattern be p, which has a length of m, where m is less than or equal to n. Te sequence and the search window are compared, character by character, to fnd the pattern in the text string.Te term "search window" refers to the area of a text string compared to a pattern in which the search box's length equals the pattern's length [4].
To comprehend biological data, mainly when the datasets are enormous and complicated, the interdisciplinary discipline of bioinformatics develops techniques and software tools [5].Pattern matching issues appear in many computational bioinformatics tasks, including basic local synchronization search, biomarker discovery, sequence matching, homologous sequence identifcation, and proteogenomic mapping [6,7].Pattern matching can be used in biotechnology, forensics, medical, and agricultural research to look into probable disease or anomaly diagnoses [8].Hashing's efectiveness in storage and search has made it a popular choice for colossal pattern matching [9].Hashing methods generate a unique hash value that helps search for a specifc text pattern [10].
Pattern matching can be broken down into numerous categories.Exact and approximate pattern matching are the two primary types of pattern matching in terms of accuracy [11,12].While approximate pattern matching results in erroneous searching, exact matching efectively searches for the precise occurrence of the pattern within the text [13,14].
Based on the number of patterns, pattern matching can be divided into two categories: single and multiple patterns [14,15].Single pattern matching searches for a single pattern in the text, whereas multiple patterns are searched in multiple pattern matching [16].Te concept of exact pattern matching can be classifed into two distinct groups: online and ofine.Online string matching involves preprocessing only the pattern, while ofine string matching involves preprocessing the text while keeping the pattern unchanged [17,18].Figure 1 displays the classifcation of string matching.
Although many algorithms were proposed to solve the pattern matching problem, here we propose an approach based on the hashing technique.Te hashing method notably improves efciency and efectiveness when applied to pattern-matching challenges [20].Te success of this algorithm can be attributed to its versatile design, which allows it to perform well across various domains and facilitates single and multiple searches [21,22].Te main objective of this paper is to design and construct an efcient algorithm for better performance in exact pattern matching.Te algorithm is applied within bioinformatics, specifcally for analyzing biological data such as DNA and protein sequences.Te performance of the algorithm is evaluated by comparing it with various methods, including traditional, hybrid, and hashing-based pattern matching techniques.Tese techniques include Quick Search (QS), Maximum Shift (MS), Minimum Attempts and Comparisons (MAC), Back and Forth Matching (BFM), Efcient Hashing Method (EHM), and Hash-q Unique FNG (HqUF).Te evaluation is conducted on publicly available datasets, utilizing identical parameter settings for all methods.Te contributions of this research can be summarized as follows: (1) We propose a hashing approach that improves the EHM algorithm, which includes two phases: preprocessing and searching (2) Te preprocessing phase integrates the EHM algorithm's hashing technique, QS bad character table, and hash collision-reducing technique, whereas the searching phase follows our own approach (3) Experiments were undertaken to assess the efcacy of the proposed approach, which was then compared with several algorithms across three distinct datasets (4) Whether the accuracy is outstanding compared to other tasks is investigated Te subsequent sections of this work are structured in the following manner.Te related works that have been recently developed are outlined in Section 2. Section 3 of the document comprehensively explains the proposed technique for addressing string-matching problems.Tis explanation also includes a working example, an analysis of the hashing method, and the time complexity.In addition, graphical representations enhance the clarity and understanding of the presented information.Section 4 represents the outcomes of the conducted tests, accompanied by appropriate graphical and tabular illustrations.Te conclusions of our study are explained in Section 5.

Literature Review
To address the pattern matching problem, a large number of algorithms employ their technique or a hybrid approach.Hash-based string-matching algorithms have become very popular recently for solving string-matching problems.However, the algorithm that frst comes to mind when thinking about how to solve the pattern-fnding problem is probably Brute Force (BF) [23].Brute Force is a straightforward algorithm that searches the pattern character by character and right-shifts the pattern by one position [24].Te Boyer-Moore (BM) method introduces three ideas for searching for patterns: the right-to-left comparison, the good sufx rule, and the bad character rule [24,25].According to the pattern of the selected alphabet, the fnal two observations are subject to the preprocessing stage [26].Aligning the pattern from left to right and using a search strategy that checks characters in right-to-left order character by character, like the Brute-Force algorithm, makes the string-matching method work better [27].Te BM algorithm has a variation known as the Quick Search algorithm [28].Te bad character table value of the next character in the search window determines the shifted value of the QS

2
Applied Computational Intelligence and Soft Computing algorithm [29].Te Back and Forth Matching algorithm employs index-based techniques [30].When the frst and last characters of the search window match the pattern, the second and rightmost characters are checked simultaneously, and the remaining characters are also checked in the above manner [19,31].An efcient string-matching algorithm known as Maximum Shift was designed and implemented in 2014 [32].[34].Tis algorithm uses the hashing approach to identify patterns within a text [35].Lecroq introduced the Hash-q algorithm, which calculates a hash value between 0 and 255 for each q-gram in the pattern p [36,37].Tis algorithm computes a shift for each hash value and generates q values from 3 to 8. Te hash function is Here, many hash collisions are generated for diferent q values.In a recent study, researchers proposed an efcient hashing approach with an enhanced form of Hash-q known as Hash-q with Unique FNG [38].Tis algorithm converts ASCII codes, represented as follows, into a hash with a bit representation.A: 01000001, C: 01000011, T: 01010100, G: 01000111.Te basic idea of this algorithm is to use two unique bits, which are A: 00, C: 01, G: 11, and T: 10, instead of eight bits.Tese two bits are the rightmost second and third bits from the ASCII code of the corresponding character.Te following is the hash function for this algorithm: ( If the last q characters of the text and the pattern do not have equal hash values, the pattern will be moved by m − q + 1; otherwise, compare the text and pattern from start to before the last q characters.Precise string-matching algorithms are widely used in many disciplines, including bioinformatics, computational biology, text processing, and intrusion detection [39].Tese algorithms are widely used to solve various problems related to string matching and pattern recognition.Exact stringmatching methods are used in bioinformatics for a wide range of tasks, including genome assembly, sequence alignment, and gene prediction [40].For DNA and protein sequence alignment, methods like Quick Search and Boyer-Moore are frequently used.Tese algorithms can quickly match and compare sequences to identify similarities and diferences, which is critical for understanding the structure and function of genes and proteins.Exact string-matching algorithms are also used in computational biology to fnd patterns in DNA and protein sequences.For example, the Boyer-Moore algorithm can search for specifc patterns in DNA sequences [41], which is very important for fnding genes, regulatory elements, and other functional parts.In text processing, exact string-matching algorithms are used for tasks such as spell-checking, plagiarism detection, and natural language processing [11].For example, algorithms such as the Berry-Ravindran algorithm can be used to identify the diference between two strings, which helps spell-check and identify plagiarism.Intrusion detection systems use a string-matching method to identify data packets with intrusion-related keywords.Whenever new data is received, it is compared to the database, which contains all the dangerous code.If a match is detected, an alert will be sent.Every intruded packet must be captured and identifed using exact string-matching methods.
Te essential requirement of the abovementioned methods is to reduce the search time.Tese methods were usually accomplished by creating a single function or a hybrid approach combining the advantageous features of several single algorithms.Performance can be afected by various factors, including the speed of the processor, the operating system, and the database, the length of the string or pattern, the frequency with which it occurs, and the size of the alphabet.Although the time complexity of the BF algorithm is high, it applies to all felds; nonetheless, it performs slowly when dealing with lengthy patterns and text.Te BM algorithm is superior to the BF algorithm, although the performance of the BM algorithm is variable depending on the length of the pattern and the alphabet set.Using QS to diferentiate between short and long sequences in real-world applications is a process that is both quick and uncomplicated to carry out.Despite this, the amount of time required for QS preprocessing grows longer when the letter size of the pattern is increased.Compared to previous algorithms, BFM shows a considerable boost in its ability to locate strings inside substantial text fles.However, because a preprocessing phase must be fnished before searching can begin, BFM's performance will improve if the text and the pattern are brief.Tis is because of the nature of the search.In experimental settings, the MS algorithm yields superior results for English text, DNA sequences, and protein sequences; nevertheless, its performance sufers when dealing with limited alphabet sets.Even though the MAC algorithm only makes a limited number of attempts and comparisons, it nonetheless needs the maximum amount of time to execute because of the index-based approach.Te Rabin-Karp algorithm ofers a rapid means of determining the presence of a pattern inside a given text, obviating the need to examine all potential positions within the text exhaustively.However, it is essential to note that this technique may Applied Computational Intelligence and Soft Computing exhibit suboptimal temporal complexity in scenarios where numerous hash collisions arise.Te Hash-q algorithm works fast for small patterns in small-sized alphabets but shows the worst output for large patterns due to having to calculate a hash value between 0 and 255 for each q-gram.Te HqUF technique efectively mitigates hash collisions in the context of DNA sequences.Te system ofers optimal hashing capabilities and efciently generates hash values.Te main problem with the HqUF algorithm is that it only produces unique bits for DNA sequences, and it is complicated to create individual bits for proteins or other arrangements that contain more than four characters.
Te HqUF method has been identifed as a highly effective hash-based text matching technique.Tis algorithm efciently adapts the Hash-q algorithm and demonstrates suitability for DNA sequences.Nevertheless, a limitation of this technique is its exclusive ability to generate distinct hash values solely for DNA sequences.Tis algorithm cannot generate separate bits if a dataset contains more than four characters.Research has also demonstrated that a single algorithm with hashing approaches capable of efectively processing all types of data is yet to be identifed as the optimal option.In light of the shortcomings of existing algorithms, this research aims to introduce a novel and efective approach that generates distinct hash values across all datasets to prevent hash collisions.Terefore, if hash collisions can be eliminated, runtime, character comparisons, and hash comparisons can all be decreased.

Proposed Approach
In the challenge of string-matching problems, our observation concludes that some existing algorithms consider the act of shifts, comparisons, and execution time.We concentrated our research on the string-matching algorithm and proposed one that may decrease the number of shifts and comparisons for sequential pattern matching.We have improved the hashing method used in the Efcient Hashing Method (EHM) [19].Our proposed algorithm is divided into two phases: one is preprocessing, and the other is searching.

Te Proposed Hash Function.
Te hash function is mainly used for the ASCII value of its corresponding character.Te basic idea of the hashing method is to sum up all the ASCII values of a particular string and modulate it with a specifc prime number to get a remainder.Te following equation is used to generate the hash function: (3) Here, n is the pattern length or substring length, and q denotes a predefned prime number, where prime numbers are a subset of natural numbers divisible solely by one and the number itself.Te likelihood of producing distinct and nonrepetitive values increases when using prime numbers in the hashing process.Tis characteristic is inherent to the feld of mathematics.As an illustration, consider the given string "ResearchTopic."A diferent hash value may be obtained by assigning a prime number to each letter and summing these values.
Tere is some hash collision that occurs in this equation.Let a pattern p1 � ACGTTGA and p2 � TAGCACG of a DNA sequence and prime numbers 17.Te hash function or value is computed by the following equation: Tese two patterns generate the same hash values, but these are diferent patterns.Tis is known as a hash collision.We improve the hash collision by dividing the pattern with a specifc prime number for getting a quotient.Te following equation is used to generate the quotient: (5) Here, n is the pattern or substring length, and q denotes a predefned prime number.

4
Applied Computational Intelligence and Soft Computing Let the above pattern p1 � ACGTTGA and p2 � TAGCACG of a DNA sequence and prime numbers 17.Te quotient is computed by the following equation: Tese two patterns generate diferent values, although their hash function generates the same ones.

Preprocessing Phase.
Our proposed algorithm is onlinebased, which preprocesses the pattern and keeps the text intact.First of all, our algorithm calculates the hash of the pattern using the following equation: Ten, the algorithm calculates the quotient of the pattern using the following equation: For shifting the pattern within the text, our proposed algorithm uses the good properties of the QS algorithm.Te shifting value of the QS algorithm depends on the next character of the search window.Te following function below generates the next shifting distance to skip character comparisons.
Here, m is the length of pattern P, and i denotes the pattern index from 0 to m − 1. Character x serves as the text's defnition for each character.

Searching Phase.
A window of size m glides along with the text during the searching phase, starting at position 0. After each try, the window is shifted to the right until the text's conclusion is reached.First, compare the pattern's frst character with the search window.If the match occurs, create a substring based on the pattern length and compute the hash value and quotient using the above hash value and quotient equation.Te hash and quotient of the pattern will be compared with the substring hash value and quotient.If the substring hash value and quotient match the pattern hash value and quotient, then the fnal character of the pattern and substring will be compared.If matched, then the substring and pattern's leftmost second and rightmost second characters are checked concurrently.If there are any diferences, the algorithm moves the pattern based on the QS table values saved during preprocessing.If no diferences exist, the substring and pattern's leftmost and rightmost third characters are checked simultaneously.Te rest of the characters are compared in the same way.
Algorithm 2 presents the searching stage's pseudocode: //Create substring based on pattern length (5) for (i � 0 to m) do (6) sum ⟵ ASCII (s i ) (7) end for loop (8) //Create hash value using predefned prime number (9) h (s) ⟵ sum mod q (10) //Create quotient value using predefned prime number (11) r (s) ⟵ sum divide q (12) if h(p) � h(s) and r(p) � r(s) break (17) end if (18) end for loop (19) end if (20) end if (21) if every character is matched (22) then the pattern found occurs (23) end if (24)  (1) //preprocess only pattern characters and take any prime number (2) q ⟵ prime number (3) for (i � 0 to m) do (4) sum ⟵ ASCII (p i ) (5) end for loop (6) //Generate hash value (7) h (p) ⟵ sum mod q (8) //Generate quotient value using predefned prime number (9) r (p) ⟵ sum divide q (10) //Generate QS table for the pattern (11)   ( Since the substring hash and quotient value are equal to the pattern hash and quotient, respectively, the following comparison is performed according to the search technique: A diagram depicting the proposed algorithm for the string-matching problem is presented in Figure 2. Tis diagram mainly represents the data fow of our proposed algorithm.

Working Example.
Te aloe vera plant is a succulent that retains water as a gel in its leaves.Tis moisturising gel is perfect for sunburns, insect bites, minor cuts and wounds, and other skin problems.Te Aloe vera voucher Aloe vera chloroplast nucleotide sequence was used to test our proposed approach.According to the FASTA format, we selected a small portion of the gene's nucleotide sequence from index 4970 to 5002 (just 33 characters) [42].Te wording of the DNA sequence under consideration is as follows: Text is t � TACGGCTCGAGAAAAAATGATTCTAAT TCTGTA, pattern is p � GATTCTA, and the prime number is 17.

T A C G G C T C G A G A A A A A A T G A T T C T A A T T C T G T A G A T T C T
Here, h(p)! � h(s) so shift by qsBc [G] � 7.

T A C G G C T C G A G A A A A A A T G A T T C T A A T T C T G T A G A T T C T A h(s) � [ASCII(G) + ASCII(A) + ASCII(A) + ASCII(A) + ASCII(A) + ASCII(A) + ASCII(
Here, h(p)! � h(s) so shift by qsBc [T] � 5. 4 th attempt:

T A C G G C T C G A G A A A A A A T G A T T C T A A T T C T G T A G A T T C T A
Shift by qsBc [C] � 3. 5 th attempt:

Applied Computational Intelligence and Soft Computing
As h (p) � h (s) and r (p) � r (s), so perform the next step:

T A C G G C T C G A G A A A A A A T G A T T C T A A T T C T G T A G A T T C T A
Shift by qsBc [T] � 5, which exceeds the text, so stop searching.
Our proposed algorithm needs six shifts and thirteen character comparisons to fnd the pattern within the text.

Hashing Method Analysis.
We have used only 200 MB of DNA sequence to test the efectiveness of the hashing technique in our proposed approach against various prime values [43].Our approach divides the pattern by a predetermined prime number to determine the hash value by calculating the remainder of the patterns.Te hash value is computed by the following equation: Here, n is the pattern or substring length, and q denotes a predefned prime number.Let a pattern p � ACGTA of a DNA sequence and prime numbers 3 and 229, which were chosen randomly.Te hash function or value is computed by the following equation:  Here, 352 mod 3 � 1 and 352 mod 229 � 123.When we divide the sum by a smaller prime number (3), the remainder value gets smaller (1).For this reason, the substring hash value is becoming more and more equal to the pattern hash value.As a result, the number of attempts and comparisons increases, but when we divide the sum by a more signifcant prime number (229), the remainder value gets larger (123).Due to this reason, the substring hash value and pattern hash value are getting less equal.As a result, the attempts and comparisons are decreasing.
After analyzing the hashing process, it can be concluded that a more extensive prime number results in fewer attempts and comparisons, leading to favourable output.However, the prime number must be less than the sum of the ASCII values of the letters.
Figures 3-5 present a graphical representation of the experimental fndings about the tallying of attempts, comparisons of characters, and length of time required for the task's completion, respectively.We have used the prime numbers 3, 7, 11, 13, 17, 39, 47, 73, 97, and 229, which were 10 Applied Computational Intelligence and Soft Computing chosen based on randomness or pseudorandomness.We ran our algorithm ten times for a single prime number to get a more efcient and accurate result.We used the pattern length of ten taken from the DNA text.Based on the obtained fndings, it can be inferred that there exists an inverse relationship between the magnitude of the prime number and the number of attempts and character comparisons required.Te larger the prime number, the fewer attempts and character comparisons.Tis is because the more signifcant the prime number, the fewer times the substring hash value and the pattern hash value will match.However, the impact on execution time is very insignifcant.Whether the prime number is large or small, execution time is random because the residual value depends on the sum of the ASCII value and the prime number.To compute the hashing method, we employ the ASCII character and then determine the summation of its corresponding ASCII value, resulting in the generation of a signifcant value.

Time Complexities in Perspective and Comparison to the
Proposed Algorithm.Te amount of time a statement will take to execute depends on its complexity.Te preparation stage of the proposed algorithm's complexity is O(m).However, the temporal complexity of the search phase can be broken down into two scenarios as follows: Case I: In the worst-case scenario, each character in the text would appear to ft into a specifc pattern.
Troughout the procedure, the worst-case scenario frequently happens if the characters in the pattern match those in the following text.For example, it is stated that the worst-case complexity for the text t � "LLLLLLLLLLLLLLLLLLLLLL," and the pattern p � "LLLLL" is O(nm).
Case II: Te pattern is p � "FFFFF," and the text t � "SSSSSSSSSSSS" is used to evaluate the best case of the proposed search phase.Te preparation stage of the proposed algorithm is directly infuenced by the searching stage, where the maximum shift value is always denoted by the QS shift value, which is (m + 1).
To determine the best case, follow the following formula: O(n/(m + 1)).
Table 2 shows the time complexity of diferent stringmatching algorithms.

Results and Discussion
We have used three diferent types of data to examine the performance of our proposed algorithm.Tese are the E. coli dataset, DNA sequence, and protein sequence.Escherichia coli (E.coli) is a small dataset with a length of 4,686,137.Tis dataset contains only DNA sequence letters (A, C, T, and G).Text fles containing the E. coli dataset were acquired from the NCBI website [38,44].We have used a large dataset of 200 MB for DNA (the alphabet's set of 4 characters) and protein (the alphabet's set of 20 characters) sequences taken  from the "Pizza & Chili Corpus" website [43].For the E. coli dataset, pattern lengths 4, 8, 16, 32, 64, 128, 256, 512, and 1024 were chosen randomly using the HqUF algorithm.Te pattern lengths for DNA and protein sequences are 3, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, and 200, which are randomly selected from the text fle.To determine the performance of our proposed HAPM algorithm, 20 distinct patterns were chosen randomly for each pattern length and executed 20 times each.Te C++ programming language is employed to execute the proposed algorithm.Te algorithm was run utilizing code blocks version 17.12.Te computer is equipped with an Intel Core i5-3210M processor operating at a frequency of 2.50 GHz.It features 8 GB of random access memory (RAM), HD Graphics 3000, and a 750 GB hard drive.3 displays average run time.From this result, it can be seen that there is a 73.04% speed increase for the pattern length 64.Te lowest speed increase is found for pattern length 512, which is 15.38%.It can also be seen from the table that for small patterns (pattern length 4 to 64) the execution time of our algorithm is less than HqUF algorithm, which means that the speed increases a lot.Tis is because the shorter pattern matches the text more often, so it takes longer to fnd the more succinct pattern.

Te E. coli Dataset Outcomes. Table
Table 4 displays the average number of shifts or attempts.It can be seen from the table that 31.04% improvement has occurred for pattern length 256, and the lowest improvement is found for pattern length 32, which is 02.43%.Improvements vary for all other patterns.Results are often suitable for small patterns, and sometimes results are good for large patterns.Te attempt depends on the shifting value of the pattern.
Table 5 displays the average number of character comparisons.Tis result shows that an improvement of 28.69% has been made for pattern length 1024, while the lowest improvement, 11.43%, is observed for pattern length 32.For the other patterns, the upgrades are varied.Results are frequently favourable for large patterns but can also be favorable for large and small patterns.Te character comparison depends on the hashing value of the pattern.
However, our HAPM algorithm outperforms the HqUF algorithm for each pattern length regarding execution time, attempts, and character comparisons.

4.2.
Te DNA Dataset Outcomes.Te efciency of the BFM algorithm, in terms of execution time, number of attempts, and comparisons, is notably superior when compared to the Knuth-Morris-Pratt (KMP) and Boyer-Moore-Hoorspool (BMH) algorithms [30].It is worth noting that the BMH algorithm is a simplifed and enhanced version of the BM algorithm.When comparing the results of the EHM algorithm with those of the BFM, MAC, MS, and QS algorithms in terms of gaining lesser numbers of attempts and comparisons, the EHM algorithm demonstrated superior performance for both the short and long patterns [19].Te HqUF method performs better than the Hash-q method, but it works only for DNA sequences [38].

Algorithms
Time complexity Data BF [23] O(mn) All datasets BM [24] O(mn) All datasets QS [29] O(mn) All datasets BFM [30] O Figure 6 displays the average number of attempts.From these outcomes, it is clear that the EHM algorithm performs better than the BFM algorithm for all patterns.Te HqUF algorithm performs better than the EHM algorithm for all large patterns, but the EHM algorithm performs better for small patterns.Tis is because the HqUF algorithm uses 2 bits for each character of DNA and matches the last q character of the pattern with the q character of the text most of the time in case of small patterns.Our HAPM algorithm shows superior results to the HqUF algorithm for all small and large patterns.It also shows better than the EHM and BFM algorithms.
Figure 7 displays the average number of character comparisons.Tis outcome demonstrates that the EHM algorithm is better than the BFM algorithm for all patterns, where the BFM algorithm displays an unstable behaviour.Te HqUF algorithm displays better results than the EHM algorithm for all large patterns but shows the worst outcomes for small patterns such as pattern length 3. Te use of the HqUF technique stems from implementing a 2-bit encoding scheme for each DNA letter.Tis encoding scheme facilitates the matching process between the last q characters of the pattern and the corresponding q characters in the text, mainly when dealing with smaller patterns.Our proposed HAPM as well as the HqUF algorithms show stable behaviour.However, our HAPM algorithm performs better than the HqUF, EHM, and BFM algorithms for all pattern lengths.
Figure 8 displays the average number of execution times.Tese fndings show that the EHM algorithm's execution time is most signifcant for all considerable pattern lengths.However, the algorithm's performance with tiny pattern lengths reveals some encouraging signs.Te EHM algorithm is a substring-based approach that takes so long to run.Te time required to hash text substrings increases as the pattern length increases.Te BFM algorithm shows better results than the EHM algorithm for all patterns.Te HqUF approach outperforms the BFM algorithm for large patterns but underperforms for short pattern lengths due to the 2-bit encoding time required by the HqUF methodology.Our proposed algorithm shows stable behaviour and better results than the HqUF algorithm for all pattern lengths, but worse results than the EHM and BFM algorithms for short pattern lengths.Te poor outcomes for small patterns are because our HAPM algorithm produces fewer shifted values for small patterns and takes longer to hash them.

Te Protein Dataset Outcomes.
Te MS algorithm is more efcient than four other string-matching algorithms: Quick Search, Horspool, Smith, and Berry-Ravindran [32].In addition, the MAC algorithm outperforms the MS and IBS (index-based shift) algorithms regarding the number of attempts and the total number of character comparisons [33].
Figure 9 displays the average number of attempts.Both the MS and QS algorithms exhibit unstable behaviour, with the MS algorithm doing signifcantly better than the QS algorithm across the board.Tese fndings make it abundantly evident that the MS algorithm is superior to the QS algorithm.Te MAC algorithm exhibits stable behaviour for all pattern lengths and outperforms the MS and QS algorithms.Te BFM algorithm performs better than MAC, MS, and QS algorithms, but for some patterns (such as   10 displays the average number of comparisons.Te results indicate that the performance of the MS algorithm surpasses that of the QS algorithm, while the MAC algorithm exceeds both the MS and QS algorithms across all pattern lengths.Te BFM method performs better than the MAC algorithm, except for pattern length 20.In the context of the MAC algorithm, specifc patterns exhibit fewer attempts when utilizing the indexing strategy.Tis phenomenon is the underlying cause of the reduced character comparisons required in particular patterns.Conversely, the EHM algorithm shows superior outcomes to the BFM algorithm across all pattern lengths.Our suggested HAPM algorithm outperforms the EHM, BFM, MAC, MS, and QS algorithms across all pattern lengths.
Figure 11 displays the average number of execution time.From these results, it is seen that the execution time of the QS algorithm is highest for all pattern lengths.Te MS algorithm performs better than QS and MAC algorithms.Although the execution time and character comparison of the MAC algorithms are lower than the MS algorithm, the execution time is higher because the MAC algorithm is an index-based method.Te preprocessing step of the MAC algorithm takes more time to index the alphabet.Te execution time of the BFM algorithm is lower than MAC, MS, and QS algorithms for all patterns and lower than the EHM algorithm for all extensive pattern lengths but higher or almost equal to the EHM algorithm in execution time for some small patterns.Te EHM algorithm is a substringbased hashing method whose execution time is higher than the BFM algorithm despite the small number of attempts and character comparisons.Hashing of text substrings takes longer for longer pattern lengths.Our proposed HAPM algorithm shows stable behaviour and better results than all algorithms for all pattern lengths.

Conclusion
In computer science, string matching has grown signifcantly in popularity and will be crucial to future technology development.Hashing-based string-matching algorithms are increasing daily, but the most vital objective is reducing hash collisions.We have proposed a hashing-based algorithm that has reduced hash collisions.Tree diferent data types are used to test our proposed algorithm's performance, and 20 distinct patterns for each pattern length are randomly selected from the dataset.Teir average value is taken after executing each of them 20 times.We implemented six alternative algorithms to evaluate the performance of our approach and tested them on the dataset.73.04% speed-up,   14 Applied Computational Intelligence and Soft Computing 31.04%, and 28.69% improvement have been achieved for average run time, the average number of shifts, and comparisons, respectively, comparing our proposed HAPM algorithm with the HqUF method on the E. coli dataset.Our algorithm performs better for DNA and protein datasets than the previous algorithms in terms of an average number of attempts and comparisons.Still, some cases show worse results for some short patterns regarding an average number of execution times.In future research, we will create fresh strategies based on the suggested hash function to speed up the execution of short patterns.

Figure 3 :
Figure 3: Number of attempts for prime numbers using DNA sequence.

Figure 4 :Figure 5 :
Figure 4: Number of comparisons for prime numbers using DNA sequence.

Figure 9 :
Figure 9: Average number of attempts using protein sequence.

Figure 10 :
Figure 10: Average number of character comparisons using protein sequence.

Figure 11 :
Figure 11: Average execution time using protein sequence.

Table 1 :
Quick Search table.

Table 2 :
Te time complexity of diferent string-matching algorithms.

Table 3 :
Average run time using E. coli.

Table 4 :
Average number of attempts using E. coli dataset.

Table 5 :
Average number of character comparisons using E. coli dataset.
Applied Computational Intelligence and Soft Computing pattern lengths 20, 50, and 90), it shows results almost close to the MAC algorithm.Te reason for this is that the MAC algorithm is index-based.If the frst character of patterns is less frequent in the text, then it requires less number of searches.Te EHM algorithm ofers better performance than the BFM algorithm.Our proposed HAPM algorithm outperforms EHM, BFM, MAC, MS, and QS algorithms for all pattern lengths.Figure