An Artificial Intelligence Approach for Verifying Persons by Employing the Deoxyribonucleic Acid (DNA) Nucleotides

Deoxyribonucleic acid (DNA) can be considered as one of the most useful biometrics. It has efectively been used for recognizing persons. However, it seems that there is still a need to propose a new approach for verifying humans, especially after the recent big wars, where too many people lost and die. Tis approach should have the capability to provide high personal verifcation performance. In this paper, a personal recognition approach based on artifcial intelligence is proposed. Tis approach is called the artifcial DNA algorithm for recognition (ADAR). It utilizes a unique identity for each person acquired from DNA nucleotides, and it can verify individuals efciently with high performance. Te ADAR has been designed and applied to multiple datasets, namely, the DNA classifcation (DC), sample DNA sequence (SDS), human DNA sequences (HDS), and DNA sequences (DS). For all datasets, a low value of 0% is achieved for each of the false acceptance rate (FAR) and false rejection rate (FRR).


Introduction
With advanced science and technology, it is now possible to authenticate people in order to achieve high levels of security.Maintaining private data and meeting the increased demands for security have become important matters.Tere are several methods that use biometrics to approve the identity such as fngerprint [1], palm print [2], iris print [3], and voice print [4].Biometrics include measuring an individual's distinctive physical or behavioral biometric trait [5].Te Greek word "bio" means life and "metric" means measuring; both words are combined to form the phrase "biometric" [6].In fact, there are diferent terminologies that are associated with the word "biometrics" such as verifcation, identifcation, classifcation, authentication, and recognition.It seems hard to distinguish between each one of them.However, such terminologies are clarifed over years of working.Verifcation utilizes the one-to-one policy, where a user declares his/her identity in order to compare with specifc related information belonging to the same user.Ten, a decision about accepting or rejecting the personal identity claim is provided [7].Identifcation exploits the oneto-many policy.Here, it is necessary to apply matching between the provided information by a user and all the stored information of all users.So, there is no need to provide a user's identity, and the decision can either assign or refuse to declare the identity [8].Classifcation refers to categorizing information into a certain group or set [9].Authentication refers to the process of proving an actual action.In computer science, this term is typically associated with approving a user's identity [10].Recognition is a general terminology, and it can be used to mention any of the previous biometric styles (verifcation, identifcation, classifcation, or authentication).
Deoxyribonucleic acid (DNA) can ofer trustworthy personal verifcation.It is inherently digital and remains unchanged during the person's lifetime and even after death [11].Te form of DNA known as a double helix; it is comprised of two connected strands that twist around one another to resemble a spiral ladder.Deoxyribose and phosphate are the main components of the backbone of each strand.Each sugar molecule in the DNA has one of four bases (or nucleotides): adenine (A), cytosine (C), guanine (G), or thymine (T) [12].Te A, T, C, and G refer to the chemical elements that connect the two strands together.Figure 1 demonstrates a sample of the DNA with the chemical components.
DNA bases pair up with each other, A with T and C with G, to form units called base pairs.Each base is also attached to a sugar molecule and a phosphate molecule.Te sequence of these pairs difers from one person to another, making the DNA unique for each individual and therefore this can be used for personal verifcation or any other recognition style.
It is known that using the DNA is so valuable for personal verifcation.However, a more efective DNA system is still required.Tis has been exposed in Iraq because of the big issues and wars, where too many humans died and were lost.Such DNA verifcation system should have the ability to deal with huge number of samples and provide precise outcomes.Tis work presents a new system based on the artifcial intelligence by employing a unique DNA pattern of the nucleotides (A, T, C, and G) for verifying persons.Te proposed approach here is called the artifcial DNA algorithm for recognition (ADAR).It can provide high performance, it facilitates searching for DNA verifcation samples, and its efciency is proven with four utilized datasets.
Te next sections are architectured as follows: Section 2 presents the literature review.Section 3 describes the ADAR theory.Section 4 discusses the experimental work and Section 5 provides the conclusion.

Literature Review
Tere are many prior DNA studies that can be highlighted.In 2005, Mitra presented a survey about the roles of diferent soft computing techniques such as fuzzy sets, artifcial neural networks (ANNs), evolutionary computation (EC), and support vector machines (SVMs) to classify and recognize the major pattern for DNA genomic sequence and protein architecture.Te SVM classifer recorded the highest accuracy and least error compared to other applied methods [14].In 2009, Wei proposed a system for categorizing the DNA sequence of four types of bacteria.It consists of the following steps: extracting DNA sequence features, constructing the ANN model, and classifying data.Te accuracies of classifying the four types of bacteria for lengthy and repetitive DNA sequences in the utilized dataset was 92.9%, 90.2%, 80.4%, and 41.7% after learning the ANN model [15].In 2012, Khashei et al. presented a novel hybrid model integrating AI and fuzzy logic for the analysis of gene data.Comparative evaluations against conventional approaches such as artifcial neural networks (ANN), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), Knearest neighbor (KNN), and support vector machines (SVM) demonstrated that the proposed model achieved enhanced classifcation accuracy.Tis suggests that the suggested hybrid model holds promise as a viable alternative technique, particularly in scenarios where data scarcity is a concern [16].In 2017, Pashaei et al. concentrated on the human genome by considering splice site identifcation with random forest.Te efectiveness of the employed classifers was mostly infuenced by feature extraction and feature selection techniques used in DNA encoding.Te feature selection methods removed the extraneous information, whereas the feature extraction methods attempted to extract as much information from the DNA sequences as possible.Te applied random forest was examined as a means of feature selection and classifcation in the splice site domain [17].In 2018, Pashaei and Aydin worked on Markovian encoding models.Recognition of splice sites for persons was considered.A third order Markov model with SVM (MM3-SVM) was proposed.It outperformed the best-known stateof-the-art methods [18].In the same year, Kaniwa and Phuthego explained how genetics is afected by nextgeneration sequencing to rapidly generate the DNA, and ribonucleic acid (RNA) sequences.Tis is for swiftly constructing the DNA and RNA sequences.Madrid, Spain, was the site of this study.It was based on the fundamental notion that DNA sequence information was expanded, which made simple and afordable analysis possible [19].In 2020, Sun et al. a novel multilayer deep neural network (DNN) was devised and implemented for survival prediction in a genome-wide association study.Tis DNN survival model exhibited superior predictive accuracy compared to several existing models, while also successfully identifying clinically signifcant risk subgroups.Te model employed an efective approach for capturing complex architectures among genetic variants.Te evaluation of the model was conducted on genome-wide association studies (GWAS) data from two large-scale randomized clinical trials involving over 7800 participants with age-related macular degeneration (AMD) [20].In 2021, Alatrany et al. proposed a hybrid machine learning (ML) technique for the prediction of Alzheimer's disease using genome sequence.Te most important singlenucleotide polymorphisms linked to Alzheimer's disease were chosen.Using data from a random forest, a DL model for the illness prediction was then provided.Utilizing a convolutional neural network (CNN) and multilayer perceptron (MLP), the simulation results showed that the hybrid model was efective in predicting people who had Alzheimer's disease [21].In 2022, Manhal investigated the use of DNA to identify individuals.An efcient algorithm was used to fnd the distinctive DNA patterns.Te unique personal DNA pattern (UPDP) was approached for personal identifcation.Four databases were employed, and they all yielded low reported errors [22].In the same year, Rukhsar et al. introduced DL analysis of RNA sequence gene expression data for cancer classifcation.Five diferent kinds of cancer data from the Mendeley archive were examined.Te appropriate characteristics were retrieved and chosen using the DL.Eight DL algorithms were employed to accomplish classifcation in the fnal phase.Te evaluation of DL classifers was performed using k-fold cross-validation and four diferent data splitting techniques.Among the evaluated classifers, the CNN exhibited the highest overall performance [23].Also in the same year, Hamed et al. provided a review on enhancing algorithms for pattern matching.Tis survey concentrated on biological sequences.It presented analyses of techniques, efciency, and complexity.Furthermore, it ofered comparisons between various algorithms for matching [24].In 2023, Ibrahim et [26].
Tis paper adds a signifcant contribution to previous work by approaching an artifcial intelligence algorithm named the ADAR.Tis algorithm is employed for verifying persons according to their DNA sequences of nucleotides.

Proposed Approach
Te proposed approach is called the ADAR.Its construction starts with substantial numbers of DNA sequences.Each DNA sequence has a unique nucleotide pattern code.Each strand of DNA is viewed as a fundamental sequence of nucleotides (or bases).Figure 2 depicts a DNA sequencing sample of two strands with nucleotide arrangements.
Any sequencing arrangement in a single DNA strand consists of A, G, T, and C nucleotides.In this work, determining the identity of a person is considered after counting the number of repeated patterns of four nucleotides (quaternary nucleotides).
Te ADAR algorithm considers counting all numbers of repeated four-nucleotide patterns.Ten, the maximum repeated pattern is determined.An identifcation claim is applied to a specifc person.Terefore, comparisons for (pattern index, maximum repetition and identity claim) are employed in the case of verifcation.
Te full system of the ADAR works in two main phases: enrolment and verifcation.In the enrolment phase, DNA samples are received and processed for storage in the system.In the verifcation phase, an identity claim and a DNA sample are provided for testing.A fowchart for the proposed ADAR with the two phases is given in Figure 3.
For the enrolment phase, the system of the ADAR consists of the following layers: input layer, search layer, max layer, identity layer, and comparison layer, which will be used for comparison.Te verifcation phase of the ADAR system involves the same stages as the enrolment phase and the output layer, which is added at the end and provides the verifcation decision.Te proposed ADAR layers for the two phases of enrolment and verifcation are demonstrated in Figure 4. Tey can be illustrated as follows: Input Layer: It is required for receiving DNA sample D as a string of sequences of nucleotides (or bases).Search Layer: It is employed for counting the numbers of repeated quaternary patterns X of nucleotides (frequencies of quaternary patterns of nucleotides).It considers all possible probabilities P(X), starting from "AAAA" and ending with "CCCC" (this covers 256 probabilities).Max Layer: Tis layer collects the maximum frequencies of the most repeated quaternary patterns of nucleotides for all D samples.Te following equation expresses a maximum operation: where Y is the maximum frequency of the most repeated quaternary pattern of nucleotides, max is the maximum operation between all frequencies of X i patterns, and i � 1, 2, . . ., 256 possibilities.Identity Layer: Tis layer during the enrolment phase stores the identity of n persons who provide their DNA sequences.Whereas, this layer during the verifcation phase matches between the identity claim for a person who requires to be verifed and his/her stored information.Te ADAR verifcation algorithm can be illustrated as follows: Step 1: Receiving the DNA sample as a string of sequence of nucleotides.
Step 2: Counting the numbers of repeated quaternary patterns of nucleotides.
Step 3: Collecting the maximum frequencies of the most repeated quaternary patterns of nucleotides for all the DNA sample.
Step 4: Matching between the identity claim for a person who requires verifcation and his/her stored information.
Step 5: Comparing with the three factors of (pattern index (i), maximum repetition (Y) and identity claim).
Step 6: Providing the output verifcation decision according to all processing layers and identity claim.
Parameters used for the ADAR analysis are given in Table 1.

Results and Discussion
4.1.Datasets Descriptions.Four datasets are employed in this paper: these are the DNA classifcation (DC) [28], sample DNA sequence (SDS) [29], human DNA sequences (HDS) [30], and DNA sequences (DS) [31].Each one of these datasets consists of many DNA sequences of nucleotides (A, G, T, and C).Te DC database involves 106 samples, the SDS dataset includes 426 samples, the HDS dataset contains 4380 samples, and the DS dataset has 11738 samples.All samples are used as strings of DNA sequences for nucleotides.

5'
Figure 2: Nucleotide sequences for a DNA sample of two strands [27].4 Journal of Electrical and Computer Engineering In more details, such datasets with their total numbers provide huge numbers of probabilities for clients and imposters, as shown in Table 2.

ADAR System.
Te proposed ADAR approach is constructed within a system.It is applied four times, each for an employed dataset.Te ADAR is implemented in both phases of enrolment and verifcation.Simple yet efective graphical unit interfaces (GUIs) are designed and provided.Figure 5 shows frst GUI, which has 5 essential buttons: (1) Load dataset: it is responsible for loading the dataset and applying the enrolment phase.
(2) Input DNA pattern: it allows entering a DNA sequence for the verifcation phase, as demonstrated in

Journal of Electrical and Computer Engineering
Figure 6, where the requesting window to enter a DNA sequence and an example of providing a DNA sequence are shown.
(3) Input identity claim: it facilitates entering an identity claim for the verifcation phase, as illustrated in Figure 7, where a request window to enter an identity claim and an example of providing an identity claim are given.( 4) Result: it is for performing the verifcation process and displaying the result of accepting or rejecting the identity claim.(5) End: It is for stopping and closing the ADAR system.Otherwise, the system stays working and can be used for other information.
As mentioned, the verifcation result should include accepting or rejecting the identity claim.Figure 8 shows both expected verifcation results in the ADAR system, where the output of rejecting the identity claim and the output of accepting the identity claim are displayed.Rejecting the identity claim is reported as incorrect identity with a red colored icon and accepting the identity claim is reported as correct identity with a blue-colored icon.

Results
Discussion.For evaluating the generalization of any ADAR system, holding out separate testing samples with efective loop instructions can be used.Tis causes intensive evaluations as: 106 clients and 11130 imposters for the DC datasets; 426 clients and 181050 imposters for the SDS dataset; 4380 clients and 19180020 for the HDS dataset; and 11738 clients and 137768906 imposters for the DS dataset.
It can be concluded that the ADAR system was successfully constructed.Furthermore, very high verifcation performance can be attained for each of the four datasets, as false acceptance rate (FAR) equals to 0%, and false rejection rate (FRR) equals to 0%.It can also be highlighted that the artifcial intelligence system in ADAR is user-friendly and easy to implement.
Additional metrics are also considered, these are the precision, recall, loss, and F1-score.In addition, receiver operating characteristic (ROC) curve and confusion matrices are also provided, as given in Figures 9 and 10, respectively.For all the employed datasets, the following values are computed: Precision � 1, Recall � 1, Loss � 0, F1score � 1, and area under the curve (AUC) � 1. Tis is expected as all false positive verifcations and all false negative verifcations have 0 values, as they are demonstrated in the confusion matrices.
Time spent for an ADAR verifcation has been measured, and it attained an interesting outcome of around 0.23 second.Tis measurement was carried out on a computer with the following specifcations: a hp laptop, an Intel Core i7 processor, 2.70 GHz processor speed, and 8 GB main memory.

ADAR Limitations.
Te proposed ADAR approach still has limitations and challenges to be considered.Examples of these are as follows: (i) It cannot be utilized for DNA samples that have no nucleotides (having diferent values instead).(ii) It is assigned for the verifcation, so, it requires adaptation for the identifcation too.(iii) It is not a machine learning technique; therefore, it is suggested to be developed in this direction.

Comparisons.
Comparisons between the proposed ADAR approach and state-of-the-art studies are considered, as given in Table 3. Tis table shows performance of state-of-the-art studies, which are conducted with the Unique Personal DNA Pattern (UPDP) method.Tey use the same employed datasets but with the numbers of samples as: 106 samples for the DC, 426 samples for the SDS, 500 samples for the HDS, and 1000 samples for the DS.Manhal et al. [22,32] focus on identifcation and have reported the FAR achievements as: 2.07%, 1.41%, 0.26%, and 0.75% for the DC, SDS, HDS, and DS, respectively.Ahmad et al. [32,33] work on verifcation and have recorded the FAR results as: 0.32%, 0.31%, 0%, and 0.16% for the DC, SDS, HDS, and DS, respectively.Te verifcation tasks using our ADAR can achieve even better performances.Signifcantly, it accepts full numbers of samples for all employed datasets: 106 samples for the DC, 426 samples for the SDS, 4380 samples   for the HDS, and 11783 samples for the DS.Each one of the datasets can benchmark a remarkable FAR performance of 0% by using the proposed system.Te FRR can be reported as 0% for any method, recognition (verifcation or identifcation), and dataset.
As a summary, the proposed system which uses the ADAR approach for verifcation has the capability to provide superior performance compared to previous state-of-the-art studies.It also accepts the full numbers of samples for all employed datasets.It can provide high reliabilities and performance.

Conclusion
Tis paper provides a new artifcial intelligence approach called the ADAR.It has been proposed for person verifcation by DNA nucleotides.ADAR works on two main phases: enrolment and verifcation.During the enrolment phase, DNA samples are received, processed, and stored for their unique information.In the verifcation phase, a DNA sample and identity claim are provided and processed, and their unique information is compared with the stored ones to make a verifcation decision.Te ADAR approach involves multiple layers: input layer for receiving a DNA sample, search layer for counting the frequencies of repeated quaternary patterns of nucleotides, max layer for specifying the maximum frequency among the repeated patterns, identity layer for storing or matching the identity claims, comparison layer for assigning comparison factors, and the output layer for providing the verifcation decision in the verifcation phase.
A system is also presented in this study; it implements the proposed ADAR.Moreover, four datasets, namely, the DC, SDS, HDS, and DS are employed.Remarkable   Comparisons with state-of-the-art studies are also illustrated.Te ADAR approach can overcome previous proposed methods or approaches.In addition to its ability to accept the full number of DNA samples for any employed dataset.It can be revealed that the ADAR can deal with a huge number of DNA samples.
In the future, multiple considerations can be suggested such as developing the ADAR to be used for identifcation and adapting it for machine learning.

Figure 1 :
Figure 1: Part of a DNA with its chemical components [13].

Figure 3 :
Figure 3: Flowchart for the proposed ADAR with the two phases of enrolment and verifcation.

Figure 4 :
Figure 4: Proposed ADAR architecture for the two phases of (a) enrolment phase and (b) verifcation phase.

Figure 5 :
Figure 5: First GUI window in any constructed ADAR system.

Figure 6 :
Figure 6: Windows to input a DNA sequence for the verifcation phase: (a) requesting window to enter a DNA sequence, and (b) example of providing a DNA sequence.

Figure 7 :Figure 8 :Figure 9 :
Figure 7: Windows to input an identity claim for the verifcation phase: (a) requesting window to enter an identity claim and (b) example of providing an identity claim.

Figure 10 :
Figure 10: Confusion matrices of: (a) ADAR verifcation for the DC dataset, (b) ADAR verifcation for the SDS dataset, (c) ADAR verifcation for the HDS dataset, and (d) ADAR verifcation for the DS dataset.
al. proposed a novel fast technique.It was for pattern matching.It is determined by biological sequences.Tis work was constructed to increase speed up the search for DNA sequence pattern [25].In the same year, Hamed et al. investigated the efciency of optimizing classifcation.It considered machine learning.It focused on pattern matching.Tis study suggested a new DNA sequence classifcation model.It fused between a pattern-matching procedure and machine learning techniques

Table 2 :
Used datasets and their total numbers with the probabilities for clients and imposters.

Table 1 :
Parameters used for the ADAR analysis.

Table 3 :
Comparisons between the proposed ADAR approach and state-of-the-art studies.Journal of Electrical and Computer Engineering performances can be achieved as 0% FAR and 0% FRR for applying the ADAR in a system of any employed dataset.