Managing and Retrieving Bilingual Documents Using Artificial Intelligence-Based Ontological Framework

In recent times, artificial intelligence (AI) methods have been applied in document and content management to make decisions and improve the organization's functionalities. However, the lack of semantics and restricted metadata hinders the current document management technique from achieving a better outcome. E-Government activities demand a sophisticated approach to handle a large corpus of data and produce valuable insights. There is a lack of methods to manage and retrieve bilingual (Arabic and English) documents. Therefore, the study aims to develop an ontology-based AI framework for managing documents. A testbed is employed to simulate the existing and proposed framework for the performance evaluation. Initially, a data extraction methodology is utilized to extract Arabic and English content from 77 documents. Researchers developed a bilingual dictionary to teach the proposed information retrieval technique. A classifier based on the Naïve Bayes approach is designed to identify the documents' relations. Finally, a ranking approach based on link analysis is used for ranking the documents according to the users' queries. The benchmark evaluation metrics are applied to measure the performance of the proposed ontological framework. The findings suggest that the proposed framework offers supreme results and outperforms the existing framework.


Introduction
e recent development in the information retrieval (IR) techniques facilitates effective document management (DM) functionalities in organizations. e process of retrieving relevant information by passing a query in a search engine is called IR [1][2][3][4][5]. A query is a text in a natural language to extract a relevant document. For instance, a search engine can fetch approximately one million webpages for a user query. Organizations apply business intelligence (BI) tools to process a large amount of data and retrieve valuable information [6][7][8][9][10][11]. To compete effectively, organizations should analyze and leverage a wide range of data, information, and expertise in order to make effective decisions. Decision Support Systems (DSS) are interactive computer-based systems designed to assist decision-makers in identifying and solving problems, completing decision process tasks, and making decisions [12][13][14][15][16][17][18][19]. ese systems are becoming increasingly popular among managers due to this trend. However, the shortcomings include unstructured data and complex queries reducing IR technologies' performance. In other words, users failed to retrieve relevant documents for their queries [20][21][22][23][24][25]. Moreover, the absence of bilingual (English and Arabic) IR systems causes difficulties for organizations in the Middle East countries.
On the one hand, there is an availability of a wide range of IR systems. On the other hand, there is a lack of domainspecific ontologies or IR systems to serve an organization [26][27][28][29][30]. In the Kingdom of Saudi Arabia (KSA), most organizations offer a sophisticated application for employees and stakeholders to share the information and valuable documents.
In the current environment, organizations store documents in Portable Document Format (PDF) form and their relevant metadata in a different storage location. e AI tools widely use the metadata for making decisions [42][43][44][45][46][47].
ere are many techniques for retrieving a document using a query.
us, organizations cannot access the document's content without its metadata [48][49][50][51]. e KSA's Vision 2030 motivates researchers to apply innovative techniques to the current functionalities of the organization. erefore, developing an ontological framework for document management can support organizations in satisfying their stakeholders. In addition, the role of natural language processing (NLP) in the ontological framework enables individuals to interact with the system in their natural language [52][53][54].
e objectives of the study are: (1) Build a data extraction model for extracting text from Arabic and English PDF documents. (2) Construct a name entity-relationship (NER) classifier for classifying the documents. (3) Implement a ranking approach to retrieve relevant documents for a user query.
e remaining part of the study is organized as follows: Section 2 reports the features of existing literature and research gaps. Section 3 outlines the research methodology and Section 4 discusses the study's findings. Finally, Section 5 concludes the study with its future direction.

Literature Review
DM is one of the critical processes in an organization. e communication between the users of the internal and the external units of an organization may generate a document [1][2][3][4][5]. Organizations follow the government and the international archival policies to store and manage their documents [6][7][8][9]. e existing studies show many techniques and frameworks for managing documents and IR [9][10][11][12][13][14][15].
Zaman et al. proposed an ontological framework for retrieving scientific sources [1]. ey employed fuzzy rule base and word sense disambiguation for extracting information from multiple scientific documents. e experimental outcome suggests that the framework was less sensitive to the document file format modifications. However, there is limited information on the performance of the framework.
Yao et al. developed an AI-based ontological model for predicting the side effects of medicines [2]. e model had certain entities such as value and relationships. e value and relationship are used to indicate the drug and its side effects. e AI model's fuzzy and dynamically defined latent attributions can redefine vital records. e performance of the IR model is affected by the limitations, including the lack of negative data and the smaller dataset.
Crimp and Trotman proposed a linguistic model using Roget's and WordNet [3]. ey employed an Attre search engine and evaluated the model using the mean average precision (MAP) metric. e outcome highlights the better performance of the linguistic model. However, the authors utilized a limited set of features from Roget's and WordNet.
Vocabulary mismatch is one of the limitations of the IR system. To overcome this limitation, query expansion (QE) techniques are developed. However, QE techniques are based on specialization and context relationships [4]. Raza et al. discussed that domain-specific ontologies are widely used in medicine, agriculture, and other scientific fields [4]. Multiple automated QE systems are proposed in IR [5]. Yunzhi et al. constructed an Arabic ontology based on the Protégé and SPARQL language to extract candidate expansion terms [6].
Domain-independent ontologies serve as a valuable resource for multiple domains. Aggarwal and Paul extracted expansion concepts from DBPedia and Wikipedia ontologies using semantic analysis [7]. However, the shortcomings include ambiguous terms and a lack of unique ontological properties causes more complexities. Zingla et al. and Omar et al. proposed hybrid models for extracting expansion concepts from DBpedia and Wikipedia [8,9]. ey employed Microblog and TREC 2011 datasets for evaluating their ontological performance. e existing studies focus on the specific domains, and there are no studies on the DM and IR [10][11][12][13][14][15]. ere is a lack of bilingual ontological framework for the organizations in the KSA. Most studies considered the NER classification of webpages as a primary objective rather than the ranking approach [16][17][18][19][20][21]. Particle Swarm Optimization (PSO) is used to enhance and train Hidden Markov Model (HMM) estimate approaches (PSO). PSO identifies the optimal response for a user query. For instance, the metadata of a document can be extracted using this approach [22][23][24][25][26][27]. A text extractor can be built using the AI technique for the automated extraction of key terms from a document [28][29][30][31][32][33][34]. An ontology-based dynamic information extraction framework identifies a wide range of document resources published in the scientific community and extracts the whole structural information [35][36][37][38][39][40][41]. e accuracy and scope of information extraction can be improved using an entity-relationship-based framework [42][43][44][45][46][47]. Few research works employed the term-frequency methodology for ranking the webpages [48][49][50][51][52][53][54]. us, there is a demand for a practical ontological framework for managing documents and retrieving information based on the user query.

Research Methodology
In order to achieve the objective of the study, researchers construct a bilingual (Arabic and English) ontological framework for retrieving documents. Figure 1 presents the proposed research framework of the proposed study. It covers four phases including data extraction, NER classification, ranking technique, and performance evaluation. e first phase outlines the data extraction process for extracting text from PDF documents. e NER 2 Computational Intelligence and Neuroscience classification using MNB is described in the second phase. e third phase highlights the ranking techniques to retrieve relevant documents. Lastly, the fourth phase evaluates the performance of the proposed ontological framework (POF).

Phase 1: Data Extraction.
is phase transforms the PDF document into a text document. It supports the retrieval process to extract relevant documents. During communication, employees or stakeholders widely use PDF documents for sharing information. It is difficult to search a PDF document using a user query. erefore, A PDFtoWord is developed in order to automate the process of converting a PDF document to a Word document. However, a PDF document may contain handwritten content which cannot be converted into a Word document. In other words, converting handwritten text into standard text is challenging. Figure 2 shows the activities of phase 1. Initially, a document is converted to image format in order to extract the text. e extracted raw text is preprocessed and stored as a set of keywords and a word file. Phase 1 supports the proposed framework to search a document using a keyword. It overcomes the limitations of the searching document using metadata.
us, this study transforms the PDF document into an image, JPEG, or PNG format. e procedure of the data extraction process is as follows: Step 1: Input a PDF document.   Computational Intelligence and Neuroscience 3 Step 2: Converting documents from a PDF form to JPEG or PNG format. Let PD be the PDF document, ID be an image format of the PDF document. Doc_To_Img is a function for converting the documents from PDF to image structure and hres is the attribute to make the image with high resolution (1100 × 900 pixels at 600 pixels per inch). Equation (1) shows the expression of converting the PDF document into image format.
A text extractor is designed using the AI-based Tessaract module that extracts the text from the image [55]. Nonetheless, the module is limited to the English Language. us, a dedicated Arabic dictionary is developed and integrated with the Tessaract module. Let Tessaract() be a function to extract text from an image, P_process be a preprocess function, RT is a raw text, and d be the document's content. Equations (2) and (3) outline the extraction and preprocessing of text.
e P_process function employs an Arabic and English dictionary to ensure the RT is correct. During the text extraction, the extracted text may contain some errors. For instance, "name" may be misspelt as "mame." us, the dictionary corrects the erroneous content.

Phase 2: NER Classification.
In this proposed study, the researchers employed the Multinomial Naïve Bayes (MNB) for classifying the documents [56]. Each document is a collection of words. A class or label consists of homogeneous documents. MNB algorithm is widely used in NLP applications. It classifies documents based on the statistical outcome of the content. Figure 3 outlines the processes of phase 2. e word document is processed using the Bayesian property. e posterior function is computed for each term in the document. Finally, each document is stored as a vector. e following section explains the computation of Bayesian property and posterior function in detail. e classification assigns a text segment to a class using the probability of documents in the class of other documents. e process of grouping similar documents under a specific class is called labeling. Let S be the document to be classified. Each document in S is treated as a string related to one or multiple documents based on a class L. e classification of documents is based on a train set that contains the classified documents according to the document relationship in Figure 4. Figure 5 shows the classification of documents using the train set.
Let f be the vector in S, f i be the feature in f representing the i th term in L.
e core of the MNB model is the evaluation of probability-based decision function. e Bayesian probability for the documents is expressed in equations (4)   Computational Intelligence and Neuroscience belonging to the class L m is shown in equation (6). Equation (7) outlines the MNB in the log space. e evaluation log(P) is expressed in (8).
log (P) � ln(P), P < 1, e following steps are followed for classifying the documents using the MNB classifier: Step 1: Divide the documents (S) into a group of n-terms.
Step 2: Repeat the following process for each i th term in S.

Computational Intelligence and Neuroscience
Step 2(a): Compute the Bayesian probability using equation (4).
Step 2(b): Evaluate the P(Lm) function for each document i in L.
Step 2(c): Compute the posterior function by integrating the prior function to the sum of each term using equation.
Step 3: Compute L S of S using Eqn.
Step 4: Repeat Steps 1 to 3 with the train set.
Step 5: Classify the documents and store them as a vector.

Phase 3: Ranking Approach.
In this phase, the researchers apply the ranking approach based on the study [19]. Figure 6 highlights the flow of processes in phase 3. Phase 3 initializes the vector and computes Hub and authorities similar to the HITS algorithm. However, a random walk feature is employed for updating Hub and authority weights. e approach is the combination of PageRank [20], HITS [21], and SALSA [22] algorithms. It is a link-based ranking technique. Assume a i be the authority weight, h i be the hub weight. is ranking approach considers the document with higher a i as better authorities and higher h i as better hubs. Figures 7(a) and 7(b) show the authorities and Hub pointing with P. e weights of h i and a i are updated dynamically.
Documents are ranked according to the user query based on the weights of h i and a i . It works similar to HITS using bipartite graph (G) and seed set (R f ). In addition, the Pnorms, a parameter, assign multiple normalized weights to each document link. A duplicative feature is employed to initiate Hub and authority, and vice-versa. e random walk feature of SALSA is used to identify the highly reachable node in G. Finally, normalization of the A → generates the ranked documents. e following procedure is applied for the ranking documents: Step 1: Input user query and initialize the N h and N a node and the parameter (P), P-norm value.
Step 2: Step 3: For each element i in N h Step 3a: For each element I in the set of nodes pointed by i th node Compute Temp � Temp + a j P/|B(j)| Step 3b: Compute h j � ����� Temp

P
Step 4: For each element k in N a Step 4a: For each element l in B(k) Compute Temp � Temp + h l P/|F(l)| Step 4b: Compute a k � ����� Temp

P
Step 5: Repeat Step 3 to 5 until weight converges Step 6: Update A → with authority weight Step 7: Normalize A → , ranked documents. Based on the above terms, the metrics are computed as follows: Precision is a set of retrieved documents relevant to the user query.
F1-score is the harmonic mean of Precision and Recall.
Accuracy is the number of retrieved documents for a user query.
R-precision is used to ensure that the returned documents are relevant to a user query. It computes the recall value at R th position.
Mean Average Precision (MAP) is the average precision for each user query.
MAP � n q�1 Average Precision(q)/n where n is the number of queries (q).

Results and Discussion
To evaluate the performance of the proposed ontological framework (POF), a testbed containing 77 documents in PDF form is developed. Python 3.9.12 in Windows 10 professional environment is utilized for implementing the frameworks. Initially, a text extractor is employed to extract the text from the PDF. Figure 8 illustrates the application interface for uploading the PDF file to convert it to a word file and extract key terms.
An Arabic dictionary is integrated with the text extractor to extract the Arabic content. MNB is used for building the ontology by classifying the documents with NER. Finally, the LBR method is applied for ranking the documents according to the user query. Table 1 outlines the Arabic and English queries for evaluating the framework's performance. It comprises the five frequently used queries by the organizations to retrieve the documents. Figure 9 shows the list of documents for the term "salay issues." POF searches the documents and retrieves 27 documents based on the key terms. Using the hyperlink, the user can view the specific document. Table 2 reports the findings of the performance evaluation of the POF. It outlines that the POF achieved compelling results. For instance, in Precision@77 for English Query 1, the POF offered Precision, Recall, F1-Score, and Accuracy of 97.3%, 97.1%, 97.2%, and 98.3. Similarly, in Precision@77 for Arabic Query, the POF presented Precision, Recall, F1-Score, and Accuracy of 97.7%, 98.4%, 98.05%, and 98.1%. It is evident from the outcome that the POF has produced a similar set of results for English and Arabic queries, respectively. e NER classification and linkbased ranking approach have supported the POF in retrieving an optimal set of documents for user queries. Figure 10 highlights the POF's overall performance (Precision @77) for the English and Arabic queries. e POF achieved an average F1-Score of 97% for five English and Arabic queries. It is noticed that the POF retrieved relevant   Figure 11 portrays the performance of the ontological framework for English queries, while Figure 12 presents the outcome for Arabic queries. Figure 11 portrays the comparative analysis of the frameworks for the English queries. It represents that the POF has gained better Precision, Recall, F1-Score, and accuracy. Similarly, the GOF and YOF have accomplished higher Precision, Recall, F1-Score, and accuracy.
Likewise, Figure 12 presents the results for the Arabic queries.
e frameworks have achieved a better result. However, the POF's overall performance is better than the existing frameworks. In addition to the benchmark metrics,

Queries
English Arabic 1 What are the terms or words highly communicated by unit A?
What type of documents are accessed through the unit B?
How many times unit D uses the term "center" in their communication?
What are the documents communicated by employee A?
Who uses the word "delay" in the documents?   Computational Intelligence and Neuroscience and 95.8%, respectively. e features of HITS and SALSA have favored the POF to retrieve a compelling set of documents compared to other frameworks. Figure 13 shows that the POF offered a supreme outcome for the English and Arabic queries compared to the GOF ( Figure 14) and the YOF (Figure 15). It reveals that the effectiveness of data extraction, NER classification, and ranking approach supported the proposed framework to produce better results.
POF achieves a better Precision, Recall, F1-score, and Accuracy for both Arabic and English languages, respectively. It can be applied in any kind of document management environment. However, GOF and YOF are the ontological frameworks for specific documents which cannot be applied for general applications. In addition, POF offers a ranking technique for searching a bilingual document rather than GOF and YOF. It is a link-based searching technique, whereas GOF and YOF rank the documents       according to the user query and term frequencies of the document. us, POF enables an effective searching environment for users compared to GOF and YOF.

Applications of the Proposed Framework.
e proposed ontological framework can be applied in the real-time document management and retrieval environment. It enables an opportunity for the users to retrieve relevant documents based on the keywords. In addition, it offers the following applications for society.
Digital library: Using the proposed framework, a large corpus of documents can be developed to support the organization in facilitating a digital library for the employees to share information and manage their routine tasks.
Chatbot: e advent of AI techniques leads to the development of the question-answering system (Chatbot service) for the employees and stakeholders of an organization. e proposed framework can support the developers in training and test the Chatbot applications. e NB classifier offers the relation-based documents which the Chatbot system can use to provide relevant answers for the user queries.
Recommender system: Using phases 1 and 2, a recommender system can be developed for the employees to furnish useful data during document creation. e documents' data can be used as a keyword or metadata to search a document.
Furthermore, the bilingual feature of the proposed ontology supports Arabic and English-speaking users to share

Conclusion
is study developed an ontological framework for managing Arabic and English documents in Saudi Arabian organizations. e proposed framework comprises three phases for converting the PDF documents into ordinary word documents with a set of unique terms; a Naïve Bayesbased entity-relationship document classifier and a ranking technique for arranging documents as per the user query. e conversion technique uses a modified text extractor for extracting Arabic and English terms from the images. Furthermore, the entity-relationship technique arranges the document as per the relationship among the terms of the documents. e ranking technique combines the features of the HITS and SALSA ranking algorithm to rank the documents at a faster rate. A set of 77 documents were utilized to compare the performance of the proposed frameworks with the recent techniques. e outcome reveals that the proposed ontological framework achieves adequate Precision, Recall, F1-score, and Accuracy for the bilingual documents using a user query. In addition, it offers an effective bilingual document management environment for employees and stakeholders of Saudi Arabian organizations. e proposed framework can be extended to other languages. Furthermore, the ranking technique can be improved using metadata with the newer deep learning techniques.
Data Availability e data supporting the results can be available on request to the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest to report regarding the present study. Computational Intelligence and Neuroscience 13