A Comparative Study of Some Automatic Arabic Text Diacritization Systems

Arabic diacritization is the task of restoring diacritics or vowels for Arabic texts considering that they are mostly written without them. is task, when automated, shows better results for some natural language processing tasks; hence, it is necessary for the eld of Arabic language processing. In this paper, we are going to present a comparative study of some automatic diacritization systems. One uses a variant of the hidden Markov model. e other one is a pipeline, which includes a Long Short-TermMemory deep learning model, a rule-based correction component, and a statistical-based component. Additionally, we are proposing some modications to those systems. We have trained and tested those systems in the same benchmark dataset based on the same evaluation metrics proposed in previous work. e best system results are 9.42% and 22.82% for the diacritic error rate DER and the word error rate WER, respectively.


Introduction
Arabic is the language of over 422 million natives in the Arab world. It is present in the religious life of over a billion Muslims. It is also present, with its modern standard and dialectical forms, in the daily life of native speakers. e Arabic texts follow the Abjad writing system. Every letter is considered a consonant, while diacritics or vowels are mostly omitted and are left to the readers to deduce based on their knowledge of the language and the words' context. e vowel marks or diacritics are written either above or below the letters. It is worth mentioning that typing speed of the Arabic script may decrease to half if we include diacritics in the written text. ey could also improve the reading comprehension of people su ering from dyslexia. is is due to the cognitively demanding characteristics of inferring the diacritics, and as shown by Al-Wabil et al. [1] Arabic dyslexics lack skills in working memory and phonological skills. Moreover, using intensively the nonvowelized form in Arabic web content becomes an obstacle for dyslexic and visually impaired readers.
Another problem cited by Al-Wabil et al. [1] is that diacritics can help infer the right words, but, at the same time, they add visual complexity to the text. is adds more e ort for dyslexics since reading requires visual discrimination and memory skills. A proposed solution to address this problem is to o er the text in its undiacritized form in addition to diacritization options by levels (partial or full).
Furthermore, not only diacritics are important to children and Arabic novice learners, but they can also o er great help in some natural language processing NLP tasks. ose tasks can be text-to-speech, machine translation, automatic speech recognition, Part-Of-Speech (POS) tagging, etc. For example, the diacritized form can narrow down the result list of an information retrieval system. In addition, it can lead to better text classi cation, which could be utile for sentiment analysis systems. Ergo, the automatic Arabic diacritization task, is important in the Arabic NLP eld, and in the real life of Arabic beginner learners and people with Speci c Learning Di culties (SpLD).
In this paper, we are going to present a comparison of some diacritization systems. e rst one uses the hidden Markov model (HMM) combined with smoothing techniques. e second system is a pipeline of multiple components, starting with a deep learning model, followed by a rule-based model and a statistical-based model. e third system is quite similar to the second one, but we modified the input layer by using char embedding to replace the onehot encodings.
e study was done based on the same benchmark dataset and using the same evaluation metrics proposed in previous work [2]. e best results are done by the HMM-based system using Laplace smoothing technique. When we include the count of nonvocalized letters in the original text, the system does 9.42% and 22.82% for the DER and WER, respectively. Without the count of nonvocalized letters in the original text, the system does 10.60% and 21.89% for DER and WER, respectively.
For this comparison, this paper will have first a section that represents the language background. After that, we are going to look at some works that tackled the same problem. en, we are going to describe the main systems we compared in this study. Finally, we are going to present the experimental settings and results of the comparison we have made.

Arabic Language Background
In this section, we are going to look into the Arabic language background, specifically the diacritization area.
Arabic diacritic marks can be grouped as shown in Table 1. First short diacritics or Harakat: Fathah, Kasrah, and Dammah. Second, tanw � in or nunation diacritics are visually double short diacritics and are present at the end of some words; when read, it gives the sound of the short vowel followed by an/n/ sound. e third contains shaddah, which geminates the letters, and it can be present with the short diacritics and the nunations. e last group comprises the suk� un, which is written above a consonant to mark the absence of vowels and indicate the closure of a finished syllable.
Diacritics have two purposeful classes. e lexical or morphological ones (core-diacritics) help distinguish the lexical characteristics of the word, while the inflectional class helps with the syntactic characteristics of the same lexeme within a sentence. ey are also called case-ending vowels, and they are harder to infer than the former class [3].

Related Works
In this section, we are going to summarize some previous works related to automatic diacritization.
Elshafei et al.'s system [4] firstly gives a dictionary and words with their frequencies.
en, it gives bigram and trigram distributions. en, a hidden Markov model (HMM)-with nonvocalized sequences as observed states and diacritized forms as hidden ones-determines the best vocalization. e system was trained based on a large diacritized corpus related to different fields. e vocabulary size is 18 Ya'kov Gal's work [5] is used for Arabic and Hebrew vowels' restoration. e unigrams and bigrams are extracted at first. en, an HMM model with the Viterbi decoding algorithm helps find the best output. As an Arabic dataset, the work used a publicly accessible version of the Qur'an corpus (https://www.sacred-texts.com/) that contains 90,000 words. e system achieved 14% and 19% word error rate (WER) for the Arabic and the Hebrew test data, respectively.
Zayyan et al.'s system [6] used multilexical layers to infer diacritics. e first layer is word-based, and it uses n-grams models, taking into account left and right contexts. e second layer is letter-based, and it uses n-grams also taking into account left and right contexts. e dataset used is made up of LDC Arabic Treebank (LDC Arabic Tree Bank Part 3: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp? catalogId�LDC2005T20) (about 340K words), and the dialectal corpus of CallHome Arabic (https://catalog.ldc. upenn.edu/LDC97T19) (120 transcripted telephone conversations). e solution achieved 16.8% and 11.7% for WER and DER, respectively.
Shaalan et al. [7] combined lexicon retrieval methods, a bigram model, and a support vector machine SVM, which determines the Part-of-Speech POS tags. e latter-POS tags-are used to select the best bigrams. e system also used an external analyzer to tackle the inflectional characteristic of Arabic texts. e training and evaluation dataset used for this system is the LDC Arabic Treebank (Diacritized news Part 2 v2.0: catalogue number LDC2004T02 and 1-58563-282-1). It includes 144,199 tokens. e system achieved 11.79% as WER and 3.24% as DER.
To restore diacritics, Shahrour et al. [8] used MADA-MIRA (https://camel.abudhabi.nyu.edu/madamira/) analyzer, J48 decision tree classifiers, and syntax rules. As a dataset, they used the Penn Arabic Treebank (PATB parts 1, 2, and 3). e achieved WER is 9.4%. Ahmed Said et al.'s work [9] is in the form of sequence. Firstly, it starts with the autocorrection of common Arabic mistakes and the tokenization. Second, the out-of-vocabulary OOV generation comes accompanied by the extraction of morphological features using rules and a statistical-based analyzer. irdly, an HMM for POS tagging returns the most likely sequence and helps with the case endings' restoration. Lastly, a statistical-based model handles the OOV.
e system was trained and tested using the standard LDC Arabic Treebank corpus (part 3 version 1.0, #LDC2004T11). e training set contains around 288K words; the blind set contains 52K words. e system has achieved a WER of 11.4% and a DER of 3.6%.
Bebah et al. [10] used Alkhalil Morpho Sys (https:// sourceforge.net/projects/alkhalil/) to extract morphological information. en, an HMM chooses the most likely vowelization based on the frequency distribution of words in the corpus. e HMM uses undiacritized sequences as observations, while it either sees the possible vocalization forms or the possible diacritics as hidden states. e corpus used in the training (respectively, testing) phase is derived mainly from NEMLAR (http://catalog.elra.info/en-us/ repository/browse/ELRA-W0042/), Tashkeela (https:// sourceforge.net/projects/tashkeela/), and RDI (http:// www.rdi-eg.com/RDI/TrainingData/) corpora. e training corpus consists of 2,463,351 vowelized words. e 2 Advances in Human-Computer Interaction proposed vocalizer has a WER of 21.11%, and a DER of 7.37%. Samah Alansary's system [11] is based on rules where the Arabic diacritization processing is done in seven steps grouped into three modules: morphological processing, syntactic processing, and morph-phonological processing. e data set was selected from the International Corpus of Arabic (ICA) (http://www.bibalex.org/ica/en/about.aspx). e vocabulary size of the training set is about 300,000 Arabic words, and the testing vocabulary size is 100,000. Moreover, they used a testing set derived from LDC Arabic Treebank (nearly 52,000 words). e system achieved a WER of 15.7%.
Mohsen Rashwan et al.'s work [12] utilizes a context memory applied to each tokenized word. en, two deep networks simultaneously use them (the word and its context). One extracts the features, and the other deep network finds POS tags. e results of the two networks are fed then to a deep net for classification. e used dataset is composed of LDC's Arabic Treebank (ATB) part 3 corpus (catalog identifier � LDC2005T20) and some customized datasets. Based on the ATB dataset, the case-ending accuracy is shown to be 88.4%, while the morphological accuracy is about 97%. Yonatan Belinkov et al.'s system [13] uses a bidirectional Long-Short Term Memory (BLSTM) network to infer the diacritics. It is composed of a char embedding layer, three BLSTM layers, and an output layer that generates probability distributions over labels. e data set was from the Arabic Treebank. e train, dev, and test sets contain, respectively, 470K, 81K, and 80K words. e system's DER is about 4.85%.
Gheith Abandah et al.'s system [14] also used a deep BLSTM for diacritics' restoration. e dataset was composed of ten books from the Tashkeela corpus, the Holy Quran, and the LDC's Arabic Treebank part 3, v3.2 (#LDC2010T08). e vocabulary size of the LDC ATB3 corpus is about 305,000 words. e average vocabulary size of all the corpora (including LDC ATB3) is about 402,000 words. Testing on the Tashkeela corpus gave a DER of 2.09%, and a WER of 5.82%. Testing on the ATB3 gave a DER of 2.72% and a WER of 9.07%.
Aya Metwally et al. [15] have proposed a system that contains three phases. e first one is for restoring morphological diacritics and POS tags, using an HMM with Laplace smoothing technique. In the second step, the system infers the same things, yet for the unseen words (Out of vocabulary OOV). e last phase infers the syntactic vowels using morphological features, POS tags, and a Conditional Random Fields (CRFs) classifier. For the experiment, they used the LDC Arabic Treebank part 3 dataset. It consists of 600 articles (about 340,000 words). e WER was reported to be 13.7%.
Chennoufi et al.'s system [16] uses firstly Alkhalil Morpho Sys to extract morphological features and all different vowelization for each word. en, the syntactic rules are used to throw invalid sequences. irdly, the HMM is used to choose the best one. Finally, a char-based HMM is used to process unseen words. e dataset consists of the Tashkeela corpus (63 million diacritized words), NEMLAR text (500,000 diacritized words), and the RDI corpus (about 8.5 million diacritized words). e achieved WER and DER are 6.22% and 1.98%, respectively. Amany Fashwan et al.'s system [17] has two levels. e first one is for morphological processing through unimorphological processing, morphological rules, statistical processing, and processing of unseen words. e second level is for syntactic rules, which help restore case endings.
e testing was done based on the LDC's Arabic Tree-bank part 3 (about 52,000 words).
e WER is about 14.78%, while the DER is about 4.11%. e work of Kareem Darwish et al. [18] does the task in two phases. One is for internal vowels' inferring using bigrams, unigrams' stems, stem patterns' templates, and sequence labeling of stems. e second phase deals with the case endings via the SVM ranking model and heuristics. e training corpora (acquired from a commercial vendor) contain more than 9.7 million words. e testing data--containing about 18,300 words-was made up of 70 WikiNews articles. e work achieved 12.76% WER and 3.54% DER.
Saba' Alqudah et al. [19] followed also the hybrid approach by fusing the MADAMIRA analyzer and a deep bidirectional LSTM network. e outputs of MADAMIRA with a high confidence parameter are input to the network. e experimental data set is the LDC Arabic Treebank part 3 v3.2 (catalog id � LDC2010T08). It comprises 305,000 words. e system achieved 2.39% and 8.40% as DER and WER, respectively.
Badr Alkhamissi et al. [3] proposed a system that has two models, each with two levels. e first has a word-level encoder. e second is a character encoder. In addition, a cross-level attention unit is used to improve the character's representation, by utilizing embeddings of characters to access every word in the sentence. e second model has a forward LSTM that inputs char embeddings and one-hot encoding of the previous model. In addition, it has a final classifier utilizing a Softmax. e dataset was generated from the Tashkeela corpus. It was split into the train (2,449K tokens), validation (119K tokens), and holdout (125K tokens) sets. e system achieved 5.34% WER and 1.83% DER.
Ismail Hadjir et al. [20] presented a system based on a modification of the HMM using the Viterbi decoding algorithm. e training set is made up of 26 books from the Tashkeela dataset, while the testing set is made up of three other books from the same corpus. e system achieved a precision of up to 80% at the word level.

Presentation of the Three Systems
In this section, we are going to present the three compared systems based on which we did the comparison.

Arabic Diacritizer at Uses a Multilevel Statistical Model.
Mohamed Hadj Ameur et al. [21] proposed-as shown in Figure 1-a model consisting of multiple layers, which are statistically based ones. e first is a bigram-word-based model using a modified version of HMM. e second phase is dedicated to unsolved cases. It is based on a 4-gram character-based model, which uses also a modified version of HMM.
e first phase follows the following steps: Firstly, a dictionary is built; it associates each undiacritized word with its possible vocalizations. e second step generates a lattice, for the undiacritized sequence W 1 , W 2 , . . . , W n , where each word W i is associated with its possible diacritized forms from the dictionary. en, for each possible diacritization, a probability is calculated using the bigram model assumption expressed in the following equation: Finally, among all the possible diacritizations, the one with the highest probability is generated as formulated in the following equation: e Viterbi algorithm uses a variation of HMM as input to generate the best vocalization. e HMM is defined by the following: (i) A set of states representing the diacritized words d 1 , d 2 , . . . , d n (ii) A set of observations representing the undiacritized words W 1 , W 2 , . . . , W n (iii) e matrix of transitions: contains transition probabilities It is to note that the HMM variation proposed in this model is that the transition probabilities are considered, while the emission probabilities are neglected.
e Viterbi algorithm uses a recursive relation to pick the best diacritized sequence from the hidden Markov model. e probability, for each transition indexed (i, j), in the level i, is computed based on its precedent ones in level i − 1, as shown in the following equation: where v i−1 is the number of all diacritization possibilities for the (i − 1) th word. In the forward propagation of this algorithm, all the probabilities are calculated while recording the best transition probabilities on this path. en, the back tracing helps in returning the optimal path. To solve the problem of unseen bigrams, the smoothing technique (Laplace smoothing or Absolute Discounting) is used.
e second phase is a 4-gram letter-based model used for nonvocalized words from the first phase. It uses a letterbased HMM, which consists of a set of the following: (i) States representing the diacritized letters q 1 , q 2 , . . . , q n (ii) Observation states: represented by the nondiacritized letters l 1 , l 2 , . . . , l n (iii) e matrix of transition probabilities is letter-based model is used similarly to the wordbased model with the smoothing techniques. Besides, the best path is selected based on the Viterbi algorithm.
It is also worth mentioning that the source code (https:// github.com/Ycfx/Arabic-Diacritizer), related to this work, seems to be incomplete, so we restored some missing methods-algorithms 1-4-which were used mainly in the letter-based component. Besides, we corrected some lines of code in other methods.

Multicomponent System for Automatic Arabic Diacritization.
e system proposed by Hamza Abbad et al. [22] is a pipeline of three main components. At the outset, a preprocessing phase is needed. Its main rule is to simplify the presentation of the data. First, it keeps the Arabic letters and the spaces and replaces numbers with 0 s. e other chars are inferred after diacritics restoration. Second, each sentence is presented as an input and its corresponding outputs. e outputs are a one-hot encoded vector for the presence of Shadda diacritic on a letter, and a 2D array of one-hot encoded vectors representing the primary diacritics for each character in the sentence. Along with that, the input is mapped to the 38 numeric labels of the kept characters. ey are then one-hot-encoded as a two-dimensional array with shape (length of sentence, number of labels). After that, the input array is extended to be a 3D tensor with shape e first parallel layer, which helps predict the presence of Shadda, is connected to a single perceptron with a sigmoid activation function. e second parallel layer, which helps in the prediction of primary diacritics, is connected to seven perceptrons followed by a Softmax function. Figure 2 shows the architecture of the deep learning component. e second phase is for rule-based corrections. It is connected to the input and output of the previous component, to do the Shadda and the primary diacritics' corrections based on Arabic rules. e third component is for statistical-based corrections. Outputs and inputs of the previous correction component are merged and transformed. en the resultant sentence is split into words. Each word is checked, and it goes through four sublevels. e first one is dedicated to word trigram corrections; it generates trigrams from the nonvocalized sentence and selects the most frequent vocalization for the second word of the trigram. e second sublevel is for word bigram corrections; it is similar to the first one, and it chooses the most frequent vocalization of the second word in the bigram. e third sublevel calculates the Levenshtein distance of the nonvocalized word, which has known diacritized forms. is distance is mainly the minimum edit distance between the nonvocalized word and its diacritized forms. e last sublevel is applied when the predicted word is never seen. It calculates the Levenshtein distance between the corresponding pattern of that predicted word and the saved vocalized forms of the same pattern.

e Modified Proposed Version.
In this part, we are going to present the modifications we tried to make to the pipeline proposed by Abbad et al. [22]. In fact, instead of using the char one-hot encoding, we used char embedding to see its impact on the system.

How the Char Embedding Matrix Is Created.
e Arabic chars were embedded using a simple neural network from previous work available on Github (https://github. com/sonlamho/Char2Vec). e network inputs a one-hot encoding representation corresponding to a character and outputs its context vector, which is the distribution of the neighboring chars.
Suppose that c[i] is a character from the corpus, x is its one-hot encoding representation having v as a dimension, 2 × k is the number of characters encompassing c[i], and y is the context vector of dimension 2 × k × v. e network learns the matrices of weights U and W, which have, respectively, the shapes (v, d) and (d, 2 × k × v), and they verify equation (4). Finally, x · U becomes the embedding vector for the char c[i], and it has the shape d.

Including Char Embeddings in the Old
System. e proposed modification uses the same architecture of the deep learning component proposed in Abbad et al.'s work [22] as shown in Figure 2. Nevertheless, the input has quietly changed, so that its tensor has three dimensions. e first one is for the sentence dimension, the second is for the time steps, and the third dimension is for char embedding representations instead of one-hot encodings. Figure 3 shows the proposed input modification.
Additionally, as shown in Figure 4, using char embedding means modifications in both the input layer of the deep learning component and the input of the rule-based components.
e rule-based part uses the char indexes, which should be extracted from the input sequence of the char embeddings. erefore, we created-in algorithm 5-a method dedicated to that purpose. e algorithm takes Seq a 2D array representing the sequence of char embeddings and emb Mat a 2D array representing all the learned char embeddings. e method outputs the sequence of char indexes corresponding to char embeddings in the sequence Seq.

Experience and Results
In this section, we are going to describe the dataset and evaluation metrics used in this study. Besides, we are going to look at the results based on those metrics and for the same dataset.

Undiacritized text
Output of the first model Advances in Human-Computer Interaction

Dataset.
To make a good comparison with other works, we used the free available benchmark dataset proposed by Fadel et al. [2]. e corpus was chosen-in that work (https://github.com/AliOsm/arabic-text-diacritization)randomly from classical Arabic books and Holy Quran, which belong to the Tashkeela corpus (https://sourceforge. net/projects/tashkeela/) distribution. To make use of it properly, it was cleaned and preprocessed. e resulting output was then split into training, validation, and holdout sets with a percentage of 90%, 5%, and 5%, respectively.
From the dataset provided, we generated more statistics about the train, validation, and test sets. Table 2 shows some statistics about different chars and tokens, while Table 3 shows the percentage of Out-Of-Vocabulary words in the validation and holdout sets compared to the training set.    (i) Input: 2D tensor representing the sequence of embeddings seq, (ii) //shape of seq is (time_steps, embedding_dim) (iii) 2D tensor for the learned embedding matrix emb_Mat, (iv) //shape of Emb_Mat is (number of chars, embedding_dim) (v) Output: a tensor t//of shape (time_steps) where each row contains the index of the char (vi) seq_shape: � shape(seq) (vii) b_shape: � shape(emb_Mat)  8 Advances in Human-Computer Interaction Numbers and punctuations are not taken into account when calculating those metrics defined in that work [2], unlike other previous comparative works.

Evaluation
When calculating DER and WER, we can either include the case-ending or not. Furthermore, we can include or not the 'no diacritic' class, which gives the choice of whether to count or not the letters with no diacritics from the original text.

Results
In this part, we are going to present the results of the tested models.
Before that, it is fair to mention that, for the deep learning models, we did not use Adadelta optimizer as in the original work [22]. Instead, we used the Adam optimization method, because it is computationally efficient, is less memory-consuming, and converges faster. Besides, the training dataset used for this model is smaller than the one used in the original work [22]. In addition, we tried to replace the LSTM layers in the systems described above by using the GRU layers to see if we will get better results. e systems that are based on HMM achieved the errors shown in Table 4. When observing some errors given by the HMM-based systems, we found that one of the common errors done by those systems is that the normalizeAlif method substitutes the repetitions of the forms of the Arabic Alef letter " ‫أ‬ , " " ‫إ‬ , " " ‫آ‬ , " " ‫ٱ‬ ," or ‫",ا"‬ with the ‫".ا"‬ is caused problems for the n-gram distributions and the dictionaries. Consequently, this had an impact on the test results. For example, when the system tries to vowelize the word " ‫ف‬ ‫ل‬ ‫أ‬ ‫ن‬ "/fli'ana/, it replaces ‫"أ"‬ with ‫."ا"‬ is is diacritized as " ‫ف‬ ُ ‫ل‬ َ ‫ا‬ ‫ن‬ ٌ "/fulanun/(meaning: John Doe, or so-and-so), or diacritized sometimes as " ‫ف‬ َ ‫ل‬ َ ‫ا‬ ‫ن‬ َ "/falana/(meaning: it became soft or flexible). However, the diacritized form we want is " "/fali'anna/ (which means: and because). To solve this issue, we altered the method to substitute the repetitions of each Arabic Alef letter form with its corresponding one. e systems showed some improvements as shown in Table5. Table 6, the HMM-based systems have common mistakes that are related to the case-ending. It has also errors that are related to the letter-based component where sometimes the nunations can be put in the middle of a word. Also, sometimes this component-letter-based HMM-inserts nunation at the end of verbs. Generally, the letter-base component has generated some errors concerning the internal diacritics. Sometimes, it affects negatively some well-diacritized-by the previous word-based HMM-words.

As shown in
In Table 7, we have some common mistakes done by the systems that use the deep learning models (either with LSTM or with GRU). As shown, the systems have different kinds of errors. ose mistakes are related to either the case endings or the internal diacritics, or both. As far as we know, those models have some common mistakes with the HMM-based ones. Besides, some of the mistakes of the HMM-based systems are not present here, such as the nunation problem in the middle of the word or the beginning. Table 5 shows the comparison results of all the systems, and the best results are in italic.
For the WER, when we exclude case endings, all the systems get a much less WER, which means that a good percentage of words have case ending errors. Besides, by the use of the 'no diacritic' class and looking at the DER, we can observe from the table that the original test set has at least 1% of chars that are not diacritized.
We can observe also from the results that the letterbased HMM makes a good amelioration for the error rate, especially the DER. Furthermore, for this dataset, we can see that the HMM-based model by Mohamed Ameur et al. [21] has better results compared to the other systems that use the deep learning component. Nevertheless, this cannot justify some errors that are related to the letterbased-HMM component.
Moreover, for the system that uses one-hot encoding and the other one using char embedding, they both have similar error rates. is can be explained by the fact that the embedding is more useful for word-level because, for  Advances in Human-Computer Interaction  is confirms the fact that LSTM is more accurate on large sequences.
All the systems have some pros and cons related to the results. As an example, the HMM-based models have good error rates, but they output some bad mistakes especially when they use the letter-based model. Another example is that both kinds of systems-the HMM-based ones and the deep-net-based ones-have mistakes related to the caseending diacritics or the internal diacritics. Eventually, the studied systems can be ameliorated by training on a larger

Conclusion
To sum up, we have seen in this work a comparative study of some Arabic vowelization systems. e study we have done de facto has the purpose of assuring the reproducibility of those previous works, making some alterations to them, and summarizing the impacts. Additionally, we assured a fair comparison of those systems by training and testing them based on the same benchmark dataset. As result, we noticed that the system, which uses the hidden Markov model combined with the smoothing techniques, gives the best results. Additionally, for the second and third systems that have a deep learning component, when their deep network uses two LSTM layers, they perform better than the case when we used different layer architectures. is can be explained by the fact that, theoretically, LSTM can remember longer sequences than GRU. Furthermore, we can use a much larger dataset to see how the deep learning model would perform. is could give a conclusive decision about whether to use char embedding in this deep learning model, or the one-hot encoding system is already a sufficient solution.
Data Availability e dataset used in this comparison study is publicly available in the GitHub repository: https://github.com/ AliOsm/arabic-text-diacritization, from a previous work done by Fadel et al. [2]. For further information about the trained models in this work, you can contact the corresponding author. via a.mijlad@uae.ac.ma