Machine-Based Transliterate of Ottoman to Latin-Based Script

. In this paper, a machine-based transliterate is presented. The automatic transliteration of Ottoman to the modern Latin Turkish script can open a big window for scientists in ﬁelds of history and literature while most of the Turkish people are not familiar with Ottoman script, despite the fact that no concrete solution has been proposed yet for this issue. The proposed method includes several steps since the transliteration process of Ottoman alphabet to Latin base consists of many problems; the ﬁrst is the basic character mapping which covers the regular pronunciation and orthography mapping. On the other hand, covering other irregular and extraordinary cases is based on rules and normalization. The transliteration system achieved 73.9% accuracy in general.


Introduction
Transliteration is a process of transforming the script of a language to another script for the same or different language. Unlike translation, however, the meaning does not change. For example, the word data in English is "deita" in Korean and "deeta" in Japanese when it is transliterated [1]. e X input � a 1 , a 2 , . . ., a n, is mapped directly into Y output � c 1 , c 2 , . . . c n while X and Y are two different scripts. Although the automatic transliteration looks an easy and straightforward process, there are different exceptions.
In general, transliteration faces different challenges in the forms of use; pronunciation varies according to the morphological differences of the language. In other words, all sounds are not deployed in one script. In addition, the power of the language script is another issue related to transliteration. is forces us to use more than one letter to represent a letter in the opposite script or vice versa. is leads us to loss information if our transliteration method does not have enough efficiency [2]. e Ottoman script had been used in Turkey, as a main script, for more than 700 years, since the 13th century until the beginning of the 20th century. However, during the last century (20th C.), the Turkish language fell in a dramatic update.
Alphabets of expanded Arabic were deployed in the Ottoman script. ere are many characteristics of Ottoman script that adopted from Arabic and Persian languages, even the 28 letters. e writing was from right to left. Some grammatical and morphological uses of other regional languages can be found in the Ottoman scripts, but the main characteristics are still like the Modern Turkish language [3,4]. e complexity of Ottoman orthography, its exceptions, and differences from modern Turkish language make it difficult to be understood or used by the majority of the contemporary Turkish citizens [5]. e other important factor in transliteration is pronunciation. In this direction, the representation of vowels in Ottoman script is embedded sounds [6].
Mainly transliteration is needed for several reasons such as education, the development of the language, and the historical, political, and regional documentations. Moreover, related to NLP, many applications of information retrieval, translators, and talking applications use transliteration. In the case of the Ottoman script, documentation is considered as the main reason for transliteration. Although the problem of transliteration of Ottoman to Latin-Turkish has been studied and analysed in several computational studies, including [5,[7][8][9] as a core issue, a concrete practical method has not been implemented yet.
On the other hand, dealing with Arabic or Perso-Arabic scripts in any NLP application contains many challenges and issues; in our transliteration system from Ottoman Turkish (OTr.) to Latin Turkish (LTr.) scripts, the noticeable challenges are obvious in the following points.
1.1. Mapping Scripts (Many to One). In the transliteration process, a phoneme in the input script can be mapped into a single phoneme in the target script. In our case, most of the OTr are character pairs or triples mapping to a single character; for instance, {" ‫خ‬ " , " ] are represented by the letter "h," ‫",ص"{‬ ‫",س"‬ ‫}"ث"‬ to "s" and {" ‫د‬ , " " ‫ط‬ , " " ‫ض‬ ,"} to "d." Table 1 shows the remaining corresponding letters. is may ease our procedure and reduce the complexity while back transliteration is not needed here, which could be difficult to deal with in this situation.

Unicode of Ottoman Transcript.
e Arabic Unicode is used to represent the alphabet as the script source is Arabic and Persian. Unicode of (U + 0600-U + 06FF) represents extended Arabic Unicode to represent other languages as well; one of the script properties is the joint writing cursive(Tashkil); for instance, the memory representing of ( ‫ه‬ ‫ه‬ ‫ه‬ ‫ه‬ , ) ‫ه‬ ‫ه‬ ‫ه‬ ‫ه‬ ‫ه‬ )))is different while they have the same letters,the Unicode for ‫)ه(‬ is (U + 200C) but the connected ( ‫ه‬ ‫ـ‬ ) is (U + 2013) [10].
Some letters in OTr act as a variable when transliterated. According to their position in the word, they behave like a vowel or consonant; the letter ‫"ع"‬ "ayin" pronunciation appears like "a, ı, i, u,ü,ö, or O) at the beginning of the words; "ilim" came from " ‫ع‬ ‫ل‬ ‫م‬ " which means science; hence, and "arz" from " ‫ع‬ ‫ر‬ ‫ض‬ " means Earth. ‫ع"‬ "example in other positions is shown in Table 3.
In the same way, "‫""ء‬Hemza" does not appear in one way; although it is different from "Ayin" as it does not come alone in the top of " ‫و‬ ، ‫ي‬ ، ‫,"ا‬ it is transliterated to "ü, i, e" according to the position or even omitted sometimes.

Language Update.
Another issue yelling for the transliteration of the language similar to the Ottoman Turkish language is the rich vocabulary and complexity, the language brought vocabulary and syntax from Arabic and Persian beside the Turkish language itself, and the languages come from different family origins thus generating incompatibility and multirules following language in spelling and pronouncing [11]. Beside the update of the Turkish language scripts from the OTr to LTr, there was a continuous update for the language since 1929 [12,13]. It is worth noting that Table 4 includes related examples. e current work excludes dealing with the language update; the OtoL transliteration tool covers the basic transliteration for now without dealing with updated words.
In our project, a transliteration method from Ottoman Script to Modern Latin Turkish Alphabet is implemented; the method mainly depends on rules used in manual transliteration mixed with normalization steps and thus had been proposed before in [5,8] without implementing or showing results. e first section summarizes the related work which concentrates on Turkish transliterated models. In Section 2, the Turkish written system is briefly explained. e third section clarifies the proposed system in detail; for simplicity, we call our system OtoL. Section 4 sheds light on the normalization level and error corrections, followed by the experiment results. Finally, the last section implies the conclusion of this work, and some tips for the future works have been suggested.

OTr
LTr Table 2: Ottoman represented in Latin script (one to many).

Related Works
Many natural language processing (NLP) applications use transliteration as a basic step, including translation, extracting terminologies, and intralingual data linking. In general, transliteration uses phoneme or a grapheme as a base for its procedures, both models showed different performances depending on the data sets with high heel for a grapheme-based method [14,15]. On the other hand, the authors in [16,17] deployed a phoneme-based procedure for this purpose. However, the authors in [18] implemented supervised and semisupervised models in their transliteration approach while the authors in [19] used finite state automata (FSA) as a named entity transliteration.
Since the dramatic change in Turkish script century ago, the transliteration techniques have been tried manually. One of the complete documented resources in this direction is [6]; in more than 34 pages, the author explains all theoretical the details and importance of the transliteration with examples. Nonetheless, some updates occur on the modern Turkish script and language, but the document is still precious. One of the first proposed frameworks was [8], and the authors summarized their suggested method by using a pipeline of natural language processing techniques, morphologic parsing, transliteration, dictionary, morphological synthesis, word disambiguation, error handling, and detecting nouns respectfully. e reserved dictionary included around 30000 words, and according to authors, this should be sufficient for newspapers and magazine transliteration. e followed approach was [5], after reporting the basic challenges and the need for the transliteration method explained. e authors were applying partial parsing for the text, getting its root, transliterating the root, and finally searching in two dictionaries conducted, a limited dictionary from Ottoman Turkish to Modern search directly and a more extensive Modern Turkish word list in which searched using a regular expression to generate from the Ottoman spelling of a presumptive root to reconstruct the text following modern Latin Turkish. Reference [7] is the final published attempts for language update, implemented supervised neural network. ey used a corpus of old and modern Turkish books to train the system. e study is distinguished from previous ones by reporting its results.
eresults are related to a part of transliteration process which is normalization but achieving 33.8 BLUE score point for RNN neural network. e summary of the article which is related to the area is shown in Table 5.

Turkish Writing System.
In fact, the Turkish language has an important role in expressing the ideology and the construction of the national identity. In the Ottoman period, as the Islamic government was in role, upgraded Arabic-Persian script was used as a writing system for the Turkish language. With the establishment of the modern Turkish government, Latin alphabets were started to use from the beginning of 1929 [20]. e Ottoman script holds the main characteristics of Arabic script, written direction right to left, letter format according to their position "beginning, end, middle, and isolated"; vowels are limited "almost all letters are consonant"; therefore, the context has a big role in reading and meaning while the number of letters has an updated version from the Arabic 28-alphabets with four Persian letters ‫،ط(‬ ‫ذ‬ ، ‫ض‬ ، ‫ث‬ ), which are shown in Table 6 [21,22]. For simplicity, we use OTr. to represent the Ottoman transcript and LTr. for Latin Turkish transcript.
On the other hand, the modern LTr system extended to have 29 letters by adding (Ç, G, I, I,Ö, S,Ü), seven modified letters to the original twenty-two letters. e important property of this written system is the representation of the sounds as they are not embedded unlike the Otr, which affects the transliteration efficiency and complexity. Other properties include written direction "left-to write" and capitalization in specific cases [6].
It is worth mentioning that our transliteration method is a single direction model from OTr to LTr. e reason behind that is to reduce the complexity while the OTr abounds for written or use only. It is useful to retrieve the old Ottoman archives and documents.
e materials and methods section should contain sufficient detail so that all procedures can be repeated. It may be divided into headed subsections if several methods are described.

Proposed Method.
e proposed method OtoL is straight forward excluding any kind of dictionary or learning approach for now as shown in Figure 1, beginning with an optical character recognition; since the Ottoman documents that are available online are limited to image text resources, the current version of the OtoT uses an online software (http://miletos.co/en/showcase/ottoman-ocr) for this task ( Figure 2). We do not deal with entire detail of the OCR process and the method used as far as we manually checked the validity of the OTr got. On the one hand, the data arranged in text format are converted to Unicode format, to remove suspicions between characters that are similar in shape and different in Unicode. On the other hand, to reduce errors and similarities in the current state of the system ‫,)"ء"(‬ the hamza in cases ( ) is neglected to be replaced by Unicode base letters ‫,و(‬ ‫,ا‬ a); for instance, the (" ‫م‬ ‫ؤ‬ ‫م‬ ‫ن‬ ") transliterated to ("mümin")(" ‫ت‬ ‫أ‬ ‫ر‬ ‫ي‬ ‫خ‬ ") to ("tarih"), ‫ر"(‬ ‫ئ‬ ‫ي‬ ‫س‬ ") to ("reis"); this step also overcomes some complexity of mapping letters.
In fact, the proposed method depends on the phoneme properties of the Ottoman writing to overcome the previous mentioned challenges and struggles in mapping according to the transliteration process; for this purpose, the rules of [6] are used in our method, the rules collected for American library association and the library of the Congress as the Ottoman language were one of the major Islamic language in (1) Front vowels (e,ü, i,ö) and back vowels (a, o, u, ı) follow each other in the syllable, for instance, (" ‫ك‬ ‫ر‬ ‫ا‬ ‫م‬ ") transliterate to ("kiram"), (" ‫م‬ ‫ق‬ ‫ب‬ ‫ل‬ ") to ("mukbil") as shown in Algorithm 1.

Data.
e collection of textual data in the Ottoman text is regarded to be as one of the primitive obstacles, as we mentioned before; writing in the Ottoman letters ended about a century ago; to cross these trammels, we do our best to tackle and explore two ways: first: gathering a collection of Ottoman scanned books and articles on the Internet through the archive website (https:// archive.org/details/books?query�osmanli&sin�); then using optical character recognition to convert images into texts of the Ottoman script; in this regard, we used a miletos (http://miletos. co/en/showcase/ottoman-ocr). e second resource is a book of Table 5: e most related works.

Scientific Programming
Ottoman Nutuk, and this book is available already in both forms of OTr and LTr. However, these two versions are not transliterated but are simply translated or rewrote. For this reason, we build our own OtoT corpus; this corpus contains the text in both of OTr and LTr. e corpus includes 1000 sentences in each part, with an attached scanned document in case of using OCR applications.

Results and Discussion
In the case of the current system OtoL, the transliteration is one way, unlike other systems that operate as bidirectional.
e main reason for this situation is the uselessness of converting Latin Turkish to Ottoman. Table 7 shows the results of the machine in transliteration of the OTr to the LTr. e results are shown according to the program steps and the analysis of the percentage of true and false. is method of showing results has not been done in previous studies; they just showed some sample of the outcomes [5,7,8]. e results in column two and three show a basic rule-based step, which is detecting the beginning syllable of the words, as far most of the mapping situation depending on this step according to [6], while the last column shows the correct transliteration in the word count.
Since the Ottoman Turkish language derives words and expressions from different languages, especially Arabic and Persian, in addition to the Turkish language itself, a specific law is applied and limiting all rules of writing to it is very difficult; nonetheless, we were keen to apply some rules related to transliteration mentioned in source [6].
In detecting the consonant letters and assigning them to forward and backward classes, since it is a straight forward option, the program did an excellent job, finding all scripts in this regard with an efficiency of 97%. On the other hand, dealing with phonetic syllables is considered a complex case Input: Word X Output: b_Vowel, f_vowel, in comparing to consonants because the Ottoman writing, like Arabic, contains movements and phonetic connotations. In 451 cases, the OtoT Program was unable to find 68 cases correctly, with a total accuracy rate of 84.9%.
Finally, the OtoL program obtained 73.9% of accuracy in general since the transliteration from the Ottoman to Latin is not direct and there are many deviations and challenges in this regard mentioned before; for example, not restricting, the letter ‫)و(‬ in the OTr can act on its behalf in LTr {"o", "u", "ü", "ö"} in the case of transliteration. Table 8 shows a sample of the OtoL result.

Conclusion and Feature Work
e results of the method used in OtoL for the translation of the Ottoman alphabet into the Latin alphabet are reasonable in comparison with the complexities of the language and contrast between the use of Parso-Arabic and Latin letters. Most of the challenges and difficulties in transliteration were taken into consideration, and the straightforward and simplicity of the procedure are the basic merit of this method. Although our method can reach above 91% detecting vowel and consonant syllables, which are the one of the rules for correct mapping from one-to-many characters especially in the cases of { ‫و‬ { , } ‫ى‬ { , } ‫ه‬ }, unfortunately most of the wrong transliterates are in this area.
To improve the results of the program, the following steps must be applied for newer versions: (1) Dictionaries are used to improve the performance of the program, especially there are many words that change when transliterated from Ottoman to Latin alphabets, regardless of the continuous updates in the Turkish language to get rid of words of Arabic, Persian, and other origins and replace them with original Turkish words.
(3) Bigger and well prepared data set are used; as it is obvious through obtaining the Ottoman writing text, we use OCR process for the images which include small errors and missing some notations. (4) Heavy normalization should be implemented, especially if we urge to get an updated Turkish language script, this method has been applied in many language reforms and social media comment normalization as in [20,23].
Data Availability e data and code are available from the corresponding author upon request (ashti.a@garmian.edu.krd).