Afan Oromo Speech-Based Computer Command and Control: An Evaluation with Selected Commands

,


Introduction
Speech is the simplest and most vital form of communication [1].Trough this speech-based Afan Oromo interface, users can communicate with the computer by speaking commands instead of relying on standard input devices such as keyboards and mice.Humans naturally utilize acoustic, lexical, and contextual information to communicate through speech in a naturalistic manner [2].To achieve computergenerated speech that resembles human speech, it is important to explore systems that can understand human speech without the need for human interpretation [3,4].
Te process of converting speech sound waves into readable text, allowing the computer to hear and respond automatically to human speech, is known as speech recognition [5].Speech command-based applications are widely used in various felds and have signifcantly enhanced human-computer interaction [6].Speech recognition interfaces are integrated into digital devices, e-commerce, elearning, the Internet of Tings, robotics, and medical equipment to facilitate control and monitor through speech input [7,8].
Speech is a natural, fexible, and simple mode of communication [9].Terefore, in situations where a computer is used in the dark, the user's hands are occupied, or speech recognition is needed for a specifc purpose or in remote areas where manual input is not feasible; it is advantageous to employ speech-based computer commands for more natural and comfortable communication.Speaking vocal commands directly to applications, instead of relying on a mouse and keyboard to manipulate text, accelerates communication in human-computer interactions.However, achieving accurate automatic speech recognition (ASR) remains a major challenge due to factors such as speaker and language variability, vocabulary size, and noise interference [10,11].
While speech is a natural, fexible, and efortless mode of communication, the relationship between the physical speech command signal and the corresponding words is exceedingly complex and difcult for computers to comprehend [12].In the natural world, speech sounds vary across individuals based on factors such as age and gender, which poses challenges for machines to diferentiate.Te accuracy of an ASR system improves as the vocabulary size increases because the model is trained on a larger dataset.In addition, algorithms and features also play infuential roles in speech recognition [13].Labeling and mapping sequences of speech vectors to symbol sequences require precise boundaries, as words can be spelled in various ways and changes in accents signifcantly impact speech recognition accuracy [14].
Te approach to explore automatic speech-based recognition depends on the amount of collected data and the study's objectives.Te development of speech-based ASR has been achieved through algorithms such as hidden Markov model (HMM), recurrent neural network (RNN), deep neural network (DNN), convolutional neural network (CNN), hybrid hidden Markov models with multilayer perceptron (HMM-MLP), and hybrid hidden Markov models with the deep neural network (HMM-DNN) [15][16][17].
Some people confuse multimodal speech with automatic speech recognition command and control, although the two are fundamentally distinct.Multimodal speech employs various communication channels for human-computer interaction, while speech recognition commands do not [18].Speech-based computer command and control are used for speech output, whereas a recognition system is used for speech input.
Terefore, the primary goal of scholars investigating speech recognition is to develop a better recognizer that can accurately transform sequences of feature vectors into words using phonetic and linguistic information, enabling human-computer conversations on any topic, in any environment [19].
Tis research is primarily intended for individuals with special needs.Speech-based computer commands enable users to interact with devices using spoken language, benefting those with disabilities and providing a handsfree operation.Tis technology comprehends natural language and fnds applications in virtual assistants and dictation, but it faces challenges with accuracy in noisy environments.Ensuring security, expanding language support, and integrating with AI are important considerations.Te advancement of this technology enhances user experiences and accessibility while driving innovation in user interfaces.Tese commands, which utilize natural language and oral speech, ofer convenience to people with disabilities such as arm impairment or visual impairment, as well as those with busy hands.Tis technology has applications in various areas, such as Windows applications, Wi-Fi control, nuclear reactions, medical devices, robots, and electronic/computing devices.Te specifc focus of the research is the development of speakerindependent speech-based computer commands for the Afan Oromo language.Te collected speech data can be utilized for various purposes, including Google voice recognition for Afan Oromo commands and evaluating the efectiveness of Afan Oromo speech-based computer commands.
Furthermore, this should encourage researchers to explore Afan Oromo speech-based computer commands and controls, given the limited work in this area and the growing need for computer commands and controls in the Afan Oromo language.Tis research aims to examine the feasibility of utilizing speech-based computer commands for controlling commands during text processing.

Literature Review
Te way humans interact with machines has been transformed by automatic speech recognition (ASR) technology, which has found extensive applications across various domains.Tese include voice assistants, transcription services, call center automation, and language learning tools.ASR systems play a pivotal role by converting spoken language into written text, facilitating seamless and efcient communication between humans and machines.As the demand for accurate and robust ASR systems continues to increase, it becomes crucial to explore the present state of the feld, identify key research trends, and address associated challenges.
Te main objective of this literature review is to thoroughly examine the advancements, techniques, and challenges in the realm of ASR.Trough a comprehensive analysis of existing research, we aim to gain a deeper understanding of the fundamental principles, methodologies, and algorithms employed in ASR systems.In addition, we will explore the diverse applications of ASR technology and its impact on various industries and sectors.
Recent years have witnessed remarkable progress in ASR technology, owing to advancements in deep learning techniques and the availability of extensive speech datasets.However, despite these advancements, several challenges persist.Tese include efectively handling variations in speech patterns, mitigating the impact of background noise and environmental conditions, and accommodating the intricacies of diverse languages and accents.By critically analyzing the existing literature, our goal is to identify the current state-of-the-art ASR models and techniques while also highlighting ongoing research eforts aimed at addressing these challenges.

2
Advances in Human-Computer Interaction Te application enables elderly and disabled users to efortlessly control their electrical appliances within the Internet of Tings (IoT) environment without the need for physical movement.By employing speech recognition technology, Apple Siri, Amazon Alexa, and Google Assistant are compared in terms of their efectiveness in executing IoT-based voice commands using machine learning.Trough experiments involving smartphones, smart speakers, and control systems, Google Assistant demonstrates the highest pronunciation accuracy at 95%, while Apple Siri exhibits the lowest performance at 80%.However, the study's fndings indicate that the development of online and real-time interaction systems for security applications in smart cities and ofces will continue to be challenging due to the limitations of IoT technology [8].
To create a speech control interface for computers, a GMM model with a performance rate of 74.38% has been developed.Future research should focus on expanding the repertoire of commands to enhance the efciency of the interface [20].
Based on empirical data collected from structured usercentered activities involving military personnel, an AI-based decision support for speech command and control system, when successfully implemented, can provide the advantage of faster information analysis, enabling quicker decisionmaking and operational superiority over adversaries.AI support for execution may involve evaluating action alternatives for commanders and facilitating various types of operations, such as using speech-to-text tools for swift and accurate communication during briefngs.Ensuring transparency is a critical challenge in military decision support, where the ability to explain recommendations, understand, and rely on the system is of utmost importance [21].
A voice-controlled electric fan can comprehend spoken commands, ofering a more efcient user experience due to its ability to capture commands more rapidly compared to writing or typing.In terms of speed, the system can capture the user's speech at a faster rate than relying solely on the electric fan.According to the results, the Filipino voice commands "IKALAWA" and "IKATLO" exhibit the highest accuracy rate at 100%, while the command "ISA" yields a 50% success rate and "PATAYIN" yields 60%.However, the drawback of this speech-based voice command system is that it only caters to Filipinos and is not universally applicable as a language [22].
Tis speech interface empowers users to execute common computer commands using the Gaussian mixture model (GMM) and a future technique called Spectral Feature-based Speech Recognition (SFBSR) for speech interface control on PC Windows.Te best performance is achieved with 64 centers and 200 iterations.However, as the number of iterations increases, so does the computation time required for creating the GMM.Despite the study using a limited number of speech commands for interface development, the average recognition performance of speech commands reaches 74.38% [20].
In 2003 [23], Martha attempted Amharic speech recognition system for computer command and control based on an experiment with Microsoft Word using the HMM model.
A corpus was prepared by participating 10 females and 16 males in the age range of 20 to 35 years.Te application operates using a limited set of vocabularies, employing a specialized Amharic word recognizer designed for individuals aged 20 to 35.Training the recognizer involved using 76.9% of the recorded data, while the remaining data were used to assess the recognizer's performance.As a result, the model has performed less (80%) to achieve a goal.Te accuracy result of 80% is acceptable, but the researcher used a very small amount of data, and the evaluation matrix for the researcher is only live.In addition, the interface developed for isolated Amharic words cannot be integrated into Microsoft Word even though the researcher did not recommend the service provided for this purpose.
An evaluation of an Amharic speech-based dictation system is examined for the judicial domain with an accuracy of 84.550% and with a word error rate of 16.475%, using an HMM approach (sphinx tool) on continuous speech data for training and building two acoustic models and text data for building a language model.Te results demonstrated acceptable performance, but no work in transcription for the indirect court [24].
A spontaneous, speaker-independent Amharic speech recognizer was developed in this research work using training data that consist of 9460 unique words and is approximately 3 hours and 10 minutes of speech.According to this study, the best recognizer performance is 41.60% word accuracy for speakers involved in training, 39.86% for test data from both speakers involved and not involved in training, and 23.25% for speakers not involved in training.Te recognizer was developed and tested using less frequent nonspeech events and had lower word accuracy than those that included them [25].
Speech segmentation for Amharic investigation extraction of information from a large archive requires both the extraction of audio fle structure and the extraction of speech recognition information.Using a monosyllable acoustic model and forced alignment to segment speech automatically, promising results are obtained.Te most accurate results were achieved with a decision tree classifer.Te highest evaluation was achieved with segmentation accuracy of 91.93 and 85%, respectively, for reading loud and spontaneous speech.Other evaluation techniques are not applied to the recognizer [26].
Currently, people communicate with electronic devices through their speech with the help of an automatic speech recognition system.A Sphinx 4 tool was used in this study to investigate the feasibility of developing automatic speech recognition for the Ge'ez language using HMMs.A total of 79.70% word accuracy was shown in two experiments conducted online and ofine with the Sphinx tool.Test results show that the system using the developed interface has a word accuracy of 67.79%.In this study, recognition accuracy increased when the corpora size was maximized [27].
Afan Oromo language with phonetic recognition and syllable-based recognition HTK tool used to collect speech corpora from 39 males and 24 females of various dialects, with speeches lasting approximately four hours for training Advances in Human-Computer Interaction and 40 minutes for testing.Syllable recognition showed promise, but the overall accuracy of this model is low, increasing from 39.55%, 47.21%, 55.35%, and 43.96% with monophones, triphones, tied-state triphones, and syllablelevel recognition, respectively [28].
In this study, HTK tool was used to approach HMM modeling techniques for large vocabulary, Afan Oromo speaker independence, and continuous speech.For this study, 2953 utterances (approximately 6 hours of speech) were collected from 57 speakers (42 males and 15 females).Increasing the Gaussian tuning parameters for word insertion penalty to 1.0 and grammar scale factors to 15.0 according to the researchers can improve system performance.Acoustic models that are context-independent (based on monophones) and context-dependent (based on triphones) have been developed.In terms of word error rate, the results for context-independent were 91.46% and 89.84%, respectively.Te outcome, however, is best for the study which must be assisted by another research [29].
In 2016, the authors conducted research on the feasibility of developing a large vocabulary speaker-independent continuous speech recognition system for Afan Oromo experimentation, and the bigram language model performed the best with 93% word accuracy for the speaker-dependent test dataset and 43.6% for the speaker-independent test dataset using the HMM approach.When compared to the previous researchers' fndings, this one is less accurate [30].
To get a better understanding of ASR and ASR speechbased computer commands, we reviewed various literature sources.A signifcant change in overall accuracy in speech models is observed because of advancements in open source toolkits HTK, CMU-Sphinx, and Kaldi and their fastprocessing speed-based ASR speech recognition.Te performance of a speech system is difcult because it is dependent on variations in speakers, their pronunciations, the rate at which they speak, and the dialects of the regions they belong to [31].ASR speech-based computer command recognizer accuracy varies with ambiguity and vocabulary size; hence, hybrid HMM works best for large vocabulary and HMM works best for small vocabulary [32].
In this preliminary literature review, most studies on speech ASR and speech-based computer commands have been conducted in international languages, with few studies on Afan Oromo and Amharic local languages.Te application of Amharic speech commands is related to my feld of study, but there is still no Afan Oromo speech-based computer command.Tis indicates a research gap that motivated me to investigate Afan Oromo speech-based computer commands using HMM model evaluation with selected commands.

Materials and Methods
Te overall structure of our system consists of three essential components, each playing a vital role in facilitating efcient communication through speech-driven computer commands.Tese key elements encompass the command text translation module, the Afan Oromo ASR (automatic speech recognition) module, and the communication interface.
Command text translation: Te command text translation module functions as a conduit between English and Afan Oromo command texts.Its primary purpose is to take selected command texts in English and accurately convert them into their corresponding Afan Oromo equivalents.Tis step is crucial to ensure that users comfortable with Afan Oromo can seamlessly interact with our system, thereby enhancing accessibility and usability for a broader audience.Afan Oromo ASR: At the core of our speech-based interaction system lies the Afan Oromo ASR module.It takes the translated Afan Oromo command texts, coupled with the necessary data for robust speech recognition.To achieve this, we prepare corpora, which comprise various datasets, including training speech data, training text data, the acoustic model of speech units, and a statistical language model.Tese components work collaboratively, enabling our system to recognize and comprehend spoken Afan Oromo language.Communication interface: Te communication interface serves as the point of interaction between users and our system.It acts as the gateway through which users convey their commands, and our system responds with appropriate actions.Supported by the models generated by the Afan Oromo ASR module, this interface empowers users to communicate their intentions using spoken Afan Oromo commands.It serves as a versatile tool, facilitating hands-free operation and providing benefts to individuals with disabilities, to those with busy hands or those seeking a more natural and convenient way to engage with computers and devices.
Figure 1 shows the general architecture of the proposed Afan Oromo ASR speech-based computer command.

Te HTK Software Toolkit.
Te HMM model is used to train and decode the recognizer for the application's speech-based computer command.Although various tools including HTK, Sphinx, and Kaldi have been selected and designed for building HMM speech-based audio data processing for particular ASR isolated word-level recognition [33].HTK, the most popular toolkit for building Hidden Markov models, was created especially for the implementation of speech-based isolated word recognition [31].Terefore, HTK toolkits were selected for the investigation of Afan Oromo isolated speech-based recognition computer commands.HTK toolkit libraries play a vital role in tasks such as training to estimate a set's parameters HMM transcription of unknown utterances and decoding speech signals [14]. Figure 2 shows many tools used in HTK.
Te HShell library module in the HTK program oversees user interactions with an operating system, managing input and output.HLM is utilized for preparing language model fles, HNet for creating word networks and lattices, HDict for generating dictionaries, HVQ for forming VQ  Advances in Human-Computer Interaction

Te Proposed Afan Oromo Speech-Based Computer
Command Prototype.Te development of the prototype involves a systematic approach that can be broken down into three distinct phases: data preparation, recognizer training and testing, and analyzer performance analysis.Each of these phases plays a critical role in creating the proposed Afan Oromo speech-based computer command prototype, as illustrated in Figure 3.
In the initial phase known as data preparation, we embark on the journey by utilizing the translated MS command as a starting point.Tis phase serves as the foundation for the subsequent stages.We meticulously extract features from these translated commands, which are then stored methodically in the Afan Oromo speech command repository.Simultaneously, the text from these commands is archived in both the Afan Oromo phone dictionary and the Afan Oromo command language model.Tis comprehensive archive ensures that all relevant linguistic elements are preserved and readily available for further processing.
Te Afan Oromo phone dictionary subsequently undergoes a series of crucial transformations, including segmentation and transcriptions.Tese refned linguistic units are then seamlessly integrated into the acoustic model.Te purpose of this integration is to establish a bridge between linguistic representations and acoustic patterns, enhancing the model's ability to interpret and analyze spoken commands efectively.Te culmination of this phase involves preparing the data specifcally for MFCC (Mel-frequency cepstral coefcient) feature extraction.Tis meticulous preparation results in the creation of the training, testing, and transcribed text corpus, which serves as a cornerstone for the subsequent phases.
Transitioning to the recognizer training and testing phase, we capitalize on the prepared transcribed text corpus.
Here, the spotlight is on our acoustic model enriched with linguistic insights.Te process encompasses not only model training but also rigorous testing, including segmentation and transcription tasks.Te meticulously curated training corpus is introduced to the acoustic model to further enhance its profciency.It is noteworthy that this step precedes actual model training, as it is imperative to equip the acoustic model with essential linguistic tools, such as the phone dictionary and the translated MS command, before formal training ensues.
Meanwhile, the test dataset steps onto the stage as a critical player.Tis dataset undergoes the model's scrutiny, assessing its performance against the diverse set of commands it might encounter.Tis rigorous evaluation provides valuable insights into the model's strengths and areas for improvement.Te evaluation process delves deep into decoding the model's output, considering factors such as segmentation and transcription accuracy.Te ultimate goal of this phase is to refne the model's capabilities, ensuring that it accurately comprehends and responds to spoken commands.
Te culmination of these eforts is manifested through the integration of the acoustic model with the analyzer performance analysis.Tis integration is facilitated via a communication interface that promotes seamless interaction between the two components.Te interface acts as a conduit for sharing crucial information between the model and the analyzer, enabling a comprehensive assessment of the prototype's performance.Notably, the interface relies on data derived from our MFCC feature extraction, establishing a strong link between the preparatory phases and the fnal analysis.
Troughout the entire journey encompassing each phase, a diverse array of HTK tools is judiciously employed.Tese tools contribute to the refnement, enhancement, and analysis of the prototype, ensuring that each step is meticulously executed with precision.
Te proposed Afan Oromo speech-based computer command prototype embodies these three interconnected phases.From data preparation's meticulous groundwork to recognizer training and testing's model refnement culminating in analyzer performance analysis's comprehensive evaluation, every aspect of the prototype's development is a testament to the meticulous planning and execution that underpins its creation.

Data Preparation.
Te initial stages of preprocessing computer text commands involve the translation and transcription of these commands, as well as the creation of phone-based dictionaries.Transcription occurs at two levels: the word level and the phone level.For the transcription at the phone level, we utilize a pronunciation dictionary fle containing a list of words and their possible equivalent phone sequences.
Te transcribed fles, along with the acoustic features of the speech signals, are essential for the development of a recognition model.To obtain audio signals for this process, diferent speakers read the translated script of each selected text computer command.Te preparation includes noise reduction using both front-end and back-end methods.Tis prepared data are then used to train and test the Afan Oromo speech-based recognizer, employing the HTK toolkit.Tis comprehensive approach ensures that our recognition model is efectively trained and evaluated.

Command Translation.
In this study, our initial step is to select an English computer command and translate it into an equivalent command in the Afan Oromo text corpus.We are seeking the expertise of a linguistic professional to assist us in this process.Te commands chosen for translation are Microsoft shortcut commands specifcally designed for word text processing.Such commands are Copy, Cut, Paste, Save, Open, and so on.

Pronunciation Dictionary.
To train and transcribe at the phones' level, language lexicons or pronunciation dictionaries are essential fles.Te researcher developed Python code to generate phone-based pronunciations for each word.Te prepared pronunciation dictionaries include both phone-based and alternative versions.It is important to note that, for this research, the specifc pronunciation dictionary 6 Advances in Human-Computer Interaction has not yet been created or published.For this study, the researcher had the task of preparing pronunciations that would be utilized.Te initial step involves creating a phonebased dictionary for training and labeling purposes, utilizing recorded computer command words from the text corpus.Tis begins by generating a list of words extracted from the text corpus.A Perl command is then executed through a Perl program to facilitate the creation of these wordlists.

Transcribing Segmented Speech.
To create a speech corpus, our initial step involves transcribing the chosen text computer command, which serves as the foundation for the speech corpus preparation.We carefully select the source and defne the scope of the text to be used.Transcribing segmented speech is essential for constructing an acoustic model.In this process, we utilize the Afan Oromo alphabet, known as "Qubee," to represent phone numbers in each segment.Following the Afan Oromo writing rules, we construct words.
It is important to note that some phonemes in the IPA representation of the Afan Oromo alphabet cannot be represented in ASCII and are consequently unsupported by HTK (a tool or software).To handle the transcription of these letters, we apply the same method, except for glottal sounds, which are treated diferently.For instance, "c" corresponds to "c," "ch" represents "ch," and so on.
However, for glottal sounds, we use "hh."In Afan Oromo, the "h" phoneme is not duplicated.You can fnd the general IPA phonemes used in the transcription, along with their IPA equivalents in Table 1.

Speech Data Collection.
In this research investigation and analysis of computer commands conveyed through speech, the absence of pre-existing speech datasets necessitated the creation of a new corpus from the ground up.Tis process was notably time-intensive.Te speech dataset was exclusively compiled from specifc text commands used in Microsoft applications, which are vital for text-based operations.To capture the audio signals of spoken commands, translated text-based computer instructions were sourced from multiple speakers.Among these speakers, a subset was selected and requested to articulate the provided scripts from the text corpus.Te report lacks information regarding the distribution of male and female speakers or their age ranges.Te determining factor for selection was speaker availability.Te recording sessions took place in both a high school environment and the Oromia Science and Technology video conference room, chosen to minimize background noise interference.
In total, 38 speakers participated in this study, ranging in age from 18 to 40 years.Out of these, 32 speakers (18 males and 14 females) were employed for training, constituting   For word-level transcriptions, an orthographic transcription in HTK label format is required.Tis can be achieved by either creating separate label fles for each line of the Word fle or using a programming language to construct a master label fle (MLF).In this experiment, the second method was chosen, and a Perl script called prompts2mlf was used to generate the .mlffle, resulting in a fle named trainwords.mlfcontaining word-level transcriptions.To convert the word-level transcriptions to phone-level transcriptions, the HLEd command from the HTK tool was used.Tis command substitutes each word with its corresponding phoneme by looking up the phones in the prepared dictionary fle.Te output is stored in a fle called phones0.mlf, which does not include short pauses (sp) after each word-phone group.Te HLEd command is executed with the mkphones0.ledmodifying script.Afterward, the HLEd command is run again with the mkphones1.ledediting script to create a phones1.mlffle that includes short pauses (sp) after each word-phone group.Overall, the process involves creating the trainwords.mlffle for word-level transcriptions followed by generating the phones0.mlfand phones1.mlffles for phone-level transcriptions using the HLEd command with specifc modifying and editing scripts.RNNs, however, are well-suited for large-scale datasets.Tey are designed for complex sequential prediction tasks that involve generating irregular and error-prone sequences as outputs.In speech recognition, when the dataset is small and uncomplicated, the HMM model tends to outperform the RNN model.For our research on Afan Oromo speechbased computer command and control on Microsoft Word, our available data are limited and not complex.Hence, HMMs are considered a more suitable candidate model for investigation compared to RNNs.Tis is the stage where recognition and decoding occur, utilizing the HMM model, as speech is characterized by its temporal structure and encoded as spectral vectors [34].Advances in Human-Computer Interaction image classifcation and object detection to natural language understanding, playing a central role in advancing the capabilities of AI systems.

Creating Prototype Monophone.
Te process of creating a prototype monophone is a crucial phase in HMM model training.Te focus of this phase is on specifying the model topology rather than the parameters.In our phonebased recognition system, a 5-state left-to-right HMM architecture is used, consisting of 3 emitting states and 2 nonemitting states (see Figure 5).
Te HTK tool HCompV is utilized to calculate the global mean and variance from a set of data fles.It sets all Gaussians in a specifc HMM to have the same mean and variance.Te HCompV command, with appropriate parameters and confguration fle (HCopy_proto.txt), is executed using the train.scpfle that contains the list of training fles.Tis command modifes the prototype fle, replacing zero means and unit variances with global speech means and variances.
With the newly created prototype model from HCompV, a master macro fle (MMF) named hmmdefs is manually generated.Tis fle copies the prototype and replaces it for each required monophone, including "sil."Te format of an MMF is similar to that of an MLF, eliminating the need for multiple HMM specifcation fles.Te macros section of the hmmdefs fle includes a global options macro and the variance foor macro (vFloors) previously developed by HCompV.Te global options macro defnes the HMM parameter kind and vector size.

Re-Estimating Monophones.
Te HERest program is used to re-estimate the monophones.Multiple iterations of HERest are performed, each with diferent model directories (hmm0, hmm1, and hmm2) and output directories (hmm1, hmm2, and hmm3).Te re-estimation is done using the data listed in the train.scpfle and the labels/phones0.mlffle, which contains phone-level transcriptions.Pruning thresholds are set using the -t option to restricting the range of state alignments included in the training process.Te thresholds are initially set at 250.0 and increased if re-estimation fails for a fle.Te updated model set is stored in the hmm directories (hmm1, hmm2, and hmm3) after each iteration.Figure 6 shows the fow of prototype HMM defnition.

Fixing the Silence.
A 3-state left-to-right HMM is created for each phone, including a silent model called "sil."Additional transitions are added to the silent model, allowing individual states to absorb impulsive sounds in the training data and improve the model's robustness.A singlestate short-pause model ("sp") is developed and connected to the central state of the silent model.Figure 7 shows the topology for two silent models.

Realigning the Training Data.
Te HVite program is used to realign the training data.It takes word-level transcriptions, phone models, and a dictionary as input.Te output is a new phone-level transcription fle (aligned.mlf) that matches the acoustic data more accurately.Te -o SW option is used to include time-stamp information in the alignment output to detect signifcant stops at the start and fnish of utterances.Te HMM set parameters are reestimated using HERest after the new phone alignments have been established.

Training Prototype Triphone and Tied-State.
Te training of prototype triphone and tied-state models is a fundamental aspect of acoustic modeling in the feld of speech recognition.Tese models play a crucial role in converting spoken language into text by capturing the acoustic characteristics of speech signals.Prototype triphone models represent phonetic transitions within speech, while tied-state models group similar acoustic states together.Tis process involves building a comprehensive dataset, extracting acoustic features, and employing techniques such as hidden Markov models (HMMs) to train these models.Te goal is to enhance the accuracy and efciency of automatic speech recognition systems, enabling them to accurately transcribe spoken language into text form.Te process of training prototype triphones and tied-state triphones includes the following: 3.7.1.Tied-State Triphones.Te frst step is to determine whether to use crossword triphones.If so, monophones are converted to triphones, and word boundaries are marked in the training data.Triphone models are developed and reestimated, and acoustic states are tied to ensure the use of the same parameters.Te HERest tool is used to update the context-dependent models.

Making Triphones from Monophones.
Monophones are used to create triphones.Te HLEd tool is used to create a list of triphones based on the monophone transcriptions.Te triphone transcriptions are created by modifying the monophone transcriptions according to predefned rules.

Creating Tied-State Triphones. Once a set of triphone
HMMs is prepared, the states of triphone sets are tied to share data and produce reliable parameter estimations.Te HHEd tool is used, and two methods are described: one based on data and the other using decision trees.Te decision tree searches for contexts that distinguish clusters based on acoustic properties.Te tied-state triphone models are updated using HERest.

Performance Evaluation Technique. Te hidden
Markov model (HMM) was utilized to develop the speech recognizer, as HMMs are integral to most modern speech recognition systems, particularly those employing statistical methods.Te speed of the recognizer is measured in realtime factors, while accuracy is assessed in terms of performance accuracy, typically represented by the word error rate (WER) [35].Te performance of the speech-based 10 Advances in Human-Computer Interaction recognizer is evaluated by measuring the WER and the word recognition rate [36].Word errors can include insertions, replacements, and deletions.Te HResult tool in HTK is employed to analyze the system's performance, comparing the original reference transcription fle with the output transcription fle generated by the HVite tool.HMMs are recognized as the most powerful statistical tool in automatic speech recognition (ASR) for modeling nonlinearly aligned speech and estimating model parameters [37,38].
For evaluating the performance of the recognizer, out of a total of 64 Microsoft (MS) commands, testing and training are conducted.Te performance of a speech-based command interface can be evaluated using the following equations, where N represents the number of words in the test set, D denotes the number of deletions, S represents the number of substitutions, H stands for word correct unit, and I represents the number of insertions.Te internal WER(R), accurate word, and correct word are calculated based on the following formula [35]: (1)

Result and Discussion
Te creation of a functional prototype for an Afan Oromo speech-based command-and-control system involves a series of foundational steps, each contributing to its efectiveness.Tis section delves into these stages, detailing the methodologies employed and their pivotal role in shaping the fnal outcome.Beginning with the assembly of an extensive speech Advances in Human-Computer Interaction 11 corpus representing diverse Afan Oromo commands, the training and testing sets are meticulously divided.Te training set enhances the recognizer's ability to understand intricate language patterns, adapting its algorithms to Afan Oromo speech intricacies.Conversely, the test set evaluates the recognizer's profciency with new commands.Tis section also highlights the importance of refning the system's performance through experimentation, fne-tuning parameters, and algorithms.Detailed exploration of the command-and-control system's components is provided, from data preparation to recognizer training, testing, and performance analysis.Te holistic approach ensures that linguistic data are meticulously processed, improving the recognizer's learning and evaluation across various scenarios.Tis understanding is pivotal, serving as a roadmap for developing a seamless speech-based command system where language nuances meet cutting-edge technology.Te interconnectedness of data, training, testing, and analysis forms the bedrock of this innovative approach, paving the way for a future where spoken language seamlessly interfaces with digital systems.

Performance Evaluation.
Performance evaluation is a critical process used to assess the efectiveness, efciency, and quality of a system, process, or entity.It involves systematically analyzing and measuring various metrics to gauge how well the subject performs its intended functions or achieves its goals.Performance evaluation provides valuable insights that help stakeholders understand strengths, weaknesses, and areas for improvement.Whether applied to technology, business processes, or individuals, performance evaluation plays a pivotal role in making informed decisions, optimizing outcomes, and driving continuous enhancement.
To analyze recognizer performance, HTK ofers the HResults tool.Test data were fed into the recognizers, and the recognized transcriptions were saved in a separate MLF.HResults were executed with this MLF, which had been created in the data preparation step, to evaluate the performance of isolated Afan Oromo word recognizers.During prototype development, researchers primarily engage in testing and evaluation.Prototypes are evaluated using the most commonly used evaluation techniques: live and nonlive (phone-based).Tis process is commonly referred to as decoding or recognizing the speech signal.Each word is then represented by hidden Markov models (HMMs) that correspond to the word's sequence of sound units.As a result, the search graph becomes a complex HMM, and recognition is carried out by aligning the search graph with the speech features extracted from the utterance using the Viterbi algorithm.
4.1.1.Phone-Based Recognizer Evaluation.Word-internal triphones, crossword triphones, and tied-state triphones are all considered in phone-based modeling.HVite utilizes phone sets, test data, and output fles from n-gram language models, along with a phone-based pronunciation dictionary and other inputs, for the phone-based decoding process.We have created four phone-based systems; thus, the phone sets used for the recognition procedure difer.Tese phone set lists are grouped into categories such as crossword, monophone, triphone, and tied lists.Te text fles list the monophone, triphone, and tied-state phone sets, respectively.All phone-based systems employ the same language model, test dataset, and pronunciation dictionary.Te word-level transcriptions of each test fle are used as the test data.To decode the phone-based systems, you will need to execute the following commands: A fxed amount is added to each token when transitioning from the end of one word to the beginning of the next, known as the word insertion penalty.Te language model probability is scaled before being added to each token during this transition, which is called the grammar scale factor.Given their potential impact on recognition performance, it is highly recommended to tune these factors using development test data.
Te following commands are used to assess the recognition results after they have been completed, and the table below displays the results for this triphone model.Comparison of internal words with neighbors in the phone-based recognizer is shown in Tables 2 and 3.
HResults -I labels/testref.mlfcorpus/monophones1 recog/monorecou.mlfHResults -I labels/testref.mlftriphones1 recog/ wirecout.mlfHResults -I labels/testref.mlftiedlist recog/ witiedrecout.mlfWord-internal triphones, crossword triphones, and tiedstate triphones are important considerations in phone-based modeling.HVite, a decoding tool, utilizes phone sets, test data, an output fle from the n-gram language model, a phone-based pronunciation dictionary, and other inputs for the phone-based decoding process.Four phone-based systems were created, each using diferent phone sets for recognition.Tese phone sets are categorized as crossword, monophone, triphone, and tied lists.Te systems employ the same language model, test dataset, and pronunciation 12 Advances in Human-Computer Interaction dictionary.Te test data consists of word-level transcriptions from each test fle.Table 2 shows the phone-based models' recognition.Table 2 demonstrates that the tied-state and triphonestate models exhibit slightly higher accuracy.Te options -p and -s in these commands are used to specify the word insertion penalty and the grammar scale factor, respectively.Word insertion penalties are applied when tokens transition from one word to another.Despite varying the parameters -p and -s, the recognizer's performance remained unafected.
To assess the accuracy of nonlive recognizers, 38 speakers (17 females and 21 males), aged between 18 and 40, were evaluated based on their availability.Out of a total of 64 MS command words, 54 words (84.37%) were used for training and 10 words (15.63%) were reserved for testing.Te monophone tied-state, triphone, and triphone recognizers achieved word-level accuracies of 78.12%, 86.87%, and 88.99%, respectively.Consequently, the triphone-based recognizer outperforms in nonlive recognition performance.

Live Recognizer Evaluation.
In general, the process of live speech recognition involves assessing the accuracy of commonly used words and phrases.Due to time limitations and the extended duration of this test, each participant was assigned only eight randomly selected command words to orally control Microsoft Word.To evaluate performance, each person was tasked with commanding and operating Microsoft Word using the randomly selected phrases.Among the eight participants in this study, three were females and fve were males.

For recognizer evaluation, the following command was utilized:
HVite -H hmm15/macros -H hmm15/hmmdefs -C confgs/hvitelive.txt-w wdnet -p 0.0 -s 5.0 dicts/dicts tiedlist Table 3 represents the participants who did not undergo training indicated by the "#" symbol.Te evaluation of the Afan Oromo speech-based recognizer performance is based on the participants who underwent training and those who did not.For participants not involved in training, the recognizer performance shows a maximum accuracy of 96.29% in Table 3.On the other hand, participants who underwent training achieved a maximum accuracy of 100% in recognizer performance.To calculate the average number of accurately recognized words, the number of correctly recognized words was divided by the total number of available word commands during the live evaluation of the fxed variation.
Te performance of the recognizer's variable variance is evaluated by considering the users who participate in training and those who do not.Te results presented in Table 4 demonstrate the performance of the users who participated in training, achieving a maximum accuracy of 97.95%.On the other hand, the recognizer's accuracy for users who did not participate in training reached a maximum of 91.83%.

Prototype Communication.
Te service-oriented component ofers an extension that facilitates software communication with other software, networking with networks, Advances in Human-Computer Interaction and system integration.As shown in Figure 8, a prototype recognizer is developed, and a service-oriented component is designed to enable communication between the recognizer and Microsoft Word.To establish a connection between the recognizer and Microsoft Ofce, Microsoft needs to authorize the recognizer as a service object.Tere are two fundamental structures for speech-based computer commands and controls: the provided service object and the required interface, which is the recognizer.Prior to speech recognition, the recognizer, as depicted in the diagram (see Figure 8), initiates an audio signal recording and proceeds to search for a matching command To evaluate the performance of the system, both live and nonlive evaluation techniques are employed.In the live setting, the recognizer's performance for Afan Oromo speech-based computer commands is assessed using a fxed variance model (49 words) and a variable variance model (54 words), achieving a maximum accuracy of 91.83% and 96.29%, respectively, for those not involved in training.Te fxed variance model is more afected by noise, whereas the variable variance model performs better.
Moving on to the nonlive recognizer evaluation, an internal word evaluation is conducted, which includes monophones, triphones, and tied-state triphones.Te monophone, triphone, and triphone-based recognizers achieve accuracy rates of 78.12%, 86.87%, and 88.99%, respectively.Terefore, the triphone-based recognizer exhibits the best performance in nonlive recognition, as the comparison of internal word triphones enhances the recognizer's performance.Based on the comparison of results in both live and nonlive settings, it is recommended to focus on investigating Afan Oromo speech-based command and control in live scenarios.
Since there is no established speech corpus specifcally for Afan Oromo computer MS commands, comparing the results of this research with previous studies is not meaningful.However, the existing literature on Amharic language indicates the investigation of speech computer commands with a maximum recognizer accuracy of 87% and 96% using HMM models with fxed variance, which is lower than the performance achieved in the Afan-Oromo speech-based computer command for nonparticipating users [39].
In conclusion, this research has demonstrated the feasibility of speech-based computer commands with promising performance using the available HMM tool under limited resources for Afan Oromo speech corpus.

Conclusions
Automatic speech recognition (ASR) involves converting speech into text to enable computers to understand and respond to human speech.Speech command-based interfaces have been implemented in various systems, including e-commerce, medical equipment, and digital devices, allowing users to control applications through speech input.However, there is currently no developed speech-based computer command interface for Afan Oromo.
Te objective of this study was to investigate and develop an Afan Oromo speech-based command and control system using selected MS Word commands.Te development process involved creating a speaker-independent, HMMbased Afan Oromo speech recognizer using the HTK toolkit.To collect data, speech recordings were obtained from 38 speakers (17 females and 21 males) between the ages of 18 and 40.Out of a total of 64 MS command words, 54 words were used for training (84.37%) and 10 words were for testing (15.63%).Te performance of the recognizers was evaluated using both live and nonlive techniques.
In the nonlive recognizer evaluation, internal word evaluation was conducted, including monophones, triphones, and tied-state triphones.Te monophone tied state, triphone, and triphone recognizers achieved accuracy rates of 78.12%, 86.87%, and 88.99%, respectively.Terefore, the triphone-based recognizer performed best in nonlive recognition.In the live setting, the recognizer's performance was assessed using fxed variance and variable variance models.Te fxed variance model achieved a maximum accuracy of 91.82%, while the variable variance model performed at 96.29% for users who did not participate in the training.
Based on these results, it can be concluded that the variable variance model exhibits higher accuracy.In addition, the live recognizer outperformed the nonlive recognizer.However, the performance of the recognizer in a practical MS Ofce Word environment could not be evaluated due to the requirement of an object-as-a-service component for integration.Te development of the speech interface prototype faced limitations in terms of language resources and tools, which afected the accuracy of the recognizer.Despite these challenges, the experiment yielded promising results, demonstrating the potential for developing a prototype for an Afan Oromo speech-based command and control system using selected MS Ofce words with fxed and variable models.

Figure 1 :Figure 2 :
Figure 1: Te architecture of AO speech-based computer command and control.

Table 1 :
Advances in Human-Computer Interaction 84.3% of the overall speaker pool.In addition, 6 speakers (three males and three females), making up 15.7% of the total, were reserved for testing purposes.Tis specifc age bracket (18-40 years) actively utilizes Microsoft Word for text manipulation.Subsequent to amassing the speech data, it was divided into distinct sets for training and testing.Te test set served to assess the performance of the recognition system, while the training set was instrumental in training the recognition model.Prior to collecting the speech data, a comprehensive speech text corpus was meticulously constructed.Te English commands commonly used in Microsoft Word were painstakingly translated into prompts in the Afan Oromo language.Te collected speech data underwent recording, segmentation, coding, and parameterization processes to extract its acoustic characteristics.Te Mel-frequency cepstral coefcients (MFCC) technique was employed to simultaneously extract all relevant acoustic features.Te HCopy tool was employed to extract acoustic information from the recorded utterances.Te parameterization of speech data could be done in real time or collectively, extracting all parameters using the HTK tool.3.4.Data Preparation Phase.Te speech corpus was collected and then divided into training and testing sets in order to train the recognizer.Te training set was used to train the recognizer, while the test set was used to evaluate its performance.Multiple experiments were conducted to explore a functional prototype of a command-and-control system for an Afan Oromo speech-based computer.Tis chapter provides a description of the key ASR tasks and components used in the statistical approach, including data preparation, recognizer training, recognizer testing, and analyzer performance analysis.Figure3illustrates the developed system architecture and the results obtained from the HTK tools.Phoneme with corresponding IPA.
3.4.1.Te Task Grammar.In the process of developing a prototype recognizer, a word-level network called task grammar is used to defne valid word sequences.Te are necessary due to the limitations of ASCII code.Table1ofers an illustrative representation of the employed phones.Te HDMan command is employed to search for word pronunciations within the source dictionary (dict_phone.lex)and output the results in a dict_phones.lexfle.Tis process involves the creation of monophone lists, removal of specifc phones, and the generation of a new dictionary Since HTK is not efcient in processing.wavfles, they are converted to MFCC format using the HCopy program.In the experiment, a fle listing the source audio fles and the corresponding converted MFCC fles are created.Te HCopy command is run using this script fle as a parameter to extract speech features and generate MFCC fles for each utterance.Two script fles, codetrain.scpand codetest.scp,are written for training and testing, respectively.Te HCopy command is executed with the provided confguration fle (HCopy_train.txt)to convert the wave format to MFCC.Te conversion parameters are specifed in the confguration fle.By running the HCopy command, a series of MFCC fles is produced corresponding to the audio fles listed in codetrain.scp.
additional preprocessing during training.Te coding task is crucial in the data preparation stage.To achieve this in HTK, the HCopy tool is used with a confguration fle and a script fle containing a list of fles.Te HLEd tool is employed to generate a single MLF fle by setting the TARGETKIND confguration variable to MFCC0D.Adding time derivatives to the static parameters improves speech recognition system performance.RNNs, which are a type of artifcial neural networks (ANNs) commonly used for modeling sequential data such as speech signals have the tendency to identify false patterns in the data and overft.On the other hand, HMMs are a suitable choice when working with a small and simple dataset.In our case, out of the total of 64 Microsoft command words, 54 words (84.37%) were used for training and 10 words (15.63%) were used for testing.
Te average number of words for users involved in training is 52.12.Users with a '#' symbol before their name did not participate in live training.Additionally, the average accuracy for word recognition among the 8 users involved in training is 96.52%.

Table 4 :
Recognizer in fxed variance.Te average number of words for users involved in training is 44.75.Similarly, users with a '#' symbol before their name did not participate in live training.Additionally, the average accuracy for word recognition among the 8 users involved in this training is 91.32%.If the audio matches the command key, the recognizer can execute the corresponding operation by either opening or closing the words.Te Afan Oromo speech-based computer command interface utilizes a dictionary to search for Afan Oromo-to-English key commands, enabling communication with the MS computer command.Any updates to the interface require a process of returning, searching, and calling Afan Oromo commands.4.3.Recognition and Discussion.Decoding algorithms are utilized to accomplish the process of recognition.Te objective of this experiment, as mentioned earlier, was to develop a speech interface system that allows users to control the computer through spoken commands.In order to train the Afan Oromo speech-based a dataset is prepared by performing translation, transcription, audio data segmentation, and MFCC feature extraction.