A Novel Technique for Speech Recognition and Visualization Based Mobile Application to Support Two-Way Communication between Deaf-Mute and Normal Peoples

,


Introduction
Historically the term deaf-mute referred to the person who was either deaf using sign language as a source of communication or both deaf and unable to speak.This term continues to be used to refer to the person who is deaf but has some degree of speaking ability [1].In deaf community, the word deaf is spelled in two separate ways.The small "d" deaf represents a person's level of hearing through audiology and being not associated with the other members of the deaf community whereas the capital "D" Deaf indicates the culturally Deaf people who use sign language for communication [2].
According to world federation of the deaf (WFD) over 5% of the world's population (≈360 million people) has disabling hearing loss including 328 million adults and 32 million children [3].The degree of hearing loss is categorized into mild, moderate, severe, or profound levels [4].Hearing loss of a person has a direct impact on his/her speech and language development.People with severe or profound hearing loss have higher voice handicap index (VHI) scores than those who suffer from mild hearing loss [5].A person with mild hearing loss has less problems in speech development as he/she might not be able to hear certain sounds and the speech clarity is not affected that much.A person with severe or profound hearing 2 Wireless Communications and Mobile Computing loss can have a severe problem in speech development and usually relies on sign language as a source of communication.
Deaf people face many irritations and frustrations that limit their ability to do everyday tasks.Research indicated [6] that Deaf people, especially Deaf children, have high rates of behavioral and emotional issues in relation to different methods of communication.Most people with such disabilities become introverts and resist social connectivity and faceto-face socialization.The inability to speak with family and friends can cause low self-esteem and may result in social isolation of Deaf person.It is not only that they lack social interactions but communication is also a major barrier to Deafmute healthcare [7].In such conditions, it becomes difficult for the caretaker to interact with the deaf person.
Different medical treatments are available for the deaf community in order to get rid of their deafness but the cost of these treatments are expensive [8].A report of world health organization (WHO) 2017 [9] states that there are different types of costs associated with hearing loss, which are as follows: (1) direct costs: they include the cost associated with hearing loss incurred by healthcare systems; some other types of direct costs include the education support for such children; (2) indirect costs: they include the loss of productivity and usually refer to the cost of individual being unable to contribute to the economy; and (3) intangible costs: they refer to the stigma experienced by the families that are experiencing the hearing loss.This report concludes that unaddressed hearing loss poses substantial costs to the healthcare system and to the economy as a whole.
Many communication channels are available, through which Deaf-mute people can deliver their messages, e.g., notes, helper pages, sign language, books with letters, lip reading, and gestures.Despite these channels, there are many problems which are encountered by Deaf-mutes and normal people during communication.The problem is not confined only to a Deaf-mute person who is unable to hear or speak, but another problem is lack of awareness of Deaf culture by normal people.Majority of hearing people have either no/ little knowledge or experience of sign language [10].There are also more than 300 sign languages and it is hard for a normal person to understand and become used to these languages [11].The above-mentioned problems can be solved by involving the assistive technology as it can be used as an interpreter for converting the sign languages into text or speech for better communication between the Deaf community and hearing individuals [12].Other technologies such as speech technologies can assist in different ways to help people with hearing loss by improving their autonomy [13].A common example of speech technology is speech recognition, also termed as automatic speech recognition (ASR).It is the process of converting the speech signal into sequences of words with the help of an algorithm [14].The ASR process comprises three steps, i.e., (1) feature extraction, (2) acoustic model generation, and (3) recognition phase [15,16].For feature extraction, MFCC is the most commonly used technique [17,18].The success of MFCC makes it the standard choice in the state-of-the-art speech recognizers such as HTK [19].
The main purpose of this research paper is to use a mobile-based assistive technology for providing a simple and cost-effective solution for Deaf-mute with little or complete speech development.The proposed system used HTK based speech recognizer to identify the speech of Deaf-mute and provide a communication platform for them.The next two sections explain the related work and proposed methodology of our system.Section 4 states the experimental setup and results of the proposed system.

Related Work
The Deaf community is not a monolithic group; it has a diversity of groups which are as follows [20,21]: (1) Hard-of-hearing people: they are neither fully deaf nor fully hearing, also known as culturally marginal people [22].They can obtain some useful linguistic information from speech.
(2) Culturally deaf people: they might belong to deaf families and use sign language as the primary source of communication.Their voice (speech clarity) may be disrupted.
(3) Congenital or prelingual deaf people: they are deaf by birth or become deaf before they learn to talk and are not affiliated with Deaf culture.They might or might not use sign language based communication.
(4) Orally educated or postlingual deaf people: they have been deafened in their childhood but developed the speaking skills.
(5) Late-deafened adults: they have had the opportunity to adjust their communication techniques as their progressive hearing losses.
Each group of a Deaf community has a different degree of hearing loss and use a different source of communication.Table 1 illustrates the details of Deaf community groups with their degree of hearing loss and source of communication with others.Hearing loss or deafness has a direct impact on communication, educational achievements, or social interactions [23].Lack of knowledge about Deaf culture is documented in society as well as in healthcare environment [24].Kuenburg et al. also indicated that there are significant challenges in communication among healthcare professionals and Deaf people [25].Improvement in healthcare access among Deaf people is possible by providing the sign language supported visual communication and implementation of communication technologies for healthcare professionals.Some of the implemented technology-based approaches for facilitating Deaf-mutes with easy-to-use services are as follows.
2.1.Sensor-Based Technology Approach.Sensors based assistance can be used for solving the social problems of Deafmute by bridging the communication gap.Sharma et al. used wearable sensor gloves for detecting the hand gestures of sign language [26].In this system, flex sensors were used to record the sign language and to sense the environment.The hand gesture of a person activates glove, and flex sensors on glove convert those gestures into electrical signals.The signals  [27] was also suggested for Deaf-mute people to communicate with the doctor.This experiment used a 32-bit microcontroller, LCD to display the input/output, and a processing unit.The LCD displays different hand sign language based pictures to the user.The user selects relevant pictures to describe the illness symptoms.These pictures then convert into patterns and pair with words to make sentences.Vijayalakshmi and Aarthi used flex sensors on the glove for gesture recognition [28].The system was developed to recognize the words of American Sign Language (ASL).The text output obtained from sensor-based system is converted into speech by using the popular speech synthesis technique of hidden Markov model (HMM).The HMMbased-text-to-speech synthesizer (HTS) was attached to the system for converting the text obtained from hand gestures of people into speech.The HTS system involved training phase for extraction of spectral and excitation parameters from the collected speech data and was modeled by context-dependent HMMs.The synthesis phase of HTS system was used for the construction of HMM sequence by concatenating contextdependent HMMs.Similarly, Arif et al. used five flex sensors on a glove to translate ASL gestures for Deaf-mute into the visual and audio output on LCD [29]. it into text so that they can enjoy the interactive environment [7].Voice for the mute (VOM) system was developed to take input in the form of fingerspelling and convert into corresponding speech [30].The images of fingerspelling signs are retrieved from the camera.After performing noise removal and image processing, the fingerspelling signs are matched from the trained dataset.Processed signs are linked to appropriate text and convert this text into required speech.Nagori and Malode [31] proposed the communication platform by extracting images from the video and converting these images into corresponding speech.Sood and Mishra [32] presented the system that takes images of sign language as input and displays speech as output.The features used in vision-based approaches for speech processing are also used in different object recognition based applications [33][34][35][36][37][38][39].

Smartphone-Based Technology Approach.
Smartphone technology plays a vital role in helping the people with impairments to get themselves interacted socially and to overcome their communication barriers.Smartphone technology approach is more portable and effective as compared to sensor or vision technology.Many of the new smartphones are furnished with advanced sensors, high processors, and highresolution cameras [40].A real-time emergency assistant "iHelp" [41] was proposed for Deaf-mute people where they can report any kind of emergency situation.The current location of the user is accessed through built-in GPS system in a smartphone.The information about the emergency situation is sent to the management through SMS and then passed on to the closest suitable rescue units, and hence the user can get rescue through the use of iHelp.MonoVoix [42] is an Android application that also acts as a sign language interpreter.It captures the signs from a mobile phone camera and then converts them into corresponding speech.Ear Hear [43] is an Android application for Deaf-mute people.It uses sign language to communicate with normal people.The speech-to-sign and sign-to-speech technology are used.For a hearing person to interact with Deaf-mute, the text-to-speech (TTS) technology inputs the speech signal, and a corresponding sign language video is played against that input through which the mute can easily understand.Bragg et al. [44] proposed a sound detector.The app is used to detect the red alert sounds and alert the deaf-mute person by vibrating and showing a popup notification.

Proposed Methodology
Nowadays many technology devices such as smartphoneenabled devices prefer speech interfaces over visual ones.
The research [49] highlighted that off-the-shelf speech recognition system cannot be used to detect the speech of deaf or hearing loss people as these systems contain a higher ratio of word error rate.This research recommended using human-based computations to recognize the deaf speech and using text-to-speech functionality for speech generation.In this regard, we proposed and developed an Android based application named as vocalizer to mute (V2M).The proposed application acts as an interpreter and encourages two-way communication between Deaf-mute and normal person.We refer to normal person as the one who has no hearing or vocal impairment or disability.The main features of the proposed application are listed below.

Normal to Deaf-Mute Person
Communication.This module takes text or spoken message of a normal person as an input and outputs a 3D avatar that performs sign language for a Deaf-mute person.ASL based animations of an avatar are stored in a central database of application.Each animation file is given 2-5 tags.The steps of normal to Deaf-mute person communication are as follows: (1) The application takes text/speech of normal person as an input.(2) The application converts the speech message of a normal person into text by using the Google Cloud Speech Application Program Interface (API) as this API detects normal speech better compared to Deaf persons' speech.
(3) The application matches the text to any of the tags associated with an animation file and displays the avatar performing corresponding sign for Deaf-mute.

Deaf-Mute to Normal Person Communication.
Not everyone has knowledge of sign language so the proposed application uses disrupted speech of a Deaf-mute person.This disrupted form of speech is converted into recognizable speech format by using speech recognition system.HMMbased speech recognition is a growing technology as evidenced by the rapidly increasing commercial deployment.
The performance of HMM-based speech recognition has already reached a level that can support viable applications [50].For this purpose, HTK [51] is used for developing speech recognition system as this toolkit is primarily designed for building HMM-based speech recognition systems.(b) Acoustic Analysis.The purpose of the acoustic analysis is to convert the speech sample (.wav) into a format which is suitable for the recognition process.The proposed application used MFCC approach for acoustic analysis.MFCC is the feature extraction technique in speech recognition [52].Main advantages of using MFCC are (1) low complexity and (2) better performance with high accuracy in recognition [53].

Speech Recognition
The overall working of MFCC is illustrated in Figure 2 [19].
The features of each step of MFCC are listed below.
(1) Pre-Emphasis.The first step of MFCC feature extraction is done by passing the speech signal through a filter.The pre-emphasis filter is the first-order high-pass filter.It is responsible for boosting the higher frequencies of a speech signal.
() =  () −  ( − 1) 0.9 ≤  ≤ 1.0, where  represents the pre-emphasis coefficient, () is the input speech signal, and   () is the output speech signal with a high-pass filter applied to the input.Pre-emphasis is important because the components of speech with high frequency have small amplitude w.r.t components of speech with low frequency [54].The silent intervals are also removed in this step by using the logarithmic technique for separating and segmenting speech from noisy background environments [55].
(2) Framing.Framing process is used to split the pre-emphasized speech signal into short segments.The voice signal is represented by  frame samples and the interframe distance or frameshift is  ( < ).In the proposed application, the frame sample size () = 256 and frameshift () = 100.The frame size and frameshift (in milliseconds) are calculated as ( (3) Windowing.The speech signal is a nonstationary signal but it is stationary for a very short period of time.The window function is used to analyze the speech signal and extract the stationary portion of a signal.There are two types of windowing: (i) Rectangular window, (ii) Hamming window.
Rectangular window cuts the signal abruptly so the proposed application used Hamming window.Hamming window shrinks the values towards zero at the boundaries of the speech signal.The value of Hamming window (()) is calculated as The windowing at time  is calculated by where   () is the Fourier transform of   () and  is the length of the DFT.
(5) Mel-Filter Bank Processing.Human ears act as band-pass filters; i.e., they focus on only certain frequency bands and have less sensitivity at higher frequencies (roughly >1000 Hz).A unit of pitch (mel) is defined for separating the perceptually equidistant pair of sounds in pitch into an equal number of mels [56] and it is calculated as (6) Log.This step takes the logarithm of each of the melspectrum values.As human ear has less sensitivity to the slight difference in amplitude at higher amplitudes as compared to lower amplitudes.Logarithm function makes the frequency estimates less sensitive to the slight difference in input.
(7) Discrete Cosine Transform (DCT).It converts the frequency domain (log mel-spectrum) back to the time domain by using DCT.The result of the conversion is known as mel frequency cepstrum coefficient (MFCC) [57].We calculated the mel frequency cepstrum by In the proposed methodology, the value of  = 12 because a 12-dimensional feature parameter is sufficient to represent the voice feature of a frame [17].The extraction of cepstrum via DCT results in 12 cepstral coefficients for each frame.These set of coefficients are called acoustic vectors (.mfcc).The acoustic vector (.mfcc) files are used for both the training and (c) Acoustic Model Generation.It provides a reference acoustic model with which the comparisons are made to recognize the testing utterances.A prototype is used for the initialization of first HMM.This prototype is generated for each word of the Deaf-mute dictionary.The HMM topology comprises 6 active states (observation functions) and two nonemitting states (the initial and the last state with no observation function) which are used for all the HMMs.Single Gaussian observation functions with diagonal matrices are used as observation functions and are described by a mean vector and variance vector in a text description file known as prototype.This predefined prototype file along with acoustic vectors (.mfcc) of training data and associated labels (.lab) is used by the HTK tool HInit for initialization of HMM.
(d) Recognition Phase.HTK provides a Viterbi word recognizer called HVite, and it is used to transcript the sequence of acoustic vectors into a sequence of words.HVite uses the Viterbi algorithm in finding the acoustic vectors as per MFCC model.The testing speech samples are also prepared in the same way of preparing the training corpus.In the testing phase, the speech sample is converted into series of acoustic vectors (.mfcc) using the HTK-HCopy tool.These input acoustic vectors along with HMM list, Deaf-mute pronunciation dictionary, and language model (text labels) are taken as an input by HVite to generate the recognized words.

Messaging Service for Deaf-Mute and Normal Person.
The application also provides messaging feature to both Deafmute and normal people.A person can choose between the American Sign Language or English keyboard for sending the messages.The complete flowchart of "V2M" is illustrated in Figure 3.

Experimental Results and Discussions
4.1.Experimental Setup.The proposed application V2M required a camera, a mobile phone for the installation of the V2M app, laptop (acting as a server), and an instructor to guide the Deaf-mute student.The complete scenario is shown in Figure 4.A total of 15 students from Al-Mudassir Special Education Complex Baharwal, Pakistan, participated in this experiment and the participated students were between the ages of 7 and 13 with some speech training in school.The instructor guided all students in using the mobile application.The experiment consisted of two phases.
4.1.1.Speech Testing Phase.In this phase, instructor selected a "register voice" option from a menu of the app and entered a word/sentence or question (label) in a text field of the "register sample" dialog box, for which the training speech samples of participants were taken (see Figure 5(b)).At first, the instructor needed sign language for asking the participants to speak a word/sentence or an answer.The system took 2 to 4 voice samples of each word/sentence.Whenever the participant registered his/her voice, the system acknowledged by a visual support (as in Figure 5(c)).For testing, the researcher asked questions via the V2M app, and it displayed an avatar that performed sign language for a Deaf-mute participant in order to understand the questions (see Figure 5(d)).In response, the participant selected the microphone icon (as shown in Figure 5(e)) to speak his/her answer.The app processed and compared the recorded speech sample with the registered samples.After the comparison, it returned the text and spoke out the answer of the participant (see Figure 5(f)).

Message Activity Phase.
The participants took minimal support from an instructor in this phase.They easily composed and sent the messages by selecting sign language keyboard (see Figure 5(g)).

Qualitative Feedback.
Researchers formalized questionnaire survey to evaluate the effectiveness of the Deaf-mute application.The survey comprised 12 questions for participants to answer and the reason for this short-length selection of questions was not to overwhelm Deaf-mute students with longer interviews.Secondly, these students had no experience of using any Deaf-mute based application.The qualitative feedback is summarized into following categories (paraphrased from the feedback forms).
Familiarity with Existing Mobile Apps.All participants have not heard or used any mobile applications which are dedicated to Deaf-mutes.
Ease of Use and Enjoyment.All participants enjoyed using the app.They liked the idea of using an avatar for performing sign language.Out of 15 students, 12 students performed the given tasks quite easily and 3 students have not used or interacted with mobile devices before.Initially, they found this app difficult but it became easier for them after app functions were performed 2-3 times in front of them.Overall they found this app user-friendly and interactive.
Application Interface.Participants liked the interface of the app.They learned the steps of app quite fast and they also liked the idea of an avatar performing greeting gesture at home screen.
Source of Communication.All participants were using sign language as a primary source of communication.They recommended the intervention of mobile application as a source of communication for them.They acknowledged that the mobile app can be used to convey the message of deaf-mute to a normal person.

Results and Comparative
Analysis.The application training and testing corpus are obtained from the speech samples True positive (tp) refers to words that are uttered by the person and detected by the system.
False positive (fp) refers to words not uttered by the person but detected by the system.
False negative (fn) refers to words that are uttered by the person but the system does not detect it.
True negative (tn) refers to everything else.
The experimental results of the proposed methodology in terms of precision, recall, and accuracy parameters are illustrated in Table 3.
It is observed from Table 3 that the number of speech samples has direct impact on precision and recall of the application.Overall average precision is 56.79% and recall is 46.79% when registered sample count in all statements is 2 ( = 2) for each participant.However, the average precision is 93.16% and recall is 83.19% for registered sample count 3 ( = 3).The average accuracy in terms of precision and recall is above 97% when registered sample count in all statements is 4 ( = 4) for each participant.The 1-score of best precision and recall is calculated: Hence it is deducted that the precision of application decreases by taking the limited number of speech samples ( ≤ 2) of the deaf-mute.The application outperforms when the number of speech samples for each statement is greater than 2 (2 <  ≤ 2).The speech recognition methodology of proposed application is compared with other speech recognition systems as shown in Table 4.

Conclusion
Deaf people face many irritations and frustrations that limit their ability to do everyday tasks.Deaf children have high rates of behavioral and emotional issues in relation to different methods of communication.The main inspiration behind the proposed application is to remove the communication barrier for Deaf-mutes especially children.This app uses the speech or text input of normal person and translates it into sign language via 3D avatar.It provides speech recognition system for the distorted speech of Deaf-mutes.The speech recognition system uses MFCC feature extraction technique to extract the acoustic vectors from speech samples.The HTK toolkit is used to convert these acoustic vectors into recognizable words or sentences by using pronunciation dictionary and language model.The application is able to recognize Deaf-mute speech samples of English alphabets (A-Z), English digits (0 to 9), and 15 common sentences used in daily routine life, i.e., good morning, hello, good luck, thank you, etc.It provides message service for both Deaf-mutes and normal people.Deaf-mutes can use customized sign language keyboard for composing the message.The app also can convert the received sign language message to text for a normal person.The proposed application was also tested on 15 children aged between 7 and 13 years.The accuracy of proposed application is 97.9%.The qualitative feedback of children also highlighted that it is easy for Deaf-mutes to adapt the mobile technology and mobile app can be used to convey their message to a normal person.(Mannepalli et al., 2016) [47] MFCC − GMM 92% AMAZIGH LANGUAGE (Elouahabi et al., 2016) [48] MFCC + HTK (6-state HMM) 80% Group, Prince Sultan University, Riyadh, Saudi Arabia [RG-CCIS-2017-06-02].The authors are grateful for this financial support and the equipment provided to make this research successful.
Approach.Many vision-based technology interventions are used to recognize the sign languages of Deaf people.For example, Soltani et al. developed a gesture-based game for Deaf-mutes by using Microsoft Kinect which recognizes the gesture command and converts System Using HTK.ASR system is implemented by using HTK version 3.4.1.The speech recognition process in HTK follows four steps to obtain the recognized speech of Deaf-mute.The steps are training corpus preparation, feature extraction, acoustic model generation, and recognition as illustrated in Figure1.(a)Training Corpus Preparation.The training corpus consists of recordings of speech samples obtained from Deaf-mute in .wavformat.The corpus contains spoken English alphabets (A-Z), English digits (0 to 9), and 15 common sentences used in daily routine life, i.e., good morning, hello, good luck, thank you, etc.The utterance of one participant is separated from the others due to the variance in speech clarity among Deaf-mute people.The training utterances of each participant are labeled to simple text file (.lab).This file is used in acoustic model generation phase of the system.

Figure 2 :
Figure 2: Block diagram of MFCC feature extraction technique.

Figure 4 :Figure 5 :
Figure 4: Experimental setup: a participant performing registration of speech sample task.

Table 2 :
Details of a configuration file (config.txt).

Table 4 :
Comparison of proposed methodology with state-of-the-art ASR systems.