This paper discusses the results of the pilot experimental research dedicated to speech recognition and perception of the semantic content of the utterances in noisy environment. The experiment included perceptual-auditory analysis of words and phrases in Russian and German (in comparison) in the same noisy environment: various (pink and white) types of noise with various levels of signal-to-noise ratio. The statistical analysis showed that intelligibility and perception of the speech in noisy environment are influenced not only by noise type and its signal-to-noise ratio, but also by some linguistic and extralinguistic factors, such as the existing redundancy of a particular language at various levels of linguistic structure, changes in the acoustic characteristics of the speaker while switching from one language to another one, the level of speaker and listener’s proficiency in a specific language, and acoustic characteristics of the speaker’s voice.
Speech intelligibility and speech recognition are important and trending topics of research in various fields of science: Linguistics, Medicine, Electrical Engineering, and Information Technology. Speech recognition process is investigated from different sides, as only an integrated approach could lead to a better understanding of this process. One of research directions is study of biological and neurologic mechanisms of speech perception [
Study of listener’s specifics on the process of speech recognition showed that music training affects positively speech-in-noise perception [
In Russia studies of speech characteristics and speech intelligibility and recognition in noise environment started in the middle of 20th century ([
The current research studies perception of native (Russian) and nonnative (German) speech in noisy environment (pink and white types of noise were chosen for the experiment) focuses on the following aims: To identify the effect of the tested types of noise with various levels of signal-to-noise ratio (in comparison) on speech perception; To identify effects of linguistic and extralinguistic factors on speech perception in noisy environment.
Our pilot research included perceptual-auditory analysis at various levels of linguistic structure of speech utterances in Russian and German (in comparison) in the same noisy environment: various (pink and white) types of noise with various levels of signal-to-noise ratio as well as effects of linguistic and extralinguistic factors on speech perception in noisy environment.
The research material of the study was a specially composed (according to the method of Potapova [
The following requirements for development of the ad hoc research material were stated: Test phrases should be grammatically and semantically linked and consist of words which exist in both languages; Various types of consonants and vowels should be represented in the test phrases; Acoustic realization of the chosen types of consonants and vowels should be similar and comparable in both languages; Comparable (by place and manner of articulation) consonants and vowels should be in identical positions in a syllable (for all test words and phrases in Russian and German); The rhythmic scheme of test words and phrases should be identical for both languages; Combinations with various types of vowels in stressed position in the first syllable (with regard to unilateral distribution) should be tested for each type of consonants.
According to the requirements of the research material the following test phrases were formulated (see Table
Test phrases for experiment.
N | Analyzed languages | |
---|---|---|
Russian | German | |
1 |
|
|
|
||
2 |
|
|
|
||
3 |
|
|
|
||
4 |
|
|
|
||
5 |
|
|
Each phrase consists of 3 words, having in the first stressed syllable a combination of the tested type of consonant with one of the tested types of vowels. Since pronunciation norms of the German language require voicing the voiceless fricative consonant “s” preceding a vowel, speakers were instructed to pronounce this sound as a voiceless one in the word “Sascha.”
Speakers, who took part in the study, were native speakers of the literary Russian language without prominent dialectal features of pronunciation, speaking also German (50%) and being native speakers of the literary German language without prominent dialectal features of pronunciation, speaking also Russian (50%). The level of knowledge of foreign language of all speakers was the same, B2-C1, which was tested according to the system developed by the Council of Europe [
Each speaker read aloud test phrases and isolated words from phrases three times each. Thus, the total number of obtained realizations of the test words and phrases totaled 480; the total number of realizations for a single speaker was 120 (60 in Russian and 60 in German).
All test words and phrases were combined into 2 tables (in Russian and in German). The order of words and phrases was random and differed for different speakers. All test words and phrases were read with the intonation of the completed narrative, followed by a pause.
Audio recording of the research material was conducted in a specially equipped room, preventing foreign interference and noise: an anechoic chamber of the Institute of Applied and Mathematical Linguistics of Moscow State Linguistic University.
Two samples of noise (white and pink) were generated using the program Cool Edit Pro 2.0. for mixing them with audio records of the test words and phrases realizations.
Each speech segment was mixed with white and pink noise with various levels of signal-to-noise ratio: 0 dB, −3 dB, −6 dB, −9 dB, and −12 dB. Thus, for each spoken realization of the test material 10 variants of mixed signals were obtained, 4800 samples in total, plus 150 phonograms containing only noise (75 with white noise and 75 with pink noise). The total number of phonograms for the experiment was 4950.
The number of listeners was 21: 6 males and 15 females, 19–21 y.o., native speakers of the Russian language without prominent dialectal features of pronunciation with normal hearing, who have a command of English at level B2-C1 (which was tested on the system developed by the Council of Europe [
All phonograms were numbered randomly for presentation to each listener: total number of rotation variants was 11.
Listeners have to listen to phonograms according to their sequence numbers in the proposed rotation variant and to write down the answers for each of them in the table (see example of the answer table in the Table
Answer table for listeners.
# of phonogram | Do you hear a speech signal? (mark the corresponding cell with |
What is the utterance language? |
Write down everything, you have heard (in Russian) | |||
---|---|---|---|---|---|---|
Yes | No | Russian | German | Do not know | ||
1 | ||||||
2 | ||||||
⋮ | ||||||
|
They could listen to each phonogram as many times they wanted. Each half an hour there were short breaks. Work time per day did not exceed 4 hours. The perception test run during 2 days: total work time for each listener was 8 hours.
The total number of played phonograms was 23085. The total number of played phonograms, which contained only noise, was 727.
The total number of played phonograms, containing speech signal (test words and phrases mixed with noise), was 22358 (the share of phonograms with white and pink noise types made up 50% each).
These calculations indicate that the size of the base of played phonograms is sufficient to ensure reliable and stable quality of the data.
A summary table with quantitative description of the experiment is presented in Table
Quantitative description of the experiment.
N | Characteristics | Number |
---|---|---|
1 | Research material: number of words and phrases in Russian | 20 |
2 | Research material: number of words and phrases in German | 20 |
3 | Research material: number of words and phrases in Russian and German | 40 |
4 | Number of speakers | 4 |
5 | Number of realizations of test words and phrases in Russian and German per one speaker | 120 |
6 | Total number of realizations (phonograms) of test words and phrases in Russian and German | 480 |
7 | Number of tested types of noise | 2 |
8 | Number of tested levels of signal-to-noise ratio for each type of noise | 5 |
9 | Number of phonograms for perceptual-auditory analysis, which contain speech signal (mixes of test words and phrases with noise) | 4800 |
10 | Number of phonograms for perceptual-auditory analysis, which contain only noise | 150 |
11 | Total number of phonograms for perceptual-auditory analysis | 4950 |
12 | Number of listeners | 21 |
13 | Number of auditions, containing only noise | 727 |
14 | Total number of auditions, containing speech signal (mixes of test words and phrases with noise) | 22358 |
15 | Total number of auditions | 23085 |
We calculated statistical sampling error for the findings to prove the observed tendencies statistically. The statistical sampling error was calculated using the following formula [
The working hypothesis of the study was as follows: speech recognition (detection of speech in noise and identification of the utterance language) and perception of the semantic content of the utterance in the variable noisy environment are influenced by the type of noise and the signal-to-noise ratio, as well as by some of linguistic and extralinguistic factors. These factors are the existing redundancy of a particular language at various levels of linguistic structure, changes in the acoustic characteristics of the speaker while switching from one language to another one, the speaker and listener’s level of proficiency in a specific language, and acoustic characteristics of the speaker’s voice.
The experiment showed that within the corpus of the research material pink noise provides better protection of the utterance than white noise at equal integral level of signal-to-noise ratio (for all tested levels) in terms of the following indicators: detection of speech signal in noise (see Figure
Influence of the signal-to-noise ratio on detection of speech signal in noise.
Recognition of the utterance language: comparison of Russian and German in noisy environment (with various types of noise and various levels of signal-to-noise ratio).
The lower the level of signal-to-noise ratio (the higher the level of noise over the level of the desired signal), the higher the difference of efficiency degree between pink and white types of noise, reaching its maximum at the lowest tested signal-to-noise ratio (−12 dB): while assessing detection of speech signal in noise, the efficiency of white noise masking is ~4.7 times lower as compared to pink noise. This result corresponds to findings observed in [
Detection of the utterance in noise also depends on level of speaker and listener’s proficiency in a specific language, as well as on utterance language. Thus, for listeners, who are native Russian speakers, a higher score of detection of utterance in noise was shown for utterances in German, pronounced by native German speakers, than for utterances in Russian, pronounced by native Russian speakers (see Figure
Utterance miss (false negative error) for listeners, who are native Russian speakers, depending on the characteristics of the speaker.
Listeners, who are native speakers of the language of the utterance, are able to detect native speech in noisy environment regardless the speaker’s level of proficiency in this language of (see Figure
Recognition of the word
Besides, for a number of phonograms with low scores of recognition, the list of these substituting words also differed for speech of native and nonnative speakers: Tables
Main substitutes (4% and more) for phonogram
NN | Substitutes | English transcription | Answer share (in %) |
---|---|---|---|
1 | била | [ |
42 |
2 | Мила | [ |
6 |
3 | мыла | [ |
4 |
4 | хиар | [ |
4 |
|
|||
5 | 26 substitutes with share 1%–3% | 43 |
Main substitutes (4% and more) for phonogram
NN | Substitutes | English transcription | Answer share (in %) |
---|---|---|---|
1 | била | [ |
34 |
2 | Зина | [ |
13 |
3 | Мила | [ |
13 |
4 | Дима | [ |
6 |
|
|||
5 | 28 substitutes with share 1%–3% | 35 |
From Tables
At the acoustic level the degree of the utterance protection also depends on the fundamental frequency of the speaker’s voice: within the corpus of the research words and phrases in Russian and German utterances in realizations of males are more concealed from detection against noise (i.e., they demonstrated higher score of utterance miss (false negative error) against noise) than utterances in realizations of females against both tested types of noise with all tested levels of signal-to-noise ratio (see Figures
Utterance miss (false negative error) for utterances in Russian against pink noise: comparison of male and female voices.
Utterance miss (false negative error) for utterances in Russian against white noise: comparison of male and female voices.
At the phonetic level various sounds have various degrees of intelligibility depending on their acoustic nature. Thus, the most resistant (among those tested within the research material of the experiment) to recognition in the stressed syllable are consonants [s] and [m], as well as vowel [a], while the most masked are consonants [b] and [p] and vowel [i] (see Figures
Correct recognition of consonants in first stressed syllable against pink noise.
Correct recognition of consonants in first stressed syllable against white noise.
Correct recognition of vowels in first stressed syllable against pink noise.
Correct recognition of vowels in first stressed syllable against white noise.
At the syntactic level correct recognition of words depends on the context: the scores of correct recognition of words, which functioned as a part of a phrase, were higher than the scores for the same words in an isolated position against both tested (pink and white) types of noise at all tested levels of signal-to-noise ratios (see Figures
Comparative analysis of correct recognition of words in Russian against various types of noise in various context (isolated words or as a part of a phrase), in %.
Number of tested words (from 15), the score of recognition of which did not exceed 50% at each level of signal-to-noise ratio.
In the phrases the best recognized parts were subjects (which always came first in the phonogram corpus), with predicates showing the second highest score (see Figures
Recognition of isolated words in the phrase
Recognition of isolated words in the phrase
At the lexical level recognition of utterances in Russian is influenced by the occurrence of words in the language: within the research corpus of Russian words the highest score of recognition was shown by words
Frequency of occurrence in the Russian language of the tested words with the highest and the lowest scores of recognition (according to the results of the current experiment).
N | Word | ipm |
---|---|---|
1 | мама |
322,6 |
2 | Саша |
93,6 |
3 | папа |
143,4 |
|
||
4 | Милу |
10 |
5 | Борю |
19,6 |
6 | Зинин |
20,8 |
7 | Поле |
8,1 |
According to the results of the experiment the following factors influence intelligibility and perception of the speech in noisy environment: Type of noise and signal-to-noise ratio (pink noise provides better protection of the utterance than white noise at equal integral level of signal-to-noise ratio (for all tested levels) in terms of the following indicators: detection of speech signal in noise, correct identification of the utterance language, and correct perception of the semantic content of the utterance); Utterance language and speaker and listener’s proficiency in a specific language; Fundamental frequency of the speaker’s voice (within the corpus of the research material in Russian and German utterances read by males was detected by listeners statistically rarely than utterances read by females against both tested types of noise with all levels of signal-to-noise ratio); Context: isolated word or as a part of the phrase (within the corpus of the research material in Russian intelligibility of words within the phrase was better against both tested types of noise with all levels of signal-to-noise ratio compared to isolated words); Frequency of word occurrence in the language (according to the results of the experiment, words with higher frequency of occurrence in the Russian language showed better intelligibility); Phonetic composition of the word (within the corpus of the research material in Russian the voiceless sibilant fricative alveolar [s] and sonorant bilabial [m] among consonants and central open [a] among vowels showed the best intelligibility (i.e., the worst ability of masking using noise) within the tested set of sounds, while among consonants stop bilabial ones: voiced [b] and voiceless [p] and front close [i] among vowels showed the worst intelligibility, that is, the best ability of masking using noise).
Among the further possible directions of analysis the following can be mentioned: Increase of volume of bilingual research material; Expansion of the inventory of acoustic parameters for the analysis of the language sounds recognition; Increase of the number of speakers and listeners taking into account such factors as age, gender, degree of experience in listening, and proficiency in the utterance language, in relation to the studied languages; Study of the influence of linguistic and extralinguistic factors on the recognition in noisy environment for long connected texts; Organization of the database, including units of the sound composition and intonation system of various languages.
The authors declare that they have no conflicts of interest.
The research was supported by Ministry of Education and Science of Russian Federation (Project no. 34.1254.2014K, head of the project R. K. Potapova).