Blindness and certain visual disabilities may be partially compensated by artificial retinae, that is, with light sensors directly implanted in a person’s nervous system [
In the present work, substitution is also initiated from optical data, but these are processed to deliver specific information on depth. In addition, the signal delivered to the user is auditory. This sensory modality has also been exploited before for navigation. Depth perception is a major source of information for navigation. For example, it is used to anticipate trajectories and avoid obstacles. In humans, depth perception is mainly visual. Beyond what we can touch, vision provides more qualitative and quantitative information than other senses. In the absence of visual input, especially in blind people, an additional signal can provide information on depth [
Our system is inspired by auditory substitution devices that encode visual scenes from a video camera and produce sounds as an acoustic representation called a “soundscape.” In the vOICe (for Oh I See) [
Loomis et al. [
Other authors worked on navigation performance via an SSD called Brainport [
Various systems differ in the way they convert images into sounds. As mentioned before, some techniques use the sequential scanning of columns in images. By processing only one column at a time [
More recent systems handle an entire 2D image snapshot in various ways. Taking the human vision system as a model, a multiresolution retina can encode the center in greater detail than in the rest of the field [
Only a limited number of audition-based SSDs focus on depth acquisition. Yet in vertebrates, different procedures evolved to evaluate object distance independent of vision and touch. For instance, echolocation was first discovered for bats in the air [
More recently, the emergence of the RGB-D Kinect sensor has paved the way for a new generation of navigation systems by SSDs based on real-time depth map acquisition [
We propose an SSD prototype constructed based on Kinect principles called MeloSee. The emitter-receiver system is mounted on the user’s head (Figure
A blindfolded participant equipped with the set-up called MeloSee. The participant holds the distraction apparatus in her right hand.
Processing optical data to infer depth from the surrounding environment requires attention and cognitive resources. MeloSee readily extracts depth information in real-time and transposes it along acoustical scales. The user’s resources may be spared and left available for tasks other than navigating.
The goal of our research is to test this depth-to-sound design in a navigation task. Firstly, does it at least allow travelling between walls without sight and without touch via only auditory clues? Secondly, if the depth-to-sound conversion is relevant, a small amount of training should be sufficient to achieve navigation, even in an unknown space. Thirdly, if learning is effective, it should remain for days without continuous refreshing (long-term learning). Finally, if the depth-to-sound comprehension is easy enough for users, the participants should be able to complete a distractive task while navigating. To test these four ideas, blindfolded participants navigated on different unknown paths for two sessions separated by a one-week interval. In each experimental session, they also participated in two trials with a distractive task. Travel time and errors (contacts with walls and U-turns) were recorded.
The system presented in Figure
Visual-auditory sensory substitution flowchart. The high-input information throughput is significantly reduced before being converted into sound.
The Asus Xtion sensor projects an infrared pattern on the scene. (For this reason, the system does not work properly outdoors.) Depth can be inferred from the deformation of the infrared pattern [
An example of this approach is shown in Figure
Retinal depth encoder. (a) Grayscale depth map. (b) Activity computation. (c) RF activities: the closer the object, the lighter the disc (Figure
Each RF’s activity is transformed into a particular sound source. Pitch is then associated with each RF according to its vertical position (from C4 to C5 on an octave scale, with low frequencies at the bottom and high frequencies at the top). The horizontal position defines stereophonic left and right gain (for amplification or attenuation) applied to the source of the binaural sound representation. Intensity is translated inversely proportional to distance. Sounds for all RFs are played in parallel. A stereo tone generator outputs the auditory scene by adding the contributions of all RFs as a synthesizer with linear contributions from oscillators. The audio update rate was 7.5 Hz or 132 ms.
The Asus Xtion sensor cannot operate closer than 50 cm from the targeted object or surface. Below this minimum distance, the set-up becomes silent. Sound intensity also fades with distance, down to zero at a certain maximum limit. For our experiment, the maximum was adjusted to 250 cm. While we could set the sensor to operate farther away, doing so would reduce the sound intensity contrast at closer distances.
In a sensory-motor task, time latency is a key feature for the system. We accurately assessed the latency of whole images to sound conversion processing using a blinking LED, shot by the RGB-D sensor, which also triggered an oscilloscope (because the LED light prevents depth calculation at the dazzled point). Based on this, time latency is the difference between the LED extinction and the beep’s arrival. With a midrange laptop (Win7, Asus PC Intel Atom N550 CPU), we measured a latency of approximately 100–150 ms, which is commendable for a real-time system.
Twenty-one healthy participants took part in the study (14 women; Age: Mean = 21.2, Standard deviation = 2.1). They were students at Grenoble University. They were paid for their participation in each session.
The study was conducted in accordance with the Helsinki Declaration and with the understanding and a written consent of each participant.
Participants were blindfolded with a sleep eye mask during the entire experiment. Therefore they never saw the experimental set-up. Then, they were led to a starting point and using MeloSee they were asked to walk until the end of a path. They were asked to navigate as rapidly and accurately as possible, that is, making the least possible contact with walls and screens. Before starting the trials, they were informed that there were no barriers, stairs, cul-de-sacs, or corridors narrower than a doorway along the route. As participants had to do not touch the walls, the task could only be performed with SSD switched-on, and so no baseline trial was recorded (e.g., blindfolded participants walking with switched-off SSD). During some trials they had to complete a concurrent discrimination task monopolizing both hands while navigating. Therefore, there was no comparative trial with another device requiring the hand (e.g., a control group using a white cane). Importantly, during the trials, participants were warned when they started going in the wrong direction (i.e., U-turns).
Two different paths (Figure
Paths used in the experiment. Path A (left) was 22 m long and path B (right) was 19.6 m long. Squares indicate either the start or finish, depending on the route direction.
The distraction task required participants’ attention without impeding “melody” hearing. It consisted of detecting a temporal pattern of touches. The participant’s thumb was in contact with a bare loudspeaker held in one hand and connected to an MP3 player. The touch stimuli consisted of 100 Hz sinus-wave buzzes lasting for 200 ms each. They were not audible by participants during the navigation task but strong enough to stimulate their skin. Buzzes were emitted alone, by pair, or by triplet, with a one second interbuzz onset interval. The patterns themselves were separated by random intervals of 2.7 to 12.0 s and assembled in blocks lasting for 30 s. Each block contained one pair, one triplet, and two single-buzz patterns, all randomly distributed. Ten blocks were assembled in a 5 min audio file played in a loop. While navigating, participants had to respond to double buzzes, and only double buzzes, by clapping their thigh with their free hand. Their performance was monitored and assessed in real-time with the help of a waveform track representing the sequence of buzzes. Because the study focused on navigation performance, participants were asked to prioritize the distraction task each time it was a part of a trial. Thus, defining navigation performance as secondary for all participants prevented various trade-offs between tasks that would obscure the actual distracting effect expected for navigation.
Prior to the first trial of the first session, participants were familiarized with the upcoming distraction task and the SSD in an anteroom. First, they were told to recognize the double buzz within a 30-second sequence. Then, the rationale of the SSD system was explained to them while they were equipped and blindfolded. The way sound intensity changed with distance from walls or screens was demonstrated (and experienced) in a didactic exchange with the experimenter. Emphasis was placed on the importance of head movements. Additional explanations were given on the system becoming silent at extreme-short range and when facing wide, open spaces devoid of obstacles. Then, participants moved around for two minutes in the anteroom in order to understand the sound coding. During this task, they were allowed to use their hands to explore the room with both touch and sound. Finally, they experienced the distraction task, together with the substitution system, for one minute. The overall familiarization lasts less than eight minutes and all participants felt ready for the navigation task.
Two sessions, each lasting less than one hour, were separated by one week (intersession interval M = 7, SD = 2 days). Each session included six trials.
The experimental design was within-participant, and the general procedure was the same in each session. Each participant was randomly assigned series # 1, 2, 3, or 4 (Table
Experimental block design for the two sessions.
Trial | Week session 1 | Week session 2 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | 12 | 13 | 21 |
|
|
11 | 12 | 13 | 21 |
|
| ||
Series | #1 | A+ | A+ | A+ | B+ |
|
|
A− | A− | A− | B− |
|
|
#2 | A− | A− | A− | B− |
|
|
A+ | A+ | A+ | B+ |
|
|
|
#3 | B+ | B+ | B+ | A+ |
|
|
B− | B− | B− | A− |
|
|
|
#4 | B− | B− | B− | A− |
|
|
B+ | B+ | B+ | A+ |
|
|
Each series (#1 to #4) started with a different route and used the four routes (A+ and A−, B+ and B−) in a different order. In each session, the second run of the second route (
The same procedure was applied the second week to test for long-term learning. Inverting travel direction for the two paths resulted in two new routes for the participants.
As previously noted, participants had to navigate as quickly as possible without touching the walls. For each trial, we measured travel time and navigation errors (number of contacts with the walls and U-turns). The performance of the distractive task was also recorded. For each dependent variable, a repeated-measure ANOVA was conducted with week session (first, second) and trial ranks (11, 12, 13, 21,
The results for travel time are presented in Figure
Navigation performance in the six experimental trials during the first and the second week sessions. Trials 11, 12, 13, and 14 are conducted on the same path and trials 21 and 22 on another one. Cognitive load was added in trials
When a cognitive load was added, travel time increased for the second route (trial
An ANOVA on contact with the walls (Figure
The same pattern of results was observed with the number of U-turns (Figure
Together, our results seem to show the applicability of depth image conversion into sound for navigation along unknown paths. Interestingly, we noticed time improvement between a first and a second trial performed in the same path. We also observed time improvement between the first trial in the first path and the first trial performed in a new path (short-term learning). Performance also improved over sessions (long-term learning), even when a distraction task was introduced.
It is important to consider that the aim of this study was to test the system alone in a basic navigation task. The experimental procedure was operational enough to assess performance with different quantitative variables and with an additional cognitive load that may be further adjusted in complexity. The paradigm may inspire assays for future development of the same real-time device or to compare different SSDs later.
Our system was employed readily by inexperienced users (with only eight minutes of familiarization with the device). Travel time decreased between the first and the second trials and remained lower along the remainder of each session. With even shorter travel times a week later, learning was also demonstrated to be of longer term. In addition to travel time, a decrease in navigation error frequency confirmed the long-term improvement, with the agreement of two additional measures. Moreover, when participants had to deal with a distraction task via tactile stimulation, learning was robust as evaluated by both travel time and errors. The distraction task itself was hardly affected in the process. The additional load somewhat lengthened travel time in the less-navigated route only. The effect was slight in comparison with the massive learning progress observed.
However, the speed performance, around 8.7 m/min at the best, may seem low in general as compared to other systems (e.g., [
Our results seem consistent with other studies [
Our system presents the advantage of being constructed with the RGB-D sensor, a common manufactured component. However, it has two functional limitations. First, because it is based on infrared beams, it cannot operate outdoors where it is jammed by stronger concurrent signals from sunlight. Other sensor systems could be considered as input devices for a version that would deliver a similar polyphonic signal from a depth array that could be sampled outdoors. Second, the RGB-D sensor we tested does not pick up information at very close range (0.5 m). Some participants’ main difficulty was finding their way in narrow passes and confined places, such as corners or door frames. In this particular situation, they learned to step back to restore the acoustic signal, after which they succeeded in walking faster and better. In subsequent versions of the system, various methods have to be implemented to help participants discern between blanks at very close and very long range. Automated processing may additionally be implemented to help the user detect relevant patterns [
The goal of the present experiment was to show that our portable real-time SSD that turns a depth scene from RGB-D into polyphonic stimulation can be adapted into a usable SSD.
As shown by our test, using different paths, polyphonic conversion of a 2D depth array can help navigation in corridors without vision and without touch. Practice significantly improved through long- and short-term learning in both new and more familiar paths. The portable system remained functional even when a supplementary task diverted participants’ attention. The quick online coupling between real space and auditory mapping seemed to connect with cognitive processing that is normally implemented from visual natural input. Its development could help sightless people find their way in unknown as well as more familiar paths, without monopolizing the navigator’s hand and attention. Future works are however required to test the relative efficiency of our device compared to other SSDs or other guidance systems (white cane or dog) and to test our device with blind persons.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by the “ConstrainPercept” Program (ANR-12-JSH2-0008-01) run by the French National Research Agency (ANR) and by “Pôle Grenoble Cognition” and “SFR Santé-Société” Programs run by Grenoble University. This work has been partially supported by the LabEx PERSYVAL-Lab (ANR-11-LABX-0025). Special thanks are due to Benjamin de Vulpillières for his editing and proofreading assistance.