Visual sensor networks have emerged as an important class of sensor-based distributed intelligent systems, with unique performance, complexity, and quality of service challenges. Consisting of a large number of low-power camera nodes, visual sensor networks support a great number of novel vision-based applications. The camera nodes provide information from a monitored site, performing distributed and collaborative processing of their collected data. Using multiple cameras in the network provides different views of the scene, which enhances the reliability of the captured events. However, the large amount of image data produced by the cameras combined with the network's resource constraints require exploring new means for data processing, communication, and sensor management. Meeting these challenges of visual sensor networks requires interdisciplinary approaches, utilizing vision processing, communications and networking, and embedded processing. In this paper, we provide an overview of the current state-of-the-art in the field of visual sensor networks, by exploring several relevant research directions. Our goal is to provide a better understanding of current research problems in the different research fields of visual sensor networks, and to show how these different research fields should interact to solve the many challenges of visual sensor networks.
Camera-based networks have been used for security monitoring and surveillance for a very long time. In these networks, surveillance cameras act as independent peers that continuously send video streams to a central processing server, where the video is analyzed by a human operator.
With the advances in image sensor technology, low-power image sensors have appeared in a number of products, such as cell phones, toys, computers, and robots. Furthermore, recent developments in sensor networking and distributed processing have encouraged the use of image sensors in these networks, which has resulted in a new ubiquitous paradigm—visual sensor networks. Visual sensor networks (VSNs) consist of tiny visual sensor nodes called camera nodes, which integrate the image sensor, embedded processor, and wireless transceiver. Following the trends in low-power processing, wireless networking, and distributed sensing, visual sensor networks have developed as a new technology with a number of potential applications, ranging from security to monitoring to telepresence.
In a visual sensor network a large number of camera nodes form a distributed system, where the camera nodes are able to process image data locally and to extract relevant information, to collaborate with other cameras on the application-specific task, and to provide the system's user with information-rich descriptions of captured events. With current trends moving toward development of distributed processing systems and with an increasing number of devices with built-in image sensors, a question of how these devices can be used together appears [
Several survey papers on multimedia sensor networks and visual processing can be found in the current literature. In [
One of the main differences between visual sensor networks and other types of sensor networks lies in the nature of how the image sensors perceive information from the environment. Most sensors provide measurements as 1D data signals. However, image sensors are composed of a large number of photosensitive cells. One measurement of the image sensor provides a 2D set of data points, which we see as an image. The additional dimensionality of the data set results in richer information content as well as in a higher complexity of data processing and analysis.
In addition, a camera's sensing model is inherently different from the sensing model of any other type of sensor. Typically, a sensor collects data from its vicinity, as determined by its sensing range. Cameras, on the other hand, are characterized by a directional sensing model—cameras capture images of distant objects/scenes from a certain direction. The 2D sensing range of traditional sensor nodes is, in the case of cameras, replaced by a 3D viewing volume (called field of view, or FoV).
Visual sensor networks are in many ways unique and more challenging compared to other types of wireless sensor networks. These unique characteristics of visual sensor networks are described next.
The lifetime of battery-operated camera nodes is limited by their energy consumption, which is proportional to the energy required for sensing, processing, and transmitting the data. Given the large amount of data generated by the camera nodes, both processing and transmitting image data are quite costly in terms of energy, much more so than for other types of sensor networks. Furthermore, visual sensor networks require large bandwidth for transmitting image data. Thus both energy and bandwidth are even more constrained than in other types of wireless sensor networks.
Local (on-board) processing of the image data reduces the total amount of data that needs to be communicated through the network. Local processing can involve simple image processing algorithms (such as background substraction for motion/object detection, and edge detection) as well as more complex image/vision processing algorithms (such as feature extraction, object classification, scene reasoning). Thus, depending on the application, the camera nodes may provide different levels of intelligence, as determined by the complexity of the processing algorithms they use [
In order to extract necessary information from different images, a camera node must employ different image processing algorithms. One specific image processing algorithm cannot achieve the same performance for different types of images—for example, an algorithm for face extraction significantly differs from algorithm for vehicle detection. However, oftentimes it is impossible to keep all the necessary image processing algorithms in the constrained memory of a camera node. One solution to this problem is to use mobile agents—a specific piece of software dispatched by the sink node to the region of interest [
Most applications of visual sensor networks require real-time data from the camera nodes, which imposes strict boundaries on maximum allowable delays of data from the sources (cameras) to the user (sink). The real-time performance of a visual sensor network is affected by the time required for image data processing and for the transmission of the processed data throughout the network. Constrained by limited energy resources and by the processing speed of embedded processors, most camera nodes have processors that support only lightweight processing algorithms. On the network side, the real-time performance of a visual sensor network is constrained by the wireless channel limitations (available bandwidth, modulation, data rate), employed wireless standard, and by the current network condition. For example, upon detection of an event, the camera nodes can suddenly inject large amounts of data in the network, which can cause data congestion and increase data delays.
Different error protection schemes can affect the real-time transmission of image data through the network as well. Commonly used error protection schemes, such as automated-repeat-request (ARQ) and forward-error-correction (FEC) have been investigated in order to increase the reliability of wireless data transmissions [
Finally, multihop routing is the preferred routing method in wireless sensor networks due to its energy-efficiency. However, multihop routing may result in increased delays, due to queueing and data processing at the intermediate nodes. Thus, the total delay from the data source (camera node) to the sink increases with the number of hops on the routing path. Additionally, bandwidth becomes a scarce resource in multihop networks consisting of traditional wireless sensor nodes. In order to support the transmission of real-time data, different wireless modules that provide larger bandwidths (such as those based on IEEE 802.11 b,g,n) can be considered.
In visual sensor networks, most of the image processing algorithms require information about the locations of the camera nodes as well as information about the cameras' orientations. This information can be obtained through a camera calibration process, which retrieves information on the cameras' intrinsic and extrinsic parameters (explained in the Section
The information content of an image may become meaningless without proper information about the time at which this image was captured. Many processing tasks that involve multiple cameras (such as object localization) depend on highly synchronized cameras' snapshots. Time synchronization protocols developed for wireless sensor networks [
The cameras generate large amounts of data over time, which in some cases should be stored for later analysis. An example is monitoring of remote areas by a group of camera nodes, where the frequent transmission of captured image data to a remote sink would quickly exhaust the cameras' energy resources. Thus, in these cases the camera nodes should be equipped with memories of larger capacity in order to store the data. To minimize the amount of data that requires storage, the camera node should classify the data according to its importance by using spatiotemporal analysis of image frames, and decide which data should have priority to be stored. For example, if an application is interested in information about some particular object, then the background can be highly compressed and stored, or even completely discarded [
The stored image data usually becomes less important over time, so it can be substituted with newly acquired data. In addition, reducing the redundancy in the data collected by cameras with overlapped views can be achieved via local communication and processing. This enables the cameras to reduce their needs for storage space by keeping only data of unique image regions. Finally, by increasing the available memory, more complex processing tasks can be supported on-board, which in return can reduce data transmissions and reduce the space needed for storing processed data.
Visual sensor networks are envisioned as distributed and autonomous systems, where cameras collaborate and, based on exchanged information, reason autonomously about the captured event and decide how to proceed. Through collaboration, the cameras relate the events captured in the images, and they enhance their understanding of the environment. Similar to wireless sensor networks, visual sensor networks should be data-centric, where captured events are described by their names and attributes. Communication between cameras should be based on some uniform ontology for the description of the event and interpretation of the scene dynamics [
With the rapid development of visual sensor networks, numerous applications for these networks have been envisioned, as illustrated in the Table
Applications of visual sensor networks.
General application | Specific application |
---|---|
Surveillance | Public places |
Traffic | |
Parking lots | |
Remote areas | |
Environmental monitoring | Hazardous areas |
Animal habitats | |
Building monitoring | |
Smart homes | Elderly care |
Kindergarten | |
Smart meeting rooms | Teleconferencing |
Virtual studios | |
Virtual reality | Telepresence systems |
Telereality systems |
(i) Surveillance: Surveillance has been the primary application of camera-based networks for a long time, where the monitoring of large public areas (such as airports, subways, etc.) is performed by hundreds or even thousands of security cameras. Since cameras usually provide raw video streams, acquiring important information from collected image data requires a huge amount of processing and human resources, making it time-consuming and prone to error. Current efforts in visual sensor networking are concentrated toward advancing the existing surveillance technology by utilizing intelligent methods for extracting information from image data locally on the camera node, thereby reducing the amount of data traffic. At the same time, visual sensor networks integrate resource-aware camera management policies and wireless networking aspects with surveillance-specific tasks. Thus, visual sensor networks can be seen as a next generation of surveillance systems that are not limited by the absence of infrastructure, nor do they require large processing resources at one central server. These networks are adaptable to the environment dynamics, autonomous, and able to respond timely to a user's requests by providing an immediate view from any desired viewpoint or by analyzing and providing information from specific, user determined areas.
(ii) Environmental monitoring: Visual sensor networks can be used for monitoring remote and inaccessible areas over a long period of time. In these applications, energy-efficient operations are particularly important in order to prolong monitoring over an extended period of time. Oftentimes the cameras are combined with other types of sensors into a heterogeneous network, such that the cameras are triggered only when an event is detected by other sensors used in the network [
(iii) Smart homes: There are situations (such as patients in hospitals or people with disabilities), where a person must be under the constant care of others. Visual sensor networks can provide continuous monitoring of people, and using smart algorithms the network can provide information about the person needing care, such as information about any unusual behavior or an emergency situation.
(iv) Smart meeting rooms: Remote participants in a meeting can enjoy a dynamic visual experience using visual and audio sensor network technology.
(v) Telepresence systems: Telepresence systems enable a remote user to “visit” some location that is monitored by a collection of cameras. For example, museums, galleries or exhibition rooms can be covered by a network of camera nodes that provide live video streams to a user who wishes to access the place remotely (e.g., over the Internet). The system is able to provide the user with any current view from any viewing point, and thus it provides the sense of being physically present at a remote location through interaction with the system's interface [
Visual sensor networks are based on several diverse research fields, including image/vision processing, communication and networking, and distributed and embedded system processing. Thus, the design complexity involves finding the best tradeoff between performance and different aspects of these networks. According to Hengstler and Aghajan [
Due to its interdisciplinary nature, the research directions in visual sensor networks are numerous and diverse. In the following sections we present an overview of the ongoing research in several areas vital to visual sensor networks: vision processing, wireless networking, camera node hardware architectures, sensor management, and middleware, as illustrated in Figure
Several research areas that contribute to the development of visual sensor networks.
Obtaining precise information about the cameras' locations and orientations is crucial for many vision processing algorithms in visual sensor networks. The information on a camera's location and orientation is obtained through the calibration process, where this information (presented as the camera's orientation matrix
Calibration of cameras can be done at one processing center, which collects image feature points from all cameras in the system and, based on that, it estimates the calibration parameters for the entire system. However, such a calibration method is expensive in terms of energy and is not scalable, and thus it is not suitable for energy-constrained visual sensor networks. Therefore, visual sensor networks require distributed energy-efficient algorithms for multicamera calibration.
The localization algorithms developed for wireless sensor networks cannot be used for calibration of the cameras since they do not provide sufficient precision, nor do they provide information on the cameras' orientations. The ad hoc deployment of camera nodes and the absence of human support after deployment imposes the need for autonomous camera calibration algorithms. Since usually there is no prior information about the network's vision graph (a graph that provides information about overlapped cameras' FoVs), communication graph, or about the environment, finding correspondences across cameras (presented as a set of points in one camera's image plane that correspond to the points in another camera's image) is challenging and error prone. Ideally, cameras should have the ability to self-calibrate based on their observations from the environment. The first step in this process involves finding sets of cameras that image the same scene points. Finding correspondences among these cameras may require excessive, energy expensive inter-camera communication. Thus, the calibration process of distributed cameras is additionally constrained by the limited energy resources of the camera nodes. Additionally, the finite transmission ranges of the camera nodes can limit communication between them.
Therefore, camera calibration in a visual sensor network is challenged by finding the cameras' precise extrinsic parameters based on existing calibration procedures taken from computer vision, but considering the communication constraints and energy limitations of camera nodes. These calibration methods should cope successfully with changes in the communication graph (caused by variable channel conditions) and changes in the visual graph (due to the loss of cameras or a change in the cameras' positions and orientations).
Calibration based on a known object is a common calibration method from computer vision, that is, widely adopted in visual sensor networks [
Epipoles of a pair of cameras—the points where the line that connects the centers of the cameras intersects the cameras' image planes [
Thus, in [
Funiak et al. [
Devarajan et al. [
Most of the algorithms for camera calibration in visual sensor networks are based on existing calibration methods established in computer vision, and rarely are they influenced by the underlying network. Thus, future camera calibration algorithms should explore how the outcome of these calibration algorithms can be affected by the communication constraints and network topology. In particular, it is necessary to find out how multicamera calibration methods can be affected by the underlying networking requirements for reliable and energy efficient intercamera communication. Such an analysis would provide an insight into the trade-offs between the desired calibration precision and cost for achieving it.
Also, the calibration methods should be robust to the network's dynamics; for example, considering how the addition of new cameras or the loss of existing cameras affect the calibration process. Above all, the calibration algorithms should be light-weight, meaning that they should not be based on extensive processing operations. Instead, they should be easily implementable on the hardware platforms of existing camera nodes. Due to the ad hoc nature of visual sensor networks, future research is required to develop camera calibration algorithms that determine precise calibration parameters using a fully automatic approach that requires minimal or no a priori knowledge of network distances, network geometry or corresponding feature points.
The appearance of small CMOS image sensors and the development of distributed wireless sensor networks opens the door to a new era in embedded vision processing. The challenge is how to adapt existing vision processing algorithms to be used in resource-constrained distributed networks of mostly low-resolution cameras. The main constraint comes from the amount of data that can be transmitted through the network. Additionally, most vision processing algorithms are developed without regard to any processing limitations. Furthermore, timing constraints of existing algorithms need to be carefully reconsidered, as the data may travel over multiple hops. Finally, many vision processing algorithms are developed for single camera systems, so these algorithms now need to be adapted for multicamera distributed systems.
The limited processing capabilities of camera nodes dictate a need for light-weight vision processing algorithms in visual sensor networks. However, distributed processing of image data and data fusion from multiple image sources requires more intelligent embedded vision algorithms. As the processing algorithms start to become more demanding (such as those that rely on extraction of feature points and feature matching across multiple cameras' views) the processing capabilities can become a bottleneck. Considering the hierarchical model for vision processing provided in [
The initial phase of visual data processing usually involves object detection. Object detection may trigger a camera's processing activity and data communication. Object detection is mostly based on light-weight background substraction algorithms and presents the first step toward collective reasoning by the camera nodes about the objects that occupy the monitored space.
Many applications of visual sensor networks require reasoning about the presence of objects in the scene. In occupancy reasoning, the visual sensor network is not interested in extracting an individual object's features, but instead extracting the state of the scene (such as information about the presence and quantity of objects in the monitored scene) based on light-weight image processing algorithms. An example of such occupancy reasoning in visual sensor networks is the estimation of the number of people in a crowded scene, as discussed in [
Finding the polygons that contain people based on a projection of the person' silhouettes on the planar scene [
Two cameras observe a person from different positions. The cameras’ cones are swept around the person’s silhouette
Polygons obtained as the intersection of planar projections of cones in the case of two objects. Visual hull presents the largest volume in which an object can reside. The dark-colored polygons do not contain any objects
Determining good camera-network deployments and the number of camera nodes to use is also addressed in recent work on occupancy estimation problems. For example, in [
Since detection of objects on the scene is usually the first step in image analysis, it is important to minimize the chances of objects` fault detection. Thus, reliability and light-weight operations will continue to be the main concerns of image processing algorithms for object detection and occupancy reasoning.
Object tracking is a common task for many applications of visual sensor networks. Object tracking is a challenging task since it is computationally intensive and it requires real-time data processing. The basic methods for target tracking include temporal differencing and template correlation matching [
The availability of multiple views in visual sensor networks improves tracking reliability, but with the price of an increased communication overhead among the cameras. Therefore, in resource-constrained visual sensor networks it is important to use lightweight processing algorithms and to minimize the data load that has to be communicated among the cameras. Lau et al. [
Ko and Berry [
The success of the proposed tracking algorithms can be jeopardized in the case when the tracked objects are occluded. Object occlusion, which happens when a camera looses sight of an object due to obstruction by another object, is an unavoidable problem in visual sensor networks. Although in most cases the positions of moving occluders cannot be predicted, still it is expected that a multicamera system can handle the occlusion problem more easily due to providing multiple object views. This problem is discussed in [
Many novel applications of visual sensor networks are based on advanced vision processing that provides a thorough analysis of the objects' appearances and behaviors, thereby providing a better understanding of the relationships among the objects and situation awareness to the user. In these applications the objective is to provide the automated image understanding by developing efficient computational methods based on principled fundamental issues in automated image understanding. These issues include providing and understanding the performance of methods for object recognition, classification, activity recognition, context understanding, background modeling, and scene analysis.
In such an application a visual sensor network can be used to track human movements but also to interpret these movements in order to recognize semantically meaningful gestures. Human gesture analysis and behavior recognition have gained increasing interest in the research community and are used in a number of applications such as surveillance, video conferencing, smart homes, and assisted living. Behavior analysis applications require collaboration among the cameras, which exchange preprocessed, high level descriptions of the observed scene, rather than the raw image information. In order to reduce the amount of information exchanged between the cameras, research is directed toward finding an effective way of describing the scene and providing the semantic meaning of the extracted data (features). An example of such research is provided in [
Human behavior interpretation and gesture analysis often use explicit shape models that provide a priori knowledge of the human body in 3D. Oftentimes, these models assume a certain type of body movement, which eases the gesture interpretation problem in the case of body self-occlusion. Recent work of Aghajan and Wu [
Another approach in designing context-aware visual based networks involves using multimodal information for the analysis and interpretation of the objects' dynamics. In addition to low-power camera nodes, such systems may contain other types of sensors such as audio, vibration, thermal, and PIR. By fusing multimodal information from various nodes, such a network can provide better models for understanding an object's behavior and group interactions.
The aforementioned vision processing tasks require extracting features about an event, which in the case of energy and memory constrained camera nodes can be hard or even impossible to achieve, especially in real-time. Thus, although it is desirable to have high-resolution data features, costly feature extractions actually should be limited. This implies the need for finding optimal ways to determine when feature extraction tasks can be performed and when they should be skipped or left to other active cameras, without degrading overall performance. Also, most of the current work still use a centralized approach for data acquisition and fusion. Thus, future research should be directed toward migrating the process of decision making to the sensors, and toward dynamically finding the best camera node that can serve as a fusion center to combine extracted information from all active camera nodes.
Communication protocols for the “traditional” wireless sensor networks are mostly focused on supporting requirements for energy-efficiency in the low data rate communications. On the other hand, in addition to energy-efficiency, visual sensor networks are constrained with much tighter quality of service (QoS) requirements compared to “traditional” wireless sensor networks. Some of the most important QoS requirements of visual sensor networks, such as requirements for low data delay and data reliability, are not the primary concerns in the design of communication protocols for “traditional” wireless sensor networks. Additionally, the sensing characteristics of image sensors can also affect the design of communication protocols for visual sensor networks. For example, in [
Cameras C1 and C4 observe the same part of the scene, but are not in communication range of each other. Thus, data routing is performed over other camera nodes [
An event captured by a visual sensor network can trigger the injection of large amounts of data into the network from multiple sources. Each camera can inject variable amounts of data into the network, depending on the data processing (image processing algorithm, followed by the data compression and error correction). The end-to-end data transmissions should satisfy the delay guarantees, thus requiring stable data routes. At the same time, the choice of routing paths should be performed such that the available network resources (e.g., energy and channel bandwidth) are efficiently balanced across the network.
Beside the energy efficiency and strict QoS constraints, the used data communication model can be influenced by the required quality of the image data provided by the visual sensor network. For example, in [
Another important aspect in the design of communication protocols for visual sensor networks includes the support for camera collaboration on a specific task. Therefore, the reliable transmission of delay constrained data obtained through collaboration of a number of camera nodes is the main focus of the networking protocols for visual sensor networks. Thus, we further discuss the influence of requirements for reliability, latency, and collaborative processing to the design of data communication protocols for visual sensor networks. Table
Representatives of networking protocols used in visual sensor networks.
Criteria | Protocol | Strategy |
---|---|---|
Reliability | Combined redundant data transmission over multipath routes and error correction algorithms | |
Wu and Abouzeid [ | Multipath cluster based data transmissions combined with error correction at each cluster head | |
Chen et al. [ | Multipath geographical routing and error correction along the routing paths | |
Maimour et al. [ | Comparison of different strategies for load repartition over the multiple routing paths | |
Delay | Design of delay sensitive MAC and routing protocols, and cross-layer approaches | |
MAC protocols | DSMAC—Lin et al. [ | Adjustable sleeping periods of sensor nodes according to the traffic conditions |
DMAC—Lu et al. [ | Eliminates the delays caused by sleepy nodes that are unaware of current data transmissions | |
Ceken [ | TDMA-based delay aware MAC protocol that provides more time slots for time critical nodes | |
Routing protocols | SPEED—He et al. [ | Transmission delay of a packet depends on the distance to the sink and delivery speed |
MMSPEED—Felemban et al. [ | Multispeed transmission and the establishment of more than one path to the destination | |
Lu and Krishnamachari [ | Joint routing and delay optimization | |
Cross-layer approaches | Andreopoulos et al. [ | Capacity-distortion optimization based on several parameters of routing, MAC, and physical layer |
Van der Schaar and Turaga [ | Packetization and packet retransmission optimization | |
Wang et al. [ | Cross layer protocol for adaptive image transmission for quality optimization of wavelet transformed image | |
Collaborative image routing | Using spatiotemporal information form multiple correlated data sources | |
Obraczka et al. [ | Communication overhead reduction by collective reasoning based on correlated data | |
Medeiros et al. [ | Cluster-based object tracking |
Reliable data transport is one of the main QoS requirements of visual sensor networks. In wireless sensor networks, the transport layer of the traditional protocol stack is not fully developed, since the traditional functions of this layer that should provide reliable data transport, such as congestion control, are not a primary concern in low data, low duty-cycle wireless sensor networks. However, the bursty and bulky data traffic in visual sensor networks imposes the need for establishing mechanisms that provide reliable data communication over the unreliable channels across the network.
The standard networking protocols designed to offer reliable data transport are not suitable for visual sensor networks. The commonly used transport protocol TCP cannot be simply reused in wireless networks, since it cannot distinguish between data losses due to network congestion and due to poor wireless channel conditions. In wireless sensor networks, providing reliability oftentimes assumes data retransmissions, which introduce intolerable delays for visual sensor networks. For example, protocols such as Pump Slowly Fetch Quickly (PSFQ) [
Data routing over multiple paths is oftentimes considered as a way to reduce the correlations among the packet losses and to spread the energy consumption more evenly among the cameras. Since data retransmissions increase latency in the network, Wu and Abouzeid [
Visual sensor networks can experience significant loses of data due to network congestion. As a way to control data congestion in wireless multimedia networks, Maimour et al. [
Congestion control is a dominant problem in the design of reliable protocols for visual sensor networks. Considering that multimedia data can tolerate a certain degree of loss [
Real-time data delivery is a common requirement for many applications of visual sensor networks. Data delays can happen in different layers of the network protocol stack, by unsynchronized interaction between different layers of stack, and delay can be further increased by the wireless channel variability. Thus, the design of different communication layers of the network protocol stack should be carefully considered in order to improve the data latency in the network.
The rising needs of delay-sensitive applications in wireless sensor networks have caused the appearance of a number of energy-efficient delay-aware MAC protocols. The main idea behind these protocols is to reduce the sleep delays of sensor nodes operating in duty cycles, and to adapt the nodes' duty cycles according the network traffic. Since there is already a comprehensive survey on the design of MAC protocols for multimedia applications in wireless sensor networks [
The SMAC [
Finding routing strategies that enable data delivery within a certain time delay is an extremely hard problem. He et al. developed the SPEED protocol [
Such an approach is taken in [
The data delays at different layers of the network protocol stack may be caused by various factors (channel contention, packet retransmissions, long packet queues, nodes' failure, and network congestion). The cross-layer approaches that consider close interactions between different layers of the protocol stack enable the design of frameworks that support delay-sensitive applications of visual sensor networks.
Andreopoulos et al. [
Cross-layer optimization of the protocol stack enables visual sensor networks to meet various QoS constraints of visual data transmissions, including data communication within delay bounds. This cross-layer optimization needs also to include different strategies for intra-camera collaborations, which will lead to a reduction of the total data transmitted in the network. We discuss this problem further in the next subsection.
In current communication protocols, the camera nodes compete for the network resources, rather than collaborate in order to effectively exploit the available network resources. Thus, the design of communication protocols for visual sensor networks needs to be fundamentally changed, in order to support exchanges of information regarding camera nodes' information contents, which will help to reduce the communication of redundant data and to distribute resources equally among the camera nodes.
Collaboration-based communication should be established between cameras with overlapped FoVs that, based on the spatial-temporal correlation between their images, collectively reason about the events and thus reduce the amount of data and control overhead messages routed through the network [
Finally, supporting data priority has a large effect on the application QoS of visual sensor networks. Camera nodes that detect an event of interest should be given higher priority for data transmissions. In collaborative data processing, camera nodes should collectively decide on data priorities from cameras that provide the most relevant information regarding the captured event. Therefore, protocols that provide differentiated service to support prioritized data flows are needed and must be investigated.
In redundantly deployed visual sensor networks a subset of cameras can perform continuous monitoring and provide information with a desired quality. This subset of active cameras can be changed over time, which enables balancing of the cameras' energy consumption, while spreading the monitoring task among the cameras. In such a scenario the decision about the camera nodes' activity and the duration of their activity is based on sensor management policies. Sensor management policies define the selection and scheduling (that determines the activity duration) of the camera nodes' activity in such a way that the visual information from selected cameras satisfies the application-specified requirements while the use of camera resources is minimized. Various quality metrics are used in the evaluation of sensor management policies, such as the energy-efficiency of the selection method or the quality of the gathered image data from the selected cameras. In addition, camera management policies are directed by the application; for example, target tracking usually requires selection of cameras that cover only a part of the scene that contains the non-occluded object, while monitoring of large areas requires the selection of cameras with the largest combined FoV.
While energy-efficient organization of camera nodes is oftentimes addressed by camera management policies, the quality of the data produced by the network is the main concern of the application. Table
Comparison of sensor management policies.
QoS criteria | Application | ||||
Sensor management policy | Energy efficiency | Bandwidth allocation | Large scene monitoring | Object tracking | Goal of sensor management metric |
Dagher et al. [ | Yes | No | Yes | No | Battery lifetime optimization |
Park et al. [ | No | No | Yes | No | Quality of view for every 3D point |
Soro and Heinzelman [ | Yes | No | Yes | No | Exploring trade-offs between the image quality of reconstructed views and energy efficiency |
Zamora and Marculescu [ | Yes | No | No | Yes | Coordinated-wake up policies for energy conservation |
Yang and Nahrstedt [ | No | Yes | No | Yes | Proposed several sensor selection policies (random, event-based, view-based, priority-based) that consider bandwidth constraints |
Pahalawatta et al. [ | Yes | No | No | Yes | Maximize sum of information utility provided by the active sensors subjected to the average energy that can be used by the network |
Ercan et al. [ | No | No | No | Yes | Object occlusions avoidance |
Monitoring of large areas (such as parking lots, public areas, large stores, etc.) requires complete coverage of the area at every point in time. Such an application is analyzed in [
Oftentimes the quality of a reconstructed view from a set of selected cameras is used as a criterion for the evaluation of camera selection policies. Park et al. [
A similar problem of finding the best camera candidates is investigated in [
In order to reduce the energy consumption of cameras Zamora and Marculescu [
Selection of the best cameras for target tracking has been discussed often [
Although a large volume of data is transmitted in visual sensor networks, none of the aforementioned works consider channel bandwidth utilization. This problem is investigated in [
In visual sensor networks, sensor management policies are needed to assure balance between the oftentimes opposite requirements imposed by the wireless networking and vision processing tasks. While reducing energy consumption by limiting data transmissions is the primary challenge of energy-constrained visual sensor networks, the quality of the image data and application QoS improve as the network provides more data. In such an environment, the optimization methods for sensor management developed for wireless sensor networks are oftentimes hard to directly apply to visual sensor networks. Such sensor management policies usually do not consider the event-driven nature of visual sensor networks, nor do they consider the unpredictability of data traffic caused by an event detection.
Thus, more research is needed to further explore sensor management for visual sensor networks. Since sensor management policies depend on the underlying networking policies and vision processing, future research lies in the intersection of finding the best trade-offs between these two aspects of visual sensor networks. Additional work is needed to compare the performance of different camera node scheduling sensor policies, including asynchronous (where every camera follows its own on-off schedule) and synchronous (where cameras are divided into different sets, so that in each moment one set of cameras is active) policies. From an application perspective, it would be interesting to explore sensor management policies for supporting multiple applications utilizing a single visual sensor network.
A typical wireless sensor node has an 8/16-bit microcontroller, limited memory, and it uses short active periods during which it processes and communicates collected data. Limiting a node's “idle” periods (long periods during which a node listens to the channel) and avoiding power-hungry transmissions of huge amounts of data keep the node's energy consumption sufficiently small, so that it can operate for months or even for years. It is desirable to keep the same low-power features in the design of camera nodes, although in this case more energy will be needed for data capture, processing and transmission. Here, we provide an overview of works that analyze energy consumption in visual sensor networks, as well as an overview of current visual sensor node hardware architectures and testbeds.
The lifetime of a battery-operated camera node is limited by its energy consumption, which is determined by the hardware and working mode of the camera node. In order to collect data about energy consumption and to verify camera node designs, a number of camera node prototypes have been recently built and tested. Energy consumption has been analyzed on camera node prototypes built using a wide range of imagers, starting from very low-power, low-resolution camera nodes [
An estimation of the camera node's lifetime can be done based on its power consumption in different tasks, such as image capture, processing, and transmission. Such an analysis is provided in [
In [
Considering the fact that data transmission is the most expensive operation in terms of energy, Ferrigno et al. [
Analysis of the energy consumption of a camera node when performing different tasks [
Today, CMOS image sensors are commonly used in many devices, such as cell phones and PDAs. We can expect widespread use of image sensors in wireless sensor networks only if such networks still preserve the low power consumption profile. Because of energy and bandwidth constraints, low-resolution image sensors are actually preferable in many applications of visual sensor networks. Table
Comparison of different visual sensor node architectures.
Camera node architecture | Processing unit | Memory | Image sensor | RF transceiver |
---|---|---|---|---|
MeshEye [ | Atmel ARM7TDMI Thumb (32-bit RISC), 55 MHz | 64 KB SRAM and 256 KB Flash; external MMC/SD Flash | Two kilopixel imagers Agilent Technologies ADNS 3060 | Chipcon CC2420 IEEE 802.15.4 |
Cyclops [ | Atmel ATmega128L and CPLD—Xilinx XC2C256 CoolRunner | 512 KB Flash 64 KB SRAM | ADCM-1700 Agilent Technology | IEEE 802.15.4 compliant (MICA2 Mote [ |
SIMD (Single-instruction-multiple-data)-based architecture [ | Philips IC3D Xetal (for low-level image processing), 8051 MCU (local host for high level image processing and control) | 1792B RAM and 64 KB Flash internal on 8051 MCU; dual port RAM 128 KB (shared memory by both processors) | VGA Image Sensor (one or two) | Aquis Grain Zigbee module based on Chipcon CC2420 |
CMUCam3 [ | ARM7TDMI (32-bit) 60 MHz | 64 KB RAM and 128 KB Flash on MCU, 1 MB AL4V8M440 FIFO Frame Buffer Flash (MMC) | Omnivision OV6620, | IEEE 802.15.4 compliant (Telos mote) |
Compared with processors used for wireless sensor nodes, the processing units used in visual sensor node architectures are usually more powerful, with 32-bit architectures and higher processing speed that enables faster data processing. In some architectures [
It is evident that all camera node prototypes shown in Table
Testbed implementations of visual sensor networks are an important final step in evaluating processing algorithms and communication protocols. Several architectures for visual sensor networks can be found in the literature.
Among the first reported video-based sensor network systems is Panoptes [
In [
Researchers from Carnegie Melon University present a framework for a distributed network of vision-enabled sensor nodes called FireFly Mosaic [
Topology of the visual sensor network, that is, used for testing the FireFly system [
The communication and collaboration of camera nodes is scheduled using a collision free, energy-efficient TDMA-based link layer protocol called RT-Link [
Connectivity graph and camera network graph of the FireFly system [
Connectivity graph of the camera nodes from the previous figure. Marked links correspond to the camera network graph
Camera network graph—adjacent links between the cameras indicate that cameras have overlapped FoVs. The dotted lines correspond to the case when the cameras have overlapped views, but cannot communicate directly. The communication schedule must provide message forwarding between these cameras
The size of the transmitted images with a given resolution is controlled by the quality parameter provided in the JPEG standard, which is used for image compression. The authors noticed that JPEG processing time does not vary significantly with the image quality level, but it changes with image resolution, mostly due to the large I/O transfer time between the camera and the CPU. The authors also measured the sensitivity of the system's tracking performances with the respect to the time jitter, that is, added to the cameras' image capturing time.
The increased number of hardware and software platforms for smart camera nodes has created a problem in how to network these heterogeneous devices and how to easily build applications that use these networked devices. The integration of camera nodes into a distributed and collaborative network benefits from a well-defined middleware that abstracts the physical devices into a logical model, providing a set of services defined through standardized APIs that are portable over different platforms. In wireless sensor networks, middleware provides abstractions for the networking and communication services, and the main challenges are associated with providing abstraction support, data fusion and and managing the limited resources [
In the case of visual sensor networks, the development of middleware support is additionally challenged by the need for high-level software for supporting complex and distributed vision processing tasks. In [
In [
In the future, it is expected that the number of cameras in smart surveillance applications will scale to hundreds or even thousands—in this situation, the middleware will have a crucial role in scaling the network and in integrating the different software components into one automated vision system. In these systems, the middleware should address the system's real-time requirements, together with the other resource (energy and bandwidth) constraints.
The extensive research has been done in the many directions that contribute to the visual sensor networks. However, the real potential of these networks can be reached through a cross-disciplinary research approach that considers all the various aspects of visual sensor networks: vision processing, networking, sensor managemen, and hardware design.
However, in many cases of the existing work there is no coherence between the different aspects of visual sensor networks. For example, networking protocols used in visual sensor networks are mainly adapted from the routing protocols used in traditional wireless sensor networks, and thus do not provide sufficient support for the data-hungry, time-constrained, collaborative communication of visual sensor networks. Similarly, embedded vision processing algorithms used in visual sensor networks are based on existing computer vision algorithms, and thus they rarely consider the constraints imposed by the underlying wireless network.
Thus, future efforts should be directed toward finding ways to minimize the amount of data that has to be communicated, by finding ways to describe captured events with the least amount of data. Additionally, the processing should be lightweight—information rich descriptors of objects/scenes are not an option. Hence, the choice of the “right” feature set, as well as support for real-time communication will play a major role in a successfully operated task.
In order to keep communication between cameras minimal, the cameras need to have the ability to estimate whether the information they provide contributes to the monitoring task. In a postevent detection phase, sensor management policies should decide, based on known information from the cameras and the network status, whether more cameras need to be included in the monitoring. In addition, data exchanged between camera nodes should be aggregated in-network at one of the camera nodes, and the decision about the most suitable data fusion center should be dynamic, considering the best view and the communication/fusion cost. However, considering the oftentimes arbitrary deployment of camera nodes, where the cameras' positions and orientations are not known, the problem is to find the best ways to combine these arbitrary views in order to obtain useful information.
In the current literature distributed source coding (DSC) has been extensively investigated as a way to reduce the amount of transmitted data in wireless sensor networks. In DSC, each data source encodes its data independently, without communicating with the other data sources, while joint data decoding is performed at the base station. This model, where sensor nodes have simple encoders and the complexity is brought to the receiver's end, fits well the needs of visual sensor networks. However, many issues have to be resolved before DSC can be practical for visual sensor networks. For example, it is extremely hard to define the correlation structure between different images, especially when the network topology is unknown or without a network training phase. Also, DSC requires tight synchronization between packets sent from correlated sources. Since DSC should be implemented in the upper layers of the network stack, it affects all the other layers below [
From the communication perspective, novel protocols need to be developed that support bursty and collaborative in-network communication. Supporting time-constrained and reliable communication are problems at the forefront of protocol development for visual sensor networks. In order to support the collaborative processing, it is expected that some cameras acts as a fusion centers by collecting and processing raw data from several cameras. Having several fusion centers can affect the data latency throughout the network as well as the amount of the postfusion data. Thus, further research should explore the trade-offs between the ways to combine (fuse) data from multiple sources and latency introduced by these operations.
Furthermore, in order to preserve network scalability and to cope with time-constrained communication, there is a need for developing time-aware sensor management policies that will favor utilization of those cameras that can send data over multihop shortest delay routes. Such communication should support priority differentiation between different data flows, which can be determined based on vision information and acceptable delays for the particular data.
In the future we can expect to see various applications based on multimedia wireless networks, where camera nodes will be integrated with other types of sensors, such as audio sensors, PIRs, vibration sensors, light sensors, and so forth. By utilizing these very low-cost and low-power sensors, the lifetime of the camera nodes can be significantly prolonged. However, many open problems appears in such multimedia networks. The first issue is network deployment, whereby it is necessary to determine network architecture and the numbers of different types of sensors that should be used in a particular application, so that all of the sensors are optimally utilized while at the same time the cost of the network is kept low. Such multimedia networks usually employ a hierarchical architecture, where ultra-low power sensors (such as microphones, PIRs, vibration, or light sensors) continuously monitor the environment over long periods of time, while higher-level sensors, such as cameras sleep most of the time. When the lower-level sensors register an event, they notify higher-level sensors about it. Such a hierarchical model (as seen in [
The growing trend of deploying an increasing number of smart sensors in people's everyday lives poses several privacy issues. We have not discussed this problem in this paper, but it is clear that this problem is a source of concern for many people who can benefit from visual sensor networks, as information about their private life can be accessed through the network. The main problem is that the network can take much more information, such as private information, than it really needs in order to perform its tasks. As pointed out in [
Based on the work reviewed in this paper, we notice that current research trends in visual sensor networks are divided into two directions. The first direction leads toward the development of visual sensor networks where cameras have large processing capabilities, which makes them suitable for use in a number of high-level reasoning applications. Research in this area is directed toward exploring ways to implement existing vision processing algorithm onto embedded processors. Oftentimes, the networking and sensor management aspects are not considered in this approach. The second direction in visual sensor networks research is motivated by the existing research in wireless sensor networks. Thus, it is directed toward exploring the methods that will enable the network to provide small amounts of data from the camera nodes that are constrained by resource limitations, such as remaining energy and available bandwidth. Thus, such visual sensor networks are designed with the idea of having data provided by the network of cameras for long periods of time.
We believe that in the future these two directions will converge toward the same path. Currently, visual sensor networks are limited by their hardware components (COTS) that are not fully optimized for embedded vision processing applications. Future development of faster, low-power processing architectures and ultra low-power image sensors will open a door toward a new generation of visual sensor networks with better processing capabilities and lower energy consumption. However, the main efforts in the current research of visual sensor networks should be directed toward integrating vision processing tasks and networking requirements. Thus, future directions in visual sensor networks research should be aimed at exploring the following interdisciplinary problems. How should vision processing tasks depend on the underlying network conditions, such as limited bandwidth, limited (and potentially time-varying) connectivity between camera nodes or data loss due to varying channel conditions? How should the design of network communication protocols be influenced by the vision tasks? For example, how should different priorities be assigned to data flows to dynamically find the smallest delay route or to find the best fusion center? How should camera nodes be managed, considering the limited network resources as well as both the vision processing and networking tasks, in order to achieve application-specific QoS requirements, such as those related to the quality of the collected visual data or coverage of the monitored area?
In the end, widespread use of visual sensor networks depends on the programming complexity of the system, which includes implementation of both vision processing algorithms as well as networking protocols. Therefore, we believe that development of middleware for visual sensor networks will have a major role in making these networks widely accepted in a number of applications. We can envision that in the future visual sensor networks will consist of hundreds or even thousands of camera nodes (as well as other types of sensor nodes) scattered throughout an area. The scalability and integration of various vision and networking tasks for such large networks of cameras should be addressed by future developments of distributed middleware architectures. Middleware should provide an abstraction of underlying vision-processing, networking and shared services (where shared services are those commonly used by both the vision processing and networking tasks and include synchronization service, localization service, or neighborhood discovery service, e.g.). By providing a number of APIs, the middleware will enable easy programming at the application layer, and the use of different hardware platforms in one visual sensor network.
Transmission of multimedia content over wireless and wired networks is a well-established research area. However, the focus of this paper is to survey a new type of wireless networks, visual sensor networks, and to point out the unique characteristics and constraints that differentiate visual sensor networks from other types of multimedia networks. We present an overview of existing work in several research areas that support visual sensor networks. In the coming era of low-power distributed computing, visual sensor networks will continue to challenge the research community because of their complex application requirements and tight resource constraints. We discussed many problems encountered in visual sensor network research caused by the strict resource constraints, including embedded vision processing, data communication, camera management issues, and development of effective visual sensor network testbeds. However, visual sensor networks' potential to provide a comprehensive understanding of the environment and their ability to provide visual information from unaccessible areas will make them indispensable in the coming years.
Many problems still need to be addressed through future research. We discussed some of the open issues not only in the different subfields of visual sensor networks, but, more importantly, in the integration of these areas. Real breakthroughs in visual sensor networks will occur only through a comprehensive solution that considers the vision, networking, management, and hardware issues in concert.
This work was supported by the National Science Foundation under Grant #ECS-0428157.