A Stream Processing System for Multisource Heterogeneous Sensor Data

With the rapid development of the Internet of Things (IoT), a variety of sensor data are generated around everyone’s life. New research perspective regarding the streaming sensor data processing of the IoThas been raised as a hot research topic that is precisely the theme of this paper. Our study serves to provide guidance regarding the practical aspects of the IoT. Such guidance is rarely mentioned in the current research in which the focus has beenmore on theory and less on issues describing how to set up a practical system. In our study, we employ numerous open source projects to establish a distributed real time system to process streaming data of the IoT. Two urgent issues have been solved in our study that are (1) multisource heterogeneous sensor data integration and (2) processing streaming sensor data in real time manner with low latency. Furthermore, we set up a real time system to process streaming heterogeneous sensor data from multiple sources with low latency. Our tests are performed using field test data derived from environmental monitoring sensor data collected from indoor environment for system validation. The results show that our proposed system is valid and efficient for multisource heterogeneous sensor data integration and streaming data processing in real time manner.


Introduction
With the highly rapid development of networks, IoT technologies, and web related technologies in recent years, massive data service applications are developed.And these systems live in independent, distributed, and heterogeneous environment, respectively, on the web or offline.In the traditional situations, different departments of an enterprise or government have independent information systems, which usually store information about personnel, products, or other things to ease operation of the department.However, due to lack of comprehensive design and inability to foresee future development, it is often hard to share data among systems belonging to different departments.It is also a common problem in scientific research, especially when interdisciplinary study is inevitable where interoperability among independent scientific data management systems is the use case.
Compared to these conventional database systems, a mass of IoT applications arise to collect, unify, share, and publish sensory data.Mobile Crowd Sensing (MCS) [1] is the typical situation in recent hot research fields, in which domain it is often required to acquire, integrate, and further process data from wireless sensor networks, traditional databases, configuration files, online data services, and other potential heterogeneous sources.
The integration of heterogeneous data is not a new problem; despite the semantic web approach dedicated to data representation perspective, people pay enough attention to methodologies to query heterogeneous data sources [2].A normal methodology is Wrapper-Mediator [3], where mediator is responsible for handling user's request and providing a unified data accessing view, and each wrapper is associated with each data source, mediator transforms user's query into wrappers' query, respectively, and wrappers interact with concrete data sources to mask heterogeneity.In our study, Global Sensor Network (GSN) is just the IoT middleware realizing such methodology.GSN abstracts the mediator concept as virtual sensor, which serves as an integrated data view that users can query with unified SQL query sentences.The abstraction wrapper inside GSN handles the actual complexity and heterogeneity of data sources.
These years have witnessed the stunningly growing number of sensors for people's everyday life like smart phones, wearable devices, wireless sensor networks, and social networks (they are kind of virtual sensors).These sensors are around us and sense the environment we live in and in order to provide more contextual and more accurate services they produce data all the time.In the meantime, the data overload problem has emerged in insular system and leads to inability to effectively utilize real time information, however.
In our research, we aim to provide IoT application architecture paradigm that comprises procedures spanning data acquisition, integration, and utilization.As mentioned before, in modern IoT data process applications, there are two common and fundamental challenges, integration of heterogeneous data and manipulating the data in real time manner.For realistic IoT applications usually collecting data from multiple heterogeneous sources, we use GSN to handle this problem, and, after the acquisition and integration procedures, we program to populate data into Apache Storm system.Apache Storm is a general purpose real time distributed data processing system in contrast to Hadoop, which features its distributed batch processing ability.
The motivation of this paper is to provide guidance with respect to practical engineering aspects of the IoT.There is a gap in current literature, where researchers concentrate more on solving theoretical questions and less on describing how to establish a practical system.The rest of the paper is organized as follows.In Section 2, the development of IoT as well as middleware atop IoT is reviewed.Section 3 depicts the architecture of the system.Section 4 details the concrete implementation of the system.In Section 5, we conclude our current work and furthermore point out our future avenues of research.

Related Work
The IoT would become a basic infrastructure in society just like highway, water supply, and network.Beyond the current network, the IoT is more than a revolution of the Internet as it extends human beings' sense and feeling of the physical world; it is a dynamic autonomous network based on its internal communication protocols.In such network, all the physical and virtual items are identified by global unique identification and communicate sensory data with each other to achieve information sharing through intelligent interfaces [4,5].These intelligent interfaces connect and communicate with users, society, and environment context on the basis of the agreed protocols.IoT is an extension and expansion of the network based on the Internet to achieve intelligent identifying, locating, tracking, monitoring, and managing.
As IoT draws extensive attention, a great number of researches are focused on the study of IoT enabling technologies, comprising identification, sensing, communication, and middleware [4,6].And to ease the process of IoT application development, middleware for IoT is highly emphasized and is becoming a widely employed approach [7][8][9][10][11][12].Middleware of IoT bridges the wide gap between physical things and highlevel applications and furthermore provides some dedicatedly designed services for one or more clients to employ locally or remotely through network connection.The implementation technologies of middleware comprise agent-based methods [13], context awareness methods [12], service oriented architecture (SOA) based [14] methods, and other development paradigms.
Several studies have focused more on wireless sensor networks.In [10], the authors described the outlook of middleware for wireless sensor networks.Authors in [15] designed and implemented a lightweight middleware platform for distributed computation on wireless sensor networks.As wireless sensor network is becoming a new paradigm [16] in the realm of Internet of Everything, some studies have been made on facilities virtualization.In [16], authors bring an overview on the research of WSN virtualization by middleware and discuss the design goals, system architecture, service qualities, and challenges of WSN virtualization.
Data processing is always a critical issue in the domain of IoT.Scientific data processing is more specific but somehow distinguished because large-scale scientific data are typically written into parallel file systems for fast writing speed, which, in turn, leads to poor I/O performance while reading [17].Furthermore, the generic operation of collecting data from heterogeneous data sources as well as joining the data also makes the reading and analysis process time-consuming [18].And, to solve the problems, a variety of applications [17][18][19][20] are developed in recent years.
However, beyond what people have achieved, we are also confronted with quite a few issues in the frontier of IoT, which concerns standardization, network addressing, security, privacy, scalability, mobility, heterogeneity abstraction, and some other problems [4,6,21].
Recently some new research hotspots arise.As cloud computing is highlighted in both academics and industry, cloud-centric IoT finds its way into mainstream.Reference [22] discusses the realization and challenges in cloud-centric IoT.Authors in [23] make an investigation on data mining for IoT.Context aware computing also enters into the realm of IoT [24].Semantic or ontology related approach is utilized in IoT as well [25,26].

System Architecture
Our proposed system architecture consists of three layers as Figure 1.At the bottom are physical sensors.In our case, we build a wireless sensor network using the two different types of wireless sensor nodes and a programing board directly connected to a PC through USB cable.Then, the serial port data is listened to by GSN (Global Sensor Network), which is going to receive the sensor data with further aggregation and filtration.GSN can act as a general purpose sensor middleware capable of accepting heterogeneous data from a variety of sensors, particularly including virtual sensors (simulated).It can intrinsically aggregate and filter data through SQL-format queries.GSN as a sensor data server also provides several ways of publishing data.Through the APIs of GSN, spouts (one type of component of Storm) can absorb data into the distributed computing system.
As GSN itself has already provided convenient APIs for data access, applications can retrieve sensor data directly from GSN and apply some procession according to its application logic.But we add a data procession layer above the middleware layer.One of the most important reasons is that sometimes complex computation may be applied on the aggregated sensor data.As can be seen, sensors in GSN are virtual sensors in GSN's abstractions; they can be related to real physical sensors or something unreal indicating a virtual sensor, such as an email notifier reporting email content whenever a new email is received or even just a generator generating random integers periodically.A sensor featured its ability of producing data; in this sense, anything particularly software artifacts producing data like stream APIs of Twitter can be virtual sensors.And data format could be diverse, such as binary, double, strings, and big images.
Therefore, data is multisource and heterogeneous.Computation applied on this multisource heterogeneous data could be quite complex, requiring powerful computing capability.Moreover, in modern IoT applications, the data are usually real time data stream and computation employed on the stream is also required to be real time to lower processing latency and provide better user experience.These factors make a real time distributed data processing layer a necessity.

Brief Introduction to GSN.
In the past, people may build their applications directly on top of sensor networks, meaning the embedded operating systems.However, with the emergence of numerous heterogeneous sensor networks and other sensor forms, it became a tedious task to interact with the sensor systems.
Another challenge arises as the IoT industry grows rapidly around the globe that the sensor infrastructures belonging to different organizations cannot be adequately employed in the process of data integration of a wide area or even throughout the globe [35].
And here the Global Sensor Networks (GSNs) middleware and other similar middleware emerge in response to that hard condition in proper time, which lead to the new programing paradigm as the middleware interact with the sensor networks and are responsible for gathering data, with which the applications interact.
GSN is one of those middleware.Virtual sensor is a key abstraction in GSN [35].GSN can be plugged in with arbitrary sensors as you will.The sensor presentation in GSN is XML file abstracted as virtual sensor, which defines data streams as sources and queries on the streams.Another closely related abstraction is wrapper.A wrapper is a Java class, which is truly interacting with data source, like physical sensor networks, sockets, web services, and so on.Data streams defined in virtual sensor must specify data sources and wrappers.
As you can see, one virtual sensor can contain multiple data streams and SQL-format queries, meaning that it is convenient to implement data integration among various data sources.
Writing wrappers by yourself could still be a tedious task; thereby, GSN has shipped with massive common wrappers.Most of the time, you do not have to write one.There are varieties of wrappers, such as TinyOS [36] wrapper and JDBC wrapper.A specific wrapper is remote wrapper, and it is able to retrieve data from a remote virtual sensor in another GSN server on the Internet.In this way, GSNs around the globe can be interconnected, and the sensors data gathered by them can be taken full advantage from.
As IoT middleware, GSN designed easy-access APIs for feeding data into applications.There are four types of APIs [37]: connection oriented data distributor, connectionless data distributor, web service, and restful APIs.All are easy to retrieve real time data or historical data in a time window in compliance with GSN's time model.

Brief Introduction of Apache Storm. Apache Storm is a
free and open source distributed streaming data processing system.It is intended for real time data processing in contrast to batch data processing as Hadoop.Originated in 2011, now Storm has been used in wide domains.Yahoo, Twitter, Spotify, Flipboard, Alibaba, Baidu, and many other enterprises have been taking advantage of it.
Storm defines how we should write distributed programs since it provides a simple and efficient programing paradigm.It defines basic programing components, "spout" and "bolt.""Spout" acts as water source; it generates data in some way and pushes the data out as tuples periodically."Bolt" is component that takes in tuples from "spouts" or other "bolts" and employ some computation and then might produce new tuples and push them out.Programers need to implement  their own "spouts" and "bolts" and specify how these components are connected, indicating how data flows in the system.Such a data flow graph consisting of "spouts" and "bolts" is called "Topology," another key abstraction in Storm.
In fact, "Topology" is a job to process on the Storm cluster.In order to exploit the potential of distributed computing, a group of computers that installed Apache Storm needs to be connected through local network, forming a Storm cluster.The distributed program in compliance with Storm's programing paradigm is usually packaged into a runnable jar file, which will be submitted to Storm cluster afterwards.This submitted program running on Storm cluster is a Storm job or "Topology" in Storm's terminology.Different from batch jobs, jobs running on Storm cluster will usually not stop, because they are meant to execute real time tasks.Several jobs can run on Storm cluster simultaneously.Storm programs are robust, for its feature of parallelization, partition, and retrying on failure when necessary.
The basic components spout and bolt are inherently paralleled; you can specify parallel number in Topology settings.Even after the Topology has been submitted and running, you can still adjust parallelism of every component according to hardware conditions.Playing a role of "Hadoop of real time," Storm greatly eases the progress of implementing parallel real time computation.

Detailed Implementation of the System.
In order to establish the environmental monitor system, in sensor layer, we deployed IRIS sensor nodes in our laboratory in a number of different positions.The nodes are displayed in Figures 2 and 3.All these sensor nodes comprise voltage sensor, humid sensor, heat sensor, press sensor, optical sensor, and accelerator sensor of three axes.Even though the hardware of the nodes is the same, they are divided into two groups according to two different embedded applications installed on them.One is sht11, and the other is mxp430.The symbols sht11 and mxp430 are two types of sensors chips, respectively, and in this article we use them to denote the corresponding sensor nodes and the embedded applications running on them.The mxp430 nodes collect all types of sensor data mentioned above, while the sht11 nodes only sense light, temperature, and voltage of the battery equipped with them.Both of the two types of nodes also provide information of the network Topology comprising sensor node id, parent node id, and group id as well as metadata of the message as timestamp and message sequence id.These sensor nodes form a wireless sensor network, whose sink node is a sensor node plugged on a programming board (Figure 4) connected to a Linux server using USB adapter cable.These sensor nodes use TinyOS [36] as their embedded operating system and run TinyOS application for collecting environmental data.
The Linux server is installed with Ubuntu 14.04.1 LTS, along with Global Sensor Network middleware.The raw data collected from wireless sensor network will be interpreted, stored, and integrated for further usage.
Wrapper inside GSN is actually where data is acquired from an outside source.Wrappers are like borders, through which data enter into the realm of GSN.Therefore, obtaining a proper wrapper is put in the first place.Usually, GSN has already shipped with a great amount of wrappers, but in our situation the raw data format is customized by the hardware product provider; therefore we write our own wrappers.The raw sensor data is firstly collected and interpreted by a sensor network server and then gets pushed out in a well-defined XML style through sockets.So the wrappers we customized are actually for TCP connection and XML parsing.In the initialization phase, the wrappers retrieve the parameters they need, such as host address and port, and then decide the output structure to persist, which is corresponding to the table structure to be stored in the database.Then, in the work thread of the wrappers, firstly they connect to the sensor network server and, secondly, in a loop they read XMLformatted data and parse it and store parsed data into a database.Both of the wrappers of sht11 and mxp430 sensor data work in this manner.
After the wrappers have been prepared, the XML files of virtual sensors are required.Box 1 shows the virtual sensor for sht11 message.

Conclusion and Future Work
In this paper, we propose and discuss system architecture for IoT stream data processing.We basically divide the system architecture into three different layers, sensor layer, middleware layer, and data processing layer from bottom to top.And we implement the system and conduct the experiment using some popular open source projects.In the sensor layer, TinyOS is the embedded operating system  on the wireless sensor node; in the middleware layer, the Global Sensor Network functions as sensor data server; in the data processing layer, the Apache Storm is responsible for distributed real time sensor data processing.In the construction of the system, data input and output of GSN are key procedures, and the wrappers in GSN and spout in Apache Storm need to be developed manually.Both the wrappers and the spout are responsible for retrieving data out of another system by socket connection.The benefit of the architecture is apparent.It is capable of retrieving and integrating multisource heterogeneous data and finally processing it in a distributed and real time manner.Our future work aims at constructing a context aware recommender system.As can be seen from this paper, we can capture information from wireless sensor networks, mobile phones, and other equipment and then combine physical data like weather and position with social networks which might indicate an individual's activity and mood and other things.Through the individual oriented information, we are focusing on establishing a real time context aware recommender system.

Figure 5 :
Figure 5: Web page exhibiting captured sensor data in GSN.

Figure 6 :
Figure 6: Data parsed by parser of spout.

Table 1 :
Captured sensor data of mxp430 nodes.