Pattern Matching Based Sensor Identification Layer for an Android Platform

1Division of Computer and Information Engineering, Hoseo University, Republic of Korea 2Department of Civil Engineering, Hongik University, Republic of Korea 3Division of Computer Engineering, Hansung University, Republic of Korea 4School of Engineering and Computer Science, Baylor University, USA 5Center for Cybersecurity Systems and Networks, Amrita Vishwa Vidyapeetham, Amritapuri, Kollam, India 6Department of Computer Science, Czech Technical University in Prague, Czech Republic 7Department of Information and Communication Engineering, Hannam University, Republic of Korea


Introduction
Modern smartphones are equipped with various sensors such as an accelerometer, gyroscope, and Global Positioning System (GPS) that can track and monitor dynamic environmental information including user's behavior [1], health [2], traffic [3], noise pollution [4], and social interactions [5].Nowadays, smartphones are not only a communication device but also a monitoring device for various purposes [6].With the rapid development of sensor devices, wired or wireless external sensors can be connected to a smartphone in the Internet of Things (IoT) environments.For example, an application using Bluetooth Low Energy (BLE) beacons can support several services including localization detection, interaction, and activity sensing [7].Smartphones can easily collect data from the external sensors deployed around a user and physical space because it has an interface and communication capabilities [8].In broader aspects, smartphones can be used as gateways in Smart Cities, which are a key application domain for the Internet of Things (IoT) [9].Smartphones cooperate with various types of sensors in this large-scale system and it is impractical to recognize and interpret meaningful information from each sensor's data packet.In addition, smartphones and IoT devices are always connected to the Internet and their software must be changed to update system files or fix bugs [10].During the data collection process and software management, many security and privacy-related issues are also considered [11].
Interworking a smartphone among sensors has various merits.However, the packet format of each sensor device is different even among personal health devices.There are messaging standards for personal health devices, such as the ISO/IEEE 11073 Personal Health Device (PHD) [12] and Health Level Seven [13].These standards define a common framework for data interoperability in order to establish connections between personal health sensors or devices to meet the IoT quality aspects [14].Yet, many manufacturers usually provide many different message formats for each device to support certain applications.This, in turn, leads to interoperability issues when the number of sensors increases.
In this paper, we propose a client-server-based sensor adaptation layer to support interoperability of nonstandardized sensor devices.The server plays the role of identifying sensor devices by using a pattern matching process based on the Boyer-Moore-Horspool (BMH) algorithm to recognize the received data packet in the proposed system.The server also manages and stores the profile of each sensor that is used to identify an unknown data packet to its database.This profile reflects two features of a data packet: unique start and end pattern and the range value between the minimum and maximum value measured by a sensor.The client running on a smartphone receives the result from the server.The client plays the role of extracting desirable value from the data packet and reformatting the standard message based on ISO/IEEE 11073 PHD by using the result.In the case of an application, it is not changed when a new sensor device is connected because the client decodes the data packet and sends a reformatted standard packet to the application.
The rest of this paper is organized as follows.Section 2 introduces the background and related information.Section 3 presents the client-server-based sensor adaptation layer for interoperability among different sensors.Section 4 is an evaluation of the availability of the proposed system and time efficiency of sensor identification.Finally, Section 5 summarizes the paper and gives concluding remarks.

Background and Related Works
2.1.Data Packet Formats and Standard.In practice, sensors are having different data format.In fact, each sensor has its own data packet format even though sensors are made by the same manufacturer.Figure 1 shows different packet formats from various manufacturers.These data packets have different features such as the length, the start and end pattern, and the range of data values.Application developers understand the data packet format of each sensor and modify source code when a sensor is changed.The mobile device that connected to the sensors cannot only interwork with unknown sensors but also install new applications whenever a new sensor is attached.To support interoperability among various PHDs, the ISO/IEEE 11073 was proposed.The standard specifies how personal health data should be exchanged between sensors and a monitor such as a smartphone and what standard data format should be used [15].This standard is available for various u-health services such as disease management, health and fitness, and antiaging applications.The standard consists of a technical report, device specifications, and an optimized exchange protocol between a sensor device and a monitor.Figure 2 shows the data packet structure of an open platform based on ISO/IEEE 11073 [16].
In spite of the ISO/IEEE 11073 committee efforts to standardize various data packet formats, the data format of existing sensors is nonstandard.Many manufacturers and uhealth service providers still use their own data formats, and their sensor devices cannot be regarded as interoperable.This causes increases in maintenance costs and prevents integrated system implementation [17].

Sensors' Connection to Android
Platform.The Android platform is composed of components and layered in five layers.As shown in Figure 3, these layers are applications, application framework, Android runtime, native libraries, and Linux kernel [18].The application layer may include different Android applications written in Java, such as email client, Short Message Service (SMS), calendar, maps, web browser, contacts, and others.The application framework provides an application programming interface (API) to developers, which is designed to simplify the reuse of components and to communicate with various devices.Android runtime (ART) includes core libraries and virtual executable environments of Dex bytecode as its predecessor Dalvik.The Android also includes a set of core libraries that provides most of the functionalities available in the core libraries of the Java programming language.The ART is similar to the Java Virtual Machine (JVM) and every application needs ART in order to run.Libraries include a set of C/C++ libraries used by various components of the Android system.The Android relies on Linux for the core system services such as security, memory management, process management, the network stack, and device drivers.
In this study, we use the Java Native Interface (JNI) framework that enables Java application running in a JVM to call native libraries written in C and C++ for interworking an Android-based smartphone with external sensors.The sensor adaptation layer (SAL) monitors serial ports and Bluetooth interface to detect connection requests of PHDs and sends ISO/IEEE 11073 compatible messages to the application by assisting the server.The details are described in the next section.

Packet Matching Algorithms.
Sensor devices periodically send numerous data packets to the smartphone.An Android application that receives data packets from sensor devices unpacks meaningful value such as temperature, blood pressure, and heart rate.It is possible to recognize the meaningful value if the application is already known in the data packet structure.However, it is impossible to decompose the data packet if the application receives an unknown data packet.Therefore, the server part of the proposed SAL uses a packet matching algorithm called Boyer-Moore-Horspool (BMH) [19], when there is no predefined information and profile about a new coming data packet.
A packet matching problem is a kind of pattern matching.An exact string matching algorithm is needed to recognize an unknown data packet from an unidentified sensor.There are two types of string matching algorithms: online and offline algorithms.In the case of an online scheme, the compared text is not known in advance, whereas in the case of an offline scheme, the compared text can be known and preprocessed [20].The BMH algorithm is an online scheme that requires () in the worst case but (/#    ℎ) in the average case, where ,  represent the length of each pattern.These characteristics of the BMH algorithm are proper to our data packet matching module.[21].This technique allows for subsequent automatic alignment of the sensor systems and increasing the usability of modular sensor systems.To improve the accuracy of identifying, the authors use an invariant functional method and calibration error method.The invariant functional method uses a sliding window mechanism to discard noises and the calibration error method is used to find matching part of sensor streams.This scheme can improve its performance by increasing window size but our scheme focus on maximizing the number of shifts (jumping compared characters).

Sensor Identification Schemes. Fountain et al. proposed a technique of automatically identifying relationships among different positioning sensors
A clustering algorithm for unbounded sensor data is designed by Mirsky et al. to accurately infer the present context [22].This clustering algorithm is suitable for sensor data streams because it can detect overlapping clusters.The authors consider the temporal relation of the arriving points in the clustering decision and clusters' correlated distributions.By using these mechanisms, sensor data stream is partitioned dynamically according to both temporal and spatial domains during the clustering decision process and correlated distributions.The clustering algorithm is efficient in terms of the completion time but the accuracy is fluctuated as input data.
Perera et al. proposed a tool called SmartLink that can be used to discover and configure sensors in heterogeneous IoT environments [23].SmartLink follows Context-aware Dynamic Discovery of Things (CADDOT) model that consists of 8 phases: detect, extract, identify, find, retrieve, register, reason, and configure.SmartLink becomes an open wireless hotspot and detects sensors.Next, SmartLink extracts information from detected sensors and sends extract information to a cloud server.After the cloud server identifies the server, a plugin that describes how to communicate with identified sensors is installed.Once all the information about a detected sensor has been collected, registration takes place in the cloud.In our approach, we use regular expressionbased string match algorithm to identify sensor data but previous studies are based on probability mechanisms.We also design the client-server-based system architecture as well as the sensor identification mechanism.
Liu et al. [24] proposed a naming, addressing, and profile server called NAPS to give a unique name to each IoT device.The role of NAPS is similar to Domain Name Server (DNS).When a device registers into NAPS, NAPS generates the profile of the registered device as XML format.If a user requests a profile to NAPS by using RESTful interface, a user can receive all information about the device including name, address, communication protocol, device type, sampling interval, and so on.This system provides the detailed information of an identified sensor device by using a predefined profile, but the profile format is different from our approach.
Liu et al. [25] presented a centralized event detection algorithm based on Minimum Cut (Min-cut) theory and Support Vector Machine-(SVM-) based pattern recognition technique to reduce communication overhead during data collection.In the proposed algorithm, Min-cut theory is used for reducing the number of transmissions of redundant data and SVM is used for determining that a participant is in the event region or not.The time for training and accuracy for new sensor data formats cannot be exactly estimated because this scheme employs machine learning-based algorithms.

Client-Server Based Sensor Adaptation Layer
3.1.The Client-Server Architecture.The proposed clientserver system is composed of several modules.Figure 4 shows these modules in the architecture clearly.The client consists of three modules and the connection management module is used for sending a data packet stream that is received from an unknown sensor device to the server and receiving the matching results from the server.To implement the connection management module, we use Transmission Control Protocol/Internet Protocol (TCP/IP) based socket library running on Linux.The sensor detection module periodically monitors serial ports and the Bluetooth interface.
If a sensor device is attached to the Android platform, a new device file is created in the /dev directory.In the case of the Bluetooth interface, we use the BlueZ library and its utilities for detecting a connection over the Bluetooth interface.BlueZ provides functions to detect Bluetooth devices and collect Bluetooth data packets [26].The reformatting data packet module repacks raw data packets from a sensor device to ISO/IEEE 11073 compatible data packets by using the matching result from the server.Reformatted data packets are sent to the upper layer by using JNI and the application can extract meaningful values from the standardized data packet.
The server also consists of three modules.The connection management accepts requests from multiple clients and prepares to receive the data packet stream from clients.The profile generator plays an important role in the performance of sensor identification.The profile means unique characteristics of each data packet such as a fingerprint and is represented as a regular expression format [27].The data packet stream from a sensor device has some unique pattern that is repeatedly shown such as the start and end pattern, type, and length.The profiles of each sensor are shown in Figure 1 and Table 1.We found two features from the profiles.In the first feature, the regular expression of each sensor data packet starts with a unique pattern and includes a different number of wildcard characters.In the second feature, the rage of sensors' data values is different.For instance, the value range of the GPS sensor is from -180 to 180, and the blood pressure sensor is from 20 to 300.These two features of the profile are used to improve the efficiency of the pattern matching process.The pattern matching module compares a sensor's profile with the database that stores the raw data packet of all kinds of sensors.This pattern matching process is described in the next modeling subsection.

Modeling and Analysis.
We present an estimate for the efficiency of our modified BMH algorithm for identifying sensor device.We model the efficiency in terms of the expected value of a shift that determines how much to shift each pattern when a mismatch occurs.We can simplify the analysis by assuming uniform distributions of characters in both the byte streams and all profiles.In practice, these are not random, but its probabilistic performance gives a rough idea.
The key insight of the BMH algorithm is to match on the tail of the pattern rather than the head and to skip along the text in jumps of multiple characters rather than checking every character of the byte streams.Jumping along the text rather than checking every character decreases the number of comparisons that have to be performed, which is a key to the efficiency of the algorithm.In a BMH algorithm, the amount that we can jump can be computed by using the formula J(c), which is the distance from c's rightmost occurrence in pattern among its first m-1 characters to its right end.
In a sensor's profile, we are given a sensor profile  characters long with S[s 1 . . .s  ] at the start and S[s +1 . . .s + ] at the end, which may contain wildcard characters as shown in Figure 5.Note that the wildcard characters are used to substitute any other character(s) in a string.There is a key difference to generate jump tables that determine how much to shift each pattern when a mismatch occurs.Our goal is to find out the first match of the string profile in the byte stream n characters in length (alternatively called byte stream).
Therefore, we modify the jump table's construction part of the BMH to handle the sensor's profiles.We preprocess the sensor patterns and build profiles for the sensor identification.Then, we feed the byte stream to the server, which reports the matching result whenever one is found.
Here, we denote the size of the set of the characters by .We assume that both byte streams and all profiles are random strings with uniform distributions.
where the size of the characters is We first consider the probability of J(c) and then estimate its expected value.The possible cases range from J(c) = 1 to J(c) = m.We can summarize the value for four possible cases as shown in Figure 6.
(a) The character c is the end pattern (but not the rightmost).In that case, the probability that the function J(c) = x (0 < x < ) is equal to the probability that each the last (x -1) characters differs from the corresponding characters of end patterns (except for the rightmost) until the first c' occurrence.
(b) The character c is not in both start and end patterns.In that case, the value of J(c) is identical to the number of characters in end pattern, .J(c) =  occurs when all l characters in both start and end patterns mismatch.
(c) The character  is in the start pattern.The J(c) range from m - to m -1 for s  and s 1 , respectively.The probability that the function J(c) = x (m - ≤ x < m) is equal to the probability that does not occur in (x - -1) characters of the pattern at all (except for the rightmost) and one final character matching.Recall that the gamma is the number of wildcards and it can substitute any other character(s) in our algorithm.
(d) The character  is only in the rightmost character of the pattern.In that case, J(c) is the maximum possible shift distance, m.The probability is when the character  only occurs in the rightmost of end pattern.
Thus, it is clear that the sum of the probabilities of all  outcomes must be 1 (see Appendix): Accordingly, the expected value of the shift function can be computed as (see Appendix): Figures 7(a)-7(c) show the expected number of shifts with various  with given  and  that range from 1 to 3, respectively.With a given , expectations increase with  and the slope of each curve increases with .The value of  has a significant impact on the y-intercept, which is approximately the same as the value of .Note that the expectation of a naive algorithm is a constant value of 1. Accordingly, the proposed algorithm herein has an advantage over the naive algorithm.

Prototype Implementation and Performance Evaluation
Figure 8 shows the hardware components of our prototype and Table 2 shows the client and server specifications.Temperature, dust, and carbon dioxide sensors are connected to an Android-based development board with serial cables.A blood pressure sensor is connected to the development board over the Bluetooth communication.The SAL read data packet streams from device files whenever one of the sensors is attached.After receiving a pattern matching result from the server, the SAL parses the sensing value from the data packet stream and repacks the message based on the ISO/IEEE 11073 standard.the sent value from the SAL as shown in Figure 9.If we randomly turn on and off one of the sensors, the application starts and stops to plot the graph immediately.The application is not modified even though a new sensor is attached.
Figure 10 shows server side execution time comparison result of each scheme.The total number of sample data packets is between 1000 and 10000.Sequential scheme compares the profile data packet with a sample data packet per a byte.In the case of sliding window [21], we set the window size as 1/3 of the profile data packet.Clustering algorithm [22] selects a sample data packet that has the highest similarity to the profile data packet.In probability scheme [23], positions for comparing the profile packet and sample data packets are randomly selected.Our scheme  compares the proposed regular expression-based profile with sample data packets.The execution time of the sequential scheme increases dramatically as the packet length and the total number of sample data packets increase.Sliding window scheme shows better performance than sequential scheme but worse than other schemes.In this scheme, the windows size is an important role in determining the execution time.It is difficult to decide proper window size considered variable data packet size and changes of sensing values.Clustering and probability schemes are scalable because these schemes do not check the entire data packet but only some parts of the data packet.Our scheme is worse than clustering and probability schemes but better than sequential and sliding window schemes.Our scheme is also scalable because it skips comparing some parts by using the proposed regular expression-based profile.
Table 3 summarizes the pattern matching accuracy of each scheme.We tested under conditions that the total number of sample data packets is 5,000 including 20 data packets for each sensor (CO 2 , dust, blood pressure (BP), and temperature (TEMP), and total 80 packets) and tried 10 times for each scheme.The accuracy is the number of collect recognition over the number of data packets of each sensor (20).In the case of the sequential scheme, all sensors are intensified accurately.In the sliding window scheme, as window size decreases and the length of data packet increases, the accuracy decreases.In the case of clustering and probability scheme, these schemes show low accuracy to recognize a sensor data packet because it is difficult to set up proper classification criteria and a probability model for sensor data stream.Our scheme shows similar accuracy to the sequential scheme except for CO 2 sensor that does not have the end pattern of the data packet.

Conclusions and Future Work
Sensor devices are used for many applications, especially with smartphones.Numerous manufacturers provide diverse sensor devices and use their own data packet formats.Applications that work with these sensor devices are modified whenever the attached sensor device is changed because the data format is different.We proposed a client-server-based sensor adaptation layer for an Android platform to solve this problem.The proposed interface, called SAL, reforms the nonstandard data packet to the standard data packet by assisting the server.The server defines profiles that are used for identifying sensor devices and represents a regular expression and conducts a pattern matching process based on the modified Boyer-Moore-Horspool algorithm.Our model and analysis results show that the profiles reflect features of the sensor data packet and the length of the end pattern highly affects the performance of the matching process.We also implement a prototype to show that our system design is operable.The future work of our system is generating regular expressions to add new types of sensors.We predefine and register regular expressions of known sensors manually.To make our system more practical, we should automatize recognizing sensor data stream and generate regular expression from a given sensor data stream.

Figure 4 :
Figure 4: Android architecture overview and sensor adaptation layer.

Wireless Communications and Mobile Computing 7
Expected number of shifts, E[J(c)]Expected number of shifts, E[J(c)] Expected number of shifts, E[J(c)] Number of wild-cards ()

Figure 7 (
Figure 7(d)  shows the results of the practical case in that the frequently used values of  and  are 2 and 1, respectively.As shown in the figure, when the value of  is 16, the proposed mechanism is 20% better than the naive.In our investigations, when the value of  is 2 or more or  is significantly large, the proposed algorithm herein is powerful.

Figure 8 :
Figure 8: Hardware components of our prototype.

Figure 9 :
Figure 9: Screen capture of our prototype.

Figure 10 :
Figure 10: The execution time of profile matching process.

Table 3 :
Accuracy comparison.Dust BP TEMP CO 2 Dust BP TEMP CO 2 Dust BP TEMP CO 2 Dust BP TEMP CO 2 Dust BP TEMP