H2SA-ALSH: A Privacy-Preserved Indexing and Searching Schema for IoT Data Collection and Mining

Currently, smart devices of Internet of Things generate massive amount of data for di ﬀ erent applications. However, it will expose sensitive information to external users in the process of IoT data collection, transmission, and mining. In this paper, we propose a novel indexing and searching schema based on homocentric hypersphere and similarity-aware asymmetric LSH (H2SA-ALSH) for privacy-preserved data collection and mining over IoT environments. The H2SA-ALSH collects multidimensional data objects and indexes their features according to the Euclidean norm and cosine similarity. Additionally, we design a c - k -AMIP searching algorithm based on H2SA-ALSH. Our approach can boost the performance of the maximum inner production (MIP) queries and top- k queries for a given query vector using the proposed indexing schema. Experiments show that our algorithm is excellent in accuracy and e ﬃ ciency compared with other ALSH-based algorithms using real-world datasets. At the same time, our indexing scheme can protect the user ’ s privacy via generating similarity-based indexing vectors without exposing raw data to external users.


Introduction
In recent years, Internet of Things (IoT) technology has been applied to a wide range of applications [1,2], mainly driven by the rising number of Internet-connected devices that already amount to several billion [3]. The devices of IoT [4] aim to connect everyday objects, such as humans, plants, and even animals, to the Internet to enable interactions among these objects [5]. Applications of IoT have been widely developed in medical healthcare [6,7], vehicular networks [8,9], and industrial IoT [10]. With the widespread popularity of IoT, a massive amount of data is generated and widespread at a relatively fast speed.
Thus, applications in different IoT domains have seen an explosion of information generated from heterogeneous devices every day. Recently, the data collection and mining over IoT data streams have increasingly incurred research interests [11][12][13][14][15].
However, due to weak privacy and security protection in IoT devices, some smart applications of IoT expose sensitive data and user privacy to security threats [16]. Thus, data mining over raw data will collect and expose user-sensitive information. As with stream data mining [17], interesting knowledge, regularities, or high-level information, they can easily introduce privacy protection policies. At present, MIP (maximum inner production) search is prominent, and it was used in a wide range of applications, such as matrix factorization-based recommendation systems [18][19][20], multiclass label prediction [21,22], SVM classification [23], and even deep learning [24]. However, it is timeconsuming to conduct the MIP search in high-dimensional space. Moreover, it may cause user's privacy leakage. A query system needs to collect the raw data from devices of the IoT system. Many types of research try to construct an appropriate approximate structure for the search. It is usually called approximate maximum inner product (AMIP) search [25][26][27][28][29], in which a given query q and a data object x ∈ D, D is the set of target objects, and the AMIP algorithms will compute the approximate maximum inner product results for q in D.
Motivated by the promising techniques, we can extract target features from raw data objects of IoT devices and conduct the maximum similarity search between the input vector q and the set of extracted indexing vectors, and only the object of the matched features is needed to be transmitted to the user for a final decision. Thus, it will protect the user's privacy from sensitive information collection and exposure to third-parties query services [38,39]. The contributions of the paper are as follows: (1) We propose a novel privacy-preserved indexing and searching schema, termed H2SA-ALSH for highdimensional data objects collection and mining. The indexing scheme is based on homocentric hyperspheres and similarity-aware algorithm (H2SA). The searching is applied to compute the cosine similarity between a query vector and data objects. The proposed schema can support AMIP search, top-k search, etc., without exposing raw data privacy (2) We optimize the proposed indexing solution to fit IoT data collection and mining. In the process of IoT data collection, we establish an incremental indexing mechanism, which indexes an input item immediately, when a data item arrives. For IoT data mining, we design SRP-LSH to accelerate the search by filtering the low-similarity objects. Moreover, the algorithm is not sensitive to the data, i.e., it presents acceptable performance over different distribution datasets (3) We conduct comprehensive experiments to evaluate the H2SA-ALSH indexing and searching scheme using three real-world data sets. The experimental results show that the proposed approach is more accurate and efficient than the state-of-the-art algorithms. As a result, a searching query will not be directly conducted over the raw data objects in IoT environments

Problem Definition
In the section, we briefly present preliminaries of the proposed techniques and state our research problem formally. Then we use the common notations in AMIP literature and present the MIP and corresponding AMIP searching problem formally.
Definition 1. Maximum inner product (MIP) search. For a data collection T that already received n data objects and an arbitrary query q ∈ R d , the MIP search aims to find t ∈ R d that satisfies Definition 2. The c-approximate maximum inner product (c-AMIP) search. Given an approximate ratio cð0 < c < 1Þ, the goal of the c-AMIP search is to construct an approximate structure, and a user can find the approximate result t, t ∈ T, which satisfies the following condition for a query q ∈ R d , i.e., ht, qi ≤ cht * , qi, t * is the accurate result of the MIP search.
In the paper, we convert the c-AMIP search problem to the c 0 -ANN problem. The c 0 -ANN problem aims to find the nearest neighbour according to the Euclidean distance. The definition of the c 0 -ANN problem is as follows: Definition 3. Given an approximate ratio c 0 ðc 0 > 1Þ, and for a query vector q ∈ R d , c 0 -ANN aims to find data object t, t ∈ T, which satisfies the following formula: where t * is accurately obtained by the MIP search.
The LSH is a common method to solve the c 0 -ANN problem. We use the definition of the nearest neighbour whose distance measure is measured as Simðq, pÞ to depict the LSH paradigm. Let h be a hash function that maps an item to a hash value, and the corresponding definition is as follows.
Definition 4. When a hash family H meets the following conditions, it can be called ðS 0 , cS 0 , p 1 , p 2 Þ sensitive. For multidimensional data objects x and y, the hash function h from H satisfies: (1) If Simðx, yÞ ≤ S 0 , then, the probability of hðxÞ = hðyÞ is at most p 1 (2) If Simðx, yÞ ≥ cS 0 , then, the probability of hðxÞ = hðyÞ is at least p 2 where c < 1 and p 1 > p 2 , respectively.
We adopt the common LSH technology to solve the ANN search problem, and similar data objects have higher probability of getting the same hash function results than those with lower similarity. Thus, the LSH can solve the nearest neighbour and similarity problems of multidimensional data even in linear time [40].
Furthermore, we transform the AMIP search problem into the nearest neighbour problem via asymmetric locality 2 Wireless Communications and Mobile Computing sensitive hashing (ALSH). There have been some researches on ALSH technologies [27,30,31,32,33]. In this paper, we use the QNF transformation [32]. For a data object t = ðo 1 , o 2 , ⋯,:o d Þ and a query = ðq 1 , q 2 , ⋯, q d Þ, the transformation is as follows: In formulas (3) and (4), the constant M is used to present the largest Euclidean norm among the data collection T. The maximum Euclidean norm may constantly change when it collects more data from IoT devices. In our schema, we assign an appropriate M to each ALSH unit, and the M is the maximum Euclidean norm. For a data object t, ∥t∥≤M 2 . Through the QNF transformation, the AMIP search problem can be converted into the nearest neighbour search problem. The following formula can be used in the transform: In Equation (5), for a query q, M 2 and λ 2 kqk 2 are constant, so we have The argmin t∈T kQðqÞ − PðtÞk 2 is the nearest neighbour search problem, and it can be solved by the L2-LSH technology quickly. We will present the signed random projections LSH (SRP) and L2-LSH, where similarity measurement methods are the correlation similarity and the L2 distance, respectively. When the distance is the correlation similarity, let θ be the angle of two multidimensional vectors, A and B be the multidimensional vector, where 0 ≤ θ ≤ 180. The distance of correlation similarity is The correlation similarity is 1 − dðA, BÞ, and the SRP-LSH can solve the maximum correlation similarity search. The procedure can be depicted as follows: first, a random vector v with v i ∼ Nð0, 1Þ is obtained. The random vector determines a hash function h v , and the hash function h v will return dualistic results. If hv, xi < 0, then, h v ðxÞ = 0, else h v ð xÞ = 1. The LSH family H is formed by several random vectors. By the SRP-LSH, we can conclude i.e., Now, we briefly propose the indexing schema based on the asymmetric LSH scheme for high-dimensional AMIP search. We also adopt the L2-LSH and SRP-LSH. The indexing features from IoT devices were calculated by the Euclidean norm and cosine similarity among the data. More details, when a data object t comes, the schema calculates the t ' s Euclidean norm and keeps the feature into an exact block according to the cosine similarity. The exact block and the exact bucket determine the data item's storage unit. When conducting a query, the schema adopts QNF transformation and searches the c-AMIP results through the L2-LSH, precisely through the QALSH [32]. We have kept the block partition principle of H2-ALSH. The blocks are divided by the Euclidean norm of the data objects with the division ratio. Besides, we consider another factor determining the inner product which is the angle between the given query and the data objects. We use SRP-LSH to divide one block into buckets, so the buckets are the minimum storage unit in our schema. The overview of our indexing schema is shown in Figure 1.
When we conduct the AMIP search, we traverse the blocks in order, traversing blocks from a large Euclidean norm to a small block. Then, we traverse from high similarity to low similarity according to the cosine similarity within one block.
In our schema, the calculation can focus on the data objects that can be considered as candidates, which have a higher possibility of becoming the AMIP search results, and the search process finishes when there is no necessary to traverse the rest data objects. Thus, filtering the unnecessary data objects allows the schema to reach a remarkable time performance.
Our work is different from the article [32], in which the data object is treated as the static items, and all data are only divided into buckets by Euclidean norm. Our schema considers IoT environments, where the data is updated frequently, and we cannot sort the whole static sets. Instead, the input object will be inserted into our H2SA-ALSH unit when it comes. The indexing construction does not decrease the accuracy of the following queries. Therefore, our indexing schema is more appropriate for IoT scenarios where the features are dynamically generated through distributed devices and applications.

Indexing Construction
Given a continuous object series T, and an incoming object t i ∈ T, we first calculate the Euclidean norm kt i k. To effectively divide the blocks ½S 1 , S 2 , ⋯, S K , we introduce the interval rate b. Given an AMIP search approximation rate c and the query angle β in the bucket, c 0 is the where l = ð1 − β 0 · ðtan β + 1/3 · tan 3 βÞÞ.
We present explanations about b and use S to represent blocks. We assign a data object t into different blocks and different buckets. There are several buckets B in S i , and different buckets represent the classification of different objects according to the cosine similarity. Every indexing unit has a unique identifier that consists of a block identifier (S) and bucket identifier (B). The schema determines the specific bucket identifier of the data object according to the hash family of SRP-LSH. Assuming that the hash family of SRP-LSH uses k s hash functions, the bucket number can be expressed as k s bits. The bucket M k can be initialized later. All data objects that satisfy bM k < t i < M k will be assigned into the block S k . When putting the data objects into the buckets, the number of buckets gets larger. We set a threshold N 0 , and the bucket will use QNF to convert the d -dimensional data into the ðd + 1Þ-dimensional if the number reaches N 0 and then builds QALSH indexing. For the buckets which number of data objects is less than the threshold N 0 , the raw feature is stored directly. When dividing the block, the schema will determine the first block based on the first data object t 0 of T. The maximum norm M base of this block is t 0 , and the block will set as the benchmark block. Then, we can determine other data objects' blocks. For the subsequent data, we can calculate the specific block based on the norm t i and the benchmark M base . The process can be presented as the following Algorithm 1.

Similarity-Aware AMIP Searching
To respond to the arbitrary maximum inner product query q , we first need to calculate the Euclidean norm q and then we set the MIP value as φ = −∞. Since the maximum norm of the data objects in the first block is the largest one, it is most likely that the block contains the MIP data object. Thus, we traverse the block from S 1 to S K . Each block contains many buckets according to the angle similarity. Moreover, the MIP data objects are most likely to have high cosine similarity with the query q, and the traversal of the buckets is performed in ascending sequence as the angle.
For a block S i , the AMIP process can be described as the three main stages. First, for a query q and block S i , we first calculate a deadline condition ub. All t ∈ S i satisfy bM i−1 < ∥t∥≤M i , and ht, qi = ∥t∥∥q∥cos aβ≤∥t∥∥q≤M i · q∥. For a block, we can have ub = M i · ∥q∥. In the AMIP algorithm, we consider the effect of data norm ∥t∥, and the angle θ between the query q and t ∈ D. Since ht, qi = ∥t∥∥q∥cos β, within each bucket, we use SRP-LSH to estimate the cosine similarity β * between q and t ∈ B i . If the similarities of the buckets satisfy the given similarity, the schema will conduct the AMIP search process. The cosine similarity calculation will cause errors, and in the later section, we will demonstrate the specific error. Then, we use these two deadlines ub and given cosine similarity to AMIP in the buckets. (1) Before starting to traverse the block M i , the schema will stop traversing the rest blocks if ub ≤ β 0 and then the algorithm will return the AMIP data object. (2) If ub > φ, we traverse the buckets in the block, and if the cosine similarity does not satisfy the given similarity, the schema skips the buckets and traverses other buckets.
In the process of cosine similarity searching, we apply hashing banding to improve the calculation accuracy. For details, an identifier of a bucket can be represented by ks bits. When we use the hashing banding, in which the ks bits are divided into ks/bs bands, and each band has bs bits. For a query q, the SRP-LSH hash functions will calculate ks bits, which are also divided into ks/bs bands. If one of ks/bs bands from q is the same as the corresponding band of the bucket's band, we term it as having a hash similarity collision, and the angle meets our calculation requirement. The total AMIP searching algorithm can be described in Algorithm 2. Proof. According to the paper [33,37], we know that the probability that QALSH returns the result of c 0 -ANN is at least ð1/2 − 1/eÞ. If we fix the QALSH error rate is 1/e, then the AMIP algorithm that searches for MIP will return a Input: a time series T with objects t 1 , t 2 , ⋯, t k , an interval ratio b, and a threshold N 0 Output: The number of disjoint K, K disjoint sets with blocks fS 1 = fB 1 , ⋯g, S 2 = fB 1 , ⋯g, ⋯S K = fB 1 , ⋯gg. k = 1;

Accuracy Analysis
Compute the bucket B j of the block S i using SRP-LSH hash family; Build hash tables for b j of using ALSH; End IfjB j j > N 0 then Insert into QALSH indexing of B j ; End End K = jSj; Return K, fS 1 = fB 1 , ⋯g, S 2 = fB 1 , ⋯g, ⋯S K = fB 1 , ⋯gg. We first derive the expression ht, qi/ht * , qi, assuming t * is the MIP for a query q, and in the block S i , bM i < t * ≤ M i , λ = M i /∥q∥. According to the previous formula, we have As with c 0 -ANN, according to [33], for QðqÞ and PðtÞ, QALSH returns a result of c 2 0 -ANN, which is ∥QðqÞ − PðtÞ∥ /kQðqÞ − Pðt * Þk ≤ c 2 0 , let β * be the angle of t * and q. Combining the above formula, we have By SRP-LSH, we know Now we try to calculate Eð1/cos β * Þ, assuming α is the angle variable that changes as a threshold for a query, and β represents the angle between q and t, where t ∈ M ðiÞ, αϵ ½0, π. For theoretical analysis, we assume the angles of data items obey the uniform distribution, i.e., Pr½β > α = 1 − Pr½β ≤ α = 1 − α/π, and β is the angle of the smaller similarity bucket traversed in M i for the q. Then, we assume that the number of data items for a block M i is, Thus, we can get the cumulative density function of α as follows: Also, we can calculate the deviation of F β ðαÞ to get the probability density function: Assuming β 0 as the threshold, β ∈ ½0, β 0, we have Finally, we have Let l be ð1 − β 0 · ðtan β + 1/3 · tan 3 βÞÞ. We can depict the 5.2. Complexity Analysis. In this section, we conduct an analysis of the space and time complexity of our algorithm.

Theorem 6.
Given an approximation ratio cð0 < c < 1Þ for a c -AMIP search, we use Oðnd + n log nÞ space to construct indexing structure and cost Oðn log nÞ time at most for a c-AMIP object searching.
Proof. The storage structure of the H2SA-ALSH does not have essential differences with the H2-ALSH, and we also use QALSH to store and index the data. Algorithm 1 has two parts of overhead: the space of cost by arrived data Tð OðndÞÞ and space cost by indexing LSH (QALSH). According to H2-ALSH [33], the space overhead of QALSH hash table is Oðn log nÞ. Thus, the space overhead of Algorithm 1 is Oðnd + n log nÞ. To answer a c-AMIP query, in the worst case, Algorithm 2 needs to check objects in all disjoint units, and the schema searches all the units and the search will cost Oðn log nÞ query time.
More details, the overhead of Oðn log nÞ for query time represents the worst case. For the real data sets, the H2SA-ALSH will filter out most of data units, even the data is random distribution or even skewed. The H2SA-ALSH will stop in the first few blocks, and in one block, the schema only searches a few buckets. Therefore, the average query time of a c -AMIP object will be much better than in the worst case.

Experimental Evaluation
We conduct experiments on three real-world data sets (Mnist [41], Sift [42], and YearPredictionMSD [43] (be termed as Year)) and compare our algorithm with three state-of-the-art AMIP algorithms. The experiments mainly evaluate the precision of AMIP results, the time efficiency of constructing the index, and the query efficiency. We run all the experiments on an Intel Xeon E5-2620 machine with eight cores and 32 GB of memory. All the algorithms in the experiments are implemented by the C++ language and run on Centos 7 OS.
The main evaluation metrics of the experiments are the recall and precision of the AMIP results, overall approximation ratio, and running time of AMIP search. To evaluate the performance of our algorithm, we compare our approach with Simple-ALSH [27], H2-ALSH [32], and Sign-ALSH [31]. The experiment verified the performance of all methods for 0.5-k-AMIP search by varying k from 1 to 10 to show the evaluation results of recall and precision. Thus, we get the top-k MIP objects by 0.5-AMIP. Figures 2-4 describe the recall and precision curves of the evaluation. We can see from the curves of Figures 2-4, the H2SA-ALSH is better than those of other algorithms in the top-k searching (k = 1, 2, 5, 10), which means that the H2SA-ALSH can obtain more precise search results compared with other algorithms (Simple-ALSH, sign-ALSH, and H2-ALSH).
Furthermore, we use the metric of approximation ratio to evaluate the precision of the search results. For the approximate c-k-AMIP search, we set the given approximation ratio c to be 0.5. Then, we compare the approximation ratios of our algorithm with other algorithms. The comparison is conducted under c-k-AMIP searching using top-k searching (k = 1, 2, 5, and 10).
The approximation ratio is expressed as ðho, qi/ho * , qiÞ, whose value is less than 1. The overall approximation ratio is the average approximation of all queries that can show precision. Additionally, when the ratio is greater, we can obtain  9 Wireless Communications and Mobile Computing better AMIP search results. As shown in Figure 5, the overall approximation ratios of all algorithms are higher than the approximation ratio = 0:5. Our algorithm has a better approximation ratio than all the other algorithms, which means that our algorithm will reach better precision for an arbitrary query.
To examine the query efficiency, we evaluate our algorithm performance on approximate object searching. We compare the average computation time for a query with the latest H2-ALSH algorithm. Figure 6 shows that the average query time of our algorithm is less than the time used in H2-ALSH over the three data sets. Especially in the year dataset, the query efficiency of our approach improves nearly 60% compared with H2-ALSH.

Conclusion
In the paper, we propose a novel indexing and searching schema, termed as H2SA-ALSH, in IoT environments. The H2SA-ALSH can construct indexing for multidimensional data objects according to the Euclidean norm and cosine similarity without collecting the raw data objects. At the same time, the extracted indexing features are built with approximate disturbance elements into the features. By collecting and indexing the disturbed features on the fly, we design a c -k-AMIP searching algorithm, to achieve accurate and efficient maximum inner product searching and top-k searching for a given vector. Experiments demonstrate the accuracy and efficiency improvement of our approach compared with three AMIP-based algorithms using real-world data sets.

Data Availability
The authors declare that all the data and materials in this manuscript are available.

Conflicts of Interest
The authors declare that they have no conflicts of interest.