A Type-Based Blocking Technique for Efficient Entity Resolution over Large-Scale Data

In data integration, entity resolution is an important technique to improve data quality. Existing researches typically assume that the target dataset only contain string-type data and use single similarity metric. For larger high-dimensional dataset, redundant information needs to be verified using traditional blocking or windowing techniques. In this work, we propose a novel ER-resolving method using a hybrid approach, including type-based multiblocks, varying window size, and more flexible similarity metrics. In our new ER workflow, we reduce the searching space for entity pairs by the constraint of redundant attributes and matching likelihood. We develop a reference implementation of our proposed approach and validate its performance using real-life dataset from one Internet of Things project. We evaluate the data processing system using five standard metrics including effectiveness, efficiency, accuracy, recall, and precision. Experimental results indicate that the proposed approach could be a promising alternative for entity resolution and could be feasibly applied in real-world data cleaning for large datasets.


Introduction
Entity resolution (ER), also known as record linkage, entity reconciliation, or merge/purge, is the procedure of identifying a group of entities (records) representing the same realworld entity [1][2][3].Generally speaking, ER has become the first step of data processing and widely used in many application domain, such as digital libraries, smart city, financial transactions, and social networks.Especially with the rapid development of Internet of Things technology, it is common that data contains a huge amount of inaccurate information and different types of ambiguities [4].Accordingly, developing proper ER techniques to clear and integrate data collected from multiple sources has received much attention [1,5,6].The ultimate goal of ER technologies is to improve data quality or to enrich data to facilitate more detailed data analysis.
Researchers have proposed many automatic methods and techniques to resolve ER problem across multiple resources, to detect if they refer to the same entity and therefore can be merged.Wang and Madnick [7] proposed a rule-based method by using rules and unique key attribute developed by experts.However, this kind of method has some additional restrictions, such as the result of the rules must always be correct.Bilenko and Mooney [8] proposed an adaptive duplicate detection method, but it mainly focus on string similarity measures.Recently, based on different similarity metrics, some researchers proposed machine learning methods to classify the entity pairs to "match," "non-match," or "possible-match", such as the [9] proposed automatic record linkage tool by using of support vector machine classification of Christen.However, this kind of methods requires a large amount of manually labelled data, and if the entity pairs are classified to "possible-match," a manual review process is required to assess and classify them into "match" or "non-match."Thus, this is usually a timeconsuming, cumbersome, and error-prone process [10].
The mentioned work in ER mainly focused on the development of automatic algorithms [11].A general procedure for solving the ER problem includes calculating the similarity of all entity pairs by means of similarity metric methods [12], such as Jaccard similarity coefficient and Levenshtein distance.Entities whose similarity values are higher than the specified threshold are considered to be the same entity.However, in the face of large datasets, the performance deteriorates drastically as a comparison of dense attributes [2].To solve this problem, blocking and windowing mechanisms [2,10,13,14] have been introduced.The goal of the blocking scheme is to group entities into block-based clustering technique to reduce comparison searching space into one block.Windowing methods, such as sorted neighborhood method (SNM) [13,15] or multipass sorted neighborhood method (MPN) [16,17], sort entities according to the keywords, and then slide a fixed size window over them to compare the entities within the window.
Algorithmic approaches have been improving in quality, but there are still some problems that have not been fully studied [1].First, existing research typically assumes that the target dataset only contain string-type data and use single similarity metric [1,11,18,19].Second, for large high-dimensional target dataset, redundant information is verified, which not only increases the computational complexity but also may deteriorate the quality of ER results [20].Third, the common drawback of most blocking techniques is that they put the complete entities in one block (or multiple blocks) according to the block keywords, which leads to the lack of the necessary flexibility for composing ER workflows of higher performance in combination with specialized, complementary methods [5,21].Fourth, the performance of typical windowing methods (SNM or MPN) depends strongly on the size of sliding window [21], but they often employ a fixed window size.The larger the fixed window size, the more comparisons are executed, and the lower the overall efficiency gets, however, small size may lead to a high number of missed matches (e.g., the closest entities are not placed in the same window) and to low effectiveness [2,[21][22][23].
Based on above problems, this paper introduces a novel ER resolving method.First, we introduce a novel blocking scheme based on attribute value types.In these different type of blocks, we adopt diverse similarity metrics to achieve maximum efficiency gains, because even though a metric method showed robust and high performance for one type of data [24,25], it may perform poorly on others [26][27][28].Second, we provide an attribute clustering method, a novel and effective blocking approach, which is a preprocessing scheme to resolve ER in a significantly lower redundancy and higher efficiency.Its core is to divide attribute names into nonoverlapping clusters according to the similarity or correlation of the attribute values.Third, we introduce a comparison mechanism, which combines a dynamically adjustable window size to specify the processing order of all individual comparisons in each block.It aims at identifying the closest pairs of entities involving significantly fewer comparisons while maintaining the original level of effectiveness.Fourth, during the ER completion step, we adopt a weighted undirected graph to gather the output of each block and used the improved weighting and pruning scheme to enhance its efficiency.Finally, we assess the performance of our methods with using a real-life dataset, and the experimental results validate the exceptional performance of our methods.

The Overall Framework
The overall framework of the proposed ER resolving method is shown in Figure 1.It consists of three parts: the first part

Type-Based Blocking Approach
At the core of our approach lies in the notion of type-based blocking (defined as Definition The attribute vector is defined as A = A 1, A 2, … , A m , and A l (1 ≤ l ≤ m) is used to represent the lth attribute.Accordingly, the value of A l in entity r i is represented as Definition 3. (Blocking Graph) Given an entity collection closestMap, the undirected blocking graph G e = V e, E e, WS is generated by it, where V e is the set of its nodes, E e is the set of its undirected edges, and WS is the weighting scheme that determines the weight of every edge.
3.1.Splitting Attributes to Different Blocks.The comparison between attribute values is an important task for resolving the ER problem.A variety of methods has been developed for that [29][30][31][32], which typically rely on string comparison techniques.However, the dataset might contain other types of data, such as numerical, enumeration, and date, and there is still much work to be done about these types of data [33][34][35].
The purpose of splitting attributes into different blocks is to facilitate the introduction of a variety of flexible similarity metrics to identify redundant attributes or closest entities.For instance, entities (Jimi, F) and (Jimi, M), in which "F" and "M," respectively, represent the "female" and "male" for the gender attribute.The similarity between these two entities achieved by traditional methods, such as Levenshtein distance [36], is very high.However, these two entities represent two individuals with same name but with different gender.The attribute gender is an enumeration type, and the similarity of the enumeration type should be calculated using equal or unequal, rather than the methods often used for string-type data.The functionality of splitting attributes into different blocks is outlined in Algorithm 1.

Attribute Clustering
Method.The increase of data dimensionality brings new challenges to the efficiency and effectiveness of many existing ER methods.In this article, we introduce an attribute clustering method to remove redundant attributes.As a result, we obtain a more compact and easily interpretable representation of the target concept.The core of the proposed attribute clustering method is to divide attributes into disjoint clusters according to the similarity or correlation of the attribute values within the same type block, which is based on the concept of redundant.This scheme offers a better balance between the computational cost and the precision for resolving ER [37].In order to facilitate the discussion, the following definitions are introduced.

3.2.1.
String-Type Block.The attribute clustering method for string type depends on two parts: (1) the model that congruously represents the values of attribute; (2) the similarity metric that catches the common pattern between the two sets of attribute values.In this paper, the weight of each term is obtained by TF-IDF (term frequency-inverse document frequency) algorithm [38].Accordingly, the two sets of attribute values can be represented as v i = wi 1, wi 2, … , wi n and v j = wj 1, wj 2, … , wj n , and θ is the angle between v i and v j.Thus, the similarity between two sets of attribute values with cosine similarity can be defined as where sim v i, v j takes values in the interval [0, 1], with higher values indicating higher similarity between the given attributes.If it is higher than an established threshold, that means the given attributes are redundant.
where o ij is the observation frequency and e ij is the expected frequency.The calculation method of e ij can be referred to (3), where n is the number of data tuple and count A = a i is the count of value a i which appeared in the A attribute.
For the general numerical attribute, their correlation is measured by Pearson correlation coefficient [40], which is shown in where σ A and σ B are the standard deviation of attributes A and B. A and B are the mean values.r A, B is their correlation coefficient, whose value space is [−1,1].If r A, B is bigger than the established threshold, that means one of the given attributes can be removed as redundant [34].
For the ordinal numerical attribute, it is divided three steps: (1) we sort and divide the values of attribute A into set M 1, M 2, … , M 100 , also known as percentile [41,42].The processing of the attribute B is the same as that of A; (2) we count the number of attribute values that fall into each interval and use this count value ca i to replace M i 1 ≤ i ≤ 100 .Thus, the replacement set ca 1, ca 1, … , ca 100 and cb 1, cb 1, … , cb 100 can be obtained; (3) ca 1, ca 1, … , ca 100 and cb 1, cb 1, … , cb 100 can be regarded as the general numerical attribute to calculate their correlation.

4
Journal of Sensors 3.2.4.Enumeration-Type Block.For the numeration-type attributes, their correlation measurement is same with that of the nominal numerical attribute.
The detailed functionality of the proposed attribute clustering method is described in Algorithm 2, which is based on the method proposed by Papadakis et al. in research [43].
In principle, the attribute clustering method proposed in this paper works as follows: each attribute name in the input map is associated with the most similar or the strongest correlation attribute name (lines c-h).The Connect between two attribute names is stored in a data structure on the condition that the similarity or correlation of their values is more than the threshold δ (line f).The transitive closure of the stored connects is then obtained to build the basis for partitioning attribute names into cluster (line i).As a result, each cluster is taken from each connected component of the transitive closure (line j), and the singleton clusters are removed from C (lines k-m).
3.3.The Closest Entity Detection.SNM was first proposed in the mid-1990s [44].Its basic idea is to sort the dataset (entity set) according to the sort keywords and to sequentially move a fixed size window over the sorted entities.Candidate entity pairs are then generated only from entities within the same window.This technique reduces the complexity from O(n × n) to O(n × w), where n is the number of input entities and w is the window size.The linear complexity makes SNM more robust against load balancing problems [16,17].However, this method still has a room for improvement, such as the following: (a) When some attribute values are missing or the length difference between the attribute values is larger, the weighting summation method for calculating the similarity between entities will no longer appropriate.
(b) The interference caused by redundant attributes will increase the comparison's cost and affect and the quality of ER results [20].
(c) The comparison strategy in SNM is that when each new entity enters the current window, it needs to be compared with the previous w-1 (w is the window size) entities to find "matching" entities.However, it is difficult to determine the sliding window size [2].
(d) The real-life dataset often contains a variety of data types, such as enumeration, numerical, date, and string, but the traditional SNM usually assumes that the dataset only contains string-type data and just employ single similarity metric based on string type [11].
To solve the above problems, this paper proposes multiblocking sorted neighborhood (MBN) algorithm.In order to facilitate the discussion, the definition of valid weight is introduced.Definition 4. (Valid Weight) When comparing the attribute values r il and r jl, if one of them is missing or their length ratio is lower than the specified threshold, the valid weight ϑ ij l will be set to 0, the length ratio are shown in (5), and the length ratio is only adopted in string-and enumerationtype blocks: lenRatio = len big len small , 5 len big = Max length r il , length r jl , 6 len small = Min length r il , length r jl 7 3.3.1.The Similarity Metrics.As mentioned above, the data types in this article include string, numeric, date, and enumeration.Attribute value similarity metrics are discussed based on these four types.
(1) String-Type Attribute.Cosine similarity [45][46][47] is used to calculate the similarity of attribute value x il and x jl, which can be defined as sim att string x il, x jl .Accordingly, the similarity of entity r i and r j is

8
(2) Numeric-Type Attribute.The numeric type is divided into nominal, ordinal, and general numerical as mentioned above.
The diversity is adopted in numeric type, and the diversity of attribute values x il and x jl from attribute A l (1 ≤ i, j ≤ n, 1 ≤ l ≤ m) is defined as div ij l .The calculation methods are defined as the following: (a) A l is one of the general numerical attributes, div ij l = x il − x jl / max A l − min A l , where max A l is the maximum value of A l and min A l is the minimum one.
(b) A l is one of the ordinal numerical attributes, and A l has the ordered state Μ which can be expressed as 1, … , M k .Using the corresponding ranking number s il (when the x il ∈ M r 1 ≤ r ≤ k , the ranking number of x il is r) to replace x il and calculating z il = s il − 1 / M r − 1 accordingly, next, s il is replaced by z il.Finally, the processing of z il can refer to step (a).
(c) A l is one of the nominal attributes: if Consequently, the diversity of entity r i and r j is defined in (9), and the similarity between them is shown in (10), where r is the total number of attributes in this numeric block.
sim ent num r i, r j = 1 − div ent r i, r j 10 (3) Enumeration-Type Attribute.Because the possible values for enumeration attribute are a set of predefined specific symbols, there is usually no implicit semantic relationship between them.This article uses equality or inequality to measure the similarity between two enumerated attribute values, which is shown as follows: Accordingly, the similarity of entity r i and r j is obtained by (12), where s is the amount of attributes in enumeration-type block.
(4) Date-Type Attribute.The processing of date-type attributes, firstly, is to divide the attribute values into a set of ordered states 1, … , M k .Correspondingly, the similarity between x il and x jl from the date attribute A l is defined as (13), and the similarity of entity r i and r j is calculated by (14).
The Varying Window Sizing Strategies.The window size ws in traditional SNM is fixed, and the challenge here is how to select a reasonable value for ws.If the window size is too large that will increase the comparison cost, in contrast, too small may lead to missing matches.
In this paper, depending on the type-based blocks, we use varied strategies to dynamically adjust the window size in these blocks.
For string-type block, we adopt the method proposed by [15] to dynamically adjust the window size based on the amount of closest entities in the current window, which can be defined as follows: where ws i is the window size of the ith window, ws min is the minimum of window size and W max is the maximum one, and n actclo is the number of the closest entities.But the calculation method of the minimum and maximum window size is not presented in research [15].In this paper, we obtained the minimum and maximum with the following steps: (1) selecting the key attribute; (2) clustering the values of the key attribute based on Cosine similarity metric; (3) ws min is equal to the number of elements of the minimum cluster and ws max is equal to the number of elements in the biggest cluster.
For numerical-type and date-type block, the conditions for expanding window size are 1-div iw l i > δ and sim att date x il, x w i l > δ, and the conditions for narrowing window size are 1-div iw l p < δ and sim att date x il, x w p l < δ, where x il is from the entity of the latest sliding into the current window, x w i l is from the last one in the current window, and x w p l is from the pth entity in the current window.In addition, the expansion step ws step (ws step = 1) and the minimal window size ws min are also set in advance.Accordingly, the expanded window size ws i is shown in (16), and the narrowed window size is described by (17).
Journal of Sensors ws i = ws min p ≤ ws min , p ws min < p < ws i 17 For the enumeration-type block, the window size is defined as follows: 7 Journal of Sensors (c) Pruning edge-weight graphing aims at selecting the globally best pairs by iterating over the edges of an edge-weight graph in order to filter out edges that do not satisfy the pruning criterion, such as the edges with low weight.
(d) Collecting the connected subgraphs from the pruned edge-weight graph constitutes the final output of the complete entities.
In general, the strong indication of the similarity of two entities is provided by the number of blocks they have in common; the more blocks they share, the more likely they are to match.Therefore, the weight of an edge connecting entities r i and r j is set to the number of blocks which marked that these two entities are matched.However, the contribution of different blocks to the computation of the similarity between the complete entities is also different.For example, the output from a string-type block has a positive effect because their values are sufficient dispersion.As a result, we improved the common block scheme (CBS) method proposed in research [14,35] and named it as improved common block scheme (ICBS), which is shown in (19).For the sake of discussion, we use e i, j to describe the edge between entities r i and r j; correspondingly, e i, j weight is used to express its weight.
where e i, j k weight is the weight of kth block, and if the entities r i and r j are marked as matched in the kth block, then Β i, j k = 1; else Β i, j k = 0. We adopt weight edge pruning (WEP) method [14] to prune the edges with lower weight.

Experiment Evaluation
This section aims to experimentally evaluate the techniques presented in this paper, respectively.We begin our analysis with the dataset collection, as shown in Section 4.1.In addition, the following experiments are conducted: the splitting attributes into different blocks and the attribute clustering methods are examined in Section 4.2.In order to evaluate the improved common block scheme (ICBS), we compare it with CBS presented in [14] in Section 4.3.At last, we compare our MBN method with the classical SNM in Section 4.4 to evaluate the effectiveness of our similarity metric schemes and dynamic adjustment window size strategies in different blocks.Among them, all approaches and experiments were fully implemented using Java 1.8 version, and development tool is IntelliJ IDEA.For the implementation of attribute clustering and pruning of edge-weight graph, we referred to and improved the source codes publicly released at http://SorceForge.net by Papadakis et al. [21].We also employed some open source libraries in our implementation, such as jgrapht and commons-math.

Dataset Collection.
To thoroughly evaluate our techniques, we employ a large-scale, real-world dataset, which is generated and obtained from one of real-world Internet of Things project that is developed and maintained by our research group.The time span of the dataset is from June 2016 to November 2016.Because if the terrorist can easily get flammable items (such as gasoline and natural gas), it will cause huge potential danger to the society.Therefore, the purpose of this project is to monitor these combustible materials in order to predict and monitor potential violent attacks in Xinjiang (one of China's provinces).In addition to refueling records, this project also collected information about drivers, vehicles, and so forth and stores them in the cloud platform.In combining machine learning algorithms, such as random forests or rotation forests, we will use these clean data output by our proposed method in this paper to propose some smart models in our future research, for example, to predict illegal vehicles.where true positive (TP) is the amount of positive testing samples correctly predicted as positive, false positive (FP) is the amount of negative testing samples incorrectly predicted as positive, true negative (TN) is the amount of negative testing samples correctly predicted as negative, and false negative (FN) is the amount of positive testing samples wrongly predicted as negative.
In Table 1, in this experiment, we selected four tables to verify our methods.For the purpose of verifying the attribute clustering method, we added some redundant attributes and noise values in these four tables.The experimental results demonstrated that the average accuracy achieved by our splitting attributes to different blocks was 98.18% and the average accuracies obtained by our attribute clustering method from four tables were, respectively, 89.28%, 86.83%, 87.64%, and 92.08%.It can be proved that our methods work well.The main reason is that we adopt different methods to handle attributes based on the attribute value types and value distributions.For instance, in enumeration-type block, the confidence level was set as α = 0 001, the similarity threshold was set to 0.80 in numerical-and date-type blocks, and it was set to 0.76 in string-type block.

Evaluation of ICBS Method.
Combined with WEP [14] scheme, Figure 3 presents the accuracy comparison between CBS [14] and our ICBS.As shown in Figure 3, the experimental results exhibited that our ICBS had high efficiency in these four tables mentioned in Section 4.2 (i.e., accuracy > 84%).For ICBS, the weight in different type block was also varied; generally speaking, the weight in string and numerical type was enlarged in these four tables, and the weight in enumeration and date was reduced.But for the CBS, the weight in different types of block was all set to 1.The experimental results demonstrated that these four types of block have different contributions to the calculation of the similarity of the complete entity and our ICBS is an appropriate and important part in our ER workflow.
where the TP and FP are the number of true-matched and true-nonmatched candidate entity pairs generated by the ER method.The FN is the number of nonmatched entity pairs.We extracted the Cylinder_check table from the dataset to verify the effectiveness of our MBN method.This table has a total of 534,017 entities.We carried out this experiment based on five hundred, 1 thousand, 3 thousand, 5 thousand, 8 thousand, and 10 thousand entities, respectively.The entity similarity threshold was set to 0.79 both in MBN and SNM.The window size was set to 16 in SNM.In MBN, the initial window size for date-and numerical-type blocks was set to 8, the minimum window size was set to 5, the maximum window size was set to 30 for string type, and the window size of enumeration type was calculated in real-time.The input data of SNM and MBN were all from Cylinder_check table, in which the redundant attributes have been removed.The comparisons based on precision, recall, and F-score achieved by MBN and SNM are shown in Figures 4 and 5.
Figures 4 and 5    9 Journal of Sensors method to evaluate the similarity between entities because it split entities into different blocks based on the attribute value types and adopted different similarity metrics in these blocks.In addition, compared with the fixed window size in SNM, the dynamic adjustable window schemes introduced in our ER resolving method improves the performance while reducing the missing matched.Moreover, in SNM, the performance of detecting the closest entities mainly depends on its single sort key attribute, which means that if the key attribute are not selected properly or the key attribute contains noise data, its performance will be greatly reduced.Thus, it is necessary to select multiple key attributes and flexibly integrate several ER workflows to achieve better performance in MBN method.Generally speaking, these two methods all had acceptable recall under the same dataset.In Figure 6, the F-score comparison also indicated that MBN performed better than the original SNM method.

Conclusion and Future Work
In this paper, we introduce a novel hybrid approach, which is as an effective method for entity resolution in the context of voluminous and highly dimension data.In contrast to existing entity resolution methods, our approach commits to reducing the searching space for entity pairs by the constraint of redundant attributes and matching likelihood and to meeting multiple similarity comparison requirements based on attribute value types.Our thorough experimental evaluation verified the effectiveness, as well as the efficiency of our method on real-world dataset.We believe our approach is a promising alternative to entity resolution problem that can be effectively adopted to many other applications, such as the National Census area (e.g., it can be used to match data from different census collections to detect and correct conflicting or missing information) and Business Mailing Lists area (e.g., it can be used to identify redundant entities about the same customer to avoid money being wasted on mailing several copies of an advertisement flyer to one person).Of course, many interesting problems remain to be addressed in the future, including a general guidance for selecting key attribute in each block, and we will attempt to introduce parallelization technique into our approach to further improve its efficiency.Another interesting direction of our research is how to apply and develop our method to improve the efficiency of approaches that rely on entity resolution in machine learning and probabilistic inference.

Figure 2 :
Figure 2: The internal functionality of our gathering method.

4. 2 .
Evaluation of Splitting Attributes to Different Blocks and Attribute Clustering Methods.For assessing the performance of our splitting attributes and attribute clustering methods, Accuracy is used as evaluation criteria in this part and its definition is shown as follows:
3.2.3.Date-Type Block.The correlation between date-type attributes, which is the handling process, is same with that of the ordinal numerical attribute.: sd: dataset; δ: the threshold for the number of possible values of an enumeration attribute; λ: the number of values which are randomly selected from sd; Output: Map < BT(block type), list of attribute names > BT∈{NUME, STRING, DATE, ENUM} (a) Map < attribute, List < v1, v2,…, vλ> > mediateData ← sd;// Using Map to store attribute and its Note: the δ and λ should be adjusted according to the size of dataset) Algorithm 1: Splitting attributes into different blocks. Input this is for same type block δ: the threshold for the attributes similarity Output: Set of attribute names clusters: C (a) connects ←{}; (b) noOfAttributes ←the size of Map < attribute name, List v 1, v 2, … , v n >; demonstrated that, compared with SNM, our MBN method improved the precision and recall for the closest entity detection.That means MBN is a more objective

Table 1 :
The accuracy of splitting attributes to different blocking and attributes clustering.