On Detecting and Removing Superficial Redundancy in Vector Databases

. A mathematical model is proposed in order to obtain an automatized tool to remove any unnecessary data, to compute the level of the redundancy, and to recover the original and filtered database, at any time of the process, in a vector database. This type of database can be modeled as an oriented directed graph. Thus, the database is characterized by an adjacency matrix. Therefore, a record is no longer a row but a matrix. Then, the problem of cleaning redundancies is addressed from a theoretical point of view. Superficial redundancy is measured and filtered by using the 1-norm of a matrix. Algorithms are presented by Python and MapReduce, and a case study of a real cybersecurity database is performed.


Introduction
Current systems of knowledge extraction are based on the creation of the best models to solve a specific problem with particular data.In addition, the computational algorithms used are implemented and applied through different data management and processing architectures, from the most rudimentary to the most advanced analytical platforms using Big Data (BD) in real time.
In the most current cases, the creation of specific models, capable of analyzing, categorizing, and predicting different situations, such as anticipating trends or reacting to certain events, requires Big Data analytics.These techniques give rise to different challenges such as data inconsistency, incompleteness, scalability, timeliness, or data security [1,2].But, previously, data must be well-constructed [3].
On the other hand, good quality data is required to obtain good quality knowledge (in another case we fall in the wellknown garbage-in, garbage-out constituted scenario; see [4]).Hence we should notice that not every datum is useful; for instance, it is expected that only 35% of considered data for analysis will be really useful by 2020 (see [5]).
In addition, motivated by the high increase of the number of incidents, the sensors, and the Internet of Things (IoT) devices, the rate of acquisition of data grows exponentially and, therefore, the volume of databases can become in a dangerous situation because the data obesity can be presented.Also, real-world databases are severely sensitive to be inconsistent, incomplete, and noisy.This fact turns out to be especially significant when several data sources need to be integrated.Working in a multisource system of acquisition of data generates high overlapping because new data are continuously included from different sources and thus increasing the probability of finding noise and dirty data.For instance, the above situation is typically in data warehouses, federated database systems, critical infrastructures, etc. (see [3,6]).Thus, an appropriate strategy to remove unnecessary data (redundant data) in an automatized process is needed.
Data cleaning deals with detecting and removing errors from data and with eliminating the noise produced by the owned data collection procedure [7].Hence, in a first approach, we state two types of redundancy: superficial and deep redundancy that could appear at instance or variables (features) level.
(1) Superficial redundancy refers to all variables that we do not need to take into account in our further analysis from a natural point of view (empty variables, constant, and identical cases).The study of superficial redundancy allows us to filter the database without advanced statistical analysis or previous transformation of the data in treatable variables.Moreover, this redundancy may be studied in any database.
(2) Deep redundancy collects all variables containing the same information, encoded in different ways as well as correlated variables, associated variables, or in general nondiscriminant variables to the fixed target.Note that in the first case of deep redundancy a simple frequency analysis could be enough to recognize the variables with the same information.However, the detection of the relation of correlation between variables, or the computation of the relevant features for a specific target, requires more advanced statistical analysis.
Note that it would be expected that redundancy appears in more than one type.The case of duplicated cases in a database is a special type of redundancy when the database is build up from several data sources, and it can show up in both types described above.In fact, removing the duplicate information is a very complex process in databases of cybersecurity reports, since identifying them is a difficult task that requires expert knowledge (deep redundancy).For example, we can have the same incident, reported by different sources, at different times, and by using a different lexical language.Or we can observe the same case reported twice by the same source because of defects in the collection procedure such as stuck-up of cases by updates.Following [7], a data cleaning approach should satisfy several requirements: (1) It should be able to remove all main errors and inconsistencies of data from individual and multiple sources.
(2) Manual inspection and programming effort should be limited.
(3) It should be flexible enough in order to integrate additional sources.
(4) Furthermore, data cleaning should be integrated together with schema-related data transformations.
(5) Data transformations along the cleaning procedure should be specified in a declarative way and be reusable.
There are several research works that develop different approaches to data cleaning of databases by as special data mining treatment, data transformations, or specific operators (see [8][9][10][11][12][13][14]).Also, some of them perform the data cleaning on a separate cluster that, later, we need to integrate into the data smart center.However, these works are focused on the study of duplicated cases, remove of typographical errors, or detecting inconsistencies or discrepancies.Thus, one of the remaining challenges of the data science is to design and propose efficient representations, access, and tools that let us preprocess and clean huge amount of variety of data before starting data analysis procedures [3].Although there are a quite amount of commercial tools available to support these tasks, the cleaning and transformation work needs to be done manually or by low-level programs (see [7]).
Our goal in this paper is to give a mathematical model for detecting and removing superficial redundancies in an instance and variable level, for a single or multisource context, over certain kind of databases (vector databases).Our proposal is an oriented directed-graph which is theoretically based.Then, we address the problem of cleaning redundancies by using elementary algebraic tools.A matrix with entries in  = {0, 1} is attached to a given database and removing redundancy operations arise as standard transformations on the matrix.Then we can give a concrete expression of the level of redundancy by using the 1-norm of a matrix.Thus we would be able to report redundancy in order to clean reports.Note that we do not intend just to delete all superficial redundancy data but to know the level in order to perform further actions in the design of statistical analysis.We remark that redundancy is not bad itself because some measures like the reputation of sources might be performed using redundant reports.
Moreover, in this work, we present a tool in open source that cleans the database in an automatized way and that computes the level of redundancy of the database.It also permits obtaining the original and the filtered database at any time of the process as well as the level of redundancy and the associated graph.The aim and procedures would be fully applied to any standard of reporting cases by means of formal language processing.The scripts mentioned above are available at a public GitHub repository (see [15,16]).
In particular, this approach is applicable to clean up databases of cybersecurity reports (cyber databases).A cyber database contains a lot of unstructured information together with a high level of correlations where they are performed by human agents, that is, expert knowledge.The structural variety of the data of security reports is not unique (from machine-generated data to synthetic or artificial data).Moreover, the value of each feature could be structured, unstructured, or semistructured, and these typologies provide quantitative, (pseudo) qualitative, or string features.A security report usually is integrated, transformed, and combined with different data collection engines that provide only limited or null support for data cleaning, focusing on data transformations of management and schema integration.Since these engines receive information from different sources, in most cases we cannot modify the acquisition data process.Therefore, in order to extract knowledge from data, the best chance to get success is to optimize the different phases of the treatment and analysis of the data, and the first point is cleaning the database in an automatized way.Thus, we need to study the redundancy levels in order to detect superfluous reporters or optimize the resources.But, not every tool is useful.It must be noted here that some tools might be useless due to security constraints.In this case, it would not be possible to use online and private license software because sharing the data is not allowed.Therefore, in this situation, the tool to clean up the database would need to be integrated into an ecosystem with high levels of security.
In the final part of this paper, we also apply the developed tool to a real case of cleaning up a cyber database obtaining a 64% of superficial redundancy.
The paper is structured as follows: In Section 2 we give the model of a vector database from a matrix approach by graphs and our main results related to the computation of the level of redundancy.In Section 3, we develop the experimental section.This section includes the materials, the development of the tool that we have created to clean up databases as well as a comparison with some existent tools, and the case of study, in which we apply our tool to compute the level of the redundancy of a real fragment of a cyber database.Finally, our conclusions and references are given.

A Graph Approach to the Redundancy of a Database
A graph database is a database that can be structured in graph form so that the nodes of the graph contain the information and the edges contain properties and/or define relations of the information contained in the nodes.One of the main strengths of these kinds of databases is the capability to give answers in short time for questions regarding relations (see [17]).
In this section, we will define a graph structure on a database that conceptually differs from the usual one described above, and the motivation is based on the problem of detecting or cleaning redundancies in a database.In general, to show whether two columns or variables of our database are redundant or not, in some sense, one looks at the information contained in these variables and then decide.Although this will be our procedure eventually, we will cluster the set of variables by looking at the meaning they have and then we will consider the usual procedure.The point is that once the clustering is done the database and the clusters define a graph structure in a natural way where not all the nodes contain information.

A Graph Model for a Database.
Observe that in the above discussion we started by considering a usual database and we finished with the database plus a clustering of the variables.Before defining the graph structure, we will formalize this situation, and we will use it as the starting point.Definition 1.A vector database VD is a tuple of databases D  , each one of them coming with a label, which satisfy the following properties: (1) All the databases D  have the same length; that it is, all of them have the same number of rows .
(2) If a database has a unique column, the column name agrees with the database label.(3) Two different databases must have different labels.(4) To have same column names in different databases is allowed.
Table 1: Second form of a vector database.
(5) The nature of features is of any type (strings, floats, integers, etc.).
Remark 2. We will state some notation for the sake of clarity.
(1) We will use the notation { 1 , . . .,   } for the labels of the databases that are one-column composed as well as their unique column name, and we will denote by  the above set.
(2) The set  = { 1 , . . .,   } is going to be the set of labels of the databases.
(3) The set   = {  1 , . . .,     } is the set of column names of each single database   .From  = {  }, we can construct the set  = { 1 , . . .,   } and the different column labels collected from all the databases.We can reorder  according to , and then we use the following notation {  1 , . . .,     } for the ordered    .(4) The th row or report of a vector database is given by the vector constructed from the th rows of each database   and is denoted by   .
With the notation described in Remark 2, a vector database has the following form: where   could have the form ...

Row 1
Row n or Row 1 Row n Remark 3. If we apply the ordering that the set  provides, then we can understand a vector database as a unique table; see Table 1.So, any database in the form of (1) or Table 1 verifies the condition of Definition 1.
Example 4. We consider the example shown in Table 2.
(  ( ) From the last sets, we obtain the different column names,  = { 1 = Region Code,  2 = Port,  3 = Type IP}.Now, we transform    into    in order to normalize the vector database depending on .
We can now give a graph structure to the object defined in Definition 1.In order to do so we have to define the set of nodes, , and the set of arrows, .The graph,  = (, ), is going to be composed of two layers of nodes in such a way that all the arrows have their source in one layer and target in the other one.
(1) Layer 1: the nodes are the elements of the sets  and .
(2) Layer 2: the nodes are the elements of the set .
(3) We have an edge   →   if and only if   is a column name of the database   .From now on this relation will be expressed as   ∈   .
(4) Every variable   has a cycle.
The look of such a graph is shown in Figure 1.
Example 5.The associated graph to the vector database of Example 4 is shown in Figure 2.
In the sequel,  will denote the set {0, 1}, and the set of matrices with entries in  with  rows and  columns will be denoted by  × .2.
Remark 7. Note if we are only interested in the database, without more relations than those that have been defined, the study of the adjacency matrix is reduced to the matrices C and I.For this reason, in this paper, we will use the notation  = (C, I) to make reference to such adjacency matrix.
where the matrices C and I are highlighted in bold characters.
Remark 10.An added problem that we will have to take into account when carrying out advanced analysis and the cleaning of the data is the problem of removing missing values.We have two different missing values in our model; first, in some rows, we do not have the complete information on all the features.Secondly, we do not have all possible arguments in each feature subvector.The problem of missing data is solved, a priori, by substitution by two categories: sample zeroes = empty (lack of information in the sample or in the database) or structural zeroes = 0 (lack of information due to nonexistence).
Example 11.From the vector database of Example 4 we can construct the following weighted square matrices, one for each case of the database: ) ) ) ) ) ) 2.2.Cleaning the Database: Level of Redundancy.Once the graph structure is defined, our proposal is to use it to compute the level of redundancy of the database.The redundancies we are going to deal with are defined as follows: Type I: it includes empty variables.
Type II: it includes constant variables.
Type III: it includes equivalences across the databases   .  1 ∼   2 ⇔ their variables contain the same information.
In order to give a close mathematical formula for the level of redundancy in terms of the graph structure we will have to define these redundancies properly.
Definition 12. Let  = (C, I) be the adjacency matrix of the database.One defines the following maps: where C  = (   ) ∈  × is defined as We can define the analogous map R I TI acting over the submatrix I. Definition 14.The level of redundancy of Type I,  TI , is the rate of variables or columns that are always empty in the database and is computed by where ‖‖ 1 denotes the 1-norm of the matrix , that is, the number of 1  s in matrix .
Remark 15.Note that we can also obtain the redundancy of Type I in each subset of variables  and  separately: The next type of redundancy is also known (for numerical variables) as 0-variance redundancy, and it is constituted by the columns or variables that always take a constant value for all rows and so they do not provide us with relevant information.Note that redundancy of Type I is redundancy Type II when the constant value is zero.
Definition 16.Let  = (C, I) be the adjacency matrix of the database.One defines the following maps: where C  = (   ) ∈  × is defined as otherwise (16) Note that the constant  is different for each position   , but its value in a position must be equal for all rows.Again, we can define the analogous map R  TII acting over the submatrices I.
Remark 13 can be made in this case.
Definition 17.The level of redundancy of Type II,  TII , is the rate of variables or columns that are always constant in the database and is computed by Note that, in the same way as Remark 15, we can also compute redundancy Type II in both subsets of variables,  and .
The redundancy of Type III has to be with the possible equivalences between the sets of variables of the different databases of the tuple, that is, between the elements of .In this case, we make comparisons between the variables   that could have the same content in their arguments   .Definition 18.Let  = {  } and   = {   } be the set of databases and the set of variables for each   where  = 1, . . .,  and  = 1, . . .,   .One defines the following equivalence relation: after reordering the variables if it were necessary (18) This equivalence relation introduces a partition in    : in such a way that in any of those partitions   included all     that are equivalent to each other.
Definition 19.Let  = (C, I) be the adjacency matrix of the database, and fix a representative    for each part   defined by the above equivalence relation.One defines the following map: where C  = (   ) ∈  × and Definition 20.The level of redundancy of Type III,  TIII , is the rate of     that are in the subsets   once the representatives one has chosen have been taken out and is computed by Definition 21.Let  = (C, I) be the adjacency matrix of the database.Let  ( * = (C ( * , I ( * ) be the adjacency matrix of the database after applying all types of redundancy described above.Then, the redundancy of the database is defined as follows: Remark 22.At this point it is worth noticing that our procedure is able to write down superficial redundancy after each subset and filtered report may be recovered if it is necessary.

Experimental Section
This section is divided into three different subsections.In the first part, the materials that have been used in the experimental section are described.The second part is devoted to developing the tool RIASC Tool for Removing Redundancies (RTRR).It has been created in Python language to apply what we have developed in Section 2. Finally, we summarize the results obtained from the application of this tool to a real database.

Materials. The RIASC Tool for Removing Redundancies
(RTRR) is available at public git repository (see [15,16]).It has been coded using Python 2.7 (version 1, v1), and there is also a version for MapReduce (version 2, v2).Both versions allow the user to clean a database to prepare it for processing the data.The computation of time and complexities of the presented algorithms had been performed by using Cloudera's virtual machine with OS Red Hat (64-bit).The CPU specifications are as follows: 4 GB of RAM with 4 cores of which only 1.5 GB of RAM was available for the MapReduce task.
In the case of study, the database is a real fragment with cybersecurity reports with 363 variables or columns and 2600 rows.It has been anonymized due to confidentiality constraints.

RTRR: Development and Comparative.
The execution flow of the RTRR (see Figure 3) begins with the input data source.It is assumed that the input database is taken in a nonstructured way.Therefore the graph structure has to be discovered and introduced later or has to be given by the user as a second input.These are the two modes the RTRR can work with.The way the RTRR clusters the variables or column names in order to get the graph structure is by looking at the lexical structure of these column names.It is assumed that the column names have the structure where the empty space that separates each word can be changed by any other character (delimiter character), and it has to be included as an input.
In the first case mentioned above, the user distinguishes between the three types of nodes of the eventual graph structure, providing two different text files (.txt).One of them is with the list of names of the single-column-tables (), chosen among the column names.Another one is with the list of names of the rest of the tables (), chosen among the rest of the column names.In this case, an element  ∈  is not formed by the complete chain of words of one of the column names (27) but by that chain of words cut off at some and the columns are yellow nodes; those two different colors allow the user to identify the direction of the relationship.When we introduce an input data source in the tool, an initial graph representation of the initial database is generated, as well as a new graph representation for each type of redundancy that the user decides to remove.The user obtains two graphs (the original one and another without one of the types of redundancy) in the basic case and four graphs (the original one and three more, one for each type of redundancy removed) in the worst case.
In order to make a comparison between RTRR and other tools of cleaning databases, we have proposed the following assessment indicators for an eventual tool that integrates the cleaning procedure: (1) Required computational resources-for example, we should take into account the minimal required RAM needed for the tool to work properly and also the operative systems that can support it, such as Windows (W), Linux (L), and MacOS (M) (2) Types of cleaning redundancies tasks, that is, what kind of redundancies the tool is able to remove (3) Provided services-for example, the possibility of getting back to the original database, the cleaned database, the removed entries, or variables at each stage of the cleaning process as well as graphical representations of it (4) Formats and sizes of datasets that the tool is able to manage (5) The platform features, such as graphical interface and online version (6) License.
Note that the assessment of each indicator could be considered positive or negative depending on the context the tool is working in.For instance, we can find cleaning tools that work only online without existing any version working in local hosts.This could be positive, a priori, but in a confidentiality context this tool could not be taken into account.
In Table 5, we show a comparison among different tools that deal, in some way, with at least one of the redundancies defined above.These tools are Data Wrangler  [18][19][20][21][22][23][24]).Although there exist more tools for cleaning data, most of them are focused on removing duplicated data as identical cases; see the references [25][26][27][28][29].Most of the analyzed tools are designed to clean at the row level, so the way to proceed when removing empty variables is by transposing the database (that is what the symbol * means in Table 5).This fact, however, demands a high level of computation resources.Also, with some of above tools we are not allowed to include as many variables as we want, so transposing the database is not possible when the amount of rows is large.Even when transposing the database is possible, in some of these tools, the allowed type of variables is just numerical, and this imposes a strong restriction.All of them let us recover the original dataset as well as the cleaned dataset.However, none of them compute the level of redundancy.
In Table 4 we establish the analysis of the proposed assessment indicators for RTRR tool.
Although the tool RTRR covers different needs regarding cleaning data, there are still certain limitations that make this work open for future research.
In the first place, it would be necessary to enlarge the possible types of inputs and outputs that the tool can deal with.It is important to highlight that RTRR has been designed to work in a localhost, so it would be necessary to adapt it to different frameworks; for instance, it would be important to have another version being able to work in the cloud.
Note also that there is no graphic interface for RTRR, and it would be important to design it in order to bring the tool to nonexpertise users.
As the last limitation, it is worth highlighting the fact that RTRR cleans redundancies of three different types, although more types can be detected in a database.As future work, the introduction of more types of redundancies will be considered.

A Case Study
Definition 24.A cyber database is formed by a huge amount of security reports ( • ) as the row described in a vector database, that is to say, vectors with  states or features whose confidential information is about a security incident.
Usually, a cyber database has the following properties: regarding the volume of data, a cyber database has a large amount of data every day.The structural variety of the data of security reports depends on how we acquire the data, from machine-generated data (acquired by engines), correlated    Regarding the valence of a cyber database, this is a dense set of data because we usually find high valence.Data are related to each other because they rely on real events reported by several agents.But these relations are not explicit and there are links among sources, types of attacks, reports, and incidents of the same type of attacks.Finally, the value is precisely the actionable knowledge that we can get from the cyber database from analyzing the quality of the data, automatized process, prediction of incidents, or detecting intrusion in different networks (see [30][31][32]).
Remark 25.In a cybersecurity context, we usually cannot design the whole acquisition process of data.Then, the task of cleaning data is always the first available stage of the procedure in which we can try to improve the efficiency.Usually, superficial redundancy is presented because the common data integration systems are not designed for cybersecurity reports.Now we will analyze the superficial redundancy of the real cyber database we started describing at the beginning of this section, by applying the tool developed in Section 3.2.
Recall that the database has 363 variables.Recall also that the clustering of the set of variables that defines the graph structure was done manually because we have improved the procedure with expert knowledge, leading to # = 28, # = 24, and # = 131.
After applying the redundancy maps described in Section 2 on our sample, we obtain the redundancy levels that are shown in Table 6.The computation has been done in an accumulative way.
Results on Table 6 show that the most common type of redundancy in the sample is Type I. Also, the second type of redundancy more frequent in the sample is Type III.Both facts could be because the data integration engines are multisource and they are not designed specifically for cybersecurity reports.
The evolution of the associated graphs in the different cleaning stages is given in Figure 6, where connected green nodes represent the elements in , the isolated green nodes represent the elements in , and the yellow nodes represent the elements in .
The time from obtaining the initial matrices, the graph, and the level of redundancies that is taken for both, Python mode and MapReduce mode, is shown in Table 7 Result 1 (computational gain).After removing redundancies of types I, II, and III, we drop off about 64% of the rows of the original database.Hence the filtered database would be, roughly speaking, a third part of the original one.

Conclusions and Future Work
In this work, we have developed a novel graph approach for certain databases that allows computing the level of redundancy as the 1-norm of some adjacency and weighted matrices.
Furthermore, a tool (RTRR) to detect and remove some kind of redundancies has been presented and described, making use of the above theory.
Finally, this tool has been applied to a real cyber database made up by several sources, presenting a level of redundancy quite high and showing that, approximately, two-thirds of the data could be unuseful for further analysis.
As future work, we propose to model more types of redundancies, in particular, to face deep redundancies.Also, we will focus on improving the tool RTRR by creating a graphical interface that makes the tool more accessible to nonexpert users to create an online version.

Figure 1 :
Figure 1: Example of the graph associated with a vector database.

Figure 2 :
Figure 2: Associated graph with database shown in Table2.

Example 8 .
The adjacency matrix associated with the vector database of Example 4 is Applying the maps R * TI over the matrices  = (C, I), we obtain the filtered adjacency matrix   = (C  , I  ), which contains the database once Type I redundant variables have been dropped.Remark 13.In case the entries of the database were numerical, note that we could get the filtered database (without redundancy Type I), {   }, constructing    = (C   , I   ) =   ⋅   from   , where ⋅ denotes the usual Hadamard (entrywise) matrix product.
, ∀.Applying R TIII over the matrices  = (C, I), we obtain the filtered adjacency matrix   = (C  , I  ).Note that I  = I  because redundancy Type III does not concern individual features.Moreover, we can construct    = (C   , I   ) =   ⋅   from   and we get the row   filtered by redundancy Type III,    .

Table 5 :
Analysis of cleaning dataset tools.In the notation •/•, the first argument represents a feature of the free version of the corresponding tool, and the second one represents the same feature but in the enterprise version.Also, the symbol * means that the tool cleans by row level instead of column level.

Table 2 :
Example of a vector database.

Table 3 :
The filtered database of Table2after removing the superficial redundancies Types I, II, and III.

Table 6 :
Level of redundancy in the case of study.

Table 7 :
Time needed to complete the tasks in Python and MapReduce.