A mathematical model is proposed in order to obtain an automatized tool to remove any unnecessary data, to compute the level of the redundancy, and to recover the original and filtered database, at any time of the process, in a vector database. This type of database can be modeled as an oriented directed graph. Thus, the database is characterized by an adjacency matrix. Therefore, a record is no longer a row but a matrix. Then, the problem of cleaning redundancies is addressed from a theoretical point of view. Superficial redundancy is measured and filtered by using the 1-norm of a matrix. Algorithms are presented by Python and MapReduce, and a case study of a real cybersecurity database is performed.
Current systems of knowledge extraction are based on the creation of the best models to solve a specific problem with particular data. In addition, the computational algorithms used are implemented and applied through different data management and processing architectures, from the most rudimentary to the most advanced analytical platforms using Big Data (BD) in real time.
In the most current cases, the creation of specific models, capable of analyzing, categorizing, and predicting different situations, such as anticipating trends or reacting to certain events, requires Big Data analytics. These techniques give rise to different challenges such as data inconsistency, incompleteness, scalability, timeliness, or data security [
On the other hand, good quality data is required to obtain good quality knowledge (in another case we fall in the well-known
In addition, motivated by the high increase of the number of incidents, the sensors, and the Internet of Things (IoT) devices, the rate of acquisition of data grows exponentially and, therefore, the volume of databases can become in a dangerous situation because the data obesity can be presented. Also, real-world databases are severely sensitive to be inconsistent, incomplete, and noisy. This fact turns out to be especially significant when several data sources need to be integrated. Working in a multisource system of acquisition of data generates high overlapping because new data are continuously included from different sources and thus increasing the probability of finding noise and dirty data. For instance, the above situation is typically in data warehouses, federated database systems, critical infrastructures, etc. (see [
Data cleaning deals with detecting and removing errors from data and with eliminating the noise produced by the owned data collection procedure [ Superficial redundancy refers to all variables that we do not need to take into account in our further analysis from a natural point of view (empty variables, constant, and identical cases). The study of superficial redundancy allows us to filter the database without advanced statistical analysis or previous transformation of the data in treatable variables. Moreover, this redundancy may be studied in any database. Deep redundancy collects all variables containing the same information, encoded in different ways as well as correlated variables, associated variables, or in general nondiscriminant variables to the fixed target. Note that in the first case of deep redundancy a simple frequency analysis could be enough to recognize the variables with the same information. However, the detection of the relation of correlation between variables, or the computation of the relevant features for a specific target, requires more advanced statistical analysis.
Note that it would be expected that redundancy appears in more than one type. The case of duplicated cases in a database is a special type of redundancy when the database is build up from several data sources, and it can show up in both types described above. In fact, removing the duplicate information is a very complex process in databases of cybersecurity reports, since identifying them is a difficult task that requires expert knowledge (deep redundancy). For example, we can have the same incident, reported by different sources, at different times, and by using a different lexical language. Or we can observe the same case reported twice by the same source because of defects in the collection procedure such as stuck-up of cases by updates.
Following [ It should be able to remove all main errors and inconsistencies of data from individual and multiple sources. Manual inspection and programming effort should be limited. It should be flexible enough in order to integrate additional sources. Furthermore, data cleaning should be integrated together with schema-related data transformations. Data transformations along the cleaning procedure should be specified in a declarative way and be reusable.
There are several research works that develop different approaches to data cleaning of databases by as special data mining treatment, data transformations, or specific operators (see [
Our goal in this paper is to give a mathematical model for detecting and removing superficial redundancies in an instance and variable level, for a single or multisource context, over certain kind of databases (vector databases). Our proposal is an oriented directed-graph which is theoretically based. Then, we address the problem of cleaning redundancies by using elementary algebraic tools. A matrix with entries in
Moreover, in this work, we present a tool in open source that cleans the database in an automatized way and that computes the level of redundancy of the database. It also permits obtaining the original and the filtered database at any time of the process as well as the level of redundancy and the associated graph. The aim and procedures would be fully applied to any standard of reporting cases by means of formal language processing. The scripts mentioned above are available at a public GitHub repository (see [
In particular, this approach is applicable to clean up databases of cybersecurity reports (cyber databases). A cyber database contains a lot of unstructured information together with a high level of correlations where they are performed by human agents, that is, expert knowledge. The structural variety of the data of security reports is not unique (from machine-generated data to synthetic or artificial data). Moreover, the value of each feature could be structured, unstructured, or semistructured, and these typologies provide quantitative, (pseudo) qualitative, or string features. A security report usually is integrated, transformed, and combined with different data collection engines that provide only limited or null support for data cleaning, focusing on data transformations of management and schema integration. Since these engines receive information from different sources, in most cases we cannot modify the acquisition data process. Therefore, in order to extract knowledge from data, the best chance to get success is to optimize the different phases of the treatment and analysis of the data, and the first point is cleaning the database in an automatized way. Thus, we need to study the redundancy levels in order to detect superfluous reporters or optimize the resources. But, not every tool is useful. It must be noted here that some tools might be useless due to security constraints. In this case, it would not be possible to use online and private license software because sharing the data is not allowed. Therefore, in this situation, the tool to clean up the database would need to be integrated into an ecosystem with high levels of security.
In the final part of this paper, we also apply the developed tool to a real case of cleaning up a cyber database obtaining a 64% of superficial redundancy.
The paper is structured as follows: In Section
A graph database is a database that can be structured in graph form so that the nodes of the graph contain the information and the edges contain properties and/or define relations of the information contained in the nodes. One of the main strengths of these kinds of databases is the capability to give answers in short time for questions regarding relations (see [
In this section, we will define a graph structure on a database that conceptually differs from the usual one described above, and the motivation is based on the problem of detecting or cleaning redundancies in a database. In general, to show whether two columns or variables of our database are redundant or not, in some sense, one looks at the information contained in these variables and then decide. Although this will be our procedure eventually, we will cluster the set of variables by looking at the meaning they have and then we will consider the usual procedure. The point is that once the clustering is done the database and the clusters define a graph structure in a natural way where not all the nodes contain information.
Observe that in the above discussion we started by considering a usual database and we finished with the database plus a clustering of the variables. Before defining the graph structure, we will formalize this situation, and we will use it as the starting point.
A vector database All the databases If a database has a unique column, the column name agrees with the database label. Two different databases must have different labels. To have same column names in different databases is allowed. The nature of features is of any type (strings, floats, integers, etc.).
We will state some notation for the sake of clarity. We will use the notation The set The set The
With the notation described in Remark
If we apply the ordering that the set
So, any database in the form of (
Second form of a vector database.
| | | | | | | | | |
---|---|---|---|---|---|---|---|---|---|
|
We consider the example shown in Table
In this case, we can construct the following sets: From the last sets, we obtain the different column names,
We can now give a graph structure to the object defined in Definition Layer 1: the nodes are the elements of the sets Layer 2: the nodes are the elements of the set We have an edge Every variable
The look of such a graph is shown in Figure
Example of a vector database.
Attacker | Attacker | Source | Source | Target | Target | Target | Time |
---|---|---|---|---|---|---|---|
35 | 137 | 35 | 137 | 45 | static | 23:56:24 | |
56 | 1456 | 56 | 1456 | 45 | static | 23:56:39 | |
67 | 1456 | 67 | 1456 | 45 | dynamic | 23:57:02 | |
67 | 378 | 67 | 378 | 45 | dynamic | 23:57:43 |
Example of the graph associated with a vector database.
The associated graph to the vector database of Example
Associated graph with database shown in Table
In the sequel,
The adjacency matrix of a vector database is defined as the adjacency matrix,
Note if we are only interested in the database, without more relations than those that have been defined, the study of the adjacency matrix is reduced to the matrices
The adjacency matrix associated with the vector database of Example
From each row or report
An added problem that we will have to take into account when carrying out advanced analysis and the cleaning of the data is the problem of removing missing values. We have two different missing values in our model; first, in some rows, we do not have the complete information on all the features. Secondly, we do not have all possible arguments in each feature subvector. The problem of missing data is solved, a priori, by substitution by two categories: sample zeroes =
From the vector database of Example
Once the graph structure is defined, our proposal is to use it to compute the level of redundancy of the database. The redundancies we are going to deal with are defined as follows:
In order to give a close mathematical formula for the level of redundancy in terms of the graph structure we will have to define these redundancies properly.
Let
We can define the analogous map
Applying the maps
In case the entries of the database were numerical, note that we could get the filtered database (without redundancy Type I),
The level of redundancy of Type I,
Note that we can also obtain the redundancy of Type I in each subset of variables
The next type of redundancy is also known (for numerical variables) as
Let
Again, we can define the analogous map
Remark
The level of redundancy of Type II,
Note that, in the same way as Remark
The redundancy of Type III has to be with the possible equivalences between the sets of variables of the different databases of the tuple, that is, between the elements of
Let
Let
Applying
The level of redundancy of Type III,
Let
At this point it is worth noticing that our procedure is able to write down superficial redundancy after each subset and filtered report may be recovered if it is necessary.
If we apply redundancy Type I over the adjacency matrix of Example
Finally, we apply redundancy Type III, getting the following adjacency matrix:
The filtered database of Table
Attacker | Attacker | Target | Time |
---|---|---|---|
35 | 137 | static | 23:56:24 |
56 | 1456 | static | 23:56:39 |
67 | 1456 | dynamic | 23:57:02 |
67 | 378 | dynamic | 23:57:43 |
This section is divided into three different subsections. In the first part, the materials that have been used in the experimental section are described. The second part is devoted to developing the tool
The
In the case of study, the database is a real fragment with cybersecurity reports with 363 variables or columns and 2600 rows. It has been anonymized due to confidentiality constraints.
The execution flow of the RTRR (see Figure
General working flow of RTRR.
In the first case mentioned above, the user distinguishes between the three types of nodes of the eventual graph structure, providing two different text files (.txt). One of them is with the list of names of the single-column-tables (
The input data source is introduced in the tool to be processed and when the process finishes, the user obtains four outputs for each request of erasing each redundancy type (the user can request more than one type in the same process as will be described later): The first output is the graph that represents the input data source, that is to say, the graph with all the variables before removing redundancies. The second output is the cleaned input data source, the database without redundancies. The third output is the graph linked to the cleaned input data source, a new graph without redundant variables. The last output is a text file that gathers the information about the level of redundancy and the list of the variables removed throughout the execution process.
There are three key features in the tool: the first one is the cleaning of the input source by removing redundant variables. The second one is the two different ways that the tool provides for erasing redundant variables, that is to say, the working modes: the single mode and the accumulative mode, which will be described below. Finally, the generated graphs based on the recognition of variables and the later elimination of redundancies.
The tool has two modes for removing redundancies: single mode and accumulative mode.
The single mode allows removing only one type of redundancy at once because it processes always the original input database. If the user specifies more than one redundancy type to be erased, those types of redundancies are eliminated individually, according to Figure
Single mode of RTRR.
The accumulative mode allows concatenating the removal of several redundancies in the same execution, that is to say, to remove multiple redundancies at once. The diagram is radically different because the cleaned database obtained in a previous execution is used as input database in the next execution as shown in Figure
Accumulative mode of RTRR.
Finally, the tool generates oriented graphs with the isolated names and the relationships between the names and the columns. The isolated names and names are green nodes and the columns are yellow nodes; those two different colors allow the user to identify the direction of the relationship. When we introduce an input data source in the tool, an initial graph representation of the initial database is generated, as well as a new graph representation for each type of redundancy that the user decides to remove. The user obtains two graphs (the original one and another without one of the types of redundancy) in the basic case and four graphs (the original one and three more, one for each type of redundancy removed) in the worst case.
In order to make a comparison between RTRR and other tools of cleaning databases, we have proposed the following assessment indicators for an eventual tool that integrates the cleaning procedure: Required computational resources—for example, we should take into account the minimal required RAM needed for the tool to work properly and also the operative systems that can support it, such as Windows (W), Linux (L), and MacOS (M) Types of cleaning redundancies tasks, that is, what kind of redundancies the tool is able to remove Provided services—for example, the possibility of getting back to the original database, the cleaned database, the removed entries, or variables at each stage of the cleaning process as well as graphical representations of it Formats and sizes of datasets that the tool is able to manage The platform features, such as graphical interface and online version License.
Note that the assessment of each indicator could be considered positive or negative depending on the context the tool is working in. For instance, we can find cleaning tools that work only online without existing any version working in local hosts. This could be positive, a priori, but in a confidentiality context this tool could not be taken into account.
In Table
In Table
Analysis of RTRR.
Indicator | RTRR Tool |
---|---|
Minimal Required RAM | NO |
| |
Redundancy type I | |
| |
Redundancy type II | |
| |
Redundancy type III | |
| |
Representations | Graphs |
| |
Allowed Input Format | v1: CSV, PostgreSQL, |
MySQL, MS Excel | |
v2: CSV, MS Excel | |
| |
Output Format | CSV |
| |
Size restrictions | NO |
| |
Operative systems | W/L/M |
| |
Online version | X |
| |
Local version | |
| |
Interface | Console |
| |
License | GPL (free) |
| |
Company | RIASC |
Analysis of cleaning dataset tools. In the notation
Indicator | T1 | T2 | T3 | T4 | T5 | T6 | T7 |
---|---|---|---|---|---|---|---|
Minimal | NO | NO | 4 GB | NO | NO | 2 GB | 128 MB |
| |||||||
Redundancy type I | NO | NO | NO | NO | NO | | NO |
| |||||||
Redundancy type II | NO | NO | NO | NO | NO | NO | NO |
| |||||||
Redundancy type III | NO | NO | NO | NO | NO | NO | NO |
| |||||||
Representations | NO | NO | statistical | statistical | statistical | NO | NO |
| |||||||
Allowed Input Format | text | CSV, text | csv, text, | CSV, MS Excel, | CSV, text, | CSV, text, | CSV, text, |
| |||||||
Output Format | CSV, text | CSV, TSV | CSV, JSON | CSV, | CSV, text | CSV, text | CSV, text |
| |||||||
Size restrictions | 1000 cases and 40 variables | NO | NO | NO | NO | 1.048.576 cases and 16.384 variables | 1000 cases/ |
| |||||||
Operative systems | W/L/M | W/L/M | W/M | W/L/M | W | W/M | W |
| |||||||
Online version | | X | X | cloud | X | X | Cloud |
| |||||||
Local version | X | | | | | | |
| |||||||
Interface | Web | Web | Desktop | Desktop | Console/desktop | Desktop | Desktop |
| |||||||
License | Free | BSD | Free | Free/ | Free/ | Free/ | Free/ |
| |||||||
Company | Stanford/ | Open | Trifacta | Human | WinPure | Microsoft | Ashisoft |
Although the tool RTRR covers different needs regarding cleaning data, there are still certain limitations that make this work open for future research.
In the first place, it would be necessary to enlarge the possible types of inputs and outputs that the tool can deal with. It is important to highlight that RTRR has been designed to work in a localhost, so it would be necessary to adapt it to different frameworks; for instance, it would be important to have another version being able to work in the cloud.
Note also that there is no graphic interface for RTRR, and it would be important to design it in order to bring the tool to nonexpertise users.
As the last limitation, it is worth highlighting the fact that RTRR cleans redundancies of three different types, although more types can be detected in a database. As future work, the introduction of more types of redundancies will be considered.
A cyber database is formed by a huge amount of security reports (
Usually, a cyber database has the following properties: regarding the volume of data, a cyber database has a large amount of data every day. The structural variety of the data of security reports depends on how we acquire the data, from machine-generated data (acquired by engines), correlated data (correlated data by engines based on different rules), and synthetic or artificial data (added data by expert agents). The types of variables are not uniform. They could be structured, unstructured, or semistructured. These typologies provide quantitative, qualitative, string, or pseudo-qualitative features. Cyber databases are dynamical, and features can attach new values, a priori unknown, every single day, even those features that are categorical. The rate of creation, storage, and analysis of the data implies a very high velocity. Also, it is not constant because the sources do not periodically update the reported information. With respect to veracity, cyber databases are volatile because of their short lifetime. Data quality is measured bearing in mind certain white noise in the transmission. Moreover, the data validity depends strongly on the accuracy of information and the reliability of data source. Regarding the valence of a cyber database, this is a dense set of data because we usually find high valence. Data are related to each other because they rely on real events reported by several agents. But these relations are not explicit and there are links among sources, types of attacks, reports, and incidents of the same type of attacks. Finally, the value is precisely the actionable knowledge that we can get from the cyber database from analyzing the quality of the data, automatized process, prediction of incidents, or detecting intrusion in different networks (see [
In a cybersecurity context, we usually cannot design the whole acquisition process of data. Then, the task of cleaning data is always the first available stage of the procedure in which we can try to improve the efficiency. Usually, superficial redundancy is presented because the common data integration systems are not designed for cybersecurity reports.
Now we will analyze the superficial redundancy of the real cyber database we started describing at the beginning of this section, by applying the tool developed in Section
After applying the redundancy maps described in Section
Level of redundancy in the case of study.
Adjacency | | | | Redundancy |
---|---|---|---|---|
| 28 | 24 | 131 | – – |
| 28 | 15 | 129 | |
| 23 | 9 | 129 | |
| 13 | 8 | 76 | |
Results on Table
The evolution of the associated graphs in the different cleaning stages is given in Figure
Graphs.
Original graph
Graph without T2
Graph without T2 and T3
Cleaned without T1, T2, and T3
The time from obtaining the initial matrices, the graph, and the level of redundancies that is taken for both, Python mode and MapReduce mode, is shown in Table
Time needed to complete the tasks in Python and MapReduce.
Redundancy | Python | MapReduce |
---|---|---|
| 13 sec | 6 min 10 sec |
| 20 sec | 6 min 3 sec |
| 6 sec | 6 min 21 sec |
After removing redundancies of types I, II, and III, we drop off about 64% of the rows of the original database. Hence the filtered database would be, roughly speaking, a third part of the original one.
In this work, we have developed a novel graph approach for certain databases that allows computing the level of redundancy as the 1-norm of some adjacency and weighted matrices.
Furthermore, a tool (RTRR) to detect and remove some kind of redundancies has been presented and described, making use of the above theory.
Finally, this tool has been applied to a real cyber database made up by several sources, presenting a level of redundancy quite high and showing that, approximately, two-thirds of the data could be unuseful for further analysis.
As future work, we propose to model more types of redundancies, in particular, to face deep redundancies. Also, we will focus on improving the tool RTRR by creating a graphical interface that makes the tool more accessible to nonexpert users to create an online version.
The authors declare that they have no conflicts of interest.
The authors would like to acknowledge the Spanish National Cybersecurity Institute (INCIBE) that partially supported this work.