To process a huge amount of data, computing resources need to be organized in clusters that can be scaled out easily. Still, traditional SQL databases built on the relational data model are difficult to be put to use in such clusters, which has motivated the movement named NoSQL. However, NoSQL databases have their limits by using their own data models. In this paper, the original soft set theory is extended, and a new theory system called n-tier soft set is brought up. We systematically constructed its concepts, definitions, and operations, establishing it as a novel soft set algebra. And some features of this algebra display its natural advantages as a data model which could combine the logicality of the SQL model (also known as the relational model) and the flexibility of NoSQL models. This data model provides a unified and normative perspective logic for organizing and manipulating data, combines metadata (semantic) and data to form a self-described structure, and combines index and data to realize fast locating and correlating.
After entering the 21st century, with the outbreak of Internet applications, the total amount and complexity of digital information possessed by human beings have witnessed an explosive increase at an unprecedented speed, showing many new features. Some professionals believe that we have entered into the era of big data [
In order to quickly process large volume, fast flowing, and complex data in a limited time to generate value, more computing resources must be acquired. There are usually two schemes: scale up and scale out.
Scale up means configuring better performance hardware for a single computer, such as more and stronger CPUs, and larger and faster memories and disks, but without increasing the number of computers. However, the performance of computer hardware that can be obtained from the market in a period of time has its up limit, and the performance-price ratio of high-end products is usually low, which incurs high cost.
By increasing the number of computers rather than the performance of single computer, scale out incorporates a large number of high cost-effective, low (or mid)-end computers into a cluster to increase computing power. That not only reduces costs in comparison but also makes the cluster more resilient, namely, even if some of the computing nodes failed, the entire cluster can continue to provide services.
However, the relational data model [
Firstly, according to the definition of the relation model and its normalization theory, a tuple is an ordered list of atomic values that cannot be nested or contain collection types (set, list, and so on) which is difficult to represent complex structures, but there is no such restriction on variables used by application programming, thus resulting in an “impedance mismatch” (a metaphor for the mismatch between the data forms of the relational data model and the application programming model). At present, this problem is usually adjusted by using the middle layer called ORM (object relational mapping).
Secondly, the relational model uses normalization to reduce redundancy and avoid exceptions and ensure the integrity of databases. In a relational database that follows the third (or higher) normal form, the data involved in an unit process of an application are typically scattered across different tables. In order to ensure the ACID (refers to the four basic elements of the correct execution of a database transaction, namely, atomicity, consistency, isolation, and durability) requirements of a transaction and the integrity constraints required by the normal form, a series of locks and resources are costed. In a situation of high concurrency or huge volume, that can egregiously affect the performance and availability of the database.
Moreover, the relational model is algebraically based on relation rather than mapping, which cannot express index by itself (while a mapping is a natural abstraction of an index in mathematics). That renders indexes are external structures, separated from data in implementation, which not only increases the demand of storage space but also makes data difficult to locate each other on their own. To correlate the data between different tables, it is necessary to write complex SQL queries and use expensive Join operation. And in order to support Join operation between tables, the related tables must be placed in a same node, which is not conducive to data dispersion in cluster and usually needs manual design for sharding, making relational databases difficult to scale out.
Meanwhile, a relational database needs a rigid predefined schema. One has to predefine the structures and constraints of tables. And the schema is very difficult to change in reality, falling short of dealing with changing sources and requirements.
Those problems of relational databases have motivated the development of some database products called NoSQL and inspired a new round of innovation for database theory and practice [
The main reason why these databases convert the view of data from relations to key-value structures (including simple key-value, column, and document) is for dealing with aggregates. Unlike tuples in relational databases, aggregates are usually designed and used by upper applications (not by databases). It organizes all the data needed in a single processing unit to be accessed together, eliminating expensive and complex SQL queries and table Joins. Aggregates, as natural and independent data distribution units, also make data dispersed easily in a cluster. The form of aggregate is also free, which can easily add or delete content. So, impedance mismatch can be solved without ORM intermediate layers.
Although key-value typed databases have partly solved some problems of relational databases, they do not have rigorous mathematical foundations and there is no connectivity between aggregates, resulting in the difficulties of complex querying and understanding connections among data. On the other hand, relational databases with rigorous and precise algebraic foundation may use a powerful query language based on relational algebra to analyze and reason data freely and logically in the case of a small amount of data on a single machine. However, in the case of big data or in a cluster, it is also difficult to dig out value from the connections among data by using Join operation. So, the graph database, based on graph theory, is designed to explore the connections among data expediently. The graph model represents data as a set of nodes, node attributes, and edges, providing fast and efficient performance of traversing the whole graph with index-free adjacency. However, the graph model focuses on connections and networks, and it is not good at expressing entity and its attribution (mathematically, nodes in a graph have no attributes, and on the implementation, simple key-value pairs are used to store attributes), so it has a specialized range of application and lack of generality [
At present, the database models used by the mainstream are the relational model (SQL) and NoSQL (key-value, column family, file, graph, etc.) model. They are proposed to solve the problem that the relational model is too rigid to change the database schema (especially in vast amounts of data) and difficult to distribute. However, the new NoSQL models sacrifice the mathematical rigor of the relational model and the freedom of query expression.
A model that combines the same mathematical logic foundation as the relational model and uses a key-value class data structure urgently requires studying. It can be easy to distribute and also change the mode. We think that this improvement can use the “key-value pair” data structure in a distributed environment to realize a database with rigorous algebraic logic, which combines the advantages of SQL and NoSQL, and has a specific practical significance.
All these problems motivate us to explore a new data model which will not only maintain the merits of key-value structures, lend data the ability to describe itself, and can be easily located and moved in a cluster but also have an appropriate normalization and a rigorous algebraic basis like the relational model that can enable a powerful query language independent of products to be applied freely and logically. At the end, we focused on an algebraic theory called soft set. Soft set theory is a mathematics theory proposed by Russian mathematician Molodstov in 1999 in order to solve uncertainty problems. The basic idea is to provide semantic parameterized sets by using a generalized set-value mapping [
Just because a soft set is a mapping that allows fuzzy semantics for its parameters and sets for its return values, and mappings in mathematics has natural connection with key-value structures, and sets as return values can have internal structures that can be manipulated, we finally saw the hope that soft set could be used as a mathematical abstraction for an intricate key-value structure [
Molodstov gave the initial definition of soft set and a general operation and introduced several possible applications in [
In the second section, we will review the soft set theory. Because previous soft set theories are not suitable to be the algebraic basis of the data model we need, we will extend the original soft set theory from the basic structure and systematically introduce a new soft set algebra called n-tier soft set, including its definitions, operations, and related concepts, which will form a complete system and provide the theoretical basis for the later data model. In the third section, we will illustrate why and how to use n-tier soft set to build a data model, define the infrastructure and modeling principles, and finally, explain its features and advantages.
Before defining n-tier soft set, we first review the basic definition of soft set.
Let a nonempty set
This definition is slightly different from Molodtsov’s initial one [
A mapping also can be treated as a set of ordered pairs, so an equivalent definition is given.
Let a nonempty set
Examples for soft set: let
The definitions above point out that mapping and set are two equally views of soft set. So, for soft set, general properties and operations about set are also suitable (for example, intersection, union, and complement in the sense of a general set). However, the results of these operations may not be enclosed in soft set (like the union operation
Those notations are concise and enable us to see an important property of soft sets clearly, that is, the ability to maintain mapping after some splitting, merging, or deformation operations.
Meanwhile, because a soft set can be seen as a set-valued mapping, and we also can consider a soft set as a set. Such definition provides a crucial recursive way to construct a new structure, which furnishes the soft set theory with a new and richer content. Next, we will introduce a new notation
Firstly, we define n-tuple, n-ary Cartesian product, and some other related concepts and introduce some notations to facilitate the following discussion.
An n-tuple is a finite ordered list of
And let
In this paper, we use
Let
Using the usual notation
In particular, when
N-tier soft set: let
When
Among which,
When
Among which,
When
In this paper,
When
When
Next, we will define some other important concepts related to soft set.
Soft empty set
Among which,
Soft universal set
Soft subset
It is important to note here that in earlier soft set theory, the conditions of soft subset can be summarized as follows:
Equality = : let
Let
By using the inductive method, we prove the base case of induction firstly.
According to the definition, when
Then, when
Next, we prove the inductive step: if when
According to the inductive hypothesis,
So, if the proposition is true when
Soft power set
It is easy to prove the following properties of soft subset and soft power set by using similar inductive methods in Proof 1.
For any
Soft union
In addition, let
Soft intersection
In addition, let
Soft difference
Soft complement
Soft symmetry difference
The above operations of n-tier soft set have the following properties.
Let Commutative law: Associative law: Distributive law: Identity element: Zero element: Inverse element: Complementary law: Idempotent law: Absorption law: De Morgan law:
By using a similar inductive method in Proof 1, those properties can be proved directly by definition and the specific process will not be repeated here.
Soft range
Key set
Value set
Selection
Please note that an n-ary predicate is reduced to an
Domain remove
Because of no ambiguity, we use the same token for the n-tier soft set and the n-tuple, and the reader can distinguish them from each other by context.
Domain rise
In particular, when
Uncurrying
Uncurrying transforms an n-tier soft set into an
Currying
Mind here, an n-ary function is reduced to an (
Currying transforms an (n−1)-ary mapping into an n-tier soft set.
Concatenate production
Particularly,
Soft direct production
Soft mapping production
According to the definition, soft mapping production is associative, namely,
Soft relation: let
Associated relation of a soft set: let
Associated soft set of a relation: let
Mind here, we indirectly used inductive definition of tuple. That is, any n-ary tuple could be considered as a nested binary tuple when
For mathematics, the n-tier soft set defined in this section and its operations have a wealth of contents to be studied. They have nice properties, soft intersection, soft union, and soft complement, and other operations satisfy all the properties of common set operations (commutation law, association law, etc.). However, this paper does not focus on the discussion of the mathematics. Next, we will focus on explaining why and how to use n-tier soft set as a data model for databases in the era of big data.
There is no natural expression for the existence of things or events. Only by purposeful selection, abstraction and simplification can we transform some specific aspects of irregular fields to structured and manipulatable objects. Data model describes the static characteristics and dynamic behavior of database system from the abstract level, providing a logical abstract framework for data representation and operation, and fundamentally determines how data are stored, organized, and manipulated. Therefore, the data model is the core and foundation of the database system, and all database systems must be based on a certain data model. The data model also constitutes a bridge between the upper applications, database system itself, and its underlying physical implementation, which enables them to view and use the data in a unified way.
We have already explained the problems of the relational data model and the most popular NoSQL data models in Introduction. In the second section, n-tier soft set is defined as a nested set-valued mapping that makes it possible to express complex key-value structures. Next, we will set up a new data model by using n-tier soft set algebra.
Just like that we often use a table to represent a relational model, in order to illustrate easily, we will introduce a plain text representation of n-tier soft set at first. It is similar to JSON and independent of specific programming languages, which is called SSSN (soft set serialization notation). The basic construction rules are as follows (just for a demo, the strict definition and parse method will not be discussed in this paper): Representing strings with double quotation marks, numerical values with literal numbers, and Boolean values with true/false, for example, “hello World this is SSSN” #String 12345678 #Number true #Boolean Representing tuples with contents enclosed in parentheses and separated in comma, for example, (“Joe,” “Male,” “New York”) Representing sets with contents enclosed in brace and separated in comma, in which the elements cannot be repeated, for example, {“Elephant,” “Monkey,” “Zebra,” “Panda”} Representing mappings with contents enclosed in brace, separated in comma and matched by colon (several-to-one ordered pairs, and the left side of colon cannot be duplicated.), for example, {“name”:”Joe,” “sex”:”male,” “address”:”New York”} Representing bijective mappings with contents enclosed in brace, separated in comma and matched by double colon (one-to-one ordered pairs, and neither side of double colon should be duplicated.), for example, {“20181001”:“Oct−1−2018”, “20181031”:“Oct−31− 2018”} When colons or double colons are used to pair, the left side of the colon or double colon can only use strings, numbers, Booleans, or tuples, and the right side can use any type of value defined above. { (“name,” “sex”):{“Joe”:”male,” “Eva”:”Female”}, (“name,” “birthday”): {“Joe”:”20001001,” “Eva”:”20000110”} }
In the following discussion, we can see that when we express and pass n-tier soft set in this way, we can not only express and pass the semantics and data integrated by n-tier soft set but also express some important logical constraints.
Next, we will define the n-tier soft set data model.
Domain soft set: let
The elements in
Domain soft set: “Joe,” “male,” “20011231,” “19990723,” “female,” “Eva,” “19980301,” “Adam,” “Bob,” “860702,” “13320255520,” ... } “name”:{“Joe,” “Eva,” “Adam,” “Bob,” ... }, “sex”:{“male,” “female,” ... }, “birthday”:{“19870220,” “19990723,” “19980301,”... }, “telephone”:{“860702,” “13320255520,” “1191101,”... }, ... }
Domain soft set combines semantic and data, determines involved domains, defines the semantic names and value range of a domain, and establishes the finite boundaries of a database system.
Domain relation soft set: let
The elements in
Domain relation soft set: (“person,” “name”):{“0001”:{“Joe”}, “0002”:{“Eva”},... }, (“person,” “sex”):{“0001”:{“male”},“0002”:{“female”},... }, ... }
By soft mapping production (Definition
Database soft set: let
The elements in
Database soft set: “college”:{ (“person,” “name”):{“0001”:{“Joe”},... }, (“person,” “student_id”):{“0001”:{“20130201025”},... }, (“course,” “name”):{“CS001”:{“Database”},... }, (“course,” “credits”):{“CS001”:{“3”},... }, ... }, “E−Shopping”:{ (“person,” “name”):{“0001”:{“Joe”},... }, (“person,” “customer_id”):{“0001”:{“00302001”},... }, ... }, ... }
Aggregate soft set: let
The elements in
Aggregate soft set: let “student”:{ (“person,” “name”):{“0001”:{“Joe”},... }, (“person,” “student_id”):{“0001”:{“20130201025”},... }, ... }, “course”:{ (“course,” “name”):{“CS001”:{“Database”},... }, (“course,” “credits”):{“CS001”:{“3”},... }, ... }, ... }
Aggregate soft sets divide a soft set database into different subsets and give each subset a name.
Meanwhile, let
Therefore, all the objects defined in this section can be represented as an n-tier soft set consisting of a semantic set
For example, for the domain relation soft set { “person”:{ “0001”:{“name”:{“Joe”},“sex”:{“male”},... }, “0002”:{“name”:{“Eva”},“sex”:{“female”},... }, ... } }
Then, it looks like the BigTable data model proposed by Google in [
Through the above definitions, we get the basic components needed to build the n-tier soft set data model. Next, we use an example to demonstrate the evolution from the relational model to the four popular NoSQL models, then to the n-tier soft set model, to show why and how to use the n-tier soft set data model for modeling.
In the traditional modeling process for relational databases, the initial stage of modeling is to understand conceptual entities in the modeling domain and the relationship between them. Through the discussion between domain experts and system architects and data architects, the results of these understandings often end up forming a so-called conceptual model, which is often represented by an ER diagram. Although the ER diagram is often used in the modeling process of the relational model, it can also provide a common conceptual starting point for all other models in our discussion.
We suppose that we have designed a conceptual model, as shown in Figure
ER model for E-shopping.
Then, firstly, let us see how to model the scenario with the relational model.
After obtaining a suitable conceptual model, the relational model transforms it into the structures and constraints of tables. As shown in Figure
Relational model for E-shopping.
The advantages of relational model lie in its simple and intuitive expression, strict and nice mathematical foundation, and the freedom from the separation of logic and physics. Without any underlying implementation information, a relational database can freely express and obtain information contained in an existing dataset by a small amount of concise operations (relational algebra has been proved to be equivalent to first-order predicate calculus restricted in secure expressions). However, we can also observe several problems with the relational model: as you can see from Figure Flat: a tuple is a flat and restricted structure, which can only contain indivisible elements. These elements are regarded as atoms at the model level. They have no internal structure and cannot be nested, which restricts its ability to express complex objects and brings the so-called impedance mismatching problem. Rigid: in a table, every tuple must contain a same fixed number of elements, and each element is rigidly coupled with its position, so even if there is actually no value in a position, its place shall be filled with the null value. Semantic and data separation: table heads as semantics and table bodies as data are separated. In the theory of the relational data model, table names and column names are defined by a metalanguage, and in a specific implementation, a relational database uses a data dictionary separated from the data to store these metadata. That makes it necessary to process metadata separately before transferring data. This separation of semantic and data makes it difficult to transmit data in a network, while other data formats such as XML or JSON combining semantics with data can enable the transmission of complete information at the same time. Index and data separation: the relational model does not express information about how tuples are located or sorted. To find tuples containing certain values in a table, one has to scan and compare them one by one. This renders the relational model too reliable on the external index structure in real use. However, indexing is not a part of the relational model. It not only consumes large storage space but also incurs maintenance costs. Data and data separation: whether in the same table or between different tables, tuples of relational models are separated from tuples. Their connections which need to be calculated dynamically are implicit in the value of specific data. Conceptually, this shows that the relational model does not directly express the relations between entities. To find links between entities, it is necessary to connect tables with Join operation, which is usually very time-consuming.
These problems in the relational model have prompted the development of NoSQL data models and database products.
Key-value model for E-shopping.
Column family model for E-shopping.
Document model for E-shopping.
Generally speaking, all above three models use key-value pairs as basic structures to organize data. Different models use different structural values, which provide different ways of aggregating information.
Key-value pairs are simple but essential. Keys can provide semantics for the values, which uncouple data and their positions, and eliminate the rigidity of system. A key-value pair is a self-described entirety that is no longer dependent on each other in form. At the same time, keys can also help locate values so that they can be accessed quickly. This allows key-value pairs can be easily dispersed into a cluster, and their contents and forms can be very free and flexible. So, we can predetermine all the required content according to the convenience of the upper application and aggregate it together for fast access without Join operation. That partly solves the problems of the relational model. However, key-value typed models also have some problems: Values can only be accessed one way by keys, and keys cannot be retrieved by values reversely (we can see the directions and granularities of access for different models through the arrows shown in the figures). To find the specific key-value pairs by values, it is necessary to compile external indexes or use external frameworks such as MapReduce for scanning processing. There is no connection between key-value pairs. Discrete key-value pairs have many advantages, and they can be formed and operated independently, but we also hope that they can maintain their logical connections (we will see how to achieve this in the subsequent discussion about the n-tier soft set model). The form of key-value typed databases is changeful (known as schemaless databases), but it is not the case for query and reasoning (which is what the relational model good at). The contents of aggregates are prepared and stored for specific needs, and aggregates designed for an application are not necessarily suitable for others, which becomes another kind of inflexibility. Key-value typed models have no rigorous mathematical basis. A strict mathematical foundation not only makes the definition and expression of the model more rigorous but also facilitates the theoretical study of the model, the deduction of its properties and theorems (or makes use of existing results), and the recognition of its logical reliability and completeness. It is also easy to design a concise and general query language (for example, the relational model achieves a powerful logical expression with a few operations).
Graph model for E-shopping.
Various models have been discussed above, as well as their problems. Now, let us take a look at how to modeling with the n-tier soft set model (hereafter referred to as the NTSS model).
Entity in the NTSS model.
Relation in the NTSS model.
Connections in the NTSS model.
Entities and attributes: as shown in Figure Relations: as shown in Figure Connections in the NTSS model: as shown in Figure Cardinality constraint of a connection is expressed and implemented by the values of domain relation pairs. For any connection C between domain A and domain B, we have the following: If it is a 1 : 1 connection, the values of domain relation pair C, If it is a 1 : n connection, If it is an n : m connection,
For example, in Figure
Macroscopically: we can see the similarities between the NTSS model and ER model in the upper half of the figure. The NTSS model, like conceptual models such as ER, retains the intuitive panorama of its modeling domain and constructs a network of domains, which has rich semantics and sufficient connections close to human natural thinking. Microscopically: in the lower half of the figure, each connection between domains in the NTSS model is represented by a pair of named soft sets which are reverse of each other. An NTSS database is actually made up of such pairs of soft sets. In implementation: an NTSS database is an n-tier soft set, so it can be uncurrying (Definition { “E−Shopping”:{ (“customer_id,” “name”):{ “0001”:”Joe”, “0002”:”Eve”, ... }, (“name,” “customer_id”):{ “Joe”:{“0001,” “0086,” “0223,”... }, “Eva”:{“0002,” “0332,” “0487,”... }, ... } } } which is a 4-tier soft set, and can be transformed into key-value pairs as { “E−Shopping, (customer_id, name), 0001”:”Joe”, “E−Shopping, (customer_id, name), 0002”:”Eva”, ... “E−Shopping, (name, customer_id), Joe”:{“0001,” “0086,” “0223,”... }, “E−Shopping,(name, customer_id), Eva”:{“0002,” “0332,” “0487,”... }, ... } So, if we use a hashtable to be the underlying implementation of an NTSS database, the information contained in the keys will be implied in storage addresses, and values will be hashed but maintain the logical structure of the database. In usage: through our formal definitions, for the upper application programming users, an NTSS database is just a function with a set of well-defined operations and uniform specifications. In fact, referring to the example mentioned above, let B be the database soft set which contains the “E−Shopping” database, and in upper programming languages, the database soft set B is just a function which return values are also functions. By giving a parameter “E−Shopping,” B (“E−Shopping”) returns the value (a domain relation soft set) of a database named “E−Shopping,” which can still be regarded as a function. By giving a parameter (“customer_id,” “name”), then B (“shopping”) (“customer_id,” “name”) will return the value of the domain relation (still a function) between “customer_id” and “name.” By giving a “customer_id” such as “0001,” then B (“shopping”) (“customer_id,” “name”) (“0001”) will return the name of the customer. This is very natural to the language, which supports functional programming, and naturally constitutes a concise query language.
NTSS model for E-shopping.
Next, we will expound the advantages of the NTSS model and explain why the NTSS model is suitable for dealing with big data.
First, we show the performance advantages of the NTSS database over the relational database through a comparative experiment. We implemented a prototype database based on NTSS (using Python) and compared it to 8.0.15 version of MySQL on a computer with 2.6 GHz Intel Core i7, 16 GB 1600 MHz DDR3, and 512 GB PCI SSD. We built three experimental data tables, Customer, Product, and Buy, to express the records of customers purchasing products. Each time 10,000 records of data are written, the time consumption of write is recorded, then the names of the customers who purchased the random 5 products are queried, and the time consumption of read is recorded.
MySQL write and read statements are similar to the following: # Write insert into customer (cust_id, cust_name, cust_sex) values (“c00001,” “Joe,” “male”) insert into product (prod_id, prod_name, prod_desc) values (“p00001,” “tv,” “just_a_television”) insert into buy (cust_id, prod_id, but_time) values (“c00001,” “p00001,” “20190101163749″) # Read select prod_name, cust_name from cust a, buy
NTSS write and read statements are similar to the following: import ntssdb as nb nb = nb.connect (host = “localhost,” dbname = “test,” user = “root,” password = “pw”) # Write nb (“cust_id,” “cust_name,” “cust_sex”).put (“c00001,” “Joe,” “male”) nb (“prod_id,” “prod_name,” “prod_desc”).put (“p00001,” “tv,” “just a television”) nb (“buy_id,”“cust_id,” “prod_id,” “buy_time”).put (“b00001,” “c00001,” “p00001,” “20190101163749”) # Read nb (“prod_name,” “prod_id,” “buy_id,” “cust_id,” “cust_name”).get (“tv,” “phone,” “pad,” “car,” “coke”)
In the experiment, we compared five key indicators with MySQL: When MySQL is not indexed, the insertion time increases with the amount of data. When MySQL is not indexed, the reading time increases with data. When MySQL is indexed, the insertion time increases with the amount of data, When MySQL is indexed, the reading time increases with the amount of data. Space usage.
It can be seen from Figures
Performance comparison between NTSS and unindexed MySQL.
Performance comparison between NTSS and indexed MySQL.
Comparison of storage consumption between MySQL and NTSS DB.
When MySQL was indexed, the insertion and reading time is O(log(n)) in theory (because the index of MySQL is usually implemented by
For space usage, NTSS is about 2.73 times as large as a nonindexed MySQL database (NTSS: 402 MB, MySQL: 147 MB) to store the same data. However, if MySQL wants to query more freely (index all columns), its index space will be about 307 MB, so it will take up 147 + 307 = 454 MB in total, which is higher than that of NTSS.
We do not compare performance with the current NoSQL databases. As a prototype database implemented in Python, there is no comparability between NTSS and the mature NoSQL database that has evolved for many years in performance. Compared with the current NoSQL database, NTSS has the advantages of query freedom and mathematical logicality. Taking MongoDB as an example, as a popular database, MongoDB is widely applied in everyday applications and has extremely high performance in some queries, but it has no mathematical logicality and cannot query freely (strong at query key to value, but weak at query key to value). So, if you need to get the relationship between the values, it will cost a lot (need index structure or traverse scan). However, the NTSS model is a model with complete mathematical logicality and can query freely between key and value. The NTSS database cannot compete with MongoDB from an implementation perspective because NTSS only stays at the prototype level and will gradually approach the current mainstream NoSQL database through future improvements.
Based on the above experiments and previous discussions, we can clearly see that the NTSS model has the following advantages: Efficient performance: as we have seen in the comparison experiment, MySQL is a relational database whose data and indexes are separate, and its performance depends on the design of indexes; the write and read performance and the convenience of query cannot be taken into account at the same time. However, an NTSS database can be transformed to key-value pairs and implemented as a hashtable directly; therefore, any data in it can be write or read with an average time complexity of Schemaless: the NTSS model represents entities or aggregates as interconnections between domains, rather than a fixed table. Connections in the NTSS model are logically represented by n-tier soft sets and implemented by key-values in the underlying, which are independent of each other and can be added or deleted at will without mutual influence. This solves flat and rigid problems in the relational model. For example, if we want to split the “name” domain which is connected to “customer_id” into “first_name” and “last_name,” we only need to add two new connections between “first_name,” “last_name,” and “customer_id” and delete the original one. This does not affect other parts of the database neither logically nor physically. Semantic and data integration: the NTSS model represents semantics and data in an integrated way, which makes it is easier to move and disperse. It is no longer necessary to process metadata separately. Index and data integration: an instance of the NTSS model is a nested index structure, and each atomic datum has a unique logical access path. The data stored in a database formed by the NTSS model are a complete index system itself, and every domain in it can be used as index key to indicate data in other domains connected to it, which solves the problem of index and data separation of the relational model. And it becomes the key to efficient performance and sufficient connections. Sufficient connections: the atomic data in an NTSS database are no longer isolated, but in a network. In the NTSS model, entity domains are connected to each other, and attribute domains are connected to entity domains. These connections are static states of the model, and each connection is bidirectional. This solves the problems of lack of connection in the relational model or the key-value typed models, and the key-value typed model can only be accessed in one direction. Rigorous mathematical foundation: based on n-tier soft set theory, the NTSS model has a rigorous formal definition. That is not available in other key-value typed models. This not only makes the NTSS model more precise in definition and expression but also facilitates more in-depth theoretical research. It enables us to infer richer properties (or to use the existing mathematical research results of soft sets) and to understand its logical reliability and completeness. It is convenient to design a concise and general query language and achieve complete logical expression ability with as few operations as the relational model. Powerful query ability: through the rigorously defined operations, fast access brought by index and data integration, and sufficient connection between data, the NTSS model has the ability to query as freely and completely as the relational model but in a big data environments. In the comparison experiment with MySQL, we not only write and read key-values but also write the same logical structure as the relational model and implement the same query as the multi-table join SELECT SQL statement. Convenient for programming usage: from a programming perspective, all the structures that make up the NTSS model include tuples, sets, and dictionaries are built into most programming languages and can be processed natively. Easy to modeling: from the similarity between the NTSS model and the ER model, it can be seen that the macroscopic view of the NTSS model is close to the original appearance of human thinking and modeling, so that modeling can be carried out intuitively. Convenient for statistical use: each domain can be used as a statistical dimension, and most of the values related to it have become a set that can be directly obtained. For these sets: counts, sums, averages, and other statistical indicators are easy to calculate.
Using the conclusions in [
Comparison of data models.
Relational | Key-value | Column family | Document | Graph | NTSS | |
---|---|---|---|---|---|---|
Strict algebra basis | Yes | No | No | No | Yes | Yes |
Completeness and consistency | Strong | Weak | Weak | Weak | Strong | Weak (but with native cardinality constraints) |
Query expressiveness | Strong | Unidirectional | Multilevel unidirectional | Multilevel unidirectional | Strong | Strong |
Data self-descriptive | No | Weak | Strong | Strong | Strong | Strong |
Distributed support | Difficult | Easy | Easy | Easy | Difficult | Easy |
Schema flexibility | Pre-defined rigid schema | Schemaless | Schemaless | Schemaless | Schemaless | Schemaless |
Connecting data | Difficult | Difficult | Difficult | Difficult | Easy | Easy |
Dependency on external indexes | Strong | Weak | Medium | Medium | Medium | No |
Through the discussion above, we can see that the NTSS model is indeed a data model suitable for dealing big data with 4 Vs. For Volume, an NTSS database is a discrete key-value structure and has natural support for distributed clusters. For Velocity, the underlying implementation of key-values provides fast and flexible data processing. For Variety, as a schemaless model, it can be altered at will, making it easy to respond to changing requirements or different data sources. For Value, the complete logical structure is preserved between the data and can be queried freely, and storing set values also facilitates statistics and data mining. Moreover, based on the features of the NTSS model, it is possible to realize an implementation with intelligent data distribution, which can automatically adapt to the status of the cluster, intelligently divide the soft aggregations, and still maintain the semantic and logical structure between the data, without manual sharding design or aggregation design.
The n-tier soft set theory and n-tier soft set data model have been proposed. We defined them in a strict formalized way and illustrated the process and design considerations. We explained why and how to use the n-tier soft set model to modeling, described the features and advantages of it.
However, a lot of details have not been covered, such as richer algebraic properties and detailed implementation aspects, which will be progressively fulfilled in the future.
However, we believe that through this paper, we have not only expanded the frontier of soft set theory but also shed light on a promising prospect of developing a new database product based on the NTSS model to meet the challenge of big data. In the future, the database will be rewritten using Scala, unlike a theoretical verification based on Python Implementation currently and open-source to improve its ability.
The data used to support the findings of this study are available upon request.
The authors declare that they have no conflicts of interest.
This study was supported by the National Natural Science Foundation of China (grant no. 72071021).