An Innovative Model for Extracting OLAP Cubes from NOSQL Database Based on Scalable Na¨ıve Bayes Classifier

Due to unstructured and large amounts of data, relational databases are no longer suitable for data management. As a result, new databases known as NOSQL have been introduced. The issue is that such a database is diﬃcult to analyze. Online analytical processing (OLAP) is the foundational technology for data analysis in business intelligence. Because these technologies were designed primarily for relational database systems, performing OLAP in NOSQL is diﬃcult. We present a model for extracting OLAP cubes from a document-oriented NOSQL database in this article. A scalable Na¨ıve Bayes classiﬁer method was used for this purpose. The proposed solution is divided into three stages of preparation, Na¨ıve Bayes, and NBMR. Our proposed algorithm, NBMR, is based on the Na¨ıve Bayes classiﬁer (NBC) and the MapReduce (MR) programming model. Each NOSQL database document with nearly the same attribute will belong to the same class, and as a result, OLAP cubes can be used to perform data analysis. Because the proposed model allows for distributed and parallel Na¨ıve Bayes Classiﬁer computing, it is appropriate and suitable for large-scale data sets. Our proposed model is a proper and eﬃcient approach when considering the speed and reduced the number of required comparisons.


Introduction
Data generation has increased dramatically in recent decades. e rise of large-scale web platforms, such as Google, Facebook, Twitter, and Amazon, has resulted in the development of Big Data management approaches. e term "Big Data" refers to a massive amount of information. ere are numerous theoretical definitions of Big Data [1]. e 3 V, Volume, Variety, and Velocity, is one of the most widely accepted definitions of Big Data [2]. e volume represents the massive amount of data in Petabytes and Exabytes. Data produced by various sources, such as smart devices and social media, results in structured, semistructured, and unstructured data. e velocity is related to the rate at which data is processed [3,4].
Managing and analyzing Big Data is difficult. Because relational databases (RDBMS) are not suitable for large amounts of data, there is a growing interest in using NOSQL (Not Only SQL) database systems. NOSQL systems are an interesting alternative to relational databases because they are more scalable and flexible. e data structure of schemaoriented and highly structured relational database management systems should be known ahead of time. A relational database stores all data in tables as rows and columns, whereas a NOSQL database does not. Relational databases are an expensive solution for handling large amounts of data. Meanwhile, the NOSQL database scale-up approach is much simpler [5,6]. e ACID principles are guaranteed by all relational databases. ACID (Atomicity, Consistency, Isolation, and Durability) is a set of database transaction properties designed to ensure that the database remains consistent in the event of a failure while processing a transaction. For this purpose, each transaction, which consists of operations, acts as a single unit, produces consistent results, is unaffected by any other transaction, and once committed, remains in the system.
Scalability and efficiency are achieved in NOSQL databases when ACID features are not strictly guaranteed, as opposed to relational databases, and BASE (Basically Available, Soft state, Eventual consistency) features are ensured instead [7].
us, data repetition, sharing, and distribution on various storage services are used to increase system accessibility in the event of an event. In contrast to relational systems, which are strict about ensuring consistency, NOSQL systems allow data to be inconsistent in some cases. e system state may change as new data are added; NOSQL eventually ensures data consistency [8,9].
Despite all of the benefits of NOSQL databases, analyzing such databases is difficult. Data should be analyzed for decision making in many complex contexts, including healthcare, security, and business. As a result, a suitable method for analyzing the NOSQL database should be developed. OLAP (online analytical processing) is a technique for processing and analyzing data. OLAP is a multidimensional structure that allows for quick access to data and advanced analysis [10,11]. Because traditional OLAPs are based on relational databases, Big Data analysis on NOSQL databases is difficult [12].
One of the most common types of NOSQL databases is the document-oriented database. e document-oriented NOSQL database is the focus of this research. Data are stored in a collection of documents with different attributes in the document-oriented NOSQL database model [13,14]. e primary goal of this research is to develop a model for converting a document-oriented NOSQL database to a structured model and extracting OLAP cubes. Previously, a variety of methods, such as rule-based, partitioning, similarity, and so on, were proposed. However, in previous solutions, Naïve Bayes was not used to extract the OLAP cube. ere are two kinds of Naïve Bayes classifiers: traditional and parallel that is based on the Naïve Bayes classifier and the MapReduce programming model [15]. e Naïve Bayes classifier is a simple learning technique for text classification that uses the Bayes rule to map documents to the class with the highest probability calculated using Bayes' rules [16]. e most basic premise is that the attributes are conditionally independent of one another. When the data set is large, storage and processing can be difficult; to overcome these issues, a parallel MapReduce processing system with Naïve Bayesian is used [17,18]. NBMR is the name of our proposed algorithm, which is scalable in large data sets. e scope of this study is depicted in Figure 1. e following is how the rest of the article is structured: e related work is presented in Section 2. e background of this study is presented in Section 3. e proposed model is presented in Section 4. e experimental results are examined in Section 5. Finally, Section 6 brings the article to a close.

Related Work
In this section, we look at the previous solutions that have been presented. Partitioning, chunking, rules (a set of rules for direct mapping from one model to another), parallel processing, MapReduce, and other methods have previously been proposed. Each solution has benefits and drawbacks. e goal of Venkatesh and Ranjitha [19] was to build OLAP cubes out of a NOSQL database. To this end, the MC-CUB operator was created.
is operator enables the processing of OLAP cubes on a large volume of data stored in a column-oriented NOSQL database. All possible aggregations have been performed at various levels and granularities. e MapReduce pattern was used to parallelize operations on stored data in a distributed environment. [20] Presents Halop, a solution for storing OLAP systems for Big Data, developed by the authors. To store data, chunking and partitioning methods are used in this solution. e data are also loaded using a parallel MapReduce processing system. NOSQL databases were used in decision-making systems in the study by Song et al. [21]. A set of rules has been defined to automatically and directly convert the multidimensional conceptual model to the logical NOSQL model. Data preaggregation has been considered in the proposed approach to improve query speed. is method has been investigated for both document-oriented and column-oriented NOSQL databases. ey proposed a system for converting a NOSQL database to a relational database in [22]. Schema Analyzer first specifies the schema of the NOSQL database. Following the specification of the schema, ETL processes are started in parallel to extract, transform, and load data from a nonstructured NOSQL database into a structured relational database. In the study by Wu et al. [23],they proposed a new model that combined the benefits of both relational and nonrelational databases in terms of logic and flexibility. ey proposed a new soft set algebra called n-tier soft set for this purpose. is model allows data to describe itself and move and locate within a cluster with ease, but it also has the algebraic capabilities of a relational model, allowing for powerful queries. e data storage in the study by Chevalier et al. [24] was done with a document-oriented NOSQL database. e data model is specified with features and values in the proposed approach, and the data route is kept as a tree structure. To convert the multidimensional model to NOSQL, three map and conversion rules have been presented: All data cubes are stored in a set with a simple format in the first mapping rule (without subdocument). Subdocuments are stored in the set in the second rule. e cubes are kept in multiple distributed sets in the third rule, each with its own set of advantages and disadvantages. Chevalier et al. [25] investigated the dimensions with complicated hierarchy in two states: inaccurate (child level with multiple fathers) and no coverage (child level with no father), and a method for modelling the complicated hierarchy to the document-oriented state was presented. Finally, the algorithm for aggregation is presented. Sohrabi and Azgomi [26] described an algorithm called MSJL for determining the similarity of large data sets. is algorithm is based on the LSH approximate similarity algorithm, which uses chunking and partitioning for Big Data management, as well as the MapReduce parallel processing system. Davardoost et al. [27] proposed a new method for extracting OLAP cubes from document-oriented NOSQL called LSHMR. e LSHMR programming model was based on the LSH and MapReduce programming models. e MapReduce framework enables distributed and parallel computing, while LSH is a fast and approximate similarity search that is used to reduce the number of comparisons. Vernica was created using the MapReduce programming model and prefix filtering. Similar sets must have a common token in their prefixes, according to the prefix filtering principle. It is a good filter that can cut down on the number of candidate pairs [28].
Various methods for converting an unstructured model to a structured model and vice versa have been presented. Table 1 summarizes the algorithms and approaches and compares them from various perspectives, including chunking and partitioning, using the MapReduce parallel processing technique, and rule-based processing.

Background
Because the goal of this article is to develop a model for extracting OLAP cubes from a document-oriented NOSQL database, the multidimensional conceptual model and the document-oriented NOSQL database are formalized first, followed by the MapReduce parallel programming model.

Formal Definition of the Multidimensional Conceptual
Model.
e multi-dimensional model includes fact tables and dimensions tables. e multidimensional scheme called E is defined as follows: F E is a limited set of the fact tables, N F : name of the fact tables, and M F : the measurement scale.
e set of attributes existing in the dimensions. ere are two types of attributes, such that the simple attribute is an atomic and inseparable attribute and the complex attribute includes several attributes [29,30].

Formal Definition of the Document-Oriented NOSQL
Database. NOSQL databases are seen as a viable alternative to relational databases, particularly in the context of Big Data. A schemaless database is a NOSQL database. NOSQL databases use documents instead of tables with data types, columns, rows, and schemas. Data are stored as a collection of documents in document-oriented databases [7,31]. Each collection contains multiple documents, each of which has a dynamic structure. e collection is unstructured, and the documents that are already there have different fields [32]. e information is stored as pairs of keys and values in the structure of JSON or XML documents in the documentoriented NOSQL database [27,33]. Figure 2 shows an instance of a document-oriented database. In document-oriented databases, the data are stored as a collection of documents as collection � {d 1 , d 2 ,. . ., d n }. Each document in collection C has a unique key. e documents' structure is defined using attributes and values, where attributes are simple and composite: Figure 3 depicts a MapReduce overview. MapReduce is a popular parallel computing method for Big Data analysis. e MapReduce paradigm is straightforward, with two tasks: Map and Reduce. e input data set is divided into chunks by a Map, and each chunk is processed completely in parallel using Map tasks; the outputs of the Map phases are then sorted and used as the input of Reduce tasks [34].

MapReduce.
Computation is executed as follows: (i) Map tasks divide the data set into chunks and generate (Key-Value) pairs that are processed in parallel.
In the shuffle phase, the (Key-Value) pairs from each Map are collected and stored by their key. en (Key-Value) pairs with the same key are passed to the same Reducer [35,36].

First Phase (Preparation Phase).
e NOSQL document database is converted to a matrix in the first phase. e NOSQL document-oriented database is first used as input data in system. A document-oriented NOSQL database is made up of n documents that each have m attributes. Because it is unstructured, some attributes are present in each document, while others are absent. A D matrix with 0 and 1 elements is created. Matrix D's rows represent documents, columns represent attributes, and elements represent each document's attribute. If the attribute is null, the value is set to 0; otherwise, the value is set to 1. Matrix D is created based on these steps. e second phase entails locating documents with similar attributes and classifying them accordingly. Because documents with similar structures belong to the same class, OLAP can be used to analyze data.

Second Phase (Naïve Bayes).
In the second phase, the D matrix is subjected to the Naïve Bayes algorithm. A Naïve Bayes classifier is a learning algorithm that is based on the data set's conditional probability values. Every attribute is assumed to be independent of the others, which is a strong assumption. e classification principle of Naïve Bayes is relatively simple [38]. e probability of an element belonging to a class is calculated first, and then, the most likely element is chosen as the classification's result. e parts of Naïve Bayes are shown in detail in Figure 5. To begin, the matrix D is split into two sections: training and testing data. e probabilities are calculated using Bayes rules in the training phase, and a model is created. e test data is compared with the model in the combining section. If the model does not allow for test data, it is calculated and stored in an intermediate table.
e prediction step is straightforward: use the model to determine which class the test data belong to [15,39]. Similar documents in the structure are mapped to the same classes in the prediction section, and finally, OLAP cubes are extracted based on the classification. e steps for classifying document-oriented NOSQL databases based on Naïve Bayes theory are as follows: (1) Let Collection � {d 1 , d 2 ,. . ., d n } In document-oriented database, data are stored in a collection of documents. Each di represents a document that documents that are structurally similar will be in the same class. In the following, the calculation of each of the conditional probabilities of step 4 is described. For example, conditional probabilities of the attributes in the documents are as follows: p (a 1 |C 1 ), P(a 2 |C 1 ), . . . , P(a m |C 1 ); P(a 1 |C 2 ), P(a 2 |C 2 ), . . . p(a m |C N ). Considering the attributes are conditionally independent, probability documents p (C i |d) based on the Naïve Bayes theory can be calculated according to equation (2). (2) Based on step 5, the class that generates the maximum value of equation (2) must be found. As can be seen from equation (2), because the denominator P(d) is the same for all C i , the maximum value of the numerator of equation (2) is considered. Considering the attributes are conditionally independent, equation (3) can be used.

ird Phase (NBMR).
is section explains how to put our proposed model into action. Because calculating conditional probabilities in the third phase is time consuming, we use the MapReduce programming model to parallelize and speed up the process. Our proposed algorithm, dubbed NBMR, is a hybrid of Naïve Bayes classification (NBC) and the MapReduce distributed programming model. e Naïve Bayes classifier assigns a document to the class with the highest probability, as determined by Bayes' rule, and

Mathematical Problems in Engineering
MapReduce allows for parallel computing, which is ideal for large amounts of data. Master, Training, and Prediction were the three tasks we used to implement.
e Master task controls the other threads as a server thread. Algorithms 1-3 shows the pseudocode for each task; each task performs the appropriate process on the input data. e input of the Master task is a document-oriented NOSQL database. e Preparation task is first called, and the document's matrix D is created. Matrix D is then divided into training and test data. e master task divides the training file "Tr" into k parts (Tr 1 , Tr 2 ,..., Tr k ) and the testing file "T" into k parts (T 1 , T 2 ,...,T k ), then calls the Training and Prediction task, which collects the results of these tasks and aggregates the probabilities that are used to classify the NOSQL database. Figure 6 depicts the process of putting our proposed model into action. Our proposed model, as shown in Figure 6, contains two sequential MapReduce tasks, the output of the first of which is the input of the second. e master task splits the training file "Tr" into K parts and assigns each part as a mapper input. Each mapper receives a portion of the data block and divides each record into specific Key-Value pairs Key � document, Value � attribute>, then computes the probability for each attribute based on each class. Each Key-Value is aggregated by class in the Reduce phase, and the model is created. e master task splits the testing file "T" into k sections. Each T part is assigned to mappers in the Prediction task. Each mapper receives a block of data from the previous MapReduce output during the Map phase. Using the model, each mapper classifies documents in T i of the testing file. If the model does not have classification, classify by calculating probabilities and update the model. All results are aggregated in the Reduce phase, and eventually a class of documents is predicted, which is used to extract OLAP cubes. Table 2 contains a list of variables along with their definitions for quick reference.

Conclusions
Due to the diversity and large volume of data, the relational database is ineffective for data management. As a result, the use of NOSQL databases has grown in popularity recently. However, due to its unstructured nature, analyzing such a database is difficult. A new model for extracting data cubes is presented in this article. Partitioning, chunking, similarity, and rule-based solutions have all been proposed in previous studies. In this article, a new method for extracting OLAP cubes from document-oriented NOSQL databases is proposed. A scalable Naïve Bayes classifier method was used for this purpose. e proposed solution consists of three main preparation phases: Naïve Bayes, NBMR, and Naïve Bayes. e NOSQL document-oriented database is converted to a    matrix in the first phase. In the second phase, we use the Naïve Bayes theory to calculate probabilities. e second phase consists of three main tasks: training, combining, and prediction, all of which carry out a proper data process. We use MapReduce in the third phase to parallelize and speed up the operation. NBMR is the name of our proposed algorithm, which is based on the Naïve Bayes Classifier (NBC) and the MapReduce (MR) programming model. Each NOSQL database document with nearly the same attribute will be assigned to the same class, allowing OLAP cubes to be used for data analysis. e proposed model is appropriate and suitable for large-scale data sets because it allows distributed and parallel computing of the Naïve Bayes Classifier. Our proposed model is a proper and efficient approach, considering the speed and reduced number of required comparisons. Other machine learning techniques could be used in future research. We will concentrate on document-oriented NOSQL databases in this article; however, we can apply the same techniques to other types of NOSQL databases, such as graph oriented, column oriented, and so on.

Data Availability
e data used to support this study are obtained from [40].

Conflicts of Interest
e authors declare that they have no conflicts of interest.