^{1}

^{1}

^{2}

^{1}

^{2}

In recent years, data has become a special kind of information commodity and promoted the development of information commodity economy through distribution. With the development of big data, the data market emerged and provided convenience for data transactions. However, the issues of optimal pricing and data quality allocation in the big data market have not been fully studied yet. In this paper, we proposed a big data market pricing model based on data quality. We first analyzed the dimensional indicators that affect data quality, and a linear evaluation model was established. Then, from the perspective of data science, we analyzed the impact of quality level on big data analysis (i.e., machine learning algorithms) and defined the utility function of data quality. The experimental results in real data sets have shown the applicability of the proposed quality utility function. In addition, we formulated the profit maximization problem and gave theoretical analysis. Finally, the data market can maximize profits through the proposed model illustrated with numerical examples.

With the rapid development of information technology, big data has become the core resource of all walks of life. Government departments, research institutions, IT companies, financial institutions, etc. have generated massive amounts of data during operations. In addition, due to the rise of mobile networks and smart terminals, a large proportion of people now have smart phones with sensors, which can easily collect data beyond the past possible range using GPS, cameras, microphones, etc. The storage and calculation of big data are no longer the sole purpose. By using data mining and machine learning to analyze data, it provides an opportunity to bring about breakthroughs in processing video, images, and speech [

Marketplaces are enablers for the exchange of data. Therefore, data trading has become an innovative business model that has driven the advent of DT (Data Technology) era. In this era, data has become an important asset for companies, from the exclusive internal data to the sharing between companies. However, due to the lack of standardized data sharing channels and unified transaction specifications, big data trading platforms and data markets have emerged as the times require in this context.

Nowadays, data products and related services are increasingly being provided to the online data market, which carries the data publisher’s data and provides it to data consumers. Figure

A typical big data market model.

However, the big data market has not formed a unified pricing mechanism yet, and various pricing strategies are still not perfect; i.e., different data markets offer different pricing mechanisms. Currently, the major pricing mechanisms in the data market include subscription, bundling, and discrimination. However, the impact of data quality on the pricing mechanism has rarely been studied. Many literatures [

The key contributions of this paper can be summarized as follows:

We first summarized several dimensional indicators that affected data quality and established a linear model to calculate the quality scores. Based on this, a hierarchical division method of the square root of the quality score is proposed.

We proposed a utility model based on the quality level and verified it with real-world datasets, using machine learning algorithms. The results have proved the applicability of this utility model.

From the perspective of economics, we considered the consumers’ willingness to pay and formulated an optimized pricing scheme based on the quality utility function. Numerical experiments have shown that the owners of data platform can maximize profits by determining the quality level and subscription fee.

The remainder of this paper is organized as follows: Section

The valuation of intangible assets, such as cloud computing services [

Before studying data pricing, we first review the representative work of these methods. The information service market usually involves three commonly used pricing mechanisms:

Subscription mechanism: Windows Azure Data Market [

Bundling pricing: this strategy originates from capital data market, and it represents an aggregation technique [

Version control pricing mechanism: the strategy is a widespread differentiation strategy used in information-product markets. Wei et al. [

There are also some scholars who have studied the pricing model of data products from different perspectives. Koutris et al. [

Through extensive review of the literature, we can conclude that existing data pricing literature either investigates published data pricing methods or studies new approaches that focus on relevance and privacy. Data quality is a key factor affecting data assessment and has been ignored so far.

In the entire data life cycle, such as data creation, transformation, transmission, and application, each stage may cause various data quality problems. Liu et al. [

Data quality is characterized by multidimensionality and complexity. Therefore, in this paper, we consider an optimized pricing model based on data quality, hoping to provide data platform owners with useful pricing decision recommendations.

When the data market owner wants to sell data at a reasonable price, the first thing to consider is to evaluate the value of data. On the one hand, data value can be measured by the size of data [

Data quality includes multiple dimensions. The measurement of dimensions will vary according to the type of data, so quality has to be evaluated using the criteria that the data has to comply with. In [

Metric definitions, description, and calculation.

Attributes | Metric | Description | Variables | Formula |
---|---|---|---|---|

Accuracy | Proportion of accurate cells | Indicate the proportion cells in a data source that has correct value according to the domain and the type of information of the data source. | | |

Completeness | Proportion of complete cells | Indicate the proportion of complete cells in a dataset. It means the cells that are not empty and have a meaningful value assigned (i.e., a value coherent with the domain of the column). | | |

Redundancy | Proportion of duplicate records | Redundancy expresses the proportion of duplicate records in the data source. Since this factor is the cost-indicator, we convert it to the benefit-indicator. | | |

Several other quality dimensions also have their calculation methods. However, due to space limit, we omit them from the paper.

Creating a universal data quality assessment standard can be an arduous task for all types of data. Without loss of generality, a linear model is presented as below, but other options may exist.

We adopt the method of dividing the quality level in [

Mapping of quality scores and levels.

In current big data business applications, it is usually big data sets that adopt model-based methods to extract knowledge and information to solve complex business applications. Figure

Big data business intelligence service.

It can be seen that data plays an important role in the entire business analysis. The quality of data directly determines the accuracy of the machine learning model [

We suppose that a utility function

Usually we assume that the function

The first attribute is rational as quality utility cannot be negative. The second attribute is the obvious requirement that the higher the quality, the better. Several reasons are given for the third property. One way to justify it is to require that the marginal utility

In order to determine the utility function of data quality in big data analysis, we consider the study from the perspective of classification-based machine learning.

Next, we describe the process of classification. For a data set

A basic machine learning workflow.

As shown in (

Suppose that the classification accuracy for each item

In this paper, for simplicity, we consider the following exponential-based utility function:

In order to prove the rationality of the proposed utility function, we use a real dataset called MNIST [

Due to the multidimensionality and complexity of data quality, it would be a difficult task if all quality dimensions were taken into account to classify quality levels. Our goal is to illustrate the effect of different quality levels of a given data on model classification capabilities. For simplicity, and in order to reflect our motivation, in the experimental design stage, we draw on the experience of the concept of signal-noise ratio (SNR) [

Select

For each sample of the selected

The quality level is the inverse image of the noise level. For simplicity, Figure

Accuracy trends under different quality levels (

In this section, we first analyzed consumers’ willingness to pay from the perspective of consumer behavior. Then, we introduced the profit maximization model with data quality level. Finally, the closed-form solutions of the subscription fee and quality level were derived and proved to be globally optimal. The key notations and description used throughout the paper were defined in Table

Frequently used notations.

Notation | Description |
---|---|

| The quality level of the data products and services |

| |

| Subscription fees for data products and services |

| |

| Customers’ Willingness to Pay |

| |

| Customer sensitivity to quality level |

| |

| Profit resulting from the separate sales of the data product and service under |

| |

| Data utility with curve fitting parameter |

| |

| The number of customers willing to buy a data product or service |

| |

| The unit price of the data quality |

| |

| Lagrangian of the profit function |

Every consumer in the market has personal preferences and interests. They make purchasing decisions based on their own needs, preferences, and prices by a self-selection process. This self-selection is described by a consumer’s Willingness To Pay (WTP) [

We assume customers’ sensitivities of quality level by

In Section

In the above equation,

where, without loss of generality, we assume that

The profit maximization problem can be formulated as follows:

The goal of (

We use Karush-Kuhn-Tucker (KKT) [

The closed-form solutions of

To get this result, we first need to find (

Next, we can consider two special cases where the data quality level

On the one hand, if

We solve the second derivatives of

In this section, we consider using the previous utility function

Figures

Data platform profit under different subscription fees.

Data platform profits under different quality levels.

In Figure

Data platform profit under different quality level costs.

In this paper, we proposed a data pricing and profit maximization model based on data quality levels. We first constructed a linear model of the quality score based on the data quality dimension and used the square root to divide the quality level. Then we established a quality level utility model and verified the applicability of the model with machine learning algorithms. Finally, we proposed an optimized pricing mechanism allowing data platform owners to optimize quality levels and subscription fees to maximize profits.

The data used to support the findings of this study are included within the article.

The authors declare that there are no conflicts of interest regarding the publication of this paper.

This research was supported in part by National Nature Science Foundation of China (Grant no. 91646202) and National Key R&D Program of China (SQ2018YFB140235).