Copyright Protection of Literary Works Based on Data Mining Algorithms

,


Introduction
e rapid development of computer storage technology and network technology has brought massive amounts of information to people. is information usually takes images, videos, audios, animations [1], and texts as the main manifestations, among which texts have the widest range of dissemination and the highest frequency of use. e massive dissemination of information brings convenience to people's work and life, but it also has shortcomings, such as many copyright disputes and illegal copying problems, which urgently needs author identification methods that can resolve copyright disputes. rough research, it is found that texts written by different authors or authors have greater style differences, and different texts written by the same author have the same writing techniques, usual sentence structure, vocabulary, etc. [2]. e author recognition method first extracts and counts the features of a large number of texts written by different authors and trains the classifier. en, for the controversial text, it uses effective feature extraction methods to obtain statistical vectors and input them into the trained classifier. Finally, it outputs specific classification categories or specific authors. e method of text author recognition can assist in resolving copyright disputes of disputed works (especially disputed works of well-known authors), combating piracy, and maintaining integrity. e key part of the text author recognition method is training and building a classifier [3].
Classification is a typical machine learning method with teachers, and it is also an important research topic in the field of data mining. e classification function or classifier is obtained by continuously learning training data. When classification is needed, the test data can use the obtained function or classifier to output a given category. How to choose a suitable classification model in the application is an important issue. Text classification technology can be widely used in fields such as natural language processing and understanding, information management, data evaluation, and information filtering. e more common text classification methods include support vector machine, K-nearest neighbor method, Bayesian classification, neural network, and decision tree classification. Support vector machine is mainly used in pattern recognition and other fields. It is a pattern recognition method based on statistical learning theory. Its characteristic is that it can maximize the geometric edge area and minimize the empirical error at the same time. According to the situation of the known samples, the nearest neighbor algorithm can determine whether the new sample and the known sample are in the same category. e nearest neighbor algorithm has many developments and improvements, but the general idea is to store all or part of the training samples first and then calculate the distance between the test sample and the training sample through the similar function and finally determine the type of the test sample. e nearest neighbor algorithm can quickly achieve classification, especially in the field of statistical-based pattern recognition. e principle of the neural network is to simulate the structure of the human brain and treat the sample as a connected input/output unit. e training sample learns by adjusting the unit value.
Based on this, this paper combines data mining technology to conduct research on the copyright protection of literary works, constructs a literary copyright protection system, and improves the copyright protection effect of modern digital literary works.

Related Work
Literature [4] proposed Triangle Similarity Quadruple (TSQ) and Tetrahedral Volume Ratio (TVR). e TSQ algorithm constructs the Macro Embedding Primitive (MEP) and selects the ratio of the side length of the triangle or the ratio of the base to the height in the MEP as the watermark embedding primitive: the TvR algorithm selects the four sides after constructing the tetrahedral sequence. e volume ratio between the volumes is used as the watermark embedding primitive. Literature [5] calculates the distance from each vertex of the model to the center of the vertex field and the distance from the center of the model and embeds the watermark by modifying the ratio between the two. is algorithm is a non-blind watermarking algorithm, which can resist similar transformation, noise, simplification, and their joint attacks. However, the transparency of the watermark is insufficient.
Literature [6] proposed two digital watermarking algorithms based on local distance: Vertex Flood Algorithm (VFA) and Triangle Flood Algorithm (TFA). e VFA algorithm divides the vertex set according to the distance from the vertex of the model to the center of the selected triangle and embeds the watermark by modifying the distance from the vertex in each set to the center of the selected triangle; the TFA algorithm continuously selects the triangle and connects the adjacent triangles of the triangle, sorting into a triangle traversal sequence according to the distance from the non-shared vertex to the shared edge, and then modifying the height of each triangle in the traversal sequence to achieve the purpose of embedding the watermark. Literature [7] embeds the watermark by modifying the distance from the model vertex to the center of the model. As a global geometric feature, this distance can well reflect the shape of the 3D model and can maintain sufficient stability without changing the visual effect of the model. erefore, the algorithm has better robustness against noise and simplification attacks; literature [8] improves the transparency of the watermark by controlling the intensity of local watermark embedding, and uses a weighting method to improve the simplification and reduction of the watermark during watermark extraction. Robustness of noise attacks: literature [9] embeds both robust and fragile watermarks in the 3D model by modifying this distance and uses the method of adding weights to improve the robustness of the algorithm when extracting the watermark. Literature [10] proposed a multiple digital watermarking algorithm.
is algorithm uses the distance from the vertex to the center of the model to embed the watermark and at the same time introduces the affine invariant range and embeds the second watermark by modifying the vertex order of the triangle face. e complementary advantages of the two watermarks increase the types of algorithms against attacks. Literature [11] focuses on improving the transparency of watermarking. Literature [12] improves the method of controlling the embedding strength of local watermarks. Literature [13] uses the K-means clustering method to select a specific set of vertices according to the curvature of the vertices and uses genetic algorithms to embed the watermark.
Literature [14] proposed a digital watermarking algorithm based on Extended Gauss Image (EGI). e algorithm builds a set of triangle faces based on the normal vector of the triangle face and embeds the watermark by modifying the statistical feature of the mean value of the normal vector of each set. Literature [15] divides the vertices of the 3D model into 6 regions, and each region establishes an extended Gaussian image of the normal vector, which realizes the repeated embedding of watermark information in each region and optimizes the method of modifying the vertex coordinates. Literature [16] proposed a digital watermarking algorithm based on complex extended Gaussian image (Copmlex EGI), which establishes a complex weight for each partition and selects the partition with larger weight to embed the watermark, which effectively improves the robustness. Literature [17] uses the vertex neighborhood of each vertex to calculate an average vector and embeds the watermark by modifying the length of the average vector. e algorithm can handle polygonal mesh models with arbitrary topologies and has good robustness to affine transformations, but it cannot resist attacks such as mesh reconstruction and mesh simplification. Literature [18] uses the model center and principal component analysis method to transform the model into an affine invariant space and transforms the vertex coordinates into spherical coordinates and then constructs a histogram reflecting the value distribution of the radial component of the vertex according to the spherical coordinates. e histogram moderately changes the distribution of the radial component to embed the watermark. e algorithm can resist similar transformation and simplification attacks, but it cannot resist shearing attacks, and it has weak resistance to noise attacks. Literature [19] defines the distance from the vertex of the 3D model to the center of the model as the vertex norm and proposes a highly robust blind watermarking algorithm based on the statistical characteristics of the vertex norm.
is algorithm establishes a histogram of all vertex norms, divides the histogram into several partitions according to the number of watermarks, and embeds the watermark by slightly changing the mean or variance of the vertex norm of each partition. is algorithm combines the stability of both 2 Scientific Programming the global geometric features and statistical features of the 3D model and has achieved good robustness against various common attacks. However, the algorithm depends on the center position of the model, so it cannot resist shearing attacks. And there are also shortcomings in transparency.

Literary Works Watermarking Algorithm Based on Text Data Mining
By analyzing the characteristics of common BIM model format DXF files, this paper combines the existing two-dimensional vector graphics digital watermarking algorithm to propose a digital watermarking algorithm for data copyright protection based on the BIM model. is paper selects the vertex coordinates of the multiface mesh of the entity of the BIM model data to embed the watermark. In order to solve the problem that the vertex coordinates in the BIM model have more identical values and less effective carriers used to embed the watermark in practical applications, random noise is added to the original coordinate data within the error tolerance to increase the embedding capacity of the watermark. In order to enhance the ability to resist pruning attacks, the watermark information needs to be embedded as evenly as possible in the X and Y coordinates of all multiface mesh vertices of the BIM model data. In order to maintain the synchronization relationship between data and watermark and realize blind watermark detection, the idea of coordinate mapping is adopted. At the same time, the security of the watermark is improved by Logistic scrambling of the watermark image. In this algorithm, firstly, it extracts the vertex coordinates of all the multiface meshes in the data to construct a vertex set and obtains the high-level part of the coordinate data. After that, it establishes a mapping relationship with the watermark through a one-way mapping function to use the low-order part of the coordinate value as the embedding carrier of the watermark and embeds the watermark into the vertex coordinate position using the quantization modulation method. Moreover, it selects the initial value of chaotic transformation as the key for watermark extraction. When the watermark is extracted, no original data is needed, and blind detection is realized. e embedding process of the watermark is shown in Figure 1.
Logistic mapping, also known as insect mouth model, is a typical chaotic sequence in chaos theory, and its equation form is formula (1). Chaos phenomenon is a random-like process that appears in a deterministic system. e process is bounded, non-convergent, and sensitive to initial values. e use of chaotic sequences to encrypt the watermark not only is simple and easy to use, but also has no periodicity and is difficult to crack, which can improve the security of the watermark. For an image of M × N size, a one-dimensional chaotic encryption sequence w is obtained after M × N iterations.
When the condition 0 < L i < 1, 3.5699456 . . . < μ ≤ 4 is satisfied, the Logistic mapping works in a chaotic state. In particular, when μ is close to 4, the iteratively generated value is a pseudo-random distribution state. is paper uses the Logistic chaotic map to encrypt an image of 32 × 64 size and then reduces the dimensionality of the generated binary watermark image to obtain a one-dimensional sequence w with a length of S � w m × w n . e initial value L 0 � 0.98 of the chaotic transformation is selected for many trials.  [20].
Due to the large number of coordinate repeated values in the BIM model, there are fewer effective carriers for embedding the watermark. To solve this problem, this paper adds random noise to the original coordinate data within the error tolerance to increase the embedding capacity of the watermark. e repeated coordinate values in the vertices set of the polyhedral mesh of the original data are subjected to the noise adding operation shown in formula (2) to obtain the processed vertex set V e [21].
Here, (x e , y e ) represents the vertex coordinates of the polyhedral mesh after adding noise, (x, y) is the vertex coordinates of the original data, rand is a random function that generates a random number within (0,1), and Q is the allowable range of error.
is algorithm embeds the watermark with the multifaceted mesh vertices of the BIM model data entity as the object. e vertices of the multifaceted mesh of the BIM model data Among them, V i represents the vertex of each polyhedral mesh, (x i , y i ) is the X, Y coordinate value of the vertex, and K represents the number of vertices of the polyhedral mesh. Scientific Programming e specific process of watermark embedding is as follows: Step 1. e algorithm reads the BIM model data, extracts all the multiface mesh vertices in the model object entity, and constructs the multiface mesh vertex set V K .
Step 2. e algorithm adds noise to the two coordinate values (x i , y i ) of each vertex V i in the set V K and at the same time enlarges it by 10 times, which is denoted as . Among them, V i ′ represents each polyhedral mesh vertex after noise processing, and (x i ′ , y i ′ ) is the two coordinate values after noise is added to the vertex.
Step 3. e algorithm selects the embedded bit x w of the watermark according to the data accuracy requirements, and the selection method is as in formula (3). en, the algorithm gradually modifies the vertex coordinates of the multiface mesh according to the mapping relationship between the high part of the data and the watermark bit w(x w ); Here, floor represents rounding down, the mod function is the modulo operation and returns the remainder after dividing x i ′ /p by S, p is the difference between the magnification and the most significant digit after the decimal point, and S represents the length of the watermark, and p � 1000 is selected in this paper.
Step 4. e algorithm uses quantization modulation technology to embed the watermark into the processed coordinate value x i ′ and calculate the embedded watermark data x w i , where the quantization amplitude is R. ere are two cases according to the value of the embedded watermark, as follows [22]: In the same way, according to the different embedded watermarks and the QIM method, the watermark is embedded in the y i ′ coordinate of the vertex V i ′ of the multifaceted mesh.
e algorithm reduces the coordinate value (x w i , y w i ) in V w i after the watermark is embedded by 10 t times, and merges the unmodified data with it to generate the watermarked BIM model data.
e extraction of watermark is the reverse process of watermark embedding (Figure 3). e specific steps to extract the watermark are as follows: Step 1. e algorithm reads the BIM model data to be detected, extracts all the vertices of the multifaceted mesh that can be watermarked, and magnifies the vertex coordinates by 10 t times, where the selection of magnification index t is the same as the value of t when the watermark is embedded.
Step 2. According to the mapping relationship established by the one-way mapping function and the watermark, the algorithm finds the position x w of the watermark.
Step 3. e algorithm performs QIM operation based on the quantized value R when the watermark is embedded, and extracts the value of the watermark bit w ′ (x w ) by formula (6).
Step 4. In this algorithm, the same watermark is embedded multiple times, and the value of the watermark bit w ′ (x w ) can be used to determine the value of the extracted watermark information w ′′ : is shows that when the value of the extracted watermark bit is less than 1, the value of the watermark information is 1; otherwise it is 0.
e algorithm performs dimension increase processing on the obtained one-dimensional watermark information w ′ and inversely scrambles to obtain the watermark image W ′ .
Step 6. Finally, the watermark similarity is evaluated by calculating the normalized correlation coefficient between the original watermark and the extracted watermark. e calculation formula is as follows: Here, NC is a measure of similarity. e greater the value, the greater the similarity. e size of the watermark image is ×N, W(m, n) represents the original watermark information, and W ′ (m, n) is the extracted watermark information.
e BIM model data is a digital expression of the physical function characteristics of the engineering project facility. Based on 3D digital technology, it integrates engineering data model data of various related information of construction projects. e diversity of BIM professional software has led to the diversification of data formats. e format of BIM model data is very important for the selection of hidden domains. e research and development of existing application systems are all based on geometric data models, and data exchange is mainly carried out through graphics information exchange standards such as IGES, DXF, and DWG. DXF data model is often used for information exchange between AutoCAD and other software. It is mainly composed of graphic objects and non-graphic objects and also contains limited attribute information, which is convenient to operate. For BIM model data in DXF format, the vertices of the multifaceted mesh are an important feature position of the model data. However, the coordinates of the vertices of the multiface mesh in the BIM model data have many repeated values, and there are fewer effective carriers for embedding watermarks. In order to solve this problem, random noise is added to the frequency domain amplitude coefficient after transformation of the original coordinate data within the error tolerance range to increase the watermark embedding capacity. As shown in Figure 4, W1 is the watermark image extracted without any processing on the original data, and the image has serious noise, and W2 is the watermark extracted after the noise preprocessing, and the watermark image is clearly visible. e algorithm proposed in this paper includes watermark embedding part and watermark extraction part. First, this paper selects the multiface mesh elements in the BIM model data as the unit and constructs a complex number sequence with all the multiface mesh vertices as characteristic points. Moreover, this paper uses the DFT transform to obtain the amplitude coefficient as the embedding carrier of the watermark, uses the QIM method to embed the watermark on the amplitude coefficient of the DFT frequency domain, and then performs IDFT transform to obtain the watermarked BIM model data. When it is attacked, the watermark is extracted, the watermark is extracted through the voting principle, and the correlation method is used to detect. At this time, the original data is not needed, and blind detection is realized. In order to enhance the ability to resist the attack of deleting entities, the watermark information is evenly embedded in the X and Y coordinate transformation coefficients of all multiface mesh vertices in the BIM model data as much as possible. In order to reduce the excessive influence on the original data, the amplitude value is enlarged. In order to maintain the synchronization relationship between data and watermark and realize blind watermark detection, the idea of coordinate mapping is adopted. According to the nature of DFT transformation, in order to avoid the large error caused by the translation attack on the data, the watermark is not embedded on the first transformation coefficient amplitude value of the set of vertices of the multiface mesh. To ensure the security of the watermark, Logistic chaotic mapping is used to scramble the original watermark image. e flowchart of the algorithm is shown in Figure 5.
First, the BIM model data in the space domain needs to be DFT-transformed to the frequency domain. e specific process of the transformation is as follows: Step 1. V d � v j , v j � (x j , y j ) represents the set of all polyhedral mesh vertices in the original BIM model data, where j � 1, 2, . . . , N, v j is the coordinates of the polyhedral mesh vertices, (x j , y j ) is the X, Y coordinate value of the vertices, and N is the number of polyhedral mesh vertices. Using multiface mesh elements as the unit, the complex number sequence a j is generated as follows: Step 2. For the N point sequence a j , its DFT transformation is shown as follows: a j e − 2πi/N jl j, l ∈ 1, 2, . . . , N { }.
Here, A l represents the data after DFT transformation. a j in the formula can be a complex value. In practice, a j is a real Scientific Programming value, that is, the imaginary part is 0. At this time, the formula can be expanded to e sequence coefficient has two values, the amplitude coefficient |A l | and the phase coefficient ∠A l , as shown in formula (11). e set of amplitude coefficients is denoted as |A l | , and the set of phase coefficients is ∠A l .
e specific steps of the watermark generation and embedding algorithm are as follows: (1) e generation of watermark information. e algorithm reads an image with a size of M × M(M ≥ 2) pixels as the original watermark image. In order to improve the security of the watermark, the original watermark is scrambled by Logistic mapping, and the dimensionality of the scrambled binary matrix is reduced to obtain a one-dimensional binary sequence W, where the sequence expression formula is W � w m � 0, 1|m � 0, 1, . . . , P − 1 , and P represents the length of the watermark. (2) e algorithm reads the BIM data, the amplitude coefficient |A i | obtained by DFT transformation of a j is expanded by 10 7 , and the noise is added.
(3) e algorithm uses the QIM method to embed the watermark into the amplified amplitude coefficient and obtain the embedded watermark amplitude coefficient |A l | through the following equation: (4) e algorithm scales the obtained |A l | to restore it to the original data size, and the reduction factor is equal to the enlargement factor. (5) e algorithm combines the obtained embedded watermark amplitude value with the unmodified phase coefficient to generate a new coefficient A l , and then IDFT transforms it to obtain the complex number sequence a l after embedding the watermark. (6) e algorithm modifies the vertices of the multifaceted mesh according to a l ′ and obtains the set of multifaceted vertices V d ′ , V d ′ � v j ′ � (x j ′ + y j ′ ) , j ∈ 1, 2, . . . , N { }, after the watermark is embedded, so as to obtain the BIM data after the watermark is embedded. e essence of watermark extraction is the reverse process of watermark embedding. When the data owner finds suspicious BIM model data, the algorithm extracts the watermark according to the following steps: (1) e algorithm reads the vertices of the multifaceted mesh of the BIM data to be tested, forms a set V d ′ , and generates a complex number sequence a j ′ according to formula (8).
(2) e algorithm performs DFT transformation on a j ′ to obtain the amplitude coefficient of the coefficient |A j | . (3) e algorithm uses the parameters consistent with the embedding process and uses the QIM method to extract the value of suspicious w m ′ . e extraction process is as follows: (4) For the extracted one-dimensional watermark W ′ � w m ′ � 0, 1|m � 0, 1, . . . , P − 1 , the algorithm performs dimensional increase processing and Logistic inverse scrambling to extract the watermark image. (5) e algorithm uses equation (14) to calculate the normalized correlation coefficient between the extracted watermark image and the original watermark image to measure the robustness. e larger the value of NC, the more similar the two and the better the robustness.
Here, M × M is the size of the watermark image, XNOR is the exclusive OR operation, W(m1, m2) is the original watermark information, and W ′ (m1, m2) is the extracted watermark information. Among them, the closer NC is to 1, the more robust the algorithm is.

Literary Works Protection Based on Data Mining Algorithm
In digitized literary works, we can use watermarking algorithm to watermark the characteristics of literary works to obtain digital literary works that have been watermarked. After that, we can combine data mining algorithms to perform text feature recognition and feature classification to improve the copyright protection effect of literary works. Author recognition method mainly includes two modules: training module and classification module. e functions of the training module mainly include the process of preprocessing the original corpus, extracting key features of the text, and training to obtain the classifier. e function of the dispute text classification module is to preprocess the    Scientific Programming dispute text, extract the statistical feature vector from the dispute text, and then input it into the trained classifier, and finally output the author category from the classifier. e methods used in the first two stages of these two modules are exactly the same. e main function of the training module is to build a training classifier. If it is a controversial work, then extract the key statistical features from it and input it into the trained classifier, and finally judge the author's category based on the similarity value. e flowcharts of the training module and the classification module are shown in Figures 6(a) and 6(b), respectively. e corpus must first undergo text normalization processing, and after it is expressed in a form that can be processed by the computer, the normalized text segmentation is processed.
e system structure is shown in Figure 7(a). e named entity refers to the actual content of the entity expressed in the Chinese text sentence, such as unit name, person, geographic name, organization name, etc. One of the basic tasks in natural language processing technology is named entity recognition, which plays an important role in word segmentation, syntactic analysis, and automatic translation with the help of machines and other technologies. At present, the lexical analysis technology researched by the Chinese Academy of Sciences and Harbin Institute of Technology has a module for Chinese text sentence named entity recognition. e principle of this module is shown in Figure 7(b).
After combining the watermarking algorithm to obtain the above model, this paper conducts experimental verification on the model. First, the effect of the text data mining algorithm in the feature recognition of the watermarking algorithm is verified, and the results shown in Table 1 are obtained. e above verifies that the text data mining algorithm has a very good effect in the feature recognition of the watermark algorithm. On this basis, the copyright protection effect is evaluated.
is part is carried out by the expert evaluation method, and the results are shown in Table 2.
e above research has verified that the copyright protection effect of literary works based on data mining algorithms is very good.

Conclusion
While the digitization of literary works brings a new production and lifestyle to people, its own characteristics have brought a copyright crisis to itself. When digital products  exist in digital form, they can be easily edited, modified, and stored through computers or other digital equipment. At the same time, it can also carry out low-cost and lossless copying and transmission through various forms of storage media, computer networks, or other data transmission methods. e advantages of these original digital literary works make it very easy to illegally occupy, copy, edit, and disseminate unauthorized products that infringe on the owner's copyright. is paper combines data mining technology to study the copyright protection of literary works, constructs a literary copyright protection system, and improves the copyright protection effect of modern digital literary works. e experimental research results verify that the effect of the copyright protection system of literary works based on data mining algorithms is very good.

Data Availability
e labeled dataset used to support the findings of this study is available from the author upon request.

Conflicts of Interest
e author declares no conflicts of interest.