Research on Universal Combinatorial Coding

The conception of universal combinatorial coding is proposed. Relations exist more or less in many coding methods. It means that a kind of universal coding method is objectively existent. It can be a bridge connecting many coding methods. Universal combinatorial coding is lossless and it is based on the combinatorics theory. The combinational and exhaustive property make it closely related with the existing code methods. Universal combinatorial coding does not depend on the probability statistic characteristic of information source, and it has the characteristics across three coding branches. It has analyzed the relationship between the universal combinatorial coding and the variety of coding method and has researched many applications technologies of this coding method. In addition, the efficiency of universal combinatorial coding is analyzed theoretically. The multicharacteristic and multiapplication of universal combinatorial coding are unique in the existing coding methods. Universal combinatorial coding has theoretical research and practical application value.


Introduction
Coding theory includes three branches: source coding, channel coding, and secrecy coding [1]. The primary mission of source coding is data compression, such as Huffman coding [2], arithmetic coding [3,4], and dictionary coding [5,6]. Channel coding can improve the reliability of communication, such as error correcting code. In order to guarantee the information security in the transmission, secrecy coding is used. It can usually be achieved by the data encryption and decryption.
Generally speaking, each existing coding method only belongs to a certain branch of the three coding branches. In fact, relations exist more or less in many coding methods. It means that a kind of universal coding method is objectively existent. This coding method can reflect a variety of encoding features from different points of view and become a bridge connecting many coding methods; thus, the deep development of coding technology can be promoted.

Theory of Combinatorial Coding
Universal combinatorial coding is a kind of lossless coding method. The concept of "universal" has three meanings: firstly, it means that the encoding method does not depend on the probability statistic characteristic of information source; secondly, it means that this method has multicharacteristic because of its combinatorial and exhaustive properties; thirdly, it refers to the multiapplication.
Assume that a sequence 1 2 ⋅ ⋅ ⋅ with elements will be coded, and it has different code elements; the frequency of each element is . If the elements are arranged to full permutation according to the benchmark sequence, a new dictionary space which has strict permutation order of sequence can be formed. The problem is how to compute the position of the sequence in the space of the dictionary.
To the th element in the sequence, if it is the same as the th element in the benchmark sequence, it means that in corresponding current position, the existing sequence which is in front of the sequence being coded is already involving (1) Each in (1) represents the number of relevant elements. " " is the number of key elements and the number of each element in these key elements is not 0. Equation (1) involves numerous computations of permutations and combinations; the computational efficiency is too low; therefore, (1) can be optimized to . (2) To the character th, the total number of sequences involving the − 1 elements in front of the character th is shown in (3). Consider Finally the position of the sequence with element in the dictionary (ordinal) can be computed by The basic idea of decoding is the presupposition that the measured position is a certain element based on the order of the benchmark sequence. The corresponding permutations value 1 is calculated based on the assumption. Let 1 compare with the ordinal number; if the ordinal is equal to 1 or greater than 1 , then calculate the permutation and combination value ( 2 ) of the next element according to the benchmark sequence. Judge that whether the ordinal is equal to or greater than 1 + 2 ⋅ ⋅ ⋅ until the ordinal is less than 1 + 2 + ⋅ ⋅ ⋅ + . At this time, the corresponding element of should be in the current position. Finally, in order to calculate the element of the next position, the new ordinal can be calculated based on When the value of the ordinal number is 0, the algorithm ends. If it still has remainder elements at this time, then the remaining elements can be filled behind the analyzed sequence according to benchmark sequence.

Optimized or Parallel Computing
Adopting combinatorial method to calculate the ordinal has higher computational complexity; the calculation process of the ordinal number can be optimized by proportion method.
In the process of the calculation of (3), each permutation and combination value in − 1 elements of the th corresponding group has proportional relation; that is to say, the proportion between each permutation and combination value and the first permutation and combination value ,1 is equal to the proportion between the number of the corresponding element and the number of the first element. Then (3) can be optimized to In fact, there is proportional relation between the permutation and combination value of the first element in the ( + 1)th element group and the permutation and combination value of the first element in the th element group. It is shown as follows: In (7), is the actual frequency of the th character. This character is the th element in the frequency table. Assume that the number of remaining elements is ( = (the number of the sequence element) − ). 1,1 is a particular case, it can be specially processed. Take 0,1 as the permutation and combination number of all elements for the sequence to be coded; is the number of all elements for the sequence to be coded; is the number of the first element in the benchmark sequence.
In (8), represents the number of the first element in the benchmark sequence, when the first th step is processed. It is the current number of the first element in the benchmark sequence. When empty is 0, (8) is (7).
In the process of computation, the first permutation and combination value corresponding to the first element is computed by the whole permutation and combination value Max. Max can be calculated through By the proportion method, the computing speed of the ordinal will be greatly improved.
The decoding process of proportion computation is similar.
When value is determined, in order to accelerate the computing speed of Max, Max can be calculated in advance when all are equal to each other. At this time Max is the largest and is called Whole; then Whole is stored in a file. When it is used, it can be read from the document directly. Then the corresponding adjusted Max is obtained according to the different in actual sequence. Obviously, the Whole can be calculated based on In order to process conveniently, the sequence length is usually integer times of different code element number .
In addition, it can also split the sequence; thus the ordinal number can be computed in parallel and the calculation efficiency will be improved [7].
GPU parallel computing is developing rapidly and it can be used in various areas [8,9]. Thousands of multithread processes can vastly advance the computing speed of the universal combinatorial coding method. But the existing coding methods are difficult to adopt GPU parallel computing technology, such as arithmetic coding or dictionary coding. Because this paper will describe the universal combinatorial coding from the angle of theory, the GPU parallel method need not be given unnecessary details. The average time comparison of computing whole ordinal at different lengths between CPU and GPU is done in Figure 1.
It can be seen in Figure 1 that the slope of CPU serial computing curve is bigger, and the slope of GPU parallel computing curve is smaller. It explains that with the sequence length increasing, CPU serial computation time cost grows fast; GPU parallel computation time cost grows relatively slow. When the length of sequences are 8 K, 16 K, and 32 K, GPU parallel computation time cost is more than time cost of CPU serial computation. For the reason that as the length of sequence is not long enough, the saved time of GPU parallel computation could not make up for the cost time that spends on data transfers between CPU and GPU. When the length of sequence is 64 K, GPU parallel computation time cost starts to be less than time cost of CPU serial computation. It indicates that the saved time of GPU parallel computation is more than the cost time that is spent on data transfer between CPU and GPU. From this time on, the advantage of GPU parallel computing is increasing more and more with the sequence length.
The increasing of speedup is shown in Figure 2.
According to the trend of curve in Figure 2, the longer the sequence length is, the greater the speedup is. Through the analysis of experiment, obviously, the speed of whole ordinal computing can be improved by GPU parallel technology and running speed increases with the increasing of the sequence length.

Relation between Universal Combinatorial Coding and Other Coding Methods
Universal combinatorial coding has many features in common with variety of classical encoding methods. (Lots of current encoding methods are derived from these classical encoding methods and are improved according to special application [10][11][12].) And the preliminary studies have shown that the universal combinatorial coding has characteristics of tree coding, arithmetic coding, and dictionary coding.
The following discussion involves the universal combinatorial coding and other coding methods.

Universal Combinatorial Coding Is a Tree Coding.
Assume that there is a sequence: 1 2 ⋅ ⋅ ⋅ , which consists of different elements and contains elements. Thereinto, = 2 V ; that is to say, each element occupies V bit. Benchmark sequence space contains different elements with full permutation. Obviously, there is ! in benchmark sequence. Once a benchmark sequence is confirmed, a tree can be generated, and the position of 1 2 ⋅ ⋅ ⋅ in this tree can be confirmed. The other paths from root to leaf node in tree represent the sequences that have the same code element with 1 2 ⋅ ⋅ ⋅ , but only arrange order is different. Total sequence number is Max; serial number starts from 0, the largest serial number Max − 1 is the biggest ordinal.
For example, assume the sequence is "bdaca, " different element number is 4. Obviously, V is 2. Benchmark sequence number is 4! = 24. It means that there are 24 trees which are all quad tree with strict order. Assume that benchmark sequence is "abcd, " then each code element can be represented as a: 00, b: 01, c: 10, and d: 11. Thus the sequence can be expressed as 0110001100 and occupy 10 bits. If this sequence uses tree form, the tree can be obtained as Figure 3.
The quad tree in Figure 3 consists of sequences that each sequence contains two "a, " one "b, " one "c, " and one "d. " The sum of sequences is 2 5 * 1 3 * 1 2 * 1 1 = 60. These sequences strictly arrange in accordance with order. Serial number is from 0 to the biggest ordinal number 59. The ordinal number corresponding to the sequence "bdaca" is 34; this ordinal number stands for the number of the sequences in front of "bdaca" and has the same code element but has different rank place. So the "bdaca" position in the quad tree can also reflect the property of combination. Decoding process is vice versa. Thus, combinatorial coding can be expressed as a tree structure. It calculates the coding sequence position in -tree structure based on the principle of combinatorial theory.

Universal Combinatorial Coding Is Dictionary Coding.
It can arrange the path of -tree according to the order from left to right strictly, a dictionary can be obtained. It can be scripted as follows: assume that there is a sequence: 1 2 ⋅ ⋅ ⋅ , which consists of different elements. Thereinto, = 2 V and each code element occupies V bit. The presence times of code elements are 1 , 2 ⋅ ⋅ ⋅ . Obviously, 1 + 2 ⋅ ⋅ ⋅ = . And the benchmark sequence is made up of different code elements, the number of the benchmark sequence is !. Once the benchmark sequence is confirmed, a dictionary space can be determined, and the position (ordinal number) of 1 2 ⋅ ⋅ ⋅ in the dictionary space is also determined. The dictionary space stores all the sequences that have the same code element and different permutation order with 1 2 ⋅ ⋅ ⋅ . The number of sequences can be calculated according to (9).
The ordinal of the coding sequence can also be calculated according to the related equations in the sections ahead. The whole dictionary space and the position of the coding sequence in the dictionary can be shown in Table 1.
Each sequence in the dictionary has strict order with the constraints of the benchmark sequence. Of course, the combination dictionary is nonexistent. It is hidden in the frequency table of the sequence. It means that it does not require actual occupation of space and time. The position (ordinal number) of the coding sequence in the dictionary can be calculated, and the size of the ordinal number is less than the size of sequence space. Taking advantage of this characteristic, universal combinatorial coding can be used for data compression. The dictionary space of the universal combinatorial coding is fixed and is objectively existent. This dictionary contains all the sequences that have the same code number with the coded sequence.

Universal Combinatorial Coding Is Arithmetic Coding.
Universal combinatorial coding uses an absolute position  value to express sequence in the dictionary space and this value is an integer; so it can be seen as an arithmetic coding. In fact, the traditional arithmetic coding expresses a sequence as a digit which is a real data between 0 and 1. It can be understood as a relative position value. Another method which is closer to the universal combinatorial coding is range encoding method. The range encoding method can also be seen essentially as arithmetic coding. But the range encoding method must have a large enough positive integer. In fact, the so-called sufficient large positive integer plays the role of the largest ordinal number, but it is not precise enough and it is larger than the actual needs of the largest ordinal number. In other words, ordinal number of the universal combinatorial coding is actually the smallest and most accurate "positive integer which is large enough. " The most important thing is that both range coding and 0-1 arithmetic coding are based on probability, and we always assume that the probability is constant. It will inevitably lead to error. Even worse, in many cases, the probability of sequence elements cannot be predicted. Of course, adaptive arithmetic coding may not depend on the probability, but its compression ratio is reduced accordingly. On the contrary, the universal combinatorial coding is not based on probability but based on the frequency. It can adjust the frequency timely in the computation process according to the actual situation. So the accuracy of the ordinal number can be ensured. Of course, the drawback is that the frequency value of each element must be recorded. Figure 4 is a schematic diagram of the calculated results of the instance "bdaca" according to combinatorial coding, region coding (0 to 10000), and 0-1 arithmetic coding.
In summary, universal combinatorial coding uses an integer value to represent the sequence more accurately, while the traditional arithmetic coding methods adopt the relative position or similar position to express the sequence. In fact, if it adopts a proportion value to represent the relative position of the sequence in the entire dictionary space in the universal combinatorial coding, then the proportion is similar to the result of 0-1 arithmetic coding: 34/59 ≈ 0.576.
The combination feature of universal combinatorial coding makes this method have more or less relation with many other coding methods. It means that this method has multicode connectivity. This means that universal combinatorial coding has the potential to become a tool to measure the characteristics of a variety of coding methods.

Multiapplication of Universal Combinatorial Coding
Universal combinatorial coding also has more special properties besides tree structure coding, arithmetic coding, and dictionary coding. These properties make universal combinatorial coding applicable in many ways.

The Estimation of the Size of the Ordinal.
Making use of the relation between the ordinal and the sequence characters frequencies, the size of the ordinal can be estimated preliminarily. Current research shows that the more frequency differences among the characters in sequence are, the less the dictionary including the sequence is, at the same time, the less the maximum ordinal is and the less the average ordinal is. When the frequency differences among the characters are less than 1, global maximum ordinal can be obtained. The global maximum ordinal is only related to the sequence length. The sizes of the sequence and all kinds of ordinal satisfy the following inequality: The length of len sequence − len Whole is growing with the growing of the sequence length , but the increasing range is smaller and smaller. It can be seen in Figure 5.
The global maximum ordinal is used for preliminarily estimates or theoretical analysis of the ordinal. In order to calculate the maximum ordinal, it can be precalculated and put in the file. And the maximum ordinal can be used for calculating ordinal of the sequence to be coded.

Combinatorial Compression
Property. The property that ordinal length must be less than sequence length can be used for combinatorial compression.
Universal combinatorial coding mainly utilizes ordinal to reduce frequency redundancy. It codes the whole sequence. The space A can be determined by the length of coding sequence. The space A can be transformed to a smaller space B through the restriction of the sequence frequency table. Space B is made up of sequence which has the same frequency coding table with coding sequence. The position of coding sequence in space B is smaller than the position in space A. For example, to the sequence "bcada, " the number of different elements is 4, each code occupies 2 bits (a: 00; b: 01; c: 10; d: 11). Assume that the benchmark sequence is "abcd, " then the position in space A is 0110001100 (396 in decimal). After being constrained by the frequency table (a: 2; b: 1; c: 1; d: 1), the dictionary space B is formed, and the position of the sequence in space B is 34. It can be shown in Figure 6.
It can be seen from Figure 6 that the space B through the restriction of the sequence frequency table is smaller than the original sequence space A. At the same time, the average position number (ordinal number) of coding sequence in space B also becomes smaller.
The arrangement of the key in combination coding adopts the method that data and position are replaced. It is to say, the ( + 1)th rounds key can be deduced from the th rounds key; the method is to find out the data in the th location (counting from 0) of the th rounds key first; then take the data as the data of the th location in the ( + 1)th rounds key. In order to prevent the appearance of dead circulation of location and data, the result of the ( + 1)th rounds key can be shifted right circularly. At last, the final ( + 1)th rounds key is formed. It can be shown in Figure 7 (to describe conveniently, take m = 4).
Data and position replacement method makes each key different, not only round key, but also group key in each round. The creating method of group key is similar to the creating method of round key, only perform right shift to the previous key in turn, and then replace it between data and position. Figure 8 shows the relationship among the main key, the round key, and the group key.
The key generation method of combinatorial encryption makes the main key, round key, and the group key have the same large space; so the difficulty of decryption is increased further. In addition, combinatorial encryption method adopts code element as the information processing unit instead of bit, so the group length of sequence can be longer, and it is more suitable to parallel computing than the existing encryption algorithm through the multi-core and multi-GPU.
The key space can be compared between the combinatorial encryption method and the existing encryption method (they usually use bit as encryption unit) in Figure 9. In order to express conveniently, there is only exponent level of key space expressed in -axis in Figure 9.
It can be seen form Figure 9 that the key space of combinatorial encryption method expands rapidly along with the increase of code element bits. So security is increased.  detection by using ordinal as error-detecting code. It can be used in data validation of communication transmission or storage information. Universal combinatorial coding can also be used for processing information abstract by incorporating the combinatorial verifying method and secret key. The information abstract is only used for checking whether the data has been distorted, and the data is not necessary to revert. According to the key sequence, multiple combination calculation can be done to the sequence. That is to say that another similar ordinal calculation can be done for the ordinal received every time until the length of ordinal is fit for user requirement. At this moment, the last ordinal is used as the information abstracts. Of course, the times of ordinal calculation must be recorded. For the method of information abstracts computing through universal combinatorial coding; the ratio of collision is very low in theory.

The Assessment of Universal Combinatorial Coding Efficiency
Universal combinatorial coding is independent of the probability statistical properties of the information source, so Shannon Entropy cannot be used to assess the coding quality for universal combinatorial coding. But most of the evaluation methods still use probability recently; so probability is still adopted to assess universal combinatorial coding efficiency by approximate calculation in this paper. To contrast with Shannon Theory, suppose that the source is a smooth with no memory sequence of elements and the sequence length is . Further, the probability distribution of the source symbols is ( = 1, . . . , ). Information source is divided into ( ) segments to process. When the length of the sequence and the segment length are very large, average code length can be calculated by the following methods: For each segment whose length is , the number of ordinal can be obtained by universal combinatorial coding as the following formula: So the length of the binary code occupied by each ordinal is Although the frequency of each character also occupies space, it is very small compared to the space of ordinal. So the total code length of the sequence with element is ( ) ⌈log ( ! (∏ ( )!) )⌉ .
Put (19) into inequality (18), When is large enough, the average code length can be scripted by the following formula: It shows that the average code length in the universal combination coding is progressively close to the source entropy limit when is large enough.
So the calculation result is magnified. It means that the actual coding efficiency is better than the approximate calculation result.

Conclusions
The concept of universal combinatorial coding is advanced and the related properties are analyzed in this paper. Universal combinatorial coding can stride over the three branches of coding theory. It has the characteristics of multiple coding methods and various application prospects, for example, combinatorial compression, encrypt or decrypt, and error detection. Theoretical researches reveal that the average code length in universal combinatorial coding is close to the source entropy limit when the value of is large enough. But the actual coding efficiency should be better than the approximate calculation results.