An Immune Clonal Selection Algorithm for Synthetic Signature Generation

The collection of signature data for system development and evaluation generally requires significant time and effort. To overcome this problem, this paper proposes a detector generation based clonal selection algorithm for synthetic signature set generation.The goal of synthetic signature generation is to improve the performance of signature verification by providing more training samples. Our method uses the clonal selection algorithm to maintain the diversity of the overall set and avoid sparse feature distribution. The algorithm firstly generates detectors with a segmented r-continuous bits matching rule and P-receptor editing strategy to provide a more wider search space. Then the clonal selection algorithm is used to expand and optimize the overall signature set. We demonstrate the effectiveness of our clonal selection algorithm, and the experiments show that adding the synthetic training samples can improve the performance of signature verification.


Introduction
Handwriting signature recognition is an effective identity authentication method by using signature data, since every person's signature is different, and especially the dynamic characteristic is difficult to imitate.Recently, lots of signature verification methods have been proposed [1][2][3], and the main goal is to improve the identification effect by investigating the effective classification feature and algorithm.
An important challenge is that most existing approaches require sufficient signature samples to guarantee the effect.First, the performance evaluation of these systems needs to provide a large number of test samples [4].More importantly, most of the classifier algorithms' (such as neural networks, hidden Markov model) performance generally depends on the amount of training data, and training a stable and efficient classifier needs providing a sufficient number of samples [5].Although some commercial signature databases have been established, the sharing and distribution of these data are very difficult due to some legal issues [6].Besides, the number of signature databases that can be shared is fairly limited.A direct solution is to collect signature samples by oneself.However, the database collection is time consuming and expensive, since users are unwilling to submit their privacy data due to potential security problems.In addition, the boring repeated submission process will affect the quality of signature samples.
To overcome the database collection problem, some synthetic signature generation methods have been presented.By these methods, some synthetic signatures can be automatically created by synthesizing real signatures [4][5][6][7].According to the sample synthesis strategies, existing methods can be divided into three categories: duplicated samples, combination of different samples, and synthetic individuals.The duplicate-based method [7][8][9][10] generates a new sample through different transformation, and it is suitable for producing different signatures corresponding to the same person.The combination-based method [11] creates a new sample by combining person's handwritten letters or units from different samples.In the synthetic-based method [6], some kind of a priori knowledge (such as stroke placement distribution, length feature) is used to create a new sample, and this method can create new individuals' signature.In summary, existing methods focus on novel sample deformation technology, which can make the new sample simulate the characteristics of the real data.Meanwhile, the accuracy of signature verification is improved by adding new training data.
In fact, adding synthetic training samples does not always improve the performance of the classifier.On one hand, adding the synthetic samples can increase the diversity of the training set, which is helpful to optimize the decision parameters in order to improve the identity rate.On the other hand, unnatural deformation will produce large bias away from real samples, which can lead to a deterioration of the accuracy.Besides, the feature distribution of the whole sample set also affects the classifier's performance.Sparse or uneven distribution will make the classifier unstable [12].Therefore, the diversity and effectiveness of the overall training sample set have an important impact on the performance of signature verification.
Accordingly, the artificial immune system (AIS) is introduced as a means for creating synthetic signatures.Our goal is to use the clonal selection algorithm (CSA) for expanding the signature set from an initial set, composed of a small amount of data.The result sample set can be used as the training data to improve the verification performance.Our method expands the population of signature data in each generation rather than creating a synthetic sample successively.Throughout the process, we focus on the quality of the population more than that of the individuals.By utilizing the advantages of AIS in the self-recognition capabilities and the diversity manipulation mechanisms, the diversity and effectiveness of the overall training sample set can be guaranteed.To investigate whether AIS improves the quality of the synthetic sample set, the duplicated-based method is selected as our synthesis strategy and a new clonal selection algorithm is proposed by introducing a novel detector generation algorithm.The detector generation algorithm uses the segmented -continuous bits matching rule and -receptor editing strategy to create the initial population of CSA.The experiment shows the effectiveness of the method.

Algorithm Overview
This paper uses the clonal selection algorithm to expand the signature data set.On the basis of ensuring the effectiveness of each new sample, this method focuses on the diversity and effectiveness of the overall set.Standard clonal selection algorithm generates the initial population randomly [13].Due to lack of the guidance of the input samples, the structures of the antibodies in the initial population and the antigens will be quite dissimilar, which affects the convergence efficiency of the algorithm.Therefore, the detector generation algorithm is introduced into the clonal selection process to present a novel clonal selection algorithm.
The process is shown as in Figure 1.Our process flow can be divided into two steps: detector generation and clonal selection.In detector generation, an iterative enumeration algorithm is firstly used to generate some new samples that can be matched with the input sample by segmented continuous bits matching rule, which means that the new sample has several successive stroke sections that are similar to some sections of the input sample.Then -receptor editing strategy is used to create some new samples that have  different strokes from the input sample.At last, the two sets are combined as the detector set, and the purpose is to get some individuals that have some differences with the input sample in advance.Therefore, the clonal selection algorithm will search the samples in a wider range to avoid losing the opportunity to generate other useful individuals.In the clonal selection process, the initial population is iteratively updated by cloning-mutation-selection operation.And more effective samples can be obtained by hypermutation and the diversity of the overall set is maintained.
Before introducing the algorithm, we firstly introduce our individual representation method that is used in our algorithm.In this paper, every signature is defined as an immune cell.The input samples are defined as antigens, and the synthetic samples are treated as antibodies.A signature sample TS consists of several sequences of strokes, and every stroke  is a sampling point sequence, which consists of the points sampled between a pair of pen-down and pen-up operations.The horizontal coordinate , the vertical coordinate , the pressure   , and the time  are recorded for each sampling point sp.Because different persons have different writing habits, it is difficult to use a fixed-length sequence to define all the signatures.Even when the same person writes the same signature at different times, the number of strokes is not fixed.
So every signature TS is defined as a variable-length stroke sequence as follows: where  is the number of the strokes.

Detector Generation Algorithm
Detector generation algorithm is the key of this algorithm.Detector generation algorithm [14] is widely used in negative selection algorithm to generate the candidate data.Here, it is used to create the initial population.The initial population of the standard CSA is generally generated randomly, and the structure of the random individual is much different from that of the antigen.Theoretically, it requires many times of iterations to get the population that can identify antigens.The efficiency problem caused by the random initial population can be improved by creating the antigenguided initial population.For example, the mature detector or existing memory cells created by self-tolerance can be used as the initial population [15].However, the diversity of the initial population created by this method is not as wide as that of the random initial population.As a result, it may make the algorithm fall into a local convergence to lose the opportunity to learn other effective structures.To solve it, a new detector generation algorithm is presented by combining the segmented -continuous bits matching rule and -receptor editing strategy.The detectors that are generated by segmented -continuous bits matching rule capture the global structure of the antigen, and the detectors that are generated by -receptor editing strategy make the initial population have a wider search space.
3.1.The Segmented -Continuous Bits Matching Rule.The -continuous bits matching rule is used to compute the matching degree of two strings, which is widely used in artificial immune system [16,17].The value  determines the matching degree.Since the -continuous bits matching rule only measures the partial sequence, it is difficult to make the generated detector maintain the global structure of the antigen by the rule directly.Therefore, we introduce the idea of sequence segmentation.First, given two signatures, the two stroke sequences are divided into  ( > 1) segments simultaneously.Then a single stroke is defined as a matching bit, and if every pair of corresponding segments between the signatures is -continuous bits matched, the two signatures are segmented -continuous bits matched.Compared with previous -continuous bits matching rule, the segmented continuous bits matching rule improves the global matching degree of two signatures.
Figure 2 shows the segmented -continuous bits matching rule indicated by string.The two strings in Figure 2(a) are 3-continuous bits matched but not segmented 3-continuous bits matched, because the second string only has one segment that is 3-continuous bits matched.And the strings in Figure 2(b) are segmented 3-continuous bits matched, since the two strings have 2 segments that are 3-continuous bits matched.The vertical line is the segment line, while the bits in the rectangles are the same.From the simple example of Figure 2, we can see that the generated detector that is only -continuous bits matched cannot control the last 4 bits, because the front 4 bits have satisfied the requirement.According to the segmented -continuous bits rule, the global structure controllability is improved.
There are two important parameters in the segmented -continuous bits matching rule: bits length  and segment number .The bits length  shows the local matching degree, and the segment number  shows the global matching degree.There is considerable variability in the stroke number of different signatures; even the signatures given by the same person often have different stroke number in different acquisition sessions.Therefore, if fixed segment number and bits length are used to control the matching degree between the detector and antigen, it is difficult to capture the important structure information while adapting to different sketching habits and acquisition sessions.To solve it, the two parameters are determined according to the stroke number  of the signature as follows: Given a signature sample TS, a mutated sample is indicated by TS  , where the symbol  is a subset of {1, 2, . . ., }.And if the th stroke in the sample TS is mutated in TS  ,  is an element of .The possible set  is exponentially large, that is, 2  .However, not every  can make the TS and TS  segmented -continuous bits matched.It is infeasible to check every set  successively to find the desirable cases; besides, it is unnecessary to search all the cases.Our goal is to generate the samples that can be segmented -continuous bits matched with the antigen, while introducing more mutation to improve the diversity of the initial population.Accordingly, an iterative enumeration method is proposed to add the valid bit to the set  progressively while maintaining the segmented -continuous bits matching rule, until the set  cannot be expanded.The whole process is shown in Algorithm 1.
A constraint-based enumeration method is proposed to update the set .Given any set  ( is the maximum in (1) Input: number of matched continuous bits ; number of segments  (2) Output: the result set list Ψ, which is a set of integral sets.
(3) Set the initial integral set  = 0; (4) Add the set  into the set Ψ; (5) Set the segment flag  of set  as 0; (6) repeat (7) for all  ∈ Ψ do (8) Update the set  and get the candidate set Remove  from Ψ; (10) if Update succeed then (11) Add all the elements of the set Ψ  into Ψ; (12) end if (13) end for (14) until Ψ is not changed Algorithm 1: The segmented -continuous bits matched detector generation.the set ), the method uses a segment flag  to indicate the segment number of the subsequence  0 ,  1 , . . .,   of sample TS  ; that is, the front  bits of samples TS and TS  are segmented -continuous bits matched, and the largest segment number is .According to the current set , 4 bit positions are defined: , , , and .Here,  means the next possible mutated bit position;  is the smallest value that makes the front  bits of samples TS and TS  segmented continuous bits matched (segment number is  + 1);  is the largest value that makes the last (−) bits of samples TS and TS  segmented -continuous bits matched (segment number is (−));  is the largest value that makes the last (−) bits of samples TS and TS  segmented -continuous bits matched (segment number is ( −  − 1)).The four positions are computed according to the maximum  and the segment flag  as follows: Then the bits from  to  are selected successively to update the set , and the detail of the process is described as in Algorithm 2. By Algorithm 2, the possible set  is generated to make TS and TS  segmented -continuous bits matched.The corresponding positions of the strokes in the set  are then mutated by adding the random noise to create a mutated sample TS  .The generated samples are the first part of the detector set.

3.2.
The -Receptor Editing Strategy.Figure 3 shows the hierarchical structure of receptor editing.This type of mutation can edit the stroke in any position without any -continuous bits matching requirement.The -receptor editors can be created from ( − 1)-receptor editors.According to the property, the generation efficiency of the -receptor editors can be improved.The detail of this process is described as in Algorithm 3.
In Algorithm 3, the parameter   shows the degree of discrimination between the antigen TS and the new sample TS  .Because the   -receptor editors can be created from (  − 1)-receptor editors, the process is the same as the mutation in the clonal selection process.As a result, the parameter   is set as a small value for generating the initial population (in this paper   = 2).The -receptor editors are the second part of the detector set, which is used as the initial population Pop 0 of the clonal selection algorithm.

Clonal Selection Algorithm
After getting the initial population, the clonal selection algorithm is used to remove the invalid samples and generate more valid samples.In this section, we first give some basic operators in our CSA, such as affinity operator, mutation operator, and density operator.Then the process and stop criterion of our CSA are described.
Affinity Operator.The affinity measures the degree of matching between the antigen and antibody.Because different signatures often have different number of strokes, it is difficult to compute the matching cost directly.So the stroke segmentation algorithm based on dynamic programming (DP) (1) Input: the current set ; (2) Output: the candidate set Ψ  ; (3) Compute the 4 positions , ,  and  by (3); (4) if  >  then (5) for  =  →  and  →  do (6) Create the new candidate integral set   =  ∪ {} and push it into the set Ψ  ; (7) end for (8) else (9) for  =  →  do (10) Create the new candidate integral set   =  ∪ {} and push it into the set Ψ  ; (11) end for (12) end if Algorithm 2: Updating the integral set.
(1) Input: the initial sample TS =  1 ,  2 , . . .,   ; the number   ; (2) Output: the result sample set Φ; (3) Add the sample TS 0 into the set Φ; (4) for  = 1 →   do (5) for all TS  ∈ Φ do (6)  = the maximum in the ; (7) for  =  + 1 →  do (8) Create the new set   =  ∪ {}; Mutate the th stroke of TS  ; (10) Create the new sample TS   and add it into Φ; (11) end for (12) Remove the sample TS  from the Φ; (13) end for (14) end for Algorithm 3: The -receptor detector generation.[18] is firstly used to make the two signatures have the same number of strokes and establish a bijective mapping between the two stroke sequences.During the DP-based segmentation process, the set of temporal ordered candidate segment points are firstly extracted according to the curvature feature, and then the segmentation of two signatures is treated as an optimization problem, which maximizes the matching degree between the two signatures by selecting the segment points from the ordered candidate segment points.For a selected segment point, optimal segmentation contains the optimal segmentation of the input stroke(s) up to this point.Accordingly, the dynamic programming is used to search the segment points recursively through a retroactive formula in order to achieve the optimization.Figure 4  Then, Mahalanobis distance is used to compare the feature between the corresponding strokes of the two segmented samples.And the affinity between the two signatures is computed by where   is the segment number after the DP-based segmentation, and distance  is the Mahalanobis distance between the th segments of the samples TS 1 and TS 2 .The feature vector that is used to compute Mahalanobis distance includes 2 geometric features and 4 dynamic features, as shown in Table 1.Given   antigens Ag, the affinity of the antibody Ab is the maximum value of the affinity between the antibody and the   antigens, which is computed by Mutation Operator.The mutation operator firstly selects some strokes randomly from the signature to mutate the individual.
Then each selected stroke is distorted by adding random noise to the horizontal coordinate , the vertical coordinate , and the pressure   of the sampling points as follows: The segmentation result of the two samples where  1 (−, ),  2 (−, ), and  3 (−, ) are uniform random number from − to  (in this paper,  = 0.05), max  , min  , max  , min  , max  , and min  are the maximum and minimum values of the -axis coordinate, -axis coordinate, and pressure of the antibody Ab, respectively.Figure 5 shows the new sample that is created by mutating the sample in Figure 4(a).
Density Operator.Density manipulation is an important characteristic in CSA to maintain the diversity of the sample set.The density of the antibody Ab is computed by where  is the size of the current sample set.After defining the above three operators, the clonal selection algorithm is used to expand the initial population Pop 0 while improving the diversity and distribution of the population.The clonal selection algorithm is described as in Algorithm 4.
The algorithm is terminated when the sample set meets the following requirements: (1) the size of the population

Evaluation
We have implemented the proposed algorithm and used several signatures to show the efficiency of our method.The endpoint x-axis The mutated x-axis coordinate values curve

Data Collection.
We collect some signature data for our experiment.Thirteen students are invited to give their signatures.Every student is first asked to write his signature for 50 times.Then, five other students are asked to forge his/her signature for 10 times for every student.So there are 50 genuine samples and 50 forged samples for every person.During the collection process, we record the horizontal and vertical coordinate, the pressure, and the time stamp of the sample points.The pen-up and pen-down events are also captured and the sample points between a pair of pen-up and pendown events constitute a stroke.Then the genuine samples are divided into 10 groups and each group has five samples.Besides, we also include a public benchmark provided by the First International Signature Verification Competition (SVC2004) [19].This corpus consists of 40 sets of signatures.Each set contains 20 genuine signatures from one contributor and 20 skilled forgeries from five other contributors, and the 20 genuine signatures are divided into 4 groups.
Fierrez's method is used as our verification system [20], which uses the hidden Markov models (HMM).The similarity of the signature is computed by using 10 left-toright HMM states and mixtures of 8 Gaussians per state.
We compute the equal error rate (EER) for performance comparison.Figure 6 shows the relation between the size of training set and EER for person 1.When the size of the training set increases (-axis) before three groups, the EER (-axis) decreases significantly.Our experiment focuses on investigating whether our algorithm can improve the performance when the size of the training set is small.So each time only one group (five samples) is used as the input training set.The other signatures are used as the test samples to evaluate the performance.

Parameter
Settings. 1 ,  2 , and  3 are the three parameters that determine the size of the temporary, clone, and new population.We have ascertained experimentally that higher numbers of  1 ,  2 , and  3 will achieve better results.However, in order to deal with the tradeoff of computational time and accuracy, we set  1 = 30,  2 = 1000, and  3 = 200.The mutation range  is set as 0.05.The termination criterion includes three parameters: the size of the population  1 is set as 200; the dispersion threshold  2 is set as 0.75; the variation threshold of the dispersion  3 is set as 0.01.

Statistical Analysis.
We select 20 sets of signatures from the SVC2004 database to show the statistical analysis result of the proposed CSA.The input consists of one group randomly selected from each set of signatures and the proposed CSA is executed for 10 times to generate 10 sets of synthetic signatures from the same input.Then the corresponding hidden Markov models are trained by the synthetic signatures and evaluated for signature verification.A summary of evaluation results is given in Table 2, which shows the average, standard deviation, and maximum values of the EERs.From the statistical analysis result, we can see that though the CSA is a random algorithm, the evaluation performance of the generated synthetic signatures is stable.The reason is that our detector generation algorithm creates the initial population under the guidance of the input samples and many invalid samples will not be searched due to the high quality of the initial population.

Comparison between Real and Synthetic Training Samples.
We compare the performance by using different training samples to train the corresponding HMMs.The proposed CSA is firstly used to create some synthetic signatures in order to expand the initial group.Then the initial group and the expanded group (five genuine signatures) are used as the training set for the HMM-based verification system, respectively.The cross validation method is used to compare the performance, and each time one group is used for training the HMM-based recognition system.Then the other nine groups and all the forged samples are used for estimating the corresponding EER.The comparison result of our own database is shown in Table 3, which records the average EER for every invited contributor.The bolded values show better case.From the table, we can see that the performance is improved by our method.Except person 12, the EERs of the other twelve persons are decreased.Among the twelve persons, the EERs of persons 1, 5, 7, and 9 have fallen by more than 50%.Figure 7 shows the relation between false acceptance rate (FAR) and false rejection rate (FRR) of persons 1, 7, and 9. From the FAR-FRR curve, we can see that the performance is improved significantly by using the expanded sample set as the training set.

Comparison between Galbally's Method and Ours.
We use the SVC2004 database and our collection to compare the quality of our generated synthetic signature set with Galbally's method [7].The cross validation method is used to compute the average EERs for each set of signatures, and the same process is described in Section 5.4.Galbally's method and ours use one group as the input and create the corresponding synthetic signatures each time, respectively.Then the average EER is computed to compare the performance.Table 4 shows the comparison using SVC2004 database, while Table 5 shows the comparison using our own collection.The bolded values show better results.From the 40 sets of signatures in Table 4, there are 27 sets showing that the corresponding average EERs of our method are lower than that of Galbally's method, while there are 9 sets showing that Galbally's method performs better than ours.From the 13 sets of signatures in Table 5, there are 8 sets showing that the average EERs of our method are lower than that of Galbally's method, while there are 4 sets showing that Galbally's method performs better than ours.A summary of evaluation results is given in Table 6, which shows average, standard deviation, and maximum values of all the average EERs by SVC2004 and our own database.The results show that introducing our CSA method to optimize the whole signature improves the verification performance (12.7% and 20.8% improvement in the SVC2004 and our collection database, resp.).Besides, the verification performance of ours is more stable for different sets of signatures (the standard deviation of ours is lower than that of Galbally's method).

Comparison between Our CSA and Other
CSAs.Then we compare the performance of different CSAs for synthetic sample generation.We compare three algorithms: the standard CSA (CSA1), the CSA by using the antigen as the initial population (CSA2), and our CSA (CSA3).The main differences are the generation of the initial population.CSA1 generates the initial population randomly.And CSA2 uses the antigen group as the initial population.CSA3 uses the segmented -continuous bits matched detectors and the receptor editors as the initial population.We randomly select one group as the antigen group and then use the three algorithms to expand the sample set for each person.The dispersion of each iteration  We compare the running time of the three algorithms, as shown in Table 7.The bolded values show the fastest algorithm.The input consists of one group that is selected randomly from each set of our own collection.Each CSA is executed for 10 times, and the average running time is then computed.From Table 7, we can see that our algorithm is significantly faster than the other two algorithms in most cases.The main reason is that our initial population is created by the proposed detector generation algorithm.Figure 8 shows the corresponding convergence analysis results.The dispersion of the initial population in CSA1 is rather high, while the size of the initial population in CSA2 is very small.So it needs sufficient number of iteration to achieve the convergence.And our algorithm uses the segmented -continuous bits matching rule and -receptor editing strategy to create the initial population, and both the size and dispersion are optimized preliminarily.The iteration number of CSA3 is 4 in this case, which is significantly smaller than that of CSA1 and CSA2.
We also use the expanded sample sets that are created by the three algorithms to train HMMs for signature verification, respectively.The average EERs are shown in Table 8.From the table, we can see that the EERs of our algorithm are better than that of the CSA1 and CSA2.In some cases, the EERs of CSA1 and CSA2 are larger than that of verification using the initial real samples as the training set.The experiment shows that the generation of better initial population is important to improve both the efficiency and effectiveness of the method.

Conclusion
In this paper, we present a novel clonal selection algorithm for synthetic sample generation.Our method focuses on the overall set rather than creating a sample successively in order to improve the signature verification performance by expanding the initial signature set.The proposed clonal selection algorithm keeps the diversity of the population while maintaining the feature distribution nonsparse.To improve the efficiency and effectiveness of the standard CSA, the detector generation algorithm is introduced by combining the segmented -continuous bits matching rule and the receptor editing strategy to create the initial population for clonal selection process.The experiment shows the efficiency and effectiveness of the method.By using the synthetic samples as the training samples, the performance of the signature verification system is improved.The future work is to extend our method to other types of synthetic signature generation methods, such as the combination-based or the synthetic-individual method.

Figure 3 :
Figure 3: Hierarchical structure of receptor editing (RE) on sample TS.
shows the DPbased segmentation result.The sample in Figure 4(a) has 17 strokes, and the sample in Figure 4(b) has 10 strokes.The -axis coordinate curves of the two samples are shown in Figure 4(c), and the horizontal and vertical axis are the time stamp and -axis coordinate of the sample point, respectively.The endpoints are also shown by the circles.By the DP-based segmentation, each sample is divided into 17 strokes.And the vertical lines in Figure 4(d) are the segmentation lines.

Figure 4 :
Figure 4: The segmentation result.(a) The first sample; (b) the second sample and its -axis coordinate curves; (c) two samples' -axis coordinate curves; (d) the segmentation results of the two samples.

Figure 6 :
Figure 6: The relation between the size of training set and EER.

Figure 8 :
Figure 8: The convergence analysis.(a) The dispersion; (b) the population size.

Table 1 :
The stroke feature. 1 ; (2) the dispersion is below a threshold  2 ; (3) the change of the dispersion is below a threshold  3 .The dispersion  measures the sparsity of the sample distribution, which is computed by

Table 2 :
EER statistics in % of repeating our CSA for 10 times by the same input (20 subjects).

Table 3 :
Average EERs in % of different training samples.

Table 5 :
Average EERs (%) comparison between Galbally's method and ours by our own collection.

Table 6 :
EER statistics in % for the comparison between Galbally's method and ours.

Table 7 :
Comparison of the average running time(s).

Table 8 :
Average EERs in % of different CSA algorithms.