An Adaptive Test Sheet Generation Mechanism Using Genetic Algorithm

For test-sheet composition systems, it is important to adaptively compose test sheets with diverse conceptual scopes, discrimination and difficulty degrees to meet various assessment requirements during real learning situations. Computation time and item exposure rate also influence performance and item bank security. Therefore, this study proposes an Adaptive Test Sheet Generation ATSG mechanism, where a Candidate Item Selection Strategy adaptively determines candidate test items and conceptual granularities according to desired conceptual scopes, and an Aggregate Objective Function applies Genetic Algorithm GA to figure out the approximate solution of mixed integer programming problem for the test-sheet composition. Experimental results show that the ATSG mechanism can efficiently, precisely generate test sheets to meet the various assessment requirements than existing ones. Furthermore, according to experimental finding, Fractal Time Series approach can be applied to analyze the self-similarity characteristics of GA’s fitness scores for improving the quality of the test-sheet composition in the near future.


Introduction
With the rapid developments of information and assessment technology, the computerized testing is generally used to assess, predict, and diagnose learners' learning statuses because it is able to effectively analyze examinees' abilities and learning barriers.The test quality offered by a computerized testing system depends on not only the quality of test items but also the satisfied test sheets to meet the various requirements of assessment parameters, such as the difficulty degree, the discrimination degree, the associated concepts, and the expected testing time.Thus, how to efficiently assist teachers in composing and generating an appropriate test sheet to meet the diverse assessment requirements has become an important research issue.
Hwang 1 applied the dynamic programming technique to solve this issue, but the solution is inefficient for a large-item bank because of the exponential growth of time and space complexity.Su and Wang 2 developed an assistance system to provide teachers with statistic information for assisting teachers in manually composing the desired test sheets, but manually selecting appropriate test items in a large item bank is still inefficient and difficult to ensure the qualities of test sheets.Therefore, the pressing problem of automatic test item allocation is emerging and it can be regarded as a combinatorial optimization problem, which is proven an NP-hard problem 3 .Therefore, Hwang et al. 4 formulated this problem as a mixed integer programming model and proposed approximate solutions by using the Genetic Algorithm GA approach 5 .The experimental results show that their proposed approach can efficiently automatically compose a good enough test sheet for a large-scale test.
However, the aforementioned studies mainly aim to automatically generate a test sheet with a highest discrimination degree and to meet the constraints in terms of expected testing time and concept relevance.These mechanisms are suitable for the large-scale test only, but their natures are difficult to satisfy various purposes of assessments during the real learning situation.In order to efficiently understand the students' learning problems, it is important to compose the test sheets with diverse conceptual scopes C , discrimination D and difficulty P degrees, such as displacement and summative assessments with normal distribution C and P , and formative and diagnostic assessments with various or specific C and P 6-9 .Moreover, the computation time of the test-sheet composition process and the Item Exposure Rate are our concerns as well.A long computation time will decrease the performance of test-sheet composition system and a high-item exposure rate will decrease the qualities of test items and Item Bank Security 10, 11 .Accordingly, to consider not only the various assessment requirements but also the computation time and item exposure rate, this study defines a new problem of automatic test item allocation, called an Adaptive Test Sheet Generation problem.To solve it, this research proposes Adaptive Test Sheet Generation ATSG mechanism, consisting of a Candidate Item Selection Strategy CISS and an Aggregate Objective Function AOF .CISS process can adaptively determine candidate test items set and the conceptual granularities according to the desired concept scope, and AOF applies GA algorithms to solve the mixed integer programming problem.The evaluation results show that the proposed approach can generate test sheets to meet the various assessment requirements.

Related Work
The original issue of the test sheet generation problem is identified for the large-scale tests, where these test items covering all required concepts and having the highest degree of discrimination are selected from a test item bank.Hwang 1 proposed an algorithm based on dynamic programming technique to find optimal test sheets, but the exponential time complexity causes the efficiency issue for a large number of candidate test items.Therefore, the researchers formulated this problem as a mixed integer programming model and applied a genetic algorithm 4 to figure out the approximate solution.In this paper, assume that a set of test items, which are related to m concepts, should be selected from n items in the item bank.Each test item Q i is defined as where Q i is a test item in the item bank IB and has a set of parameters including the expected time t i needed for answering, the degree of discrimination d i , and the degree of association r ij between Q i and a concept C j .
The assessment requirement of a Test Sheet TS includes the lower bound l and upper bound u of the totally expected answering time, and the lower bound h j of the total relevance of each concept C j .To formulate the problem, a decision variable x i is defined as a Kronecker delta, that is,

2.2
The goal of this problem is to maximize A Genetic Algorithm GA approach 5 is used to solve this problem, where a chromosome is represented as an n-bit binary string x 1 ,x 2 ,. ..,x n and the fitness rank is the summation of selected items' discrimination degrees subtracted by the penalty scores.The penalty scores are the degrees about the violation of expected time and concept ranges constraints.The genetic algorithm iteratively generates new generation of chromosomes by the Crossover and Mutation processes, as Random Functions, and finds the best chromosomes according to their fitness ranks.In the Crossover, chromosomes of the next iteration are generated by combining halves of two chromosomes, which are randomly selected from the chromosomes in the current iteration.A chromosome can be more probably selected because it has a higher fitness rank.Mutation is the other operation of changing a chromosome, where the change of an arbitrary bit is randomly raised to a chromosome.This kind of evolutionary algorithm can iteratively approach to the optimal solution and use some random operations, such as the operations of Crossover and Mutation, to prevent falling into the local optimal solutions.According to the evaluation, the test sheet generation approach based on a GA can really provide good solutions among more than ten thousand test items in an acceptable response time.Furthermore, the greedy algorithm approach 12 , the tabu search algorithm 13 , and the discrete particle swarm optimization algorithm 14 were subsequently applied to enhance the computation efficiency of test sheet generation based on the aforementioned problem formulation.
Besides, the test sheet composition problem was extended to a parallel test sheets composition problem, where multiple test sheets are generated at one time.These sheets must have similar concept relevance, discrimination, and difficulty degrees but contain no common test items.The problem was solved by extending the existing tabu search algorithm 15 and the particle swarm optimization algorithm 16 .

Adaptive Test Sheet Generation Problem
In order to efficiently understand the students' learning problems, the parameters of a test sheet including conceptual scopes C , discrimination D , and difficulty P degrees should be adaptively composed according to the various assessment purposes, such as displacement and summative assessments with normal distribution C and P , and formative and diagnostic assessments with various or specific C and P .As illustrated in Figure 1, for the formative assessment, like a small-scale test, a test sheet with the specific and detailed concepts, that is, low-level conceptual scope/fine-grained granularity, is required to evaluate the students' specific conceptual capabilities during the learning; for the diagnostic assessment, like a specific-scale test, a test sheet with the diverse conceptual scopes and granularities is used to diagnose the students' learning problems; for the displacement and summative assessments, like a large-scale test, a test sheet with the high-level conceptual granularities is required to evaluate the students' learning performance before and after the learning, respectively.However, as seen in Figure 2, the existing approaches did not take the adaptive requirements, that is, C, P, and D into account, and only focus on the highest D. Consequently, their composed test sheets may contain the miss-and error-included concept nodes and cannot meet the adaptive requirements.Moreover, they also need to spend much more computation Test items associated with concepts

Test sheets with unsuitable concept granularity
Error-included concept node Miss-included concept node Time line in the real learning situation time to select candidate test items in the item bank because they have no item selection strategy to filter the irrelevant ones in advance.Besides, item exposure rate, which denotes the number of a test item used in the test sheets, also needs to consider for enhancing the Item Bank Security.Therefore, three issues are required to be solved for satisfying the adaptive requirements of a test sheet: i how to generate a test sheet to precisely meet the adaptive requirements in terms of conceptual granularities, discrimination, difficulty, and expected test time parameters; ii how to speed up the test sheet generation process for reducing the computation time; iii how to consider the item exposure rate issue to enhance the Item Bank Security.

An Adaptive Test Sheet Generation Problem Is Defined as Follows
Assume that a set of test items should be selected from n items in the item bank Q {Q 1 , Q 2 , . . ., Q n }.All items should be related to the concepts in a concept hierarchy H, a tree of concepts as shown in Figure 1.The tree H contains m concepts as the tree nodes C, namely, Based on the Q i definition in Section 2, the item exposure times e i and the degree of difficulty p i are taken in account in this study.Thus, each test item Q i is defined as follows.
An example is provided in Figure 3, where the concept hierarchy H is a tree of concept C j and the test item set Q is a set of test items Q i .A weight r ij denotes relevance degree between 6 Mathematical Problems in Engineering the concept C j test item Q i , for example, the relevance of C 2 and Q 1 is r 12 0.75.The δ C j denotes the subtree of the concept C j , for example, C 1 and C 2 belong to the δ C 5 .Therefore, in this study, a test sheet TS can be defined as follows: TS Qs, t , p , C , r , where TS includes the expected test time t of the test sheet, target difficulty degree p , target concepts C ⊂ C, and the lower bound of average concept relevance r .Based on the definitions of existing studies mentioned in Section 2, a decision variable X x 1 ,x 2 ,. ..,x n is defined where x i is 1 if the test item Q i is selected to the test sheet; 0, otherwise.
The goal of the adaptive test sheet generation problem is to generate a test sheet to i approach all the target parameters p and t , ii have the highest average discrimination degree, iii have the balanced concept relevance weight sum of each required conceptual granularity and its descents among the required concept range C and the average relevance to be higher than r , iv have the lowest average item exposure rate.This is a multiobjective optimization problem, and the objective functions are defined as follows.
The objective function of the discrimination degree is inversed to the average discrimination degree of the test sheet: Mathematical Problems in Engineering 7 The objective function of the expected test time is the distance between the sum of expected test time and the target expected time: The objective function of the difficulty degree is the distance between the average difficulty degree and the target difficulty degree: Let r X be the average sum of relevance degree of each concept in the test sheet: Let the generalized concept relevance ij denote the maximum concept relevance of a test item toward the concept C j or its descendent concepts: The objective function of concept relevance is the distance between the sum of generalized concept relevance degrees and the average sum r X .This objective function shows the imbalance degree of the concept relevance: The objective function of the item exposure rate is the average exposure times:

3.10
The multiobjective optimization problem is to find a test sheet X to minimize all the values of objective functions and subject to the lower bound of average concept relevance r , as shown in the following: Subject to r X ≥ r .

Methodology
To solve the Adaptive Test Sheet Generation Problem, an Adaptive Test Sheet Generation ATSG mechanism has been proposed.ASTG mechanism consists of a Candidate Item Selection Strategy CISS to adaptively determine candidate test items set and the conceptual granularities according to the desired concept scope, and an Aggregate Objective Function AOF to apply Genetic Algorithm GA to figure out the approximate solution of mixed integer programming problem for the test-sheet composition.CISS process is illustrated in Figure 4.

Candidate Item Selection Strategy (CISS)
CISS process includes two phases: 1 specifying Concept Granularity and 2 selecting Candidate Test Item Set.

Phase 1: Specifying Concept Granularity
Concepts associated with a test sheet might be in various granularities for specific educational situations, so the conceptual granularities should be determined before generating a test sheet.Because the required concepts C i ∈ C might be in various granularities, the most specific required concepts should be selected as the target concept set to precisely express the requirements.Let C denote the target concept set, where no concepts in the set are the other concepts' ancestors, and the goal of the first phase is determining the concepts in C :

Phase 2: Selecting Candidate Test Item Set
Let θ be the candidate test item set, where the inner test items should be related to the target concept set.In Phase 2, test items whose related concepts are out of C are filtered: 4.1 Besides, the generalized concept relevance degrees of all test items toward all concepts in C are calculated.
After this phase, the search space can be reduced from Q to θ.An example of CISS process is provided in Figure 5, where assume the required concepts set C {C 4 , C 5 , C 9 , C 10 }.In Phase 1, C 4 , C 9 , and C 10 are selected into C for expressing the most specific required concepts.In Phase 2, only the test items which are associated with the subtrees of concepts in C can be selected to the candidate item set θ, so Q 3 and Q 4 are filtered before solving the optimization problem.

Aggregate Objective Function
An aggregate objective function F X is defined to solve the multiobjective optimization problem: The aggregate objective function includes the discrimination score S D and the penalty scores of the expected time P t , the difficulty degree P p , the concept relevance P r , the concept relevance lower bound P r , and the exposure times P e .All score and penalty score are normalized to the range from 0 to 1.The discrimination score S D is inversed to the objective function D X : The penalty score of the expected time is the percentage of the distance between the sum of expected test time and the target expected time over the target expected time.If the penalty score is greater than 1, 1 is assigned the penalty score: The penalty score of the difficulty degree is the value generated by the objective function of the difficulty degree: The penalty score of the concept relevance balance degree is the average distance between the sum of relevance degrees and the average sum of a concept: The penalty score of the concept relevance lower bound is greater than 0 if the average concept relevance is lower than the concept relevance lower bound and the value the percentage of the distance over the concept relevance lower bound.If the penalty score is greater than 1, the penalty score will be set as 1: The penalty score of the exposure times is the percentage of the average of exposure times over the exposure times parameter e , which denotes the maximum exposure times to be considered.If the average of the exposure times is greater than e , the penalty score will be set as 1: 4.9 Thus, a single aggregate objective function F X can be defined to integrate all the score and penalty scores to a single objective score as 5.1 .
The genetic algorithm GA can be applied to solve the Adaptive Test Sheet Generation Problem by maximizing the aggregate objective function F X .The overall process of the GA algorithm is shown in Figure 4.The CISS process can adaptively determine the desired concept scopes and granularities, and the out-of-scope test items, that is, error-included concept nodes in Figure 2, can be adaptively filtered to reduce the problem space of the test sheet generation.The candidate test items can be encoded into chromosomes, which is an N-bit binary string x 1 ,x 2 ,. ..,xN , where N is the amount of candidate test items and x i 1 denotes the test item i selected into the test sheet.In the beginning, a set of chromosomes, each whose bit value is randomly set, are generated as the initial selection states.Then, each chromosome is evaluated by the aggregate objective function F X .The higher score the chromosome gets, the more probability the chromosome can be reserved to generate the next generation.In the Crossover step, the chromosomes with higher score of F X are selected to generate new chromosomes.Two chromosomes are both broken into two segments in the randomly selected segment lengths and the new chromosomes are generated by exchanging a segment with each other.Further, in the Mutation step, a random bit of a random chromosome in the new generation is inversed in order to prevent falling into the local optimal solutions.Then, return to the Crossover step to further generate next generation until the iteration limitation is achieved.Finally, the chromosome having the highest score of F X among the whole process is the approximate solution.

Experiment and Evaluation
In order to evaluate the effectiveness of the proposed methodology in support of various purposes of assessments during the real learning situation, three experiments have been conducted.Firstly, various sizes of item banks are used to evaluate the efficiency and fitness scores of the proposed ATSG mechanism.Secondly, various levels of target concepts C are used to evaluate the performance and the satisfaction degree of concepts in ATSG mechanism.Thirdly, exposure times of selected test items are measured during the 50 times of use.The exposure times of test items are accumulated and the experiment can evaluate whether ATSG mechanism can prevent the generation of the test sheets with high exposure times.In the three experiments, a system of the control group has also been developed based on Hwang's methodology 4 , where the objective function shown in 5.1 was modified to meet the experimental requirements: 5.1 Some differences in the system of control group are listed as follows: 1 It does not run the CISS; all test items are considered in the GA algorithm.
2 It does not consider the exposure times of test items.
3 It does not calculate the generalized concept relevance, so the required concepts for control group are expended to all their descendent concepts.
The parameters of the GA algorithms used by the experimental and control systems were determined to balance the effectiveness and efficiency.In the three experiments, the GA algorithms were limited to 1,000 iterations and the mutation rate was 0.1.The population size was 30 and all initial bits of chromosomes were assigned to 0 because the amount of all test items was much larger than the amount of the selected test items.

Various Size of the Item Bank
The item banks having 1,000 to 20,000 test items are used to evaluate the systems' efficiency and effectiveness.In each item bank, 10 test sheets with randomly chosen parameters are generated by the control and experimental systems.The effectiveness is measured by the fitness score of the aggregate objective function F X .The result of effectiveness is shown in Figure 6, where the experimental system has more stable and generally higher fitness scores than those of the control system.The experimental result of efficiency is shown in Figure 7, where the response time of the GA algorithm becomes higher if the size of item bank grows gradually.The reason is that if there are more candidate test items, much longer chromosomes will be used and the computing time dealing with all bits in chromosomes becomes much longer as well.Among the two systems, experimental system, which applies CISS process to dramatically reduce the size of candidate test items, can have much more efficient response time.

Various Levels of Target Concepts
This experiment demonstrates the systems' effectiveness of generating a test sheet for specific level of target concepts.Target concepts in the most coarse-grained level, level 1, to the most fine-grained level, level 6, are randomly chosen for the two systems.As shown in Figure 8, the concept relevance scores of the control system are much lower than those of the experimental system, especially when the concept level is fine grained.The reason is that without filtering out-of-scope test items, the GA algorithm of the control system is difficult to precisely choose the test items with accurate concepts.Figure 9 also shows that the test sheet generated by the control system contains many out-of-scope test items, which will seriously affect the test quality.
The result of response time in Figure 10 also reveals that the control system needs more computation times to generate a test sheet because many out-of-scope test items are also computed.

Exposure Times Measurement of Test Items
In the last experiment, 50 test sheets with similar target concept ranges are generated from the item bank containing 2,000 test items and the used test items are recorded to calculate the exposure times of each test item.Results of the average exposure times of test items are shown in Figure 11, where the control system and the experimental system have no noticeable difference.According to the analysis of each test sheet, although the experimental system can prevent the test items with high exposure times, the average exposure times are still accumulated due to the small range of target concepts.However, the out-of-scope test items are usually used in the test sheet generated by the control system, so the exposure times of a   single test item are accumulated slowly.That makes the exposure times of the experimental system are not better than those of the control system.

Discussion
The proposed ATSG mechanism is able to solve Adaptive Test Sheet Generation Problem in terms of the following aspects.

The Control of the Concept Granularity of the Test Sheets and the Prevention of the Irrelevant Problem Space
To simplify the discussion of this problem, assume that the concept tree is an L-level balanced tree, and the amount of branches in each level is B. Let an adaptive requirement of the test sheet contain n target concepts in level X.By applying the CISS mechanism, the problem space of the test sheet generation problem can be reduced to n/B X−1 of the original problem space.
Proof.Assume that m items are related to a concept.The amount of candidate test items in the previous research is mB L−1 .By using the candidate item selection strategy, the amount of the candidate test items C is mnB L−X .Thus, the percentage of the new problem space over the previous problem space is mnB L−X /mB L−1 n/B X−1 .

The Generation of a Test Sheet to Precisely Fit the Target Concept Range, Difficulty, and Expected Test Time
In the new objective functions, the distances toward the target thresholds are used instead of the lower bound and upper bound in the previous studies.Thus, the difficulty and expected test item can be precisely fitted.Moreover, the candidate item selection strategy and the penalty score of the concept relevance balance degree can ensure that the test sheet contains balanced target concepts.As shown in Section 5.2, the concept relevance scores of the test sheets generated by the experimental system are also much higher than those of the control system.

The Consideration of the Item Exposure Rate
The penalty score of the exposure times P e can prevent the high-exposure-rate items selected to the test sheet.

The Extensibility of the ATSG Mechanism
Most approaches mentioned in the related work section applied more efficient evolutionary algorithms, for example, the greedy algorithm approach 12 , the tabu search algorithm 13 , and the discrete particle swarm optimization algorithm 14 to enhance the computation efficiency of test sheet generation.However, these approaches did not yet take the conceptual granularity, exposure rates, and test item filtering into account.Therefore, these enhanced evolutionary approaches can thus be expected to replace the Hwang's methodology 4 for improving the efficiency of the Selecting Candidate Test Item Set phase Figure 4 in the CISS process of ATSG mechanism.

The Future Work of the ATSG Mechanism
According to our observation and finding of experimental results, the degree of fitness score changes with the item bank sizes and the computation time see Figure 6 .Because the fitness scores directly affect the quality of the generated test sheet, a new important issue will be how to analyze the characteristics and predict the trends of fitness scores over times and item bank sizes for improving the quality of test sheet composition.However, this kind of time series problem may not be modeled by the conventional distribution model because the quality of the GA selection strategy seems to have the characteristics of self-similarity.Therefore, according to the study of Li 17 ,Fractal Time Series, which has the features of Long-Range Dependence LRD and obeys the Power Law, are a suitable mathematical approach to model and analyze the features and phenomenon of self-similar series 18 , for example, the data series in the cyber-physical networking systems 19 , the time series of sea level 20 and molecular motion on the cell membrane 21 , the DNA series 22 , and the fractal lattice geometry using Iterated Function System IFS on simplexes 23 .Accordingly, in the near future, we are going to try to apply the fractal time series approach to analyze and model the series of fitness score for figuring out the characteristics of self-similarity.

Conclusion
In this paper, an Adaptive Test Sheet Generation ATSG mechanism is proposed, where the Candidate Item Selection Strategy CISS is come up to reduce the problem space of test sheet composition and an Aggregate Objective Function AOF based on the Genetic Algorithm GA is modeled to figure out the approximate solution.In this approach, the adaptive conceptual scope and granularity and item exposure rates have been considered to meet the various purposes of assessments during the real learning situation.Experimental results show that ATSG mechanism is able to more efficiently, precisely, adaptively generate the various test sheets than the existing approaches in terms of various conceptual scopes, computation time, and item exposure rates.Furthermore, in the near future, the fractal time series approach can be expected to be applied to analyze and model the series of GA's fitness score for figuring out the characteristics of self-similarity and improving the quality of test sheet composition according to the experimental finding.

Figure 1 :
Figure 1: Test sheet types to meet various assessment requirements.

Figure 2 :
Figure 2: Issues for existing test sheet generation mechanisms.

Figure 3 :
Figure 3: Concept hierarchy H and its related test items Q.

Figure 4 :
Figure 4: The flowchart of the CISS process.

Figure 5 :
Figure 5: An example of the candidate item selection strategy CISS process.

Figure 6 :
Figure 6: Fitness scores in various sizes of item banks.

Figure 7 :
Figure 7: Response time in various sizes of item banks.

Figure 8 :
Figure 8: Concept relevance scores for various level of target concepts.

Figure 9 :
Figure 9: Amount of out-of-scope test items for various levels of target concepts.

Figure 10 :
Figure 10: Response time for various level of target concepts.

Figure 11 :
Figure 11: Average exposure times during 50 times of usage.