Autonomous Development of Algorithmic Concepts for Program Comprehension

A developmental model of algorithmic concepts is proposed here for program comprehension. Unlike traditional approaches that cannot do anything beyond their predesigned representation, thismodel can develop its internal representation autonomously from chaos into algorithmic concepts bymimicking concept formation in the brain under an uncontrollable environment that consists of program source codes from the Internet.The developed concepts can be employed to identify what algorithm a program performs. The accuracy of such identification reached 97.15% in a given experiment.


Introduction
Our idea of autonomous development of algorithmic concepts for program comprehension is inspired by the autonomous development paradigm [1] for constructing developmental robots.
Program comprehension [2,3], also known as program understanding, is concerned with ways to analyze source codes for achieving some intentions, such as code reuse [4], code plagiarism detection [5,6], algorithm recognition [7][8][9][10][11], and programming tutoring [12].Over the past decades, scientists have proposed many approaches.Most of them depend on predesigned representations, which include [13] mental models that describe human mental representations of the program to be understood, cognitive models that describe the cognitive processes and temporary information structures in the programmer's head that are used to form mental models, and programming plans that are generic fragments of code that represent typical scenarios in programming.In these traditional approaches, a machine cannot do anything beyond the predesigned representation.For example, the traditional approaches of algorithm recognition are unable to recognize the algorithms whose programming plans or templates [7][8][9][10] are not defined in the library of algorithm templates.
In contrast, the autonomous development paradigm, called autonomous mental development (AMD), enables machines to develop their minds autonomously when they interact with their environments [1,14].With AMD, a robot can learn any tasks, including those whose representations are not defined before the robot is born.Our previous work shows that applying the AMD theory to program comprehension could avoid predefining templates for algorithm recognition [11].
Unlike predesigned representations that will not change when the machine runs, the representation proposed in this paper will change or develop gradually from randomness to algorithmic concepts when the machine interacts with its environment that consists of program source codes.This is similar to the internal representation in a human brain that develops from no idea of any apple at the birth of the human being to a concept for apples after the brain obtains enough information about apples.
Although the brain has no idea of any apple when the human being encounters an apple at the first time, some information about the apple (i.e., an image of the apple) will reside in the memory of the brain, which makes something change in the memory.When the human being encounters an apple again, the brain may not think this second apple so strange by recalling the image of the first apple.Moreover, the brain may be more familiar with apples after the human being encounters the second apple.This means that the image of the first apple is updated with some information obtained from the second apple, resulting in the fact that the updated image 2 Mathematical Problems in Engineering stands for both apples rather than just for the first one.In this way, the apple image in the memory of the brain is updated whenever the human being encounters an apple, leading to the brain being more and more familiar with apples.Finally, the brain is so familiar with apples that almost nothing of the apple image needs updating.At this time, the apple image in the brain represents all the apples in the world, and we think that a concept for apples is formed in the brain (i.e., the apple image becomes the concept for apples).
Our developmental approach, which is on a basis of a computational neural network, mimics the above procedure of concept formation in the brain to develop algorithmic concepts autonomously with algorithmic information extracted from program source codes.This might be an easy task if each of the program source codes implements one of two simple algorithms.For instance, one might apply a conceptformation neural network [15] to divide these program source codes into two groups with each group representing a concept for one of the two algorithms.In this case, the task becomes classifying (or clustering) the program source codes.
On the other hand, it might be a difficult task to mine program source codes that come from online judge systems [16] in the Internet.The algorithmic knowledge underlying these program source codes, which are submitted by college students from all over the world, is very valuable for programming tutoring.However, discovering such algorithmic knowledge seems a challenge because the space consisting of these program source codes is changing uncontrollably every day.We are hardly able to know what and how many algorithmic concepts lie in these program source codes in advance.We are hardly able to predict what new algorithmic concepts will emerge in the changing space consisting of these program source codes.In simple words, this changing environmental space is muddy, so that the developmental approach is necessary for such muddy task [14] in this case.
In this paper, we propose a developmental model for concept formation under such an environmental space that is unknown but is changing uncontrollably as described above.The algorithmic concept generated by this developmental model is readable for both machine and human.This feature is desirable in program comprehension.In a concept assignment task [17], for instance, a machine with this developmental model is able to associate its perceived program with one of its developed concepts that stands for the algorithm performed by the program, while at the same time the same machine is able to display the associated concept in a humanreadable expression, making it possible for a human to understand what algorithm the program performs.

Algorithmic Concepts.
A person understands a program at an algorithmic level when being able to explain the sequence of operations that the program performs; that is, a program could be understood as a sequence of operations.
Figure 1(a) shows a program that can be understood as a sequence of concrete operations, or concrete algorithm, as described by the flowchart in Figure 1(c).The idea behind this specific sequence of concrete operations is to use an array as a function table.The function table describes a function  by tabulating all the arguments  and their corresponding function values ().In Figure 1(a), the array named  is used as the function table.When obtaining an argument value  (e.g.,  = 0) from its input, the program can return the corresponding function value (e.g., 8899) easily, just by outputting the value of [] (e.g., [0]).
The same idea of using an array as a function table is also behind the flowchart in Figure 1(d), which expresses the concrete algorithm of the program in Figure 1(b).The main distinction of the program in Figure 1(b) from the one in Figure 1(a) is that the outputs of the former are vectors; for example, the output [0] is the vector (1,2,3,4,5,6).This implies that the two concrete algorithms described in Figures 1(c) and 1(d) could be conceptualized as an abstract algorithm, as expressed by the flowchart in Figure 1(e).
An algorithmic concept is a common idea behind all concrete algorithms that have the same characteristics.The idea behind the two concrete algorithms above, for example, is such an algorithmic concept, which also refers to the abstract algorithm expressed in Figure 1(e).Each of the two programs above is regarded as an instance of this algorithmic concept.
There are many algorithmic concepts in brains of human programmers.Each algorithmic concept is a common notion shared by programmers to refer to similar concrete algorithms.Thus, an algorithmic concept represents a class of similar concrete algorithms.To understand similar concrete algorithms as an algorithmic concept makes programmers manipulate programs easier.Such manipulation includes creating, maintaining, explaining, reengineering, and reusing [17].It is convenient for programmers to communicate with each other by using algorithmic concepts.
Different algorithmic concepts represent different classes of concrete algorithms.Each concrete algorithm in a class is an instance of the algorithm concept that represents the class.

The Developmental Model for Concepts.
From our developmental standpoint, the internal state  of the brain  depends on the environment  that the agent of the brain  explores, formally written as  = ().More specifically, the environmental area   that the brain  has sensed from the very beginning to the current time  > 0 determines the internal state  +1 of the brain  at the next time +1; that is,  +1 = (  ).Biologically, the brain runs by sending signals of discrete spikes.So we think that the brain works in discrete time (i.e.,  = 0, 1, 2, . . ., , . ..).
In this work, the environment  consists of program source codes that come from online judge systems [16] in the Internet.We assume that at each time instance  the brain senses one and only one program   .Thus, we have   = { 0 ,    for each input , output the corresponding function value () from a function table stored in an array.(f) The brain  is implemented as a one-layer neural network which has  ×  neurons.Its inputs are -dimensional vectors.Each input vector e will trigger the procedure (e) to recall a neuron  in the brain .The synaptic vector s of the neuron  will be updated by the procedure (s, e).(g) The algorithmic signature of the program in (a).(h) The matrix for the algorithmic signature in (g).sequences that may form the   at the time instance  for a given  because the brain may explore the environment  randomly.
The internal state  is modeled as a set of elements, called images.Each image  in the set  will be recalled when the brain obtains the algorithmic information of a program  in .This can be formally written as  = (), where  denotes the procedure of recalling  triggered by .In addition, the image  will be updated into   ; that is,   = (, ), where  denotes the procedure of updating  with the algorithmic information of the program .The effect of this updating is equivalent to adding the updated image   into the set  from which the image  has been removed.Thus, the change of the internal state  can be formulated as  +1 = (  \ {}) ∪ {  }, where  = (  ),   = (,   ), and   denotes the program that the brain encounters at the time instance .
At the very beginning (i.e.,  = 0), the brain has no idea of any program but may remember some algorithmic information of any program that it encounters.For this reason, we assume that the initial state  0 of the set  consists of some things that will change to remember some algorithmic information of programs.In addition, for computational convenience, we use the same notation  = () for something  in  0 that will change to remember some algorithmic information of the program , so that, for each time instance , there always exists one and only one element  in   that satisfies  = () for each given program  in .Biologically, the initial state  0 refers to innateness [18] which is the result of evolution.
Being innate, each element  in the initial state  0 contains no algorithmic information at all.After updated by the procedure (, ), however, the element  turns into   , which contains some algorithmic information of the program .More and more elements in the internal state  will be updated to contain algorithmic information when the brain senses more and more programs, while some of them are updated many times to contain rich information about algorithms.When some element  in  is updated slightly by the procedure (,   ) at a time instance  (i.e., its updated image   is almost the same as the element  itself), the updated image   is regarded as an algorithmic concept.Note that the algorithmic concept   is developed from something having nothing to do with any algorithmic information.
This developmental model for concept formation is described formally as the following algorithm, where all developed concepts are collected in a set  (which is initialized empty, i.e.,  = 0).

Step 3a. 𝑒 ← 𝑒 𝑡
Step 3b. ← () Step 3c.  ← (, ) In summary, this developmental model consists of three elements: a brain , its internal state , and its environment .The behavior of the brain  is characterized as its recalling procedure  and updating procedure .Each time the brain  senses a program  in the environment , the procedure  will recall an image  in the internal state  that satisfies  = (), and then the procedure  will update the image  such that the updated image   = (, ).When the difference between the image  and its updated image   becomes very small after much many updates, we regard the updated image   as an algorithmic concept and put it into the set  of developed concepts.

The Internal Representation for Development.
We apply the theory of lobe component analysis (LCA) to implementation of our developmental model for concept formation.The LCA, which meets the idea of AMD, was proposed for developing feature detectors (neurons) in neural networks [19].
The brain  in our developmental model is implemented as a one-layer network which consists of  ×  neurons.The internal state  of the brain  is composed of all the synaptic vectors that represent synaptic weights of neurons in the brain  (see Figure 1(f)).Every synaptic vector s in  can contain the algorithmic information from programs in the environment .All these synaptic vectors of neurons are initialized with random values to represent the initial state  0 of  at the time  = 0.This means that each neuron  in the brain  is "born" with no relation to algorithmic information.
The inputs to the brain  are also vectors, which have the same dimension as the synaptic vectors.Each input vector e represents the algorithmic information of a program source code  in the environment .
The LCA neural network contains two cell-centered mechanisms, called lateral inhibition and Hebbian learning, respectively.The former can be applied to implementation of our recalling procedure (e) and the latter to implementation of our updating procedure (s, e).
The recalling procedure (e) is implemented as follows.We computer the response  e (s) of every synaptic vector s in  by  e (s) = e ⋅ s/‖s‖, where ‖s‖ denotes for the norm of the vector s, and e⋅s for the inner product of the two vectors e and s.The response  e (s) is the projection of the input vector e on the synaptic vector s.The greater the response  e (s) is, the closer the synaptic vector s is to the input vector e.A synaptic vector s 1 is closer to the input vector e than another synaptic vector s 2 if its response  e (s 1 ) is greater than the response  e (s 2 ) of the synaptic vector s 2 .If there is no synaptic vector closer to the input vector e than a synaptic vector s  , then the synaptic vector s  has the maximal response (e), where Thus, it is reasonable to think that the synaptic vector s  that satisfies  e (s  ) = (e) represents the image that the brain  will recall when receiving the input vector e; that is, (e) = s  .We also say that the procedure (e) recalls the neuron  that the synaptic vector s  belongs to.The updating procedure (s, e) is formulated as where  1 (  ) denotes the retention rate and  2 (  ) the learning rate and  1 (  ) +  2 (  ) = 1.Both the retention rate  1 (  ) and the learning rate  2 (  ) are functions of the age   of the neuron  that the synaptic vector s belongs to.The age   keeps the number of times that the synaptic vector of the neuron  has been updated.The retention rate  1 (  ) is a monotonically increasing function.Initially, the age   is equal to 1, and the retention rate  1 (  ) is equal to 0, so that (s, e) =  2 (  )(e)e, which means that no information in the synaptic vector s will remain in the updated synaptic vector s  of the neuron  at its first update.When the age   increases, the retention rate  1 (  ) is no longer equal to 0, so that the first term  1 (  )s in (2) is not zero, which means that some information in the synaptic vector s will remain in the updated synaptic vector s  of the neuron .When the age   becomes a large number, the retention rate  1 (  ) will be approaching 1, leading to the fact that almost all information in the synaptic vector s will remain in the updated synaptic vector s  of the neuron .
The learning rate  2 (  ) is a monotonically decreasing function.When   = 1, the learning rate  2 (  ) is equal to 1, which means the updated synaptic vector s  of the neuron  will accept much information of the input vector e.When the age   becomes a large number, the learning rate  2 (  ) will be approaching 0, which means the updated synaptic vector s  of the neuron  will accept little new information from the current input vector e at the update.
Thus, the effect of the updating procedure (s, e) depends on the age   of the neuron  that the synaptic vector s belongs to.When the age   becomes infinite, we have lim This means that the resulting vector s  of the updating procedure (s, e) will be almost the same as the synaptic vector s when the age   becomes a large number.For this reason, we say that the neuron  is mature when its age   becomes large enough.

Formation of Algorithmic Concepts.
Each program  in the environment  is an instance of some algorithmic concept.For example, the program in Figure 1(a) is an instance of the algorithmic concept expressed in Figure 1(e).Initially, the brain  has no idea of any algorithmic concept, so that the set  of the developed concepts is empty.All the synaptic vectors in the internal state  are initialized with random values to represent the initial state  0 at the time  = 0, which are evidently independent of the environment .Thus, the brain  has no idea of any instance of an algorithmic concept  (e.g., the one expressed in Figure 1(e)) when it encounters the first instance 1 (e.g., in Figure 1(a)) of the algorithmic concept  at the time  =  1 .However, the brain  will remember some algorithmic information of the instance  1 by updating the synaptic vector s (0)   of some neuron   with the input vector e (1) that represents the algorithmic information of the instance  1 .The synaptic vector s (0)   is the closest one to the input vector e (1) in comparison with other synaptic vectors in the internal state , so that the procedure (e (1) ) will recall the neuron   ; that is, s (0)  = (e (1) ).Because this is the first time that the neuron   is recalled to update its synaptic vector (i.e., its age is 1,  1 (1) = 0, and  2 (1) = 1), we have the following result s (1)   of the updating procedure (s (0)   , e (1) ) by (2): s (1)   =  (e (1) ) e (1) , where (e (1) ) is the response of the synaptic vector s (0)  .This means that the updated synaptic vector s (1)   of the neuron   will contain some algorithmic information of the instance  1 from the input vector e (1) .This process is equivalent to  ← ( \ {s (0)  }) ∪ {s (1)   }.The age of the neuron   is increased by one before the next update of its synaptic vector.
When the second instance  2 (e.g., in Figure 1(b)) of the algorithmic concept  arrives at the time  =  2 ( 2 >  1 ), the brain  will not think the instance  2 so strange since the synaptic vector s (1)   of the neuron   contains some algorithmic information of the first instance  1 which is similar to the second instance  2 .The brain  will recall the same neuron   again, because its current synaptic vector s (1)   is the most similar to the input vector e (2) that represents the algorithmic information of the instance  2 ; that is, s (1)   = (e (2) ).Thus, the synaptic vector of the neuron   will be updated again, and its age becomes 2. By ( 2) and (4), we have the following result s (2)   of the updating procedure (s (1)   , e (2) ): (e (1) ) e (1) +  2 (2)  (e (2) ) e (2) . (5) Because  1 (2) ̸ = 0 and  2 (2) ̸ = 0 in (5), the newly updated synaptic vector s (2)   of the neuron   will contain algorithmic information of both instances  1 and  2 from vectors e (1) and e (2) , respectively.At this time, the neuron   stands for both instances  1 and  2 rather than just for the first instance  1 .This means that the brain  becomes more familiar with the algorithmic concept .
For the reasons above, the same neuron   in the brain  will be recalled to update its synaptic vector whenever an instance of the algorithmic concept  arrives, leading to its synaptic vector containing more and more information about the algorithmic concept .When the th instance   of the algorithmic concept  arrives at the time  =   ( = 3, 4, 5, . ..), the neuron   will be recalled to update its synaptic vector for the th time.Obviously, the number  is actually the age of the neuron   at the time   .By (2), the result s ()   of the th update (s (−1)  , e () ) is obtained as follows: where s ()   denotes the th-updated synaptic vector of the neuron   and e () denotes the input vector that represents the algorithmic information of the th instance   of the algorithmic concept .By inspecting (6), we can conclude that the synaptic vector of the neuron   will be updated greatly (i.e., great difference between s (−1)  and s ()   ) when the age  is small and slightly (i.e., slight difference between s (−1)  and s ()   ) when the age  becomes large, because  1 () is an increasing function from 0 to 1 whereas  2 () is a decreasing function from 1 to 0. As the age  becomes large enough, the synaptic vector of the neuron   will be almost unchanged.
On the other hand, a large age  means that the synaptic vector s ()   contains algorithmic information from a large number of instances of the algorithmic concept .The larger the age  is, the more instances of  the neuron   represents.Finally, the neuron   represents almost all instances of the algorithmic concept  when its age  is greater than a very large number.At this time, we think that the synaptic vector s ()   is developed into a representation for the algorithmic concept  (i.e., an algorithmic concept is formed in the brain ).Thus, the neuron   is collected as a developed concept, that is,  ← ∪{  }, when its age  is greater than a threshold which is a very large number.
From the concept formation above, it can be seen that the developed concept stands for an idea shared by a group of programs that have similar concrete algorithms.In other words, the developed concept represents a common idea that programmers have when they are writing similar programs.Thus, it is helpful to apply the developed concept to program understanding.Note that the developmental process of algorithmic concepts is unsupervised.This feature is desirable for understanding what algorithmic concepts are employed by the programs that come from online judge systems in the Internet.

Representation for Algorithmic Information
3.1.Algorithmic Signatures.An algorithm is a finite list for calculating a function [20].There are many kinds of notation to express an algorithm, such as natural languages, flowcharts, pseudocode, and problem analysis diagrams.In Figure 1(c), for example, a flowchart describes a concrete algorithm.From the flowchart, we can find some algorithmic information, described as an algorithmic signature in Figure 1(g).This algorithmic signature consists of a control flow statement while and three noncontrolling language points array, input, and output.It characterizes the concrete algorithm of the program in Figure 1(a).
Generally, an algorithmic signature consists of several units.Each unit contains two parts: a control flow statement and its relevant language points.The control flow statements are the most important components, because they decide the control flow of an algorithm.There are four control flow statements: while, for, if, and switch.The noncontrolling language points in a control flow statement are regarded as relevant to the control flow statement.The algorithmic signature in Figure 1(g) has only one unit, where there are three relevant language points in the control flow statement while.
There are two relations between the units: sequence relation and nesting relation.The program in Figure 2(a) has four control flow statements, so that its algorithmic signature has four units as shown in Figure 2(b).The first unit consists of a control flow statement while with its three relevant language points equal to, input, and output.This first unit has three nesting relations to the three other units, respectively.The pair "{" and "}" of the while statement shows the nesting relation.The three units nested in the first unit have sequence relations with each other, arranged from up to down.They all have one and the same relevant language point greater than.
The algorithmic signature of each program can be generated from the parse tree of its source code.It is not difficult to identify control flow statements, noncontrolling language points, and their relationships in the parse tree.Every program can be converted into its algorithmic signature by a depth-first traversal of its parse tree.

Matrixes for Algorithmic
Signatures.An algorithmic signature can be represented by a  ×  matrix.Each row of the matrix may consist of a controlling number followed by noncontrolling numbers to represent a unit in the algorithmic signature.The controlling number denotes the control flow statement in the unit, whereas the noncontrolling numbers indicate whether there are some noncontrolling language points in the unit or not.
The first row of the matrix in Figure 1(h) represents the unit of the algorithmic signature in Figure 1(g).The rest rows are all filled with zeros, meaning that there are no more units in this algorithmic signature.The first number 80 in the first row is a controlling number, which indicates that the control flow statement of the unit is a while statement.Following the controlling number are all noncontrolling numbers, where the three 1s indicate that there are three noncontrolling language points array, input, and output, respectively, in the control flow statement.
The matrix in Figure 2(c) represents the algorithmic signature in Figure 2(b).The top four rows represent the four units of the algorithmic signature, respectively.Because the second unit is nested in the first unit, all numbers except the last in the second row move right a position, whereas the last number moves onto the first position.Thus, the controlling number of the second row is in the second position from the left.The number 190 indicates that the control flow statement of the second unit is an if statement.The second row and the third row are the same because the second unit and the third unit are the same and they have a sequence relation to each other.
Each position for a noncontrolling number in a row is associated with a noncontrolling language point.Positions from the second to the sixth in the first row of the two matrixes above are associated with language points greater than, equal to, array, input, and output, respectively.There are only two values 0 and 1 for a noncontrolling number.The value 1 of a noncontrolling number indicates that its associated language point exists in the unit of the algorithmic signature, whereas the value 0 indicates that its associated language point does not exist in that unit.
The controlling number is designed to be greater than the noncontrolling number.For the controlling number, there are four values 80, 110, 190, and 220, which denote the four control flow statements while, for, if, and switch, respectively.Note that the two controlling numbers for while and for have a smaller difference because both of them are iteration statements.

Signatures of Developed Concepts. Figure 2(d) shows
the procedure from program source codes to signatures of developed concepts.The program source code (e.g., in Figure 1(a)) is converted into its algorithmic signature (e.g., in Figure 1(g)) through its parse tree.The algorithmic signature will be converted into a matrix (e.g., in Figure 1(h)).The matrix will be converted into the input vector of the brain  which is based on LCA.From the output of the brain  are vectors which stand for developed concepts.
The relationship between a matrix and its corresponding vector is shown in Figure 2(e).The first column of the matrix maps into the first  components of the vector, where  is the number of rows in the matrix.The second column of the matrix maps into the second  components of the vector, and so on.In this way, we can convert the matrix of an algorithmic signature into its corresponding vector as an input to the brain .Reversely, the first  components of the vector map into the first column of the matrix, the second  components of the vector map into the second column of the matrix, and so on.In this way, we can convert the developed vectors from the output of the brain  into their corresponding matrixes.
Figure 2(f) shows a matrix which is derived from a developed concept.It is a simplified version of a matrix  The relationship between a matrix and its corresponding vector.(f) A simplified version of a developed matrix from our experimental results.(g) This matrix is generated from one in (f).(h) This signature is produced from the matrix in (g), which represents a developed concept.(i) This signature characterizes the algorithmic concept shared by both programs in Figures 1(a) and 1(b).from our experimental results for convenience in discussion.We can see that the maximum number in the first row of the matrix is 78.86.We suppose that 2 is the threshold to determine whether a row represents a unit or not.Because the maximum number 78.86 is greater than the threshold number 2, we can treat this row as a unit.In the same way, we find four other units, whose maximum numbers are 180.32,154.49, 160.09, and 154.17, respectively.
In order to get signatures of developed concepts, these maximum numbers should be replaced by their corresponding controlling numbers.We replace all maximum numbers with their closest controlling numbers.For example, we replace the maximum number 78.86 by the controlling number 80 because 80 is the controlling number closest to the maximum number 78.86.For the same reason, the maximum numbers 180.32, 154.49, 160.09, and 154.17 are replaced by the controlling number 190, respectively.The next step is to determine noncontrolling numbers in a unit.Each nonmaximum number (e.g., 0.05 in the first row) in a unit should be replaced by a noncontrolling number.We supposed that 0.5 is the threshold: if a number is greater than 0.5, we replace it with 1, otherwise 0. Because all the five nonmaximum numbers in the first row are less than 0.5, they are replaced with 0, respectively.We do the same thing for the other four rows.Figure 2(g) shows the result of the conversion.
With these controlling numbers and noncontrolling numbers, we can convert the matrix into a signature (e.g., in Figure 2(h)).This signature is not an algorithmic signature of a program although both of them have the same format.The former characterizes a developed concept which is employed by a group of programs, whereas the latter only represents the concrete algorithm of one program.For example, the signature in Figure 2(i) characterizes the developed concept shared by a group including the two programs in Figures 1(a) and 1(b), whereas the algorithmic signature in Figure 1(g) represents the concrete algorithm of the program in Figure 1(a) only.

Experiment and Results
In this experiment, the brain  is composed of a one-layer network of 20 × 20 neurons, each of which has a 900dimensional synaptic vector.For the updating procedure (s, e) in ( 2), the retention rate is defined as  1 (  ) = (  − 1 − (  ))/  and the learning rate  2 (  ) = (1 + (  ))/  , where (  ) is an amnesic function as follows [19]: The parameters in (  ) were set as follows:   = 20,   = 200,  = 2, and  = 5000.After a developing phase for concept formation, the same brain  with all its 400 updated synaptic vectors was applied to a concept assignment task in program comprehension for an evaluation of the developed concepts.

Experimental Data.
The environment  is composed of 2341 C++ program source codes from an online judge system submitted by sixty college students for solving sixty simple problems.All of these program source codes were judged correct by the online judge system.The number of control flow statements is not greater than ten in each of these program source codes.The nesting level of control flow statements is not greater than six.
In the developing phase, each of these 2341 program source codes was chosen randomly 300 times to form the environmental sequence   .It means that every program source code would occur 300 times randomly in the environmental sequence   and the length of the sequence is 2341 × 300.During the test for the evaluation of the developed concepts, however, each of these 2341 program source codes was chosen randomly only once for the concept assignment task.
Each program source code was converted into a 30 × 30 matrix which represents the algorithmic signature of the program source code.In the algorithmic signature, each unit is composed of a control flow statement and at most 25 relevant language points.Because the control flow statement is more important than its relevant language points, its controlling number was allocated to occupy five adjacent positions in a row of the matrix, whereas each noncontrolling number was allocated to occupy only one position for a corresponding language point.The first five positions in the first row of a matrix, for example, are all filled with the same value of the controlling number that represents the control flow statement of the first unit in an algorithmic signature.The other 25 positions for noncontrolling language points in the first row are listed in the Table 1, where all positions in a row are numbered 1, 2, . . ., 30 from the left to the right.Controlling numbers were designed to be greater than noncontrolling numbers for distinguishing the control flow statement from noncontrolling language points.The values of controlling numbers are 80, 110, 190, and 220 to denote the four control flow statements while, for, if, and switch, respectively.However, each noncontrolling number has only two values: 1 for existence of the corresponding language point in a unit and 0 for nonexistence.It should be pointed out that numbers in a nesting unit should move right five positions but numbers in a sequence unit should not (see Section 3.2).
Finally, each matrix was converted into a 900dimensional vector as an input vector to the brain .

Developmental Results
. Figure 3(a) shows the ages of all the 400 neurons in the brain  at the end of the developing phase.The neurons are numbered 1, 2, . . ., 400 in the descending order of their ages.Those whose ages are greater than or equal to 3000 are regarded as mature enough to represent algorithmic concepts.We found 78 mature neurons and put them into the set  of developed concepts.They are numbered 1, 2, . . ., 78 in the same order as theirs in Figure 3(a).The matrix of the first developed concept (i.e., numbered 1 in the set ) is shown partially in Figure 3(b), which presents the top left part of the matrix.This matrix implies a five-unit signature because each of the top five rows has numbers that are greater than the threshold number 2. In each of the top five rows, there are five numbers in five adjacent positions that are greater than 2. This means that the controlling number occupies five adjacent positions in each of these rows.The value of the controlling number is the one closest to the average of these five numbers.For example, the average of the five numbers in the first row is 78.86.Among the four values 80, 110, 190, and 220 for controlling numbers, the value 80 is the closest to the average 78.86.Thus, the controlling number of the first unit is 80, which means that a while statement is the control flow statement of the unit.In a similar way, we can know that the control flow statements of the other four units are all an if statement.All these four units are nested in the first unite because their first numbers (from the left) that are greater than the threshold number 2 are in the sixth column, which mean their corresponding controlling numbers are five positions right to the first row, indicating their nesting relations with the first unit.Moreover, these four units have sequence relations with each other because their corresponding controlling numbers are in the same columns.
Figure 3(c) shows the signature of the first developed concept.The four if statements are presented within the pair "{" and "}" of the while statement, indicating that their units are nested in the first unit.In addition, these four statements are presented from up to down, indicating that their units have sequence relations with each other.Following the symbol "//" are the noncontrolling language points relevant to their corresponding control flow statement.It can be seen, for example, that the second unit has four relevant language points logical and, equal to, variable, and decimal integer constant because they follow the symbol "//" within the pair "{" and "}" of its control flow statement if.This signature is readable in some sense, from which we can see that the first developed concept refers to four actions, each of which may or may not be executed according to its corresponding condition in every loop.

Concept Assignment Results.
To test the developed concepts for evaluation, we stopped the development of the brain  by disabling its updating procedure (s, e).When receiving an input vector e that represents a program  in , the brain  would still activate its recalling procedure (e) and recall a neuron  such that its synaptic vector s = (e).The recalled neuron  would be assigned to (or associated with) the program  if the recalled neuron  was in the set  of developed concepts (i.e.,  was a developed concept), which means that the program  is supposed to perform the abstract algorithm that the developed concept  represents.In this way, the 78 developed concepts in the set  were associated with 1229 out of the total 2341 program source codes in the environment .This proportion is 52.50% in all the program source codes.The continuous curve scaled by the right vertical axis in Figure 3(d) shows the number of program source codes that each developed concept was associated with.77 developed concepts (i.e., 98.72% of the 78 developed concepts) have 10 or more recalls (associated program source codes).Only one has less than 10 recalls; that is, the 45th developed concept has only one associated program source code.By comparing the matrix of the 45th developed concept with the matrix of the 42nd developed concept, we found that they were almost the same.This probably led to the fact that some program source codes which should be assigned to the 45th developed concept were wrongly assigned to the 42nd developed concept, resulting in the fact that the 42nd developed concept had a great number of recalls whereas the 45th developed concept had only one recall.

Evaluation of Developed Concepts.
We found by inspection that program source codes which are assigned to the same developed concept always perform the same abstract algorithm if they solve the same problem.Table 2 shows that the fourth developed concept had twenty-six associated program source codes.Among them, seventeen solve problem 1, eight solve problem 2, and one solves problem 60.By inspecting these programs one by one, we found that all the programs that solve problem 1 perform the same abstract algorithm and that all the programs that solve problem 2 also perform the same abstract algorithm.Moreover, we found that the abstract algorithms of these two groups of programs are the same.However, we also found that the program that solves problem 60 performs a different abstract algorithm.We think that this program had a wrong concept assignment and that the other twenty-five programs had their correct concept assignments.Thus, the fourth developed concept has its assignment accuracy of 25/26 ≈ 96.15%.Table 3 lists the concept assignment accuracies of the first ten developed concepts as well as their recalls.The series of "×" scaled by the left vertical axis in Figure 3(d) shows the assignment accuracy of each developed concept.Out of the 1229 program source codes that were associated with the 78 developed concepts, 1194 had their correct concept assignments.The overall accuracy of these concept assignments is 1194/1229 ≈ 97.15%.

Conclusion
An algorithmic concept is an idea behind a group of programs that have similar concrete algorithms.Human programmers have many algorithmic concepts in their minds, which guide them to design programs.To understand what algorithmic concept is used to design a given program is a muddy task, especially in an uncontrollable environment that consists of program source codes from online judge systems in the Internet.
Our developmental model, inspired by an autonomous development paradigm, aims at challenging this muddy task.Unlike traditional approaches that cannot do anything beyond their predesigned representations, our model is able to develop its internal representation from randomness into algorithmic concepts that are not designed in advance.On a basis of an LCA neural network, the developmental process mimics the recalling procedure and the updating procedure in the human brain.Every program arrival will trigger the recalling procedure to recall a neuron in the neural network.The updating procedure will modify the synaptic weights of the recalled neuron with some algorithmic information of the program that triggers the recalling procedure.The algorithmic information of a program is characterized as an algorithmic signature which will be converted into a vector as an input to the neural network.After a neuron has been recalled to update its synaptic weights for enough times, its synaptic weights become a representation of an algorithmic f(x) into an array Input n? Output f(n) from the array

Figure 1 :
Figure 1: (a) For each input number, this program prints a corresponding number from a one-dimensional array.(b) For each input number, this program prints a corresponding vector from a two-dimensional array.(c) This flowchart depicts the concrete algorithm for the program in (a).(d) This flowchart depicts the concrete algorithm for the program in (b).(e) The abstract flowchart for both programs in (a) and (b):for each input , output the corresponding function value () from a function table stored in an array.(f) The brain  is implemented as a one-layer neural network which has  ×  neurons.Its inputs are -dimensional vectors.Each input vector e will trigger the procedure (e) to recall a neuron  in the brain .The synaptic vector s of the neuron  will be updated by the procedure (s, e).(g) The algorithmic signature of the program in (a).(h) The matrix for the algorithmic signature in (g).

Figure 2 :
Figure 2: (a) A C++ program which has four control flow statements.(b) The algorithmic signature of the program in (a).(c) A matrix for the algorithmic signature in (b).(d) From program source codes to signatures of developed concepts.(e) The relationship between a matrix and its corresponding vector.(f) A simplified version of a developed matrix from our experimental results.(g) This matrix is generated from one in (f).(h) This signature is produced from the matrix in (g), which represents a developed concept.(i) This signature characterizes the algorithmic concept shared by both programs in Figures 1(a) and 1(b).

Figure 3 :
Figure 3: (a) It shows the ages of all the 400 neurons at the end of the developing phase.The neurons are numbered 1, 2, . . ., 400 in the decreasing order of their ages.(b) This is a part of the matrix derived from the first developed synaptic vector in (a).The corresponding controlling numbers occupy five adjacent positions in every unit.(c) The signature of the first developed concept.(d) The concept assignment accuracy of each developed concept (indicated by a series of "×" scaled by the left vertical axis) and the recalls that each developed concept has (indicated by the continuous curve scaled by the right vertical axis).
1 , . . .,   }.The distinction of   from  is that the elements in   are ordered in time whereas those in  are not.In other words,   denotes a sequence which consists of elements from .Note that   may have duplicate elements; for example,  1 =  10 if the brain encounters the program  1 again at the time instance  = 10.Moreover, there are many possible

Table 1 :
Positions for noncontrolling language points in the first row of a matrix.

Table 2 :
The fourth developed concept had twenty-six associated programs.

Table 3 :
The accuracy and recalls of the first ten developed concepts.