In traditional Web-based learning systems, due to insufficient learning behaviors analysis and personalized study guides, a few user clustering algorithms are introduced. While analyzing the behaviors with these algorithms, researchers generally focus on continuous data but easily neglect discrete data, each of which is generated from online learning actions. Moreover, there are implicit coupled interactions among the data but are frequently ignored in the introduced algorithms. Therefore, a mass of significant information which can positively affect clustering accuracy is neglected. To solve the above issues, we proposed a coupled user clustering algorithm for Wed-based learning systems by taking into account both discrete and continuous data, as well as intracoupled and intercoupled interactions of the data. The experiment result in this paper demonstrates the outperformance of the proposed algorithm.
Information technology and data mining have brought great changes to education field. Web-based learning is a significant and advanced type of education, which utilizes computer network technology, multimedia digital technology, database technology, and other modern information technologies to learn in digital environment.
At present, many education institutions and researchers commence the study of Web-based learning systems. They mainly study the systems’ composition, the construction of a learning mode, the design and development of hardware, relevant supportive policies and services, and so forth. Meanwhile, an increasing number of Web-based learning systems develop rapidly, for instance, online study communities and virtual schools [
User clustering can dig out hidden information from a large amount of data. By clustering users in different ways, Web-based learning systems can provide personalized learning guides and learning resources recommendation to learners. This can greatly improve learning efficiency in these systems.
Recently, there have been some cases of applying user clustering algorithms in Web-based learning systems. In order to choose suitable learning method, clustering was addressed [
All the above methods combine traditional clustering algorithms and apply them in Web-based learning systems, where learners’ attributes information is extracted consequently through analyzing their learning behaviors and finally utilized for user clustering. Most of these attributes are in the category of continuous data. From learners’ behavioral information, we can easily find quite a lot of continuous data, such as “total time length of learning resources” and “comprehensive test result.” In contrast, there also exist attributes data with categorical features which is easily neglected, like “chosen lecturer,” “chosen learning resource type,” and so forth. Although this kind of data is a smaller component of learning behavior information, it also plays a significant role in learner clustering.
In addition, the mixed data of discrete and continuous data, which is extracted from learning behaviors in Wed-based learning systems, are interrelated. There are implicit coupling relationships among them. Clustering is often ignored by the traditional clustering algorithms, which leads to massive significant information loss during the process of similarity computation and user clustering. Consequently, the quality of relevant services provided, like learning guides and learning resources recommendation, is not satisfactory. For example, we have the common sense that “total time length of learning resources” has positive impact on “comprehensive test result.” Generally, if the “total time length of learning resources” is longer, the “comprehensive test result” is better. However, there are also some special groups of students who behave differently. They can either get better “comprehensive test result” with shorter “total time length of learning resources” or worse “comprehensive test result” with longer “total time length of learning resources.” The special correlation between attributes, which is often ignored, is considered in user clustering of our approach. This will lead to certain effect on user clustering accuracy but will not lead to guaranteeing that all users can get highly qualified personalized services easily. An effect mechanism is needed to respond to the loss of the ignored information.
To solve the above issues, this paper proposed a coupled user clustering algorithm based on mixed data, namely, CUCA-MD. This algorithm is based on the truth that both discrete and continuous data exist in learning behavior information; it, respectively, analyzes them according to their different features. In the analysis, CUCA-MD fully takes into account intracoupling and intercoupling relationships and builds user similarity matrixes, respectively, for discrete attribute and continuous attributes. Ultimately we get the integrated similarity matrix using weighted summation and implement user clustering with the help of spectral clustering algorithm. In this way, we take full advantage of the mixed data generated from learning actions in Web-based learning systems. Meanwhile, the algorithm well considers the correlation and coupling relationships of attributes, which enables us to find interactions between users, especially users of previously mentioned special groups. Consequently it can provide suitable and efficient learning guidance and help for users.
The contributions of this algorithm can be summarized from three aspects. Firstly, it takes into account the coupling relationships of attributes in Web-based learning systems, which is frequently neglected before, and improves clustering accuracy. Secondly, it fully considers different features of discrete data and continuous data and builds user similarity matrix based on mixed data. Thirdly, it captures and analyzes individuals’ learning behaviors and provides customized and personalized learning services to different groups of learners.
The rest of the paper is organized as follows. The next section introduces related works. The clustering algorithm model is proposed in Section
Using mixed data to do user clustering has been achieved in some fields but rarely in Web-based learning area. Ahmad and Dey came up with a clustering algorithm based on updated
Recently an increasing number of researchers pay special attention to interactions of object attributes and have been aware that the independence assumption on attributes often leads to a mass of information loss. In addition to the basic Pearson’s correlation [
User clustering model plays a significant role in user evaluation framework [
The coupled user clustering model based on mixed data.
The model is built on the basis of the discrete data and continuous data extracted from learning behaviors. According to their different features, we tease out the corresponding attributes in parallel through analyzing the behaviors. Then intracoupled and intercoupled relationships are introduced into user similarity computation, which helps to get user similarity matrixes, respectively, for discrete attributes and continuous attributes. Finally we use weighted summation to integrate the two matrixes and apply Ng-Jordan-Weiss (NJW) spectral clustering algorithm [
This paper proposed a coupled user clustering algorithm based on mixed data, which is suitable to be applied in education field. It fits for not only user clustering analysis in Web-based learning systems but also corporate training and performance review, as well as other Web-based activities, in which user participation and behavior recording is involved. The implementation of the CUCA-MD in Web-based learning systems is introduced in this section.
Among the data generated from users’ learning behaviors, discrete data plays a significant role in user behavior analysis and user clustering. In the following section, the procedure of how to compute user similarity using discrete data in Web-based learning systems is demonstrated, during which intracoupled similarity within an attribute (i.e., value frequency distribution) and intercoupled similarity between attributes (i.e., feature dependency aggregation) are also considered.
In Web-based learning systems (
A fragment example of user discrete attributes.
|
| ||
---|---|---|---|
|
|
| |
|
Wang | Video | Online |
|
Liu | Video | Online |
|
Zhao | e-book | Interactive |
|
Li | Audio | Offline |
|
Li | PPT | Offline |
Data objects with features can be organized by the information table
An example of information table.
|
| ||
---|---|---|---|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
To analyze intracoupled and intercoupled correlation of user attributes, we define a few basic concepts as follows.
Given an information table
These SIFs describe the relationships between objects and attribute values from different levels. For example,
Given an information table
This IIF
Given an information table
When given all the objects with the
Intracoupled and intercoupled similarity of attributes are, respectively, introduced as follows.
Given an information table
Greater similarity is assigned to the attribute value pair which owns approximately equal frequencies. The higher these frequencies are, the closer such two values are. For example,
Cost and Salzberg [
Given an information table
The value subset
With (
Given an information table
In Table
Coupled Attribute Value Similarity (CAVS) is proposed in terms of both intracoupled and intercoupled value similarities. For example, the coupled interaction between
Given an information table
In Table
With the specification of IaAVS and IeAVS, a coupled similarity between objects is built based on CAVS. Then we sum all CAVSs analogous to the construction of Manhattan dissimilarity [
Given an information table
In Table
In this way, a user similarity matrix of
For instance, we get a user similarity matrix of
Continuous data is with different features when compared with discrete date. In the following section, user similarity computation is demonstrated using Taylor-like expansion, with the involvement of intracoupled interaction within an attribute (i.e., the correlations between attributes and their own powers) and intercoupled interaction among different attributes (i.e., the correlations between attributes and the powers of others).
After students log onto a Web-based learning system, the system will record their activity information, such as times of doing homework and number of learning resources. This paper refers to a Web-based personalized user evaluation model [
Comprehensive evaluation index system.
First-level index | Second-level index |
---|---|
Autonomic learning | Times of doing homework |
Average correct rate of homework | |
Number of learning resources | |
Total time length of learning resources | |
Times of daily quiz | |
Daily average quiz result | |
Comprehensive test result | |
Number of collected resources | |
Times of downloaded resources | |
Times of making notes | |
|
|
Interactive learning | Times of asking questions |
Times of marking and remarking | |
Times of answering classmates' questions | |
Times of posting comments on the BBS | |
Times of interaction by BBS message | |
Times of sharing resources | |
Average marks made by the teacher | |
Average marks made by other students | |
Times of marking and remarking made by the student for the teacher | |
Times of marking and remarking made by the student for other students |
A fragment example of user continuous attributes.
|
| |||||
---|---|---|---|---|---|---|
|
|
|
|
|
| |
|
0.61 | 0.55 | 0.47 | 0.72 | 0.63 | 0.62 |
|
0.75 | 0.92 | 0.62 | 0.63 | 0.74 | 0.74 |
|
0.88 | 0.66 | 0.71 | 0.74 | 0.85 | 0.87 |
|
0.24 | 0.83 | 0.44 | 0.29 | 0.21 | 0.22 |
|
0.93 | 0.70 | 0.66 | 0.81 | 0.95 | 0.93 |
In this section, intracoupled and intercoupled relationships of above extracted continuous attributes are, respectively, represented. Here we use an example to make it more explicate. We single out 6 attributes data with continuous feature of the same 5 students mentioned in Section
Here we use an information table
The usual way to calculate the interactions between 2 attributes is Pearson’s correlation coefficient [
However, the Pearson’s correlation coefficient only fits for linear relationship. It is far from sufficient to fully capture pairwise attributes interactions. Therefore we expect to use more dimensions to expand the numerical space spanned by
Firstly, we use a few additional attributes to expand interaction space in the original information table. Hence, there are
Extended user continuous attributes.
|
|
|
|
|
|
|
|
|
|
|
|
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
0.61 | 0.37 | 0.55 | 0.30 | 0.47 | 0.22 | 0.72 | 0.52 | 0.63 | 0.40 | 0.62 | 0.38 |
|
0.75 | 0.56 | 0.92 | 0.85 | 0.62 | 0.38 | 0.63 | 0.40 | 0.74 | 0.55 | 0.74 | 0.55 |
|
0.88 | 0.77 | 0.66 | 0.44 | 0.71 | 0.50 | 0.74 | 0.56 | 0.85 | 0.72 | 0.87 | 0.76 |
|
0.24 | 0.06 | 0.83 | 0.69 | 0.44 | 0.19 | 0.29 | 0.08 | 0.21 | 0.04 | 0.22 | 0.05 |
|
0.93 | 0.86 | 0.70 | 0.49 | 0.66 | 0.44 | 0.81 | 0.66 | 0.95 | 0.90 | 0.93 | 0.86 |
Secondly, the correlation between pairwise attributes is calculated. It captures both local and global coupling relations. We take the
Here we do not consider all relationships but only take the significant coupling relationships into account, because all relationships involvement may cause the overfitting issue on modeling coupling relationship. This issue will go against the attribute inherent interaction mechanism. So based on the updated correlation, the intracoupled and intercoupled interaction of attributes are proposed. Intracoupled interaction is the relationship between
For attribute
Here
For attribute
The
Based on the result, we can find that there is hidden correlation between user attributes. For instance, all the
Intracoupled and intercoupled interactions are integrated in this section as a coupled representation scheme.
In Table
The coupled representation of attribute
Taking an example in Table
Integrated coupling representation of user continuous attributes.
|
|
|
|
|
|
|
|
|
|
|
|
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
3.85 | 3.80 | 0.70 | 0.70 | 2.20 | 1.46 | 3.24 | 3.23 | 3.35 | 3.70 | 3.76 | 3.81 |
|
4.54 | 4.50 | 1.34 | 1.34 | 2.89 | 1.98 | 3.66 | 3.65 | 3.82 | 4.31 | 4.37 | 4.51 |
|
5.51 | 5.46 | 0.88 | 0.88 | 3.54 | 2.44 | 4.46 | 4.45 | 4.66 | 5.22 | 5.28 | 5.47 |
|
1.53 | 1.52 | 1.17 | 1.17 | 1.01 | 0.80 | 1.03 | 1.02 | 1.06 | 1.42 | 1.44 | 1.52 |
|
5.94 | 5.89 | 0.94 | 0.94 | 3.73 | 2.49 | 4.95 | 4.94 | 5.17 | 5.68 | 5.75 | 5.90 |
Finally we obtained the global coupled representation of all the
Incorporated with the couplings of attributes, each user is represented as
With the new user attributes information of the coupled information table, we utilize the formula below [
For instance, we get a user similarity matrix of
In Sections
For example, in the former examples we listed 3 discrete attributes and 6 continuous attributes; then
With consideration of intracoupled and intercoupled correlation of user attributes, we get the user similarity matrix
We conducted experiments and user studies using the coupled user clustering algorithm proposed in this paper. The data for the experiments are collected from a Web-based learning system of China Educational Television (CETV), named “New Media Learning Resource Platform for National Education” (
In the experiment, we asked 180 users (indicated by
Recently public data sets regarding learners’ learning behaviors in online learning systems are insufficient, and most of them do not contain labeled user clustering information. Meanwhile, because learners’ behaviors are always with certain subjectivity, the accuracy of labeling learners with different classifiers only based on behaviors but without knowing the information behind is not full. Therefore, we adopt a few user studies, directly and, respectively, collecting relevant user similarity data from students and teachers, as the basis for verifying the accuracy of learners clustering in Web-based learning systems.
Through analyzing the continuous attributes extracted from Table
The clustering results of
Top 5 “most similar” | Top 5 “least similar” | ||
---|---|---|---|
Before clustering |
|
|
|
|
|
|
|
|
|||
After clustering |
|
|
None of them stays in the same cluster with |
Similarity accuracy: |
Dissimilarity accuracy: |
||
|
|
|
|
Similarity accuracy: |
Dissimilarity accuracy: |
||
Comprehensive accuracy | Comprehensive similarity accuracy: |
Comprehensive dissimilarity accuracy: |
As indicated in (
We keep adjusting the number of clusters
Parameter estimation of
In the following experiments, we take use of the 20 continuous user attributes in Table
To verify the efficiency of CUCA-MD in user clustering in Web-based learning systems, we compare it with three other algorithms, which are also based on mixed data, namely,
With the clustering result, we use statistics to make an analysis. In Table
We do comparison analysis on the clustering results using user similarity accuracy and user dissimilarity accuracy. Figure
Clustering result analysis (30 h).
The collection and analysis of learning behaviors are a persistent action, so we illustrate the relationship between average learning length and user clustering accuracy. From Figures
Clustering result of different phases.
Besides, we can verify clustering accuracy through analyzing the structure of user clustering results. The best performance of a clustering algorithm is reaching the smallest distance within a cluster but the biggest distance between clusters; thus, we utilize the evaluation criteria of Relative Distance (the ratio of average intercluster distance upon average intracluster distance) and Sum Distance (the sum of object distances within all the clusters) to present the distance. The larger Relative Distance is and the smaller Sum Distance is, the better clustering results are. From Figure
Clustering structure analysis (30 h).
We proposed a coupled user clustering algorithm based on Mixed Data for Web-based Learning Systems (CUCA-MD) in this paper, which incorporates intracoupled and intercoupled correlation of user attributes with different features. This algorithm is based on the truth that both discrete and continuous data exist in learning behavior information; it, respectively, analyzes them according to different features. In the analysis, CUCA-MD fully takes into account intracoupling and intercoupling relationships and builds user similarity matrixes, respectively, for discrete attribute and continuous attributes. Ultimately we get the integrated similarity matrix using weighted summation and implement user clustering with the help of spectral clustering algorithm. In experiment part, we verify the outperformance of proposed CUCA-MD in terms of user clustering in Web-based learning systems, through user study, parameter estimation, user clustering, and result analysis.
In this paper, we analyze discrete data and continuous data generated in online learning systems with different methods and build user similarity matrixes regarding attributes with discrete and continuous features, respectively, which makes the algorithm more complicated. In the following studies, we hope to realize the simultaneous processing continuous data and discrete data, while taking into account coupling correlation of user attributes, which will definitely further improve algorithm efficiency.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is supported by the National Natural Science Foundation of China (Project no. 61370137), the National 973 Project of China (no. 2012CB720702), and Major Science and Technology Project of Press and Publication (no. GAPP_ZDKJ_BQ/01).