Massive open online courses (MOOCs) provide an opportunity for people to access free courses offered by top universities in the world and therefore attracted great attention and engagement from college teachers and students. However, with contrast to large scale enrollment, the completion rate of these courses is really low. One of the reasons for students to quit learning process is problems which they face that could not be solved by discussing them with classmates. In order to keep them staying in the course, thereby further improving the completion rate, we address the task of study partner recommendation for students based on both content information and social network information. By analyzing the content of messages posted by learners in course discussion forum, we investigated the learners’ behavior features to classify the learners into three groups. Then we proposed a topic model to measure learners’ course knowledge awareness. Finally, a social network was constructed based on their activities in the course forum, and the relationship in the network was then employed to recommend study partners for target learner combined with their behavior features and course knowledge awareness. The experiment results show that our method achieves better performance than recommending method only based on content information.
Massive open online courses (MOOCs) provide free learning opportunities for worldwide people. The term “MOOC” is defined as a course that is open with no barriers to entry, neither cost nor education criteria. The courses are also online, accessed on the Web, and are massive, requiring a significant number of students to contribute to a connected learning environment. Currently, there are two different types of MOOCs: cMOOCs and xMOOCs. In 2008, George Siemens and Stephen Downes opened up a course called Connectivism and Connective Knowledge Online Course (CCK08). It is the first cMOOC course. The “c” in cMOOC stands for connectivist. Based on the idea that learning happens within a network, where learners use digital platforms such as blogs, wikis, social media platforms to make connections with content, learners can create and construct knowledge by themselves. However, the xMOOCs such as
The xMOOCs have attracted tremendous numbers of users and have played an increasingly important role in online learning. Millions of learners with different professional backgrounds and motivations came from different countries and gathered in the same “classroom”; therefore, higher education is broadened to an unprecedented worldwide range. According to the report from Columbia University [
However, together with large enrollment, it also brings another problem: low completion rates [
Many reasons can cause the low completion rate [
With the revolution of Web 2.0, social networking is increasingly attracting the attention of academic and industry researchers. Social network analysis (SNA) is a technique analysis social network based on network theory. Recently, researchers in higher education field try to use social network analysis to help solve problems in higher education. These methods focused on how to improve teaching quality by importing social networking into traditional teaching classrooms or analyzing relationship between students in the same class. Chen et al. [
The aim of this paper is to provide methods to recommend study partners for students in the same xMOOCs course. In general, if someone has a question about certain course concept, he/she can ask for help in the course forum, but the message that he/she posts may be covered by numerous posts so that it cannot be answered by someone else. The recommendation system will recommend someone who may be able to answer the question to him/her, and he/she can ask for help more specifically. In order to achieve this objective, we first established a behavior model to describe learner’s behavior features by analyzing their forum activities such as posts and comments. Then a topic model with term dictionary was proposed to compute the similarity over topics among forum learners. Meanwhile, a social network among all learners was extracted by treating a conversation in the forum as a relationship edge. At last, we recommended study partners with high topic similarity and high relationship strength to the target learner. The recommendation will help to improve the chance to solve problems which learners encounter during the learning process and then keep their learning enthusiasm. The main contributions can be summarized as follows.
We introduce a method for classifying learners based on behavior features from the perspective of their activities in the course forum. Learners can be divided into three groups: likequestioning learner (
We propose an approach based on latent Dirichlet allocation (LDA) model to measure the learners’ course concept awareness. To make the topic similarity among learners computable, we construct a term dictionary to constrain the topic words based on the fact that the discussion topic in one course forum will not change dramatically.
We construct a social network according to the learners’ activities in the course discussion forum and employ social network analysis to measure the strength of the relationship tie among learners.
We provide an indepth analysis of the experiment results, revealing that making use of learner’s behavior features, topic similarity, and relationship tie strength together can be of great benefit to recommend study partners for xMOOCs learners.
The rest of this paper is organized as follows. In Section
Recommendation systems can be divided into two areas of focus: object recommendation and link recommendation [
Collaborative filtering (CF) recommendation suggests items for people based on users who are alike with them. The “friendsoffriends” method can be seen as a way of adopting the ideas in collaborative filtering. However, the reason why one person adds a new friend is complicated. For example, one may accept a new friend because they study at the same school or they come from the same city or they have similar interests and so on. “Friendsoffriend” method may recommend friends to people with whom they are not similar and who may share information the user is not interested in. Bian and Holtzman [
The articulated social network structure can be considered as a social graph, and the task of recommending friends to a specific user is the same as predicting new links in this graph. Graph based approach made recommendation of friends by considering the local features [
Contentbased recommendation method is based on the user’s previous behavior or interests. User behaviors such as web browsing and interaction with other members often reflect user interest, as well as user selfgenerated content such as user profile, semantic tag, and posted messages. Manca et al. [
In this paper, we combine contentbased approach with graphbased approach. We first establish a user behavior model to represent user’s behavior. Meanwhile, we also construct a relation network by users’ interaction in the course forum. Then we incorporate the user behavior feature into the network and recommend friends for users based on similarity measures. Different from other friend recommendation systems, the relationship between learners is not an explicit friendship tie but only a latent communication in the course forum. The formation of this tie is more random, and the tie weight is weaker. On the other hand, to recommend a study partner for a learner we should pay more attention to specialty than diversity.
An overview of our proposed study partner recommendation system framework is outlined in Figure
Proposed recommendation framework.
We save crawled data to database for preprocessing such as word segmentation and remove stop words and some other specific words such as “http://,” “www,” “href,” “com,” and “org.”
The detail of user behavior model and user topic awareness model will be described in Section
In this section, we will introduce the specific approach of study partners recommendation, especially the user model and recommendation based on social graph.
There are three different user roles in Coursera courses:
The weekly number of messages.
In the 10 selected courses, a total of 53,662 messages (including posts and comments) were submitted by 13,064 students who participated in the forum discussion. However, as is shown in Figure
The number of messages versus the number of students.
Therefore, active students in the course forum are candidate users for study partner recommendation.
The messages posted in the course discussion forum can be roughly divided into three categories [
For convenience, course team often creates some subforums to organize all messages in the course discussion forum. Every subforum indicates what category of content is discussed. Therefore, we can simply label the category of every message in the forum by checking which subforum it belongs to.
As we know, some students prefer to ask more questions while some students prefer to answer more questions. We propose the user model to represent user’s behavior and classify users into three classes:
The message in the course discussion forum may be one of three types.
To identify if a message is a question or nonquestion, we just simply check whether there is question word such as “what,” “why,” and “which” or punctuation such as “?.”
As the author of a message, a user can initiate a new discussion thread, reply with a new post to an existing thread, or add new comment on posts. Therefore, the role of users can be divided into three types, namely, thread starter, poster, and commenter. Moreover, the same user can do all three activities.
Suppose a student
To recommend learning partners for a target student, we should also consider the relevance of students on the course content in addition to the behavior characteristics of individual students. Therefore, we proposed a topic model to measure the similarity of students over topics.
latent Dirichlet allocation (LDA) is a widely used generative model to infer the latent topic distribution in a large corpus. All documents are represented as a “documentword” matrix and then are clustered into several different topics, and the distribution of each document over topics and the distribution of each topic over words are calculated finally. Figure
Classical LDA topic model and proposed topic model for recommendation study buddies.
The graphical representation of LDA model
The graphical representation of LDA model with term dictionary (TDLDA)
Generally, topics in documents will evolve with new topic emergence; however, in xMOOCs courses, most learning materials are offered by course team and the topics are focused on fixed course concept. Therefore, the topics in the course forum also focus on limited course concepts, and topic words about every concept are stable. Therefore, we can construct a term dictionary for each course in advance and then infer the topic distribution of discussion in the course forum over these terms and deduce the topic distribution of every student over topic terms. The proposed topic model can be represented by a graphical model as Figure
Suppose there are
For all threads
For each of relationship edges
choose one of
choose one of
choose the edge
Here, different from the original LDA model, the relationship edge between two students is described as the distribution over term dictionary rather than a word. That is, every edge is generated by a mixed model of some discrete words. Given the parameters
Our goal is to estimate the topic distribution conditional probability given by parameters
Estimating this probability is intractable, and we follow variational inference to approximate the probability. So we define the following variational distribution for all topic dictionary words:
According to original LDA model, we compute the derivative
The topic similarity of two students is measured by the distance of their topic awareness. Combining topic similarity with the relationship matrix among all students in the same course, relationship network can be constructed. Furthermore, the recommendation can be fulfilled according to the weights among students.
In the course forum, each conversation not only transferred information from one user to another user, but also established a relationship edge between two users. According to the structure of course forum in Coursera, all the messages in the same thread focus on the same problem and contain many posts and comments. Therefore, there are edges between thread starter and all poster and commenter in the thread; there are also edges between posters and all comments in the same post. All probable edges in one thread are demonstrated in Figure
The relationships between the forum users.
As previous analysis depicts, we can divide all students into three groups:
The type of relationship between learners.
From  To  



 







































According to Table
Therefore, the relationship network between students can be represented by a weighted directed graph. Consider two neighbor nodes,
The weight between two users consists of two terms. The first term is the topic awareness similarity between two nodes, that is, the product of topic awareness on all
As noted above, the relationship among students is a weighted directed graph. Let us start with a simple network as shown in Figure
Example of a learnerâ€™s network.
In this graph, students C and D are direct neighbors of E, and students A and B are indirect neighbors. Unlike FOF recommendation, although node E is connected by node A’s both friends (C and D), user B is the best recommendation because of its strong connection with user D.
To find recommendations for target student
The final score is the sum of all
To validate our approach to the study partner recommendation problem in xMOOCs courses, we gathered data from the Coursera API. Basic information of all courses such as course name, start date, course category as well as forum messages of 15 courses which are in progress or just finished under “Statistics and Data” category are crawled and saved to a database. We tested our LDA with termdictionary and corresponding algorithm against original LDA model.
Coursera provided application API to access its basic data by requesting some specific URLs, and JSON format data will be returned. We gathered three datasets: course information, user information, and forum messages information. The course information includes course ID, course name, course categories, instructor, open times, starting date, and duration. The user information includes user ID, nickname, gender, birth date, and user role. And the forum messages information includes message type (thread/post/comment), message ID, title, author ID, posted time, tags, and content as well as the hierarchy of these messages. Our data consist of 633 courses, 15 course forums, and over 10000 users involved in course forums. After that, we do some preprocessing work such as choosing 10 course forums in which the courses lasted more than 7 weeks or have finished and deleted inactive users. The final data information is as shown in Table
Dataset information.
Number of courses  10 
Number of threads  6684 
Number of posts  36669 
Number of comments  21879 
Number of active students  8614 
According to (
The question index of active students.
In this paper, we choose parameter
In order to verify the validity and accuracy of the proposed method, we split each course forum messages into two parts according to posted time and then take the first part as training data while the second part as validate data. After training, we randomly select students from training data as target user and recommend study partners to him/her. If the recommended student communicates with target user in the second part data, the commendation is valid.
Suppose
By changing the split time every week, we do the same recommendation experiment based on LDA model and LDA model with term dictionary (TDLDA). The final average recommendation precision of 10 courses is shown in Figure
Comparison of recommendation precision between LDA and TDLDA.
In the first 3 weeks, the number of students is large while the forum activities are not sufficient, so the relationship matrix of students is very sparse and the behavior feature of students is not accurate. So the recommendation accuracy of both two methods is low and the difference is not significant. But with the time passed, the behavior feature is getting more accurate, and the relationship edges are increasing while the total number of students is decreasing because some students drop out of the course. The accuracy of the proposed method is much better than the tradition LDA although the accuracy of both two methods is increasing.
After the recommendation model was tuned up, we apply the model to recommend study partners for the rest 5 courses we have gathered from Coursera. We put learners into three groups by analyzing forum messages in first 4 weeks and randomly select 50
Recommendation precision for the rest 5 courses.
Course  Accuracy 

Big Data in Education  6.36% 
Computational Methods for Data Analysis  12.53% 
Financial Engineering and Risk Management Part I  10.47% 
Computational Molecular Evolution  10.68% 
High Performance Scientific Computing  11.73% 
In the rest 5 courses, the recommendation precision for all courses is above 10% except course A. This is because the number of active learners in this course is much less than other courses. There are only 418 active learners in course bigdataedu001 while there are over 800 active learners in other courses.
We presented a method for study partner recommendation in xMOOCs courses to address the problem of how to help students to finish their learning process and improve the completion rate of xMOOCs courses.
By studying this problem with support from social networks analysis and topic modeling, our conclusion is that the LDA model with term dictionary is more effective in recommendation compared to the original LDA model as for topic modeling in messages discussed in xMOOCs course forum. In this paper, we developed a study partner recommendation system based on LDA model with term dictionary that produce quality and relevant friend recommendations in addition to providing insights into each individual’s behavior feature and perception of course concept. The result has shown that the proposed approach has thus far outperformed the original LDA approach.
The study partner recommendation system still has much room for improvement. The primary issue leading to not high enough recommendation accuracy is due to the lack of more specific behavior data of students. We also cannot reach those students who have signed up a course but not leave any message in the course forum. Furthermore, there are many other reasons for a student to drop out from a course such as the gap between his expectation and reality. How to design a more personalized course based on the student’s level is a bigger challenge.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by the Fundamental Research Funds for the Central Universities of China under Grant nos. N110316001 and N130404004 and the Liaoning Province Science and Technique Foundation under Grant no. 20132170041 and the Ministry of EducationIntel Special Research Foundation under Grant no. MOEINTEL201206.