Blogs are popular way to express opinions on the Internet. Due to their popularity and their public character blogs attract attention of many researchers. In this paper we compare two national blogospheres (Polish and American) from different angles such as characteristics of messages and interactions, structure of social groups, topics discussed in them, and the influence of real-world events on the behavior of such groups. In our approach we try to combine in advanced manner users activity on both the individual and community level. The comparison reveals some differences and various characters of both portals. Methods for analysis of groups dynamics, users roles, and topics in groups are presented.
Nowadays a large part of our life has moved to the Internet, particularly to the social media. It is hard to imagine that we stop using them. Willingly or not, we are present in them, even passively searching for sources of information. A large part of the official and unofficial life has moved there. There are various reasons for this situation, but one thing must be said with certainty that this is a process that cannot be stopped. The majority of us are only passively involved in it, treating different types of forms of social media as sources of information, that is, places where one can learn something. But there are also people who participate in social media actively and creatively: expressing their opinions, commenting on others, promoting opinions of others, and so forth. They leave so many “traces” of their activities, which can then be analyzed to find interesting patterns of human life, which can be used in marketing, business, politics, or public security domains.
The social media may take many forms, for example, blogs, forums, media sharing systems, microblogging, social networking, and wikis. Among them, blogs play a special role. The term “blogosphere,” first introduced by Brad L. Graham in 1999, should be understood as a term describing all blogs. Observing the development of blogosphere, one can say that they have passed a long way from frivolous diaries to very serious sources of information. Undoubtedly, the reason for this situation has become a development tool for creating blogs, as well as the fact that many important people have discovered that blogs are a very good place to express their opinions and to observe an immediate response to them. It is believed that blogs have become a flywheel for the development of online social networking [
Blogosphere is an interesting source of data for analysis. It is characterized by (in most cases) high dynamics: posts are often added as well as comments on them; one can analyze the reactions of readers to the posts, both in terms of response speed as well as emotion (sentiment analysis). One can analyze themes of posts and find those that receive the greatest interest (getting the most comments) as well as users who generally write such influential posts. Until recently, the analysis of the processes taking place in blogosphere was the domain of research conducted mainly by psychologists and sociologists. These studies were characterized by carrying out analyses to a limited extent due to problems with data collection. With the development of technological capabilities allowing for automatic and incremental collection of any amount of data from blogosphere and storing them in huge databases have significantly increased the possible directions of research.
The paper presents a comparison in various aspects of users activity in Polish and American blogosphere.
Generally, to our knowledge, there is no such comprehensive comparative analysis of two blogospheres in such a wide range as we have done. Particular areas of research appear in single studies. In some articles the authors analyze groups in blogosphere (but without taking into account the dynamics of change); others examine influential bloggers or analyze topics of discussion. Our approach assumes broad comparison of two national blogospheres by analyzing the structure of the groups that are formed and continued for a period of time, comparing the roles of users played in both the group and the globe in the whole network, as well as the identification of topics of conversation and the study of reaction time for posts in different blogospheres. Such a global approach to the analysis of the users allows creating much more advanced user profiles, at both the individual and global level, as well as finding user’s characteristics that are common to different nationalities, as well as those that differentiate them.
The structure of the paper is as follows. In Section
Blogosphere soon became an interesting research area for psychologists and sociologists. The research methodology was largely based on designing questionnaires and asking questions to a properly selected group of respondents (according to, e.g., demography). The results of the analysis were strictly dependent on the truthfulness of responses and the sample size of blogs, which, due to the need for manual processing, was not big. The most interesting subject of research was to determine why people started a blog and reasons they had for continuing writing. They tried to find differences based on gender and demographics of bloggers.
Initially, these analyses concerned a single nationality. Then blogs belonging to representatives of different nations were analyzed to compare and find out if there were any differences related to cultures diversity. Analyses of individual nationalities concerned tracking changes in the demographics of bloggers or certain groups of bloggers were studied.
The vast majority of authors [
In [
Since computer scientists started to be interested in the analysis of blogosphere, research has sped up, because there is a real possibility of automatic data collection from the blogosphere using webcrawlers, saving them to big, effective databases and performing virtually any analysis on such data. So there is no need to develop an experiment, invent questions, and collect responses and analyze only data. Usually all data from the page are collected, such as demographic information, text of posts, comments (as well as information about their authors), links, tags, dates, and all other kinds of available information. Directions of research now are much less related to demography, because such data are usually not available. Because all data are available in database, one can freely invent and change the directions of analysis. Generally, this research can be divided into two directions: structure and content analysis.
One of the directions of the analysis was to use methods of social network analysis [
In other studies [
The first approach to find clusters in blogospheres and recognize the structure was in [
In [
In [
Social network is not a homogeneous structure; it rather consists of areas in which vertices communicate to each other more frequently than with vertices outside given area. Such areas are called groups (communities, module, cluster, and subgroups). There are many methods of finding such groups, which can be overlapped (or not) [
Even though most methods have been developed for static environment, many researchers have recognized the need for better reflecting the dynamic nature of the most social networks (especially coming from social media sites) [
Palla et al. in [
In social network analysis there are many definitions of role [
Roles in the literature are often discussed in the context of influences [
A lot of studies relate to certain social media and attempt to define their specific roles [
Aggarwal and Wang in [
Topic modeling [
In this section we describe measures and methods applied to comparison of two blogospheres: American and Polish one. Firstly, we provide definitions of measures utilized to assess different characteristics. Next, we depict methods for analysis of groups dynamics, users roles, and topics in groups.
The lifetime lt of a post
In other words, lifetime of a post is the range of time between writing the post and the last comment for that post.
The reaction time
Reaction time for a post is the range of time between writing the post and the first comment for that post.
To analyse groups dynamics, whole range of time was divided into smaller periods of time (called later
Identification of continuation between groups
Using above measures we can define transition
Now we can label transitions: addition: when a small group attaches to big one deletion: when a small group detached from big one merge: when many groups join together into bigger one split: when group divides into 2 or more groups in the next time slot split_merge: combination of event constancy: simple continuation of a group without significant change of size change_size: simple continuation of a group with significant change of size decay: when a group disappear in the next time slot
In above definitions we used function
Users can play different roles on a global level and different ones in each of the groups they belong to (local level of roles). The set of roles we use for analysis in this paper was proposed by us in [
The presented roles take into consideration responses from other users on the content the user writes (in both the form of posts and comments). To meet such assumptions, we defined
Using the above definitions we can describe the set of roles: Influential User (infUser): Influential Blogger (infBlog): Influential Commentator (infComm): Standard Commentator (comm): Not Active (notActive): Standard Blogger (stdBlog): user that does not match any from above roles.
Topics for groups were assigned based on clusters uncovered by LDA method. The method for analysis topics in groups was used by us in [
Whole method can be described as a set of the following steps. Firstly, we used LDA method provided by mallet tool (
We can formalize it in the following way. Let us define
Using above notation we can define topics for a group
In this section we compare Polish and American blogosphere from different points of view, especially in terms of users activity, groups formation, and topics discussed by users in groups. For this purpose, we chose one dataset as a representative for Polish blogosphere and one for American one.
The first dataset contains data from the portal
Categories of posts.
The second dataset is the
Moreover, due to the performance issues of group extraction method in order to detect communities, we eliminated the edges with weight equal to one in each time slot. But for other types of analyses (such as role finding) we conducted them on full graphs without any edge removal.
As we can observe in Table
Comparison of data quantity in both datasets.
Measure |
|
|
---|---|---|
Number of posts | 380 700 | 414 225 |
Number of posts without comments | 74 979 (19.7%) | 45 604 (11%) |
Average number of comments in one post | 18.65 | 48.28 |
Number of comments | 5 703 140 | 17 796 819 |
Number of comments to posts | 2 781 303 (48.77%) | 6 961 369 (39.12%) |
Number of comments to other comments | 2 921 837 (51.23%) | 10 753 162 (60.88%) |
Number of authors | 31 750 | 680 341 |
Number of authors of posts | 10 131 (31.91%) | 1 027 (0.15%) |
Number of authors of comments | 29 536 (93.03%) | 661 676 (97.26%) |
Figure
Lifetime for posts.
Figure
Reaction time for a post.
For group extraction we used CPM method (CPMd version which is designed to discover groups in directed networks) from CFinder (
Figure
Number of stable groups at given size.
In Figure
Percentage of stable groups in relation to all groups.
Figure
Number of events.
Figures
Number of groups and evolution events in time and correlation with real-world events for
Number of groups and evolution events in time and correlation with real-world events for
Figures
Topics discussed in at least 10% of all groups in
Topics discussed in at least 10% of all groups in
One can notice that
When we look into results of both methods to associate topics for groups, we can spot that they are quite similar (in terms of proportions for different topics).
Figures
Global roles of users.
Local roles of users.
As far as local roles are concerned, one can notice a few interesting observations. Firstly, the number of inactive users is much lower than in previous case—this means that most inactive users (actually, the conditions in experiments let them write no more than one comment) are outside groups which is understandable. Moreover, the number of
In the paper, a comparative analysis of two different blogospheres, Polish and American, is presented. This approach is based on a comprehensive analysis of the structure and content of blogosphere.
The preliminary analysis of the structure of both blogospheres shows that discussions conducted in
Differences in the number of these groups are significant: in
In turn, events have a big impact on the dynamics of both blogospheres. Due to the different nature of
So, the comparison of two blogospheres gave interesting results: in some aspects nationality does not matter but sometimes has a big impact on user behavior. One can see differences in the characteristics of people from different countries in the context of their activity in the social media (taking into account their dynamic nature), for example, categories of interesting topics, speed of reaction to novelty, and way of reaction according to the categories of the world events. Presenting approach may have many practical applications. It can, for example, support sociologists and psychologists in their research on behavioral analysis in different national communities (e.g., among emigrants). The results of our experiments show that, for example, in marketing, making user profiles, one should take into account nationality, and therefore product marketing campaigns should be differentiated depending on countries (e.g., global advertising campaign). Similarly, to predict customer behavior, one should take into account the context of nationalities. These observations can be used in the development of election campaigns.
Research can be continued in several ways. One of them is analyzing and comparing differences in sentiment, for example, which nation is more optimistic? Another direction of research could be comparing the ability to predict the future of groups in both blogospheres. Furthermore, extension of comparison to other national blogospheres possibly could reveal some characteristics related to their nationality.
The authors declare that there is no conflict of interests regarding the publication of this paper.
The research reported in the paper was partially supported by the Grants nos. INNOTECH-K2/IN2/89/182461/NCBR/13 and 008/R/ID1/2011/01 from the Polish National Centre for Research and Development.