Analyzing the Impact of Storage Shortage on Data Availability in Decentralized Online Social Networks

Maintaining data availability is one of the biggest challenges in decentralized online social networks (DOSNs). The existing work often assumes that the friends of a user can always contribute to the sufficient storage capacity to store all data. However, this assumption is not always true in today's online social networks (OSNs) due to the fact that nowadays the users often use the smart mobile devices to access the OSNs. The limitation of the storage capacity in mobile devices may jeopardize the data availability. Therefore, it is desired to know the relation between the storage capacity contributed by the OSN users and the level of data availability that the OSNs can achieve. This paper addresses this issue. In this paper, the data availability model over storage capacity is established. Further, a novel method is proposed to predict the data availability on the fly. Extensive simulation experiments have been conducted to evaluate the effectiveness of the data availability model and the on-the-fly prediction.


Introduction
In the last decade, online social networks (OSNs), such as Facebook [1], Twitter, and Sina Weibo [2], have gained extreme popularity with more than a billion users worldwide. OSNs allow a user to publish the data to all his friends in his friend circle.
Currently, the OSN platforms are typically centralized, where the users store their data in the centralized servers deployed by the OSN service providers. The service providers can utilize and analyze these data to know the users' private information, such as interest and personal affairs, and in the worst case may sell this information to the third party. Therefore, the current centralized online social networks (COSNs) have raised the serious concerns in privacy [3][4][5][6].
In order to address the data privacy issue, the decentralized online social networks (DOSNs) have been proposed recently [7][8][9][10][11]. Although the DOSN products [12] are not as popular and mature as the OSN products [1], DOSN is indeed under active research and development [13][14][15][16][17]. In DOSNs, in order to protect the data privacy the centralized servers are bypassed and the data published by a user are stored and disseminated only among the friend circle of the user [9,10]. Although DOSNs can help protect the data privacy, maintaining data availability becomes a big challenge. This is because if a friend of the user is offline, the data stored in the friend cannot be accessed by other friends.
In order to achieve good data availability in DOSN, the data replication approach has been widely used. In this approach, a certain number of data replicas are created for each data item published by a user and these data replicas are stored in the user's friend circle. By doing so, if a friend is offline, the data in this offline friend node can be accessed through the replicated data stored in other friend nodes.
In the existing data replication work in DOSN, it is typically assumed that the friends of a user are always capable of contributing sufficient storage capacity to store all the published data [9,14,18]. This assumption is not ideal, especially in the current modern times. Nowadays, the users often use smart mobile devices, such as smart phones, to access the OSN services. The resources in the mobile devices are much more limited than the desktop computers used in 2 The Scientific World Journal the "old fashioned" style of accessing OSNs. Moreover, the number of the friends in a friend circle is limited (typically less than 200) [19]. Therefore, it is desired to know what level of data availability can be achieved given the total storage capacity contributed by the friend circle. However, the existing work in DOSN has not yet conducted quantitative research in this aspect.
This paper aims to address the above issue and build a quantitative model to capture the relation between the total storage capacity contributed by the friends and the level of data availability in the DOSN.
The reason why we investigate the relation between the total storage capacity and data availability is because a data item is regarded as being available as long as it is stored in the online friend nodes in the DOSN, no matter which online friends the data replicas are stored in. The location of the data replicas does not directly affect the data availability but mainly imposes the impact in the following two aspects.
(i) Data accessing performance: due to, for example, the bandwidth and latency of the friends where the data are stored, other friends who are accessing the data may experience different performance.
(ii) The data maintenance overhead: when a friend goes offline, the data replicas on the friend have to be generated on other online friends. Various attributes of the friend, such as the storage capacity contributed by this friend, bandwidth, and latency, have impact. For example, if a friend offers the big storage capacity, then potentially more data have to be generated in other friends when this friend goes offline.
How to optimize data accessing performance and reduce data maintenance overhead is the work of the underlying data replication and placement strategies. This work is situated at the level of maintaining data availability. This is why this work mainly concerns the total storage size provided by the friends collectively. Following on from this work, we plan to work down the management levels in DOSN and develop the placement strategies for data replicas among the friends in DOSN.
In order to build the data availability model, we need to have deep understandings of the DOSN properties that are related to data availability. In this paper, we analyze these relevant properties and establish the probabilistic models for them. Further, the models for the individual properties are integrated to construct the data availability models. Further, a novel method is proposed to predict the level of data availability on the fly.
Using the data availability model developed in this paper, the DOSN designers can determine the average size of the storage pool that each friend should contribute for the published data, given the level of data availability that the DOSN desires to achieve. Moreover, in DOSN, the friends become online and offline dynamically; the data availability will drop when the number of online friends decreases. The on-the-fly prediction method can be used to conduct the real-time prediction for the level of data availability in the near future. The quantitative prediction results produced by the model can greatly help the data replication and storage policies make judicious decisions on the fly.
The rest of this paper is organized as follows. Section 2 discusses related work about analyses of OSN properties, the existing DOSN approaches, and data availability work. Section 3 states the problem which we try to address. Section 4 presents the data availability model over storage capacity. Section 5 presents the on-the-fly prediction model. Section 6 shows some case study. Section 7 conducts extensive experiments to verify our models and analyzes experimental results. Finally, we make conclusions.

Related Work
This section discusses the related work mainly in the following three aspects: (i) the existing work of analyzing the OSN properties, including both the characterizations of OSN networks and the analyses of user behaviors (Section 2.1), (ii) the existing research on DOSN, that is, the alternative approaches to decentralizing the OSNs (Section 2.2), and (iii) the existing studies on data availability in DOSN (Section 2.3). Moreover, this section also discusses the existing work in achieving data availability in grids and clouds (Section 2.4).

Characterizations of OSN Networks.
Some studies use the graphs to represent the OSN networks and investigate the graph structures of OSN, such as degree distribution, network diameter, and clustering property. They conduct the analyses through the crawled data gathered from popular OSN sites such as Facebook, Twitter, MySpace, Flickr, YouTube, Live-Journal, Cyworld, and orkut [13,[19][20][21][22]. It has been found that (i) OSNs manifest power-law, small-world, and scale-free properties; (ii) the social network is nearly fully connected; (iii) the neighborhoods of the users in the social graph contain the surprisingly dense structure, while the graph is sparse as a whole; (iv) most users have a moderate number of friends (less than 200). The findings about the number of friends will be used to design the simulation experiments in this paper. [23][24][25][26][27] studied the patterns of the user behaviors through the crawled or clickstream data. Jin et al. [23] conducted a comprehensive review about the user behavior in OSNs from several perspectives, including social connectivity and interaction among users, traffic activity, and the characteristics in mobile environments. Benevenuto et al. [24] collected the clickstream data over 12 days to study the characteristics of OSN sessions, including the accessing frequency, session durations, and total time spent on OSNs. Schneider et al. [25] focused on feature popularity, session characteristics, and the dynamics in the OSN sessions. Kwon and Wen [26] empirically examined how the individual characteristics affect the actual user acceptance of social network services. Yan et al. [27] studied the human behavior using the data obtained from the "Sina Microblog, " which is one of the most The Scientific World Journal 3 popular OSN sites in China. They found that the human activity patterns are heterogeneous and bursty and often follow the power-law distribution.

Analyses of User Behaviours. The work in
Since the existing research has revealed the dynamic characteristics about user behaviors, such as the distributions of online and offline durations, these will be used as the known parameters when we derive the data availability model and the on-the-fly prediction in this paper.

DOSN.
To address the data privacy problem in COSNs, several decentralized approaches have been proposed [7][8][9][10][11]. Buchegger et al. [7] proposed a decentralized, peer-topeer approach coupled with encryption. Yeung et al. [8] adopted a decentralized approach by using the URIs as the identifiers throughout, which can provide the same (or even higher) level of user interaction as with many of the current popular OSN sties. Tandukar and Vassileva [9] also proposed a decentralized OSN. With this approach, users can maintain the control over their data to protect their data privacy and forward the social data selectively to reduce the irrelevant data among the users. None of these approaches only stores the data published by a user in his friend circle.
There is another type of DOSNs [10,11], known as friendto-friend storage systems, which focus on providing the data storage services for all participants. Li and Dabek [10] argued that a node should choose its neighbors where the data are stored based on existing social relationships instead of randomly. Sharma et al. [11] find that the limitation of storing data only on friends has a marked impact on the data availability. They showed that the problem of obtaining maximal availability while minimizing redundancy is NP complete and proposed greedy data placement heuristics to improve the data availability. Our data availability model and the on-the-fly prediction can be integrated into these existing DOSNs; for example, the quantitative results produced by our models can be used to help make the data replication and/or data storage decisions.

Data Availability in DOSN.
Because of the requirement of protecting data privacy, the data published by a user are only stored in his friend circle in the DOSN. Consequently, data availability is one of the biggest challenges in DOSNs. The existing work in improving data availability mainly focuses on designing smart data replication and data storage policies.
Shakimov et al. [28] propose three schemes for storing the data in DOSNs: the cloud-based scheme, the desktopbased scheme, and the hybrid scheme combining the above two. In the cloud-based scheme, the data will be stored in the cloud servers. In the desktop-based scheme, two mechanisms may be used: (i) the data replicas are encrypted when they are stored in potentially untrusted hosts; (ii) the users take advantage of the trust embedded in the social network to store the data replicas on trustworthy friends. The drawbacks of these mechanisms come from the complexity and overhead in the encryption key or trust management.
The approach proposed by Koll et al. [18] exchanges the recommendations among the socially related nodes in order to effectively distribute a user's data replicas among the eligible nodes carefully selected in the OSN.
In the approach developed by Olteanu and Pierre [14], the preferences are given to the nodes when it comes to selecting the nodes for storing the data (and their replicas) published by a user [14]. The online friends of the user have the highest priority. When all friends are offline, the data are then stored in the nodes which are not in the user's friend circle.
Buchegger et al. designed a two-tiered DOSN architecture (PeerSoN) [7]. One tier serves as a look-up service which is implemented by OpenDHT. The second tier consists of the peers and contains the user data. When a user is offline, all his data will be stored across the whole network.
Cutillo et al. [29] propose a P2P-based DOSN (Safebook), in which each node is accessible through the so-called shells. The profile data is mirrored and stored in a subset of a node's direct contacts, which form the so-called innermost shell. The data retrieval requires traversing the shells along a path of the nodes that are online and are friends with each other.
Tegeler et al. [30] propose an approach called Gemstone. Gemstone protects the user's privacy by encrypting all data using ABE and stores the user's data in the so-called data holding agents (DHAs). If a DHA itself is offline, the data have to be passed to the DHAs of this offline DHA.
All the above existing work about data availability focuses on how to store the data replicas so that they are still accessible when the users or certain friends of the users are offline. They all implicitly assume that the friends are always able to contribute the adequate storage capacities to store the replicated data.

Data Availability in Grids and Clouds.
We also studied the existing work in achieving data availability in grids and clouds. Amjad et al. [31] surveyed the dynamic replication strategies for improving data availability in data grids. Kossmann et al. [32] proposed a modular cloud storage system. Zeng et al. [33] studied the cloud storage architecture and then pointed out the key techniques.
However, we found that the focuses and the considerations in achieving data availability in grids and clouds are quite different from those in DOSN. One of the biggest differences is that the data replication mechanisms in grids or clouds do not treat the total storage capacity as a limitation, although some studies considered the case where the storage capacity of individual nodes in a grid system is limited. Namely, these studies all explicitly or implicitly assume that the total storage space in grids or clouds is always sufficient to store the data replicas. This assumption is reasonable for grids and clouds because of the scale of such systems. However, it is not always true for DOSN due to the aforementioned facts that (1) smart mobile devices, whose storage capacity is limited, are often used in DOSN and (2) the number of friends in a friend circle is also limited. Figure 1 illustrates the data availability problem. In Figure 1, the user publishes the data at a series of time points along the 4 The Scientific World Journal

Problem Statement
The user goes offline here The friend comes online here The user publishes his last data here The user publishes his first data here Replicated data The oldest data replicated in online friends is published here time line. Assume 1 is the first time point when he publishes the data, Data 1 , after he comes online, and is the last time point the user publishes the data, Data , before he goes offline at the time point out . Now let us consider one of the friends in the user's friend circle. Assume that the friend goes offline at time point out just before the user publishes Data (and after the user publishes Data −1 ) and then comes online at time point in after the user goes offline. Therefore, Data to Data are the data that the friend missed when he is offline and consequently need to be updated when he comes online. Since the user is already offline, the friend can only update the missed data from other online friends where the data replicas are stored. Note that if the friend comes online before the user goes offline, the friend can update all missed data from the user directly. Therefore, data availability is not a problem under this circumstance.
When a friend comes online, assume that the total amount of the data that the friend tries to update is update . Out of update , the amount of data that are stored in online friends of the user is stored . The level of data availability (denoted by DA) is defined as (1) The data replication frameworks typically work in the following way [10,18,34]. When the user publishes a data item, a certain number of data replicas are created and stored in the storage pools of the selected friends of the user. When a friend goes offline the data replicas which are stored in this friend will be recreated and stored on other online friends to maintain fixed number of data replicas for each data item. If the size of the storage pools is unlimited, the new data will just be added to the friend's storage pool. If the storage pool is limited and the pool is already full, the oldest data in the storage pool will be replaced with the new data. Therefore, the size of the storage pool will determine what period of data is stored in the pool, which affects the data availability of the DOSN. Consider Figure 1 again; for example, if the storage pool in the friends is limited and can only store the data published from back to , then the data earlier than are not available when the friend comes online at in . One aim of this paper is to establish the data availability model to capture the relation between the level of data availability and the total size of the storage pools contributed by the friends. This is presented in Section 4. Now consider a time point after the current time . The other aim of this paper is to predict the level of data availability at on the fly, which is presented in Section 5. This prediction is very useful for the data replication or storage policies to make judicious decisions dynamically.
The notations that are used in the derivations of the data availability models are introduced as Table 1.

The Data Availability Model over Storage Capacity
As discussed in Section 3, the total size of the storage pool contributed by a user's friends (denoted by SS) can determine the period of the published data stored in the storage pool. denotes the publishing time of the oldest data stored in the storage pool (i.e., in Figure 1), and out denotes the time when the user goes offline. Then [ , out ] is the period of the published data stored in the storage pool. This section first determines (Section 4.1) and then presents the method of establishing the relation between SS and the DA of the data published by the user (Section 4.2).

Calculating .
In order to determine , the size of the data published by the user has to be calculated first. ( pu ) denotes the number of times that the user publishes the data in the time duration pu . ( pu ) is a discrete random variable. pu ( ( pu )) denotes the probability density function (pdf) of ( pu ). denotes the average size of the data published by the user each time. ( pu ) denotes the total size of the data published by the user in pu . Clearly, ( pu ) = ( pu ). Therefore, the pdf of ( pu ), denoted by pu ( ( pu )), can be determined by (2) and the expectation of ( pu ) can be calculated by (3) as follows: The publishing time of the oldest data stored in the storage pool, , can be calculated using (4) given SS, where is the replication degree in the OSN, that is, the number of replicas created for each data item. Consider

Establishing the Relation between DA and SS.
When a friend comes online at in (as in Figure 1) and his last logout time (denoted by out ) is no earlier than , the friend can update all the data missed during his offline duration from other online friends. Namely, DA for a friend coming online at in , denoted by DA( in , out ), is 100% in this case. When out is earlier than , the data published in [ out , ] are not available to the friend. Therefore, DA in this case equals the The Scientific World Journal 5  The login and logout events, respectively. When any of these two events occurs, the state of a user changes from OFFLINE to ONLINE or from ONLINE to OFFLINE on on ( on ) on ( on ) The time duration of a user being online continuously (i.e., the time duration from an login event to the following logout event), which is a random variable and whose probability density function and probability distribution function are denoted by on ( on ) and on ( on ), respectively The time duration of a user being offline, which is also a random variable and whose probability density function and probability distribution function are denoted by off ( off ) and off ( off ), respectively pu ( , ) The number of times that the user publishes the data, which is a discrete random variable and whose probability density function in a duration t is denoted by pu ( , ) The statistical average size of the data published by the user each time. is a constant The replication degree, that is, the number of replicas created for each data item The publishing time of the oldest data stored in the storage pool

SS
The total storage capacity contributed by all online friends

S
The maximum storage capacity that each friend is able to contribute proportion of the data that are published in [ , out ] to those in [ out , out ]. In summary, DA( in , out ) can be calculated using off denotes the time duration of a friend being offline continuously. off ( off ) denotes the pdf of off . The probability that a friend went offline at out and then comes online at in is off ( in − out ) out and the corresponding DA( in , out ) is obtained by (5). Then, DA at time point in can be expressed by DA [ out , ℎ] denotes the expectation of DA over the time duration between out and in , where ℎ is the duration between the user's two consecutive logins (the work in [25,35,36] has presented the method to obtain the value of ℎ). DA [ out , ℎ] can be calculated by (7), where at ( in ) is the probability density function that a friend comes online at time in : DA [0, out ] denotes the expectation of DA over the time duration between 0 and out . Since the user is online between 0 and out , DA is 100% over the time duration between 0 and out ; that is, (8) holds: on denotes the time duration of a friend being online continuously. on ( on ) denotes the pdf of on . DA of the data published by the user under the given value of ℎ, denoted by DA(ℎ), can be calculated by combining (7) and (8) as follows: ℎ = on + off is also a random variable. (ℎ) denotes the probability density function of ℎ, which can be derived from the probability density functions of on and off and has also been studied in the literature [26,37].
Therefore, DA of the data published by the user can be finally calculated using 6 The Scientific World Journal As can be seen from (9), DA is a function over DA [ out , ] , which is in turn a function over DA( in , out ) (shown in (7)). DA( in , out ) is the function over (shown in (5)). As shown in (4), can be calculated from SS. Therefore, we have now established the function of DA over SS.

Predicting the Data Availability on the Fly
Using the method presented in Section 4, we can calculate SS required to achieve the desired DA of the data published by the user. Note that SS is the total size of the storage pool contributed by all online friends of the user. The friends log in and out dynamically and therefore the number of online friends varies over time. When the number of online friends decreases, the size of the individual storage pool contributed by each online friend has to be increased in order to maintain the desired DA. The existing work in the literature often assumes that the friends of a user are always capable of contributing sufficient storage capacity for the replicated data published by the user. Consequently, there is little work yet in the literature investigating the impact of the friends' dynamic behaviors (i.e., dynamic login and logout) on DA. However, as we have discussed in the introduction section, it is not always acceptable to assume that the friends are willing and able to contribute unlimited storage capacity in the nowadays OSNs. In this paper, we assume that the maximum storage capacity that each friend is able to contribute is . When the required SS exceeds the total storage capacity contributed by all online friends, the DA will drop. Due to the friends' dynamic behaviors, it is very useful to be able to predict the DA on the fly. This section addresses this issue. Consider Figure 1 again. Assume the current time is . The problem of the on-the-fly prediction of DA is to predict the DA at a future time point ( > ). According to the discussions above, the key of predicting DA is to predict the number of online friends. At the current time , we know how many friends are online or offline. We can predict the number of friends who are online at a future time , if we can predict the following two parameters: (i) how many of the friends who are online at time do not change their states from online to offline before or at , and (ii) how many of the friends who are offline at time change their states to online before or at . The methods of predicting the above two parameters are presented in Sections 5.1 and 5.2, respectively. Section 5.3 combines the results obtained in Sections 5.1 and 5.2 to predict the number of online friends and further predict the DA at time .

Predicting the Number of the Friends Who Are Online at and Do Not Change to
Offline before or at . Given an online friend V at time , we can know the time point at which the friend logged in (i.e., became online), which is denoted by on in . The probability that friend V does not change to offline before equals the probability that V will only log out after (i.e., V 's logout time, denoted by on out , is greater than ). The probability, denoted by on out ( on out > ), in turn equals the probability that V 's online duration is greater than ( − on in ) under the condition that V 's online duration is no less than ( − on in ), which can be computed using the conditional probability shown in (11). The condition of ( on ≥ − on in ) in (11) reflects the fact that V has been staying online for the duration of ( − on in ): . (11) on and on denote the set and the number of all online friends at time , respectively. Then the number of the friends in on who are still online at time can be predicted using

Predicting the Number of the Friends Who Are Offline at and Change the States to
Online before or at . The method of predicting the number of the friends who are offline at and change the states to online before or at is similar to that presented in Section 5.1: Given an offline friend V at time , we can know the time when V logged off, denoted by off out . The probability that V changes the state to online before or at equals the probability that V 's login time, off in , is no later than . The probability, denoted by off in ( off in ≤ ), in turn equals the probability that V 's offline duration is smaller than ( − off out ) under the condition that V 's offline duration is no less than ( − off out ), which can be calculated using (13). off and off denote the set and the number of all offline friends at time , respectively. Then the number of the friends in off who change the states to online before or at time can be predicted using

Predicting the Number of Online Friends and the DA at .
on ( ) denotes the number of online friends at . on ( ) can be calculated by (15) by combining (12) and (14) as follows: ) .
is the maximum storage capacity that each friend is able to contribute. Then the total storage capacity contributed by all online friends at time is ( ⋅ on ( )). Using the method presented in Section 4, the DA at can be determined.

Case Study
When we derive the DA model over storage capacity and the on-the-fly prediction of DA in Sections 4 and 5, we used the generic form of the probability distribution for online and offline durations (i.e., on ( on ) and off ( off )) as well as for the data publishing pattern, that is, the number of times that the user publishes the data in a given time duration (i.e., pu ( , )). However, it has been shown that the online and offline durations may follow the power-law distribution or the exponential distribution [35,37,38] and that the data publishing pattern may follow the Poisson process [37]. In this section, we conduct a few case studies by substituting the generic form of the probability distribution for the powerlaw, the exponential, and the Poisson distribution. In fact, any probability distributions can be used in the proposed models. Even if the mathematical derivations may not be carried out with some probability distributions, the Mathematica software [39] can be used to calculate the model results.

Poisson Distribution.
The data publishing pattern may follow the Poisson process [35]. If ( pu ) follows the Poisson distribution with the parameter pu , then we have (16). Consequently, [ ( pu )] can be calculated using (17), as follows: Further, (3) can be transformed to With (18), (4) becomes Therefore, given the storage capacity SS, the replication degree , and the logout time of the user out , the publishing time of the oldest data stored in the storage pool, , can be calculated using Moreover, with (18), (5) then becomes 6.2. Power-Law Distribution. If the offline duration, off , follows the power-law distribution with parameter off , then we have (22), where = ( off − 1) min off −1 given the minimal duration min [40]: We now show how to use the power-law distribution to derive the on-the-fly prediction for the number of online friends, which is obtained in Section 5 through (11), (13), and (15).
Equation (11) can be further derived with the power-law distribution to obtain Equation (13) can be further derived to obtain Equation (15) can be further derived to 6.3. Exponential Distribution. If a random variable follows the exponential distribution with parameter , then its probability density function and probability distribution function can be expressed as in We now show how to use the exponential distribution to derive the on-the-fly prediction for the number of online friends.
With the exponential distribution, (11) can be derived to obtain on out ( on out > ) exp Also, (13) can be transformed to Further, (15) then becomes distribution (PL) or the exponential distribution (Exp), as observed in the literature [37]. The user publishes the data following the Poisson process and copies of replicas are created for each data item and stored in the online friends.

Evaluation
In order to evaluate the DA model over storage capacity, the DA is predicted given the size of storage capacity and the values of other OSN parameters. Then the simulated OSN is run using those parameters values. Each friend contributes the same storage capacity and the storage capacity is allowed to be adjusted so that the total storage capacity of all online friends always equals the storage capacity used to predict the DA. During the running, when a friend comes online at a time point, the DA of the published data for the friend is recorded. The average of all recorded DA is regarded as the actual DA, which is compared against the predicted DA to measure the accuracy of the prediction.
In order to evaluate the on-the-fly prediction, the experimental scenario is designed as follows. A user and his friends log in and out following the specified distribution during the time interval [0, ]. The current time is set to be th min ( < and the user is offline at time ). The online or offline states of all friends at time as well as the latest login or logout time before time are collected. The collected data, combining with the specified distributions, are used to predict the number of online friends and DA at the future time points (i.e., the time points later than ). The predicted data are then compared against the data obtained from the actual running. For example, the number of the friends of a user is set to be 150. Figure 2 shows the online/offline state of each friend when the current time is set to be 31st min. A point above the red line (i.e., when = 0) represents the latest login time of a friend who is online at 31st min, while a point below the red line shows the latest logout time of a friend who is offline at 31st min.
In the rest of this section, the DA model over storage capacity is evaluated in Section 7.1 with regard to the following aspects: (i) the impact of storage capacity on DA, (ii) the impact of the DOSN parameters, including online/offline duration and the rate of user publishing data, on DA, and (iii) the accuracy of the relation established between DA and SS.
In Section 7.2, the on-the-fly prediction is evaluated with regard to the following aspects: (i) the accuracy of predicting the number of online friends on the fly and (ii) the accuracy of the DA predicted on the fly.
Unless stated otherwise, the experimental parameters used in the performance evaluations take the values shown   Figure 3 shows the impact of the total storage capacity (i.e., SS in Section 4) on the DA calculated from the DA model presented in Section 4. As shown in Figure 3, the DA increases as SS increases. Under both exponential distribution and power-law distribution of the friends' online duration, data availability tails off after SS increases more than a certain value. These results suggest that it is unnecessary to ask the friends to contribute unlimited storage capacity, as often assumed in the work in the literature [14,18].
From this figure, we can also determine SS that is required to achieve a certain DA. For example, DA reaches 99% under PL or Exp when SS is 194.38 and 151.97, respectively.

Impact of On/Offline Durations on DA.
As can be seen from the derivation of the DA model presented in Section 4, the online/offline durations have impact on DA. We conducted the experiments to evaluate their impact. Since the online and offline durations have the similar impact, only the results for offline durations are presented in this subsection. Given the distribution of the offline duration, the average duration is controlled by off . The inverse of off is the length of the duration. Figure 4 shows the impact of off on DA. In the experiments in Figure 4, SS is set to be 194.38 and 151.97 under PL and Exp (as shown in Figure 3), respectively, so that DA is 99% under the default value of off (as in Table 2). We then change the value of off and plot the corresponding DA. It can be observed that DA increases as off increases under both Exp and PL. These results can be explained as follows. When off increases, the average length of the friends' offline durations decreases. Given the certain SS, the period of the stored data (i.e., [ , ]) is fixed. Therefore, the shorter offline durations of the friends result in higher probability that the times of the data that the friends try to update fall into [ , ]. Consequently, DA is higher. publishing rate. The higher the pu , the higher the data publishing rate. Figure 5 demonstrates the impact of pu on DA. The setting of SS is the same as that in Figure 4. The figure shows that DA decreases as pu increases. This is because when the data are published at a higher rate, [ , ] is shorter given a fixed SS. Consequently, DA is lower. 7.1.4. Accuracy of the DA Model. The DA model over storage capacity proposed in Section 4 can calculate the DA given an SS. We conducted the experiments to study how accurate the calculated DA is, compared with the DA obtained from the actual running. The results are presented in Figure 6. The results under Exp and PL show similar pattern. Therefore, only the results under Exp are presented.
In Figure 6, the setting of SS is the same as that in Figure 4 (i.e., SS = 151.97). The DA calculated by the DA model is 99%, which is the red line in Figure 6(a). We run the simulated OSN with this SS and plot the actual DA over time, which is the blue line in Figure 6(a). It can be seen that the DA is fairly close to the calculated DA in most cases. These results suggest that the DA model is effective. In order to reveal the fundamental reason for this, we also compared obtained  in the DA model (the red line in Figure 6(b)) with the time of the oldest data that a friend tried to update when he came online at a time point (plotted in blue in Figure 6(b)). If the time of the oldest data is not earlier than the calculated , the DA model is effective. As can be seen from Figure 6(b), the blue line is higher (i.e., the corresponding time is later) than the red line in most cases. This gives the fundamental reason why the DA model is effective; that is, with the SS obtained by the DA model, the online friends can in most cases store the data that a friend tries to update when he comes online.

Accuracy of the Predicted Number of Online Friends and
the Impact of Online and Offline Durations. As shown in Section 5, the predicted number of online friends (i.e., on ) determines the value of the on-the-fly DA. Therefore, we conducted the experiments to evaluate the accuracy of predicting on . The experimental scenario has been presented in the third paragraph of Section 7. The experimental results are shown in Figure 7. In Figure 7, the current time point is set to be 31st min and the on-the-fly prediction predicts on from 31st min onwards, which is plotted in blue. The actual on from 31st min onwards is plotted in green. Figures 7(a), 7(b), 7(c), and 7(d) show the results under different on and off (i.e., online and offline durations).
It can be seen from Figure 7(a) that, compared with its actual values, the prediction of on is fairly accurate in the first 10 minutes, which shows the effectiveness and applicability of the proposed prediction method since the prediction can be conducted on the fly as the time elapses. By comparing Figures 7(a), 7(b), 7(c), and 7(d), we can see that the length of the accurate prediction decreases as the settings of on and off change from Figures 7(a)-7(d). These results indicate that the online and offline durations have impact on the prediction accuracy. After carefully analyzing the changing trend of on and off , it appears that the minimum value between the online and the offline durations (i.e., min(1/ on , 1/ off )) determines the length of accurate prediction. The less the value of min(1/ on , 1/ off ), the shorter the length of the accurate prediction. The reason for this is because when min(1/ on , 1/ off ) is smaller, the friends are more dynamic and, consequently, it is more difficult to obtain the accurate prediction in the future. Figure 8 presents the experiments results that show the accuracy of the on-thefly prediction of DA. The experimental settings in Figure 8 are the same as those in Figure 7. It can be seen from Figure 8 that the trends shown in Figure 8 are consistent with those in Figure 7. This once again shows the effectiveness of the onthe-fly prediction.

Conclusions
This paper proposes a data availability model over storage capacity for DOSNs. Further, a novel method is proposed to predict the data availability on the fly. Extensive simulation experiments have been conducted. The results show that the proposed data availability method is able to capture the relation between data availability and storage capacity effectively, and that the on-the-fly prediction method can predict the level of data availability accurately.
This work is situated at the level of maintaining the data availability. How to optimize the data accessing performance and reduce the data maintenance overhead is the work of the underlying data replication and placement strategies. In the future, we plan to work down the management level in DOSN and develop the strategies of placing data replicas among friends in DOSN. When designing the placement strategies, the attributes of individual friends, such as the bandwidth and latency associated with a friend and the storage capacity contributed by a friend, will be taken into account.