Three ways to estimate the size of on-line social networks
The immense size of on-line social networks and their operators’ restrictions on user bandwidth makes actually counting the number of users each serves prohibitively expensive if not impossible, which in turn makes it difficult to measure the effectiveness of initiatives such as marketing campaigns. There are, however, three methods for estimating the number of users on the network. The maximum likelihood estimator, the mark and recapture method and the random walkers method all yield decent guesses at the user number, but the result of each is slightly different, and each is most appropriate for a slightly different purpose. Informed users should understand all three and know when to employ each for the best results.
Maximum likelihood estimator
Finding the size of a network with no information to go on is a task for maximum likelihood estimation. This works by requesting random users for the system and counting the number of duplications. Since, given a set number of requests, the likelihood of duplicates increases if there are fewer total users as opposed to a larger number, it is possible to compare the number of requests with the number of duplicates through a series of esoteric mathematical procedures and estimate the size of the whole network, in this case the total number of. While the “random” user furnished by the system may not be truly random in the mathematical sense, the difference is likely to be insignificant over the size of sample in question. Statistically the estimate is surprisingly good, especially if it is possible to make a large number of requests
Using maximum likelihood estimation does not require any advance information about the size of the network. The only capability needed estimate the probable number of users in a network is the ability to request random users. Almost all social networks will furnish a supposedly randomly selected user, making this one of the more widely applicable techniques. However, an experiment using Twitter’s “public timeline” feature demonstrates the need for care in selection of a sample. The timeline is restricted to users who a) have a custom user icon and b) are not “protected” and c) actually post status updates. While it is quite easy to access, the result obtained from using this data estimate the size of the network as something less than eight million users. Twitter itself claims to have on the order of forty million active users. Restrictions of the sample can lead to misleading results, or they can be quite useful to someone who wanted, for example, to estimate the number of Twitter users who actually have custom icons and who post updates.
Mark and recapture
Like maximum likelihood estimators, mark and recapture techniques are based on uniform sampling. Modeled on the same methods used by ecologists to estimate the wild populations of living organisms or the occurrence of certain medical conditions in a given population, this involves sampling the same group twice and then comparing the two results. The more overlap that exists in the two results, the smaller the sample is likely to be. While more sophisticated variations of the technique involve more than two samples, even this comparatively primitive operation yields useful results. Just as with the maximum likelihood estimation, however, users must take care that the sample represents the group to be counted and not some subset or superset of that population. Mark and recapture procedures can also give misleading estimates if the size of the population changes rapidly during the sampling procedure. Given the rapid growth of social networks, this is a factor for which users must allow when using mark and recapture to estimate network size.
Another experiment using Twitter’s “public timeline” took the list of users from one day and compared it to the previous day to gain an estimate of the size. For limited use such as a “snapshot” size at a particular period of time, this can be a useful technique. Using it to evaluate a quality of the sample over a long period, however, becomes more and more inaccurate as the size of the population changes.
When the information required for uniform sampling is unavailable, the size of a network can be estimated using random walkers. In this procedure, one user is selected, and then a member of that user’s friends list. Then a friend of a friend is selected, and so on. Since a social network may contain small areas that are unconnected to the larger whole, such as a family or a class whose students have no outside friends, at least not with those accounts, it is important to walk the Largest Connected Component. One must also allow for the nature of the friends process on each network, since some, like LiveJournal, allow unilateral friendships and others, like Facebook, allow friendship to exist only if both parties consent.
Random walking can be an expensive process, since the process requires crawling almost the whole of a network achieve an estimate. The process can be made faster, and thus somewhat less expensive, by the use of parallel processing starting from different nodes, and by optimizing the parts of the analysis program that are repeated most often. In the end, though, the time and expense mean that the use of random walkers is a technique best used only when others are unavailable.
Determining the size of a social network is sometimes a necessity. The accuracy and precision required of the estimate, the availability of data from the network and the technical and monetary resources available all have an effect on the choice of the best technique. In the end, that choice must be made by the user who needs the estimate.
Original Paper: Estimating the Size of Online Social Networks
Shaozhi Ye* Google Inc. 1600 Amphitheatre Pkwy Mountain View, CA 94043, USA Email: firstname.lastname@example.org *Corresponding author S. Felix Wu Department of Computer Science University of California, Davis One Shields Ave Davis, CA 95616, USA Email: email@example.com Abstract: The huge size of online social networks (OSNs) makes it prohibitively expensive to precisely measure any properties which require the knowledge of the entire graph. To estimate the size of an OSN, i.e., the number of users an OSN has, this paper introduces three estimators using widely available OSN functionalities/services. The rst estimator is a maximum likelihood estimator (MLE) based on uniform sampling. An O(logn) algorithm is developed to solve the estimator. In our experiments it is 70 times faster than the naive linear probing algorithm. The second estimator is mark and recapture (MR), which we employ to estimate the number of Twitter users behind its public timeline service. The third estimator (RW) is based on random walkers and is generalized to estimate other graph properties. In-depth evaluations are conducted on six real OSNs to show the bias and variance of these estimators. Our analysis addresses the challenges and pitfalls when developing and implementing such estimators for OSNs.