In recent years, researchers and data analysts have increasingly used online social network data to study human behavior. Before such study can begin, one must first obtain appropriate data. This process poses many challenges: e.g. a this platform may provide a public API for accessing data, but such APIs are often rate limited, restricting the amount of data that an individual collect in a given amount of time. Thus, in order for the data collector to efficiently collect data, she needs to make intelligent use of her limited API queries. The network science literature has proposed numerous network crawling methods, but it is not always easy for the data collector to select an appropriate method: methods that are successful on one network may fail on other networks.
In this work, we demonstrate that the performance of network crawling methods is highly dependent on the structural properties of the network. To do that, we perform a detailed, hypothesis-driven analysis of the performance of eight popular crawling methods with respect to the task of maximizing node coverage. We perform experiments on both directed and undirected networks, under five different query response models: complete, paginated, partial, in-out, and out responses. We identify three important network properties: community separation, average community size, and average node degree. We begin by performing controlled experiments on synthetic networks, and then verify our observations on real networks. Finally, we provide guidelines to data collectors on how to select an appropriate crawling method for a particular network.
The copyright of the published articles stays with the authors.