Collecting and Analyzing Big Data on a Small Budget

Computer chip on top of American coins
Publication Date: 
Wednesday, January 3, 2018
  • Ryan Thomson, Doctoral Candidate, Dept of Sociology and Criminology & Law, University of Florida
  • Raffaele Vacca, PhD, Dept of Sociology and Criminology & Law, University of Florida

The revolution of informatics, Big Data, and computational social science[1] has reached every corner of the academy and beyond. Within the social sciences and humanities, many researchers are eager to engage Big Data[2] but often lack the financial means to purchase the prepackaged datasets designed for marketing firms. How can we get low-cost access to the large amounts of available, public raw data on online activity? This problem was taken up by the open science movement, which is seeking to make data, statistical programs, findings, and scholarship available to everyone. Thanks to their dedication, Big Data from the web are increasingly within common reach for scholars of all types.

A central role in the open science movement has been played by the R programming language for statistical computing and its large users community over the past twenty years. R is open source (GNU) and freely available with thousands of add-on packages. As a result, R has become incredibly popular, surpassing both Stata and SAS as the leading software for data analysis on several metrics.[3] While specific R packages for data extraction and mining have also grown in popularity, most of them are specialized for a single source of information. One of the strongest among these is R’s twitteR package developed by Jeff Gentry.[4]

Another software breakthrough for the collection of social-media data came when Aaron Beveridge, a former doctoral student at the University of Florida Department of English, developed and released Massmine in 2014 with his colleague Nicholas Van Horn.[5] Massmine is a suite of Mac and Linux tools designed for (non-programmer) academics looking to collect social media data from a multitude of different sources. Massmine simultaneously expanded data collection across multiple social media platforms, greatly increased the size of collectable datasets, and decreased hardware requirements. The program outputs “json” data, which are easily converted into popular comma-separated tables that can be opened in any spreadsheet software. Massmine runs in the background of personal computers (or servers) compiling social media data over days, months, or years. Both twitteR and Massmine have streamlined the technical API protocols for Twitter data established in the early 2000s (Tab. 1).

Table 1 – A comparison of Massmine and the twitteR R package

 

Massmine

twitteR

Interface

Command line from the terminal

package within R

Platforms

Twitter, Google Trends, Tumblr, Wikipedia, Web Scraping

Twitter

Output file Type

.json

R data

API compatible

Yes

Yes

Max Mining Capacity

∞ streaming: depends on hardware

3200

The other advance came in the form of hardware. A Raspberry Pi (Rpi) is a $35 single-board computer the size of a credit card. Massmine can run on an Rpi and collect terabytes of social media data.[6] Developed in the UK, the Raspberry Pi 3rd Generation Model B has a 1.2GHz 64-bit quad-core ARMv8 processor, 1 GB memory, Wireless LAN, low energy Bluetooth, HDMI and four USB ports. An Rpi and its power cord cost about $45. This mini-computer runs a variety of Operating Systems (OS), and the standard Raspbian OS (NOOBS) can support Massmine. The Raspbian OS is installed on a MicroSD card: this can be done on your own, or a pre-loaded SD card can be purchased for about $18. The cost of an Rpi case is between $5 and $20, and most of them come with heat-sinks or a fan. Many all-inclusive kits range between $40 and $80 depending on the items included. When linked to an external hard drive (with an independent power source) and synced with Massmine, a Raspberry Pi computer can stream terabytes of information into an organized data server which can be controlled remotely from another computer.

What kind of Twitter data can be obtained with Massmine and a Raspberry Pi? To give a few examples, we used these tools to collect some demo Twitter data from Gainesville, Florida, home to the University of Florida. Massmine is extremely flexible, extracting a variety of content and permitting a multitude of research questions about Twitter activity and interactions. Everything from semiotics (the study of signs, symbols and meaning-making) to geospatial and social network analysis of Twitter data is open for consideration. Where-On-Earth Identifiers (WOEIDs), for example, enable researchers to analyze place-specific patterns in Twitter usage and content. We might be interested in knowing what the primary Twitter client applications are in a large college town like Gainesville, FL (Fig. 2). Both twitteR and Massmine collect this type of information. In the demo data collected for this article, approximately one quarter of those tweeting in Gainesville were posting via a web browser, while one third used email posting or a third-party application. Also, the twitter application for iPhones was used more than twice as often as its Android counterpart.

 

Figure 1 - Clients for the 300 Most Recent Tweets in Gainesville, Florida at 10:16am on 10-30-2017.

Examining the activity and contents of specific accounts is also an option. For example, the official University of Florida Twitter account (@UF) has tweeted 39,864 times since it was created in June 2009. That’s an average of roughly 13 tweets a day. The account’s activity tends to be the highest during the Fall and Spring semesters, with a slight lull over the summer. UF’s favorite hashtags include #GoGators, #GatorDay, and #UFgrad among numerous other gator references (Tab. 2). Interesting information about discussion topics and seasonal trends of specific Twitter users or groups of users can be obtained in a similar way.

Table 2 - Hashtags in the University of Florida’s official tweets.

@UF Hashtag Frequencies

 

GoGators

248

Gatorday

173

UFgrad

107

UFbff

34

Gators

89

WeChomp

75

Happybirthday

53

Gatorgood

40

Gatornation

38

UF

33

UF21

27

Chompville

24

GatorsWin

22

CWS

20

ItsGreatUF

18

Like other online social networks, Twitter is also a rich source of information about human interactions. These data can be analyzed and visualized using methods from network science and social network analysis. Twitter is one of the most commonly studied online social networks in the social, health and natural sciences because of its explosive growth in the past few years, the wide accessibility of its data, and its combination of social network, micro-blogging platform, and news media aspects.

Different types of interaction occur between Twitter users and can be analyzed as a network, including following, retweeting, and replying among users. Examining networks of Twitter interactions, scientists have studied the ways in which information and influence spread online; how we can measure the influence of Twitter users; and how influence and information diffusion are affected by network structure, including centrality, network communities and subgraph patterns.[7] Fig. 2 shows a network of interactions among Twitter users in a day. The visualization includes about 97,000 users (nodes), who are linked if one referenced or replied to another using a mention (@). The network was extracted from tweets posted on June 23, 2014, with at least a hashtag related to the 2014 Soccer World Cup (which was being held in Brazil in those weeks).

 

Figure 2 – A social network of mentions among Twitter users in a day.

A Raspberry Pi micro-computer running software like Massmine has the potential to alter the scope and rate of knowledge production in the social sciences and humanities, opening the world of Big Data to the public. Rather than relying on costly third-party services and specialized marketing data, scholars and community organizations now have the DIY tools necessary for accessing data and knowledge on online discourses, behaviors and interactions. At a time in which an increasingly large part of human, social and political life happens online, access to Big Data has also become increasingly important. Innovations such as Massmine and Raspberry Pi deserve credit as groundbreaking extensions of the open science movement.

 

[1] Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.-L., Brewer, D., … Alstyne, M. V. (2009).Computational Social Science. Science, 323(5915), 721–723.

[2] Lazer, D., & Radford, J. (2017).Data ex Machina: Introduction to Big Data. Annual Review of Sociology, 43(1), 19–39.

[3] Tippmann, S. (2015).Programming tools: Adventures with R. Nature News, 517(7532), 109.

 Muenchen, R.  2016.The Popularity of Data Analysis Software.

[5] M Van Horn, N., & Beveridge, A. (2016).MassMine: Your Access To Data. The Journal of Open Source Software, 1(8), 50. Also see: Massmine

[7] Cha M, Haddadi H, Benevenuto F, Gummadi PK. Measuring User Influence in Twitter: The Million Follower Fallacy. ICWSM. 2010;10(10-17):30.

 Kwak H, Lee C, Park H, Moon S. What is Twitter, a social network or a news media? ACM Press; 2010 [cited 2015 May 20]. p. 591.

 Takhteyev Y, Gruzd A, Wellman B. Geography of Twitter networks. Social Networks. 2012 Jan;34(1):73–81

Publication Types: 
BEBR Division: