Collaboration metrics from the UF network data, and their evolution

We used publicly available data on UF publications and grants to extract networks of collaborations among UF researchers in 2008-2012. This allowed us to define network metrics on collaboration, which can be used to evaluate and monitor the activity of specific research institutes. As an example, we apply some of these collaboration metrics to the UF Clinical and Translational Science Institute.

Part I: Combining Networks From Publication And Grant Activity

Collaborations on publications and grants among UF researchers generate two different social networks. Their union gives us the most accurate and comprehensive picture of the UF scientific network available in the data. In the “union” network, relations can represent collaborations on a grant, a publication, or both. The position of specific actors, such as patent authors, and the relations among them, can be highlighted in the network.

Part II: Collaboration Metrics Based On Network-Level Measures Of Cohesion

Measures of network cohesion and distance show that the UF scientific network became more cohesive, that is, the overall tendency to scientific collaboration increased in the university over 2008-2012. During the same years, the CTSI came to cover an increasingly large part of the UF network, and distances between UF scientists became gradually lower.

Part III: Collaboration Metrics Based On Node-Level Measures Of Centrality

Different metrics can be used to assess the centrality of researchers in the UF scientific community. Degree centrality is the number of collaborators a researcher has, that is, his degree of connectedness to the network. Betweenness centrality is a measure of brokerage, as it quantifies the extent to which a researcher falls on the network paths between other researchers, and is a bridge between separate areas of the network. Closeness centrality measures the extent to which a scientist is close to every other actor, and can easily reach to the rest of the network. Researchers in the CTSI show higher centrality values on average on all these measures in 2008-2012, in both the publication and grant networks, compared to researchers outside the CTSI. In other words, the CTSI gathers some of the most central UF scientists, who tend to be better connected to the university scientific network, more often located in brokering positions between different research groups, and able to more easily reach to other UF scientists to whom they are not directly connected.

Part IV: Collaboration Metrics Based On Diversity Of Actor Attributes

In our scientific networks, the neighborhood of a researcher is the set of all his collaborators in a year. Thus, the degree of department, college or discipline diversity in a researcher’s neighborhood measures the extent to which that researcher engages in interdisciplinary research. CTSI researchers show a higher neighborhood diversity on average over 2008-2012, which means that the CTSI has functioned as a hub for interdisciplinary research at UF in the last years. This is confirmed by the higher number of authors per publication and investigators per grant in the CTSI, compared to the rest of the university. Finally, the CTSI exhibits a prevalence of “open triads” of researchers, as opposed to closed triads, over the years. As open triads tend to connect separate and distant areas of the network, this structural feature of the CTSI network suggests as well an increasing diversity of backgrounds, methods, and substantive topics in CTSI research activities.

Designing a network intervention on the UF scientific network

Social Network Analysis is not just about describing and explaining networks, it is also about using networks for specific interventions with specific goals. Network interventions have traditionally been designed and tested in the health sciences to block contagions or spread healthy practices, and in management and business administration to improve the economic performance of organizations. We explored ways of intervening on research collaboration networks to improve the scientific productivity of a university.

Part I: Why Intervening On A Scientific Network

A scientific network is like a brain: new connections create new ways the whole system thinks and operates. However, when researchers are simply encouraged to pick collaborators for a project (for example, by a traditional Request for Applications by a research funding agency), they tend to collaborate in their comfort zone, that is, to reproduce existing connections in the network by working with previous or anyway close collaborators. Implementing a network intervention on a scientific network means creating new links that cross scientific comfort zones. Moreover, researchers who create links, that is, collaborations, in a scientific network are rarely aware of the whole network structure. On the other hand, by mapping this structure and creating network-aware collaborations, we can create specific new links that enhance specific properties of a university’s scientific network.


Part II: A Survey To Collect Researchers’ Reactions

Based on the structure of the UF scientific network, and on specific network criteria, we identified new links (collaborations) that would be potentially successful, foster interdisciplinary exchange, and enhance certain properties of the whole network. We interviewed the potential collaborators we identified, and collected their views on the possibility of succesfully working with each other.

A social network analysis of scientific collaborations at the University of Florida

Social Network Analysis (SNA) is a methodology and a theoretical perspective that studies patterns of relations among actors. When applied to the network of scientific collaborations at a university, SNA can provide many insights on the structure and the evolution of its scientific community.

Part I: Defining And Constructing The Social Network Of Scientific Collaborations At University Of Florida

A social network is a set of actors and the relations among them. Defining actors as researchers at UF, and relations as professional collaborations between researchers, we can use publicly availble data on publications and grants to visualize the social network of scientific collaborations at UF over the years. This reveals the structure of UF scientific community, and its interaction with formal organizations and institutional boundaries like departments, institutes and academic units.


Part II: Visualizing Group And Actor Characteristics In Collaboration Networks

Once we have network data on the UF scientific community, we can use SNA to visualize and analyze specific kinds of collaboration (e.g. publications vs grants); the position and centrality of particular departments, centers or institutes within UF’s scientific network; individual characteristics of UF researchers, be they network properties (e.g. actor centrality) or non-network attributes (e.g. researcher’s number of publications); the evolution of the UF scientific network over the years. SNA methods also allow us to detect cohesive subgroups(“communities”) of researchers who tend to work together in the university. Furthermore, the UF network can be aggregated from the individual level of researchers to the collective level of UF organizations, so as to visualize networks of collaborations among UF departments, institutes or academic units.

Creative Solutions To Elusive Data: Web-Scraping Online Police Reports To Map Co-Offending Networks In US Cities

One truth new researchers quickly discover is that data collection is costly. In the social sciences, researchers expend copious amounts of time and grant money observing people, interviewing them, or gathering records about their characteristics and behaviors.
Publication Date: Wednesday, May 30, 2018

The more bountiful the fruits of our labor the more we guard them from other researchers[1]. Data collection costs are particularly high in certain fields. Criminological data, for example, typically contain sensitive information on criminal behavior and victimization, which makes them more highly protected and at times inaccessible. However, with the recent developments in data science and computing this once elusive data have become much more accessible, provided you have the tools and the know-how[2]. This article shows an example of how free software tools can be used to scrape criminological data from the web to study crime and victimization patterns in US cities.

The tool, in this case, is the R project, a completely free, open-source software environment and programming language designed for statistical computing and graphics. The community of R users has been expanding exponentially in recent years[3]. This has led to the development of a vast array of freely available “add-on” packages to perform tasks which move far beyond the scope of other statistical software such as Stata, SAS, and SPSS. The sweeping versatility of R relative to its competitors allows for creative solutions to collecting traditionally elusive crime data.

One of the traditional approaches to examining crime is via the analysis of official records. Typically this would require that we approach Police Departments, Prisons, and other criminal justice agencies in the hope that we might be granted access to the necessary data. However, in the age of the Internet many criminal justice agencies publish information online. One well-known example is the Uniform Crime Reports (UCR), published annually by the Federal Bureau of Investigation (FBI). The UCR includes swathes of information on nationwide crime, law enforcement deaths and assaults, and hate crimes.

While a lot information can be gained from national data sources such as the UCR, many criminological researchers are interested in crime at a more local level. Unfortunately, this is the level where crime statistics become more sensitive and difficult to access. That said, many local police departments have been quietly publishing official records on individual arrests, incidents, citations, ordinance violations, and traffic accidents to the ‘daily bulletin’ boards found on their respective websites. The result is an extraordinary wealth of untapped local crime data available online, in some cases for as long as a decade. R and its packages can be used to “web scrape” these data sources.

Web scraping is an automated, systematic approach to extracting and refining online information. In the social sciences it is most often used to collect data from social media and networking websites such as Twitter, Facebook, LinkedIn and the like. Open source software such as MassMine and R packages such as twitteR have helped streamline data scraping, becoming popular tools for downloading tweets, statuses, and posts[4]. While these tools target social networking websites, web scraping targets the underlying text-based HTML code composing both these and other websites. R functions can recognize patterns in HTML code (such as HTML start and end tags), extract raw text from HTML pages, and convert that text into a dataset that is amenable to statistical analysis.

We used base R in conjunction with RSelenium, a package designed to automate website navigation, to scrape police incident data from the Gainesville Police Department website and six other US police departments (Union County PD, South Carolina; Wilmington PD, North Carolina; Cedar Hill PD, Texas; Cleveland County PD, North Carolina; Concord PD, North Carolina; Wood County PD, Ohio). The raw data includes roughly 1.6 million incidents from 2007 to 2017 with information on the individuals and locations involved in arrests, citations, summons, ordinance violations, victimizations, and traffic accidents. Each observation also identifies the responding police officer and the time of the incident down to the minute. We focus on victimizations, information on the offenses committed against citizens or institutions which violate the law and subsequently reported to the respective police departments by the victims, and arrests, information on the offenders who have been formally apprehended[5] by the police for committing one or more of these offenses. Table 1 and Figure 1 show some of the information that can be obtained from the Gainesville, FL arrest and victimization data.

Table 1. Characteristics of Arrests and Victimizations in Gainesville, FL between 2007 and 2017

Figure 1. Monthly Arrests and Victimization in Gainesville, FL between May 2007 and April 2017

In addition to traditional statistical description and modeling of crime patterns, these data can be used to map ‘co-offending networks’ – networks of people who commit crimes together. It is well-known that crime is often committed by groups[6]: it is a form of human interaction, and it can be analyzed as a social network[7]. Two individuals are co-offenders if they are arrested during the same incident: in a social network this produces a link between them. As an example, we constructed a co-offending network using the decade of arrest data scraped from the Gainesville Police Department website. This network consists of 34,822 people (nodes) distributed in 28,887 disconnected components. The largest component (Figure 1) consists of 834 people[8]. This could be a group of interrelated gangs, or groups of University students co-offending with their peers. In this figure, the size of the node represents the number of experienced victimizations, the larger the node the more victimizations. Arrests are represented by the node color, the ‘hotter’ colored nodes representing a greater number of arrests. This allows us to examine the ‘victim-offender overlap’, the consistently replicated observation that victims of crime are often known offenders[9]. Co-offending networks can also be used to identify subgroups, such as gangs or crime families; detect the most central offenders, who participate in criminal activities with many other partners; and examine how crime “partnerships” and groups form.

The potential of web scraping is limited only by the information available online. Whenever there are data embedded in any given website’s HTML, it is possible to use web scraping techniques to access and refine these data. Consequently, web scraping need not be limited to a single source or type per project. For example, we plan to scrape hourly weather data from Weather Underground, and merge them with the arrest data. Research has long found that violent crime, including assault, domestic violence, and to a lesser extent homicide, become more frequent as temperature increases[10]. The arrest and weather data obtained via web scraping could allow us to examine the relationship between weather and crime more in-depth. The Internet also houses information which could let us examine cultural events (e.g. movie openings) and victimization by scraping data from websites such as boxofficemojo.com, or natural disasters and looting by incorporating data from the Federal Emergency Management Agency (FEMA). The knowledge from such projects could help local law enforcement better plan their activities based on weather forecasts, expected foot traffic from cultural events, predict looting targets following hurricanes, and beyond.

R-supported web-scraping techniques provide wide access to vast quantities of up-to-date information on an almost unlimited range of topics. As an affordable and easily reproducible mode of data collection, they have the potential to drastically transform the way we do research about crime and human interactions[11].

Figure 2. Gainesville, FL Co-offending Network from May 2007 to April 2017


[1] Kitchin, R. (2014). The Data Revolution: Big Data, Open Data, Data Infrastructures & Their Consequences. Washington, DC: SAGE

[2] And, of course, the approval of your Institutional Review Board.

[3] Tipperman, S. (2015). Programming tools: Adventures with R. Nature News, 517(7532), 109.

[4] Thomson, R. and Vacca, R. (2018) Collecting and Analyzing Big Data on a Small Budget. Bureau of Economic and Business Research. Retrieved from: https://www.bebr.ufl.edu/survey/website-article/collecting-and-analyzing…

[5] But not necessarily charged or sentenced.

[6] Warr, M. (2002) Companions in Crime: The Social Aspects of Criminal Conduct. New York, NY: Oxford University Press

[7] Papachristos, A. V. (2014) The Network Structure of Crime. Sociology Compass, 8(4): 347-357

[8] This is substantially more than the next largest component, which only consists of 28 nodes.

[9] Jennings, W. G., Piquero A. R., and Reingle J. M. (2012) On the overlap between victimization and offending: A reviewof the liteture. Aggression and Violent Behavior, 17: 16-26

[10] Cohn, E. G. (1990) Weather and Crime. British Journal of Criminology, 30: 51-64

[11] Munzert, S., Rubba, C., Meiβner, P., and Nyhuis, D. (2015) Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. West Sussex, UK: Wiley

Author(s):Smith, Thomas Bryan; Vacca, Raffaele

 

The Structure, Evolution and Interaction of Multiplex Networks of Scientific Collaboration at a Research University

The aim of this paper is to contribute to the understanding of the structural evolution of scientific collaboration networks. A large body of literature has focused on the structure and evolution of co-authorship networks, typically examining networks within a specific discipline, but spanning different academic organizations. By contrast, this paper narrows its focus to a single academic organization (the University of Florida), but expands the network boundary in two ways: including collaborations among scientists in many different disciplines; and examining three dimensions or layers of scientific collaboration, namely, co-authorship on peer-reviewed scientific articles, co-participation in awarded grants, and co-membership in PhD/Master committees.

As a result of collecting data from a five-year window (2011-2015), we obtain a multiplex longitudinal network including three layers (publications, grants, committees). The geometric intricacies of this network are analyzed by looking at the evolution of its global and local properties, in order to shed light on its stochastic formation process, and on the role played by single investigators.

First, we study the network community structure of each layer, and the extent to which community membership is explained by factors such as disciplinary affiliation and workplace location. Results show that intra-department relations are as important as inter-department relations for community formation in the three layers, with department affiliations predicting approximately 50% of the community structure over time. However, we find a high rate of heterogeneity in network communities: publication communities predict respectively 45% and 30% of community memberships in the grant and committee layer. This finding suggests that each dimension of collaboration only partially influences the other, and different mechanisms may drive connectivity in different layers.

First, we study the network community structure of each layer, and the extent to which community membership is explained by factors such as disciplinary affiliation and workplace location. Results show that intra-department relations are as important as inter-department relations for community formation in the three layers, with department affiliations predicting approximately 50% of the community structure over time. However, we find a high rate of heterogeneity in network communities: publication communities predict respectively 45% and 30% of community memberships in the grant and committee layer. This finding suggests that each dimension of collaboration only partially influences the other, and different mechanisms may drive connectivity in different layers.

Second, we test the topological weaknesses of the layers to assess the role of single scholars in connecting different areas of the network. We find that co-authorship and committee network structures are somewhat similar: they appear to gradually converge toward a power-law degree distribution, with a network architecture sustained by interlinked “stars”, which for the co-authorship network is consistent with a small-world model. On the contrary, the grant network shows a core-periphery structure. By testing different breakdown scenarios, we conclude that only the committee layer presents a highly resilient architecture, while network connectivity in the other two layers is strongly dependent on the presence of few hub investigators. This finding has significant implications for academic research policy, suggesting that academic research networks would benefit from a system of incentives for highly-connected scholars to i) remain in the university maintaining an efficient network of collaborations; and ii) increase the involvement of their collaborators in research projects, in order to reduce the dependency of the overall network from their own work. A number of inferential tests and heuristic methodologies are implemented to assess the robustness of our findings.

Download the PDF here