Big data social network analytics: Overview

Social network analytics (SNA) is the methodology used to examine how groups of people relate to each other. Big data analytics is a branch of computer science that uses massive datasets to extract insight. How massive? Lets say big data regards any dataset too large to fit on your computer. Big data social network analytics uses massive dataset to examine how very large groups of people relate to each other.

In this series of posts I'll explain some issues and solutions that are unique to big data SNI. I'll mention some of the basics of SNI and big data analytics, but the focus will really be on what arises at the intersection of the two. 

A primer on social network analytics 

SNI is conducted in two main phases. First, we construct social network graphs. Social network graphs are composed of nodes and edges. Nodes are people. Edges are relationships. The social network graph is a bunch of people, and all of the relationships they have with each other. In the second phase of SNI, we examine how the structure of the social network graph influences how a person's behavior impacts the network. For example, say I'm connected to you, and you're connected to your friend, but I'm not connected to your friend. In thise case, if information is to pass from me to your friend, you have to be the intermediator. To read more about social network analytics, click here.

A primer on big data analytics

Big data analytics is different from small-data analytics both in computational requirements, and in terms of statistical challenges. Regarding the computational requirements, when you are trying to analyze a dataset that is so large that it cannot even fit on a computer, it can help if you get a few computers to work together to process the dataset. Computer languages such as Hadoop evolved to orchestrate this distribution of datasets and analyses on a large cluster of computers. Some of the statistical challenges of big data analytics stem from the processes that generate these massive datasets. Big datasets are usually not developed for the needs of analysts. They are usually the by-product of other, more boring, requirements close to the core value-offerings of large organizations. Therefore, a big data analyst needs to transform messy, unfocused data developed for other purposes into formats amenable to mathematical modeling. Read a deeper overview of big data analytics here.

Overview of posts on big data social network analytics

In these seven posts, I will walk you from the the start of a big data social network analysis to its end. In the process, I will highlight unique methodological difficulties and opportunities. The posts will cover getting data, querying data, filtering data, writing efficient algorithms, constructing the social graph, quantifying the role that nodes have in the graph, and visualizing massive dataset. That is, 

First, you need to get data about people and their behaviors. In small data contexts you usually just go out and collect data. It is extraordinarily expensive to go out and collect data on hundreds of thousands of people. Usually, the mechanisms that generate big data sets don’t care about your analytics needs. In this post, we think about the characteristics required of a database for it to be amenable to social network analytics.

Second, you need to query these massive datasets. You certainly don’t want to use all of the data that is available to you. Problem is, the criteria used to tag massive datasets are horribly complex. Massive dataset grow, usually, out of the amalgamation of diverse processes. And, as such, you need diverse querying modes. In the SNI context you need a clear thing you’re trying to track. Only pull your data if that thing is occurring. In this post, I explain why you need to invest time understanding your dataset on its own terms.

Third, you need to filter your data. In statistics, there is an old adage: "Garbage In, Garbage Out." This means that even if your model is excellent, but your data is garbage, your results will be garbage. In small-data contexts, you tend to want to keep as much data as possible. In large-data contexts, you are OK dropping data. Lots of data. While in a small-data context, dropping data tends to give you a more biased view of your results, in big-data connect, dropping data, when using the right criteria to drop data, can give you a more clear image of the process you’re trying to model. In this post, we discuss bad datauseless data, and how to filter them.

Fourth, you need to write efficient social network computational algorithms. There are many ways to compute relevant statistics for billions of people. Not all are equally fast. And not all statistics are equally important. Pick a fast, and useful, algorithm. Just because you have terabytes of RAM at your disposal does not mean you get to be sloppy and write inefficient code. In this post, we discuss tips and tricks to write algorithms that produce results in less than a few days.

Fifth, only your dataset is focused, small, and clean, are you ready to engineer the features of SNI. We need to derive nodes, and edges, ala traditional social network analysis. The basics of this steps are not too different between small and large data SNI.

Sixth, now its time to bring your features together with some kind of machine learning model. To do this well you need good SNI technique. Big data engineering prowess wont’s save you here. This step is all about solid computational sociology thinking. In this post, we work through a few examples of how big data social network analysts quantify the dynamics of node in the graphs that they are part of.

Seventh, you need to visualize your results in a way that other people can understand. An Excel spreadsheet with 120 million rows won’t tell a story on its own. The trick here is that not all data that need to be included in the analysis need to be visualized for the analysis to be interpretable. In this post, I exemplify how to how fewer results to produce more insight. One that highlights the core value offerings of your analysis.

Coda

These posts come out of hands-on work with a successful big data social network analytics group that I have the privelage from working with. That being said, if you want to comment, celebrate, berate, or append something to any of these posts, don't hesitate to reach out :) Enjoy!