Big data social network analytics: getting data

The most basic problem in performing social network analytics on big data is … getting big data.

Most big datasets are not collected with an eye for the needs of a social network analyst. SNI is an obscure, niche technique. Most big datasets are used to solve critical, often boring, business problems close to the core infrastructure of an organization. Collecting, and storing, big datasets is expensive, so organizations are only going to do it when their very existence depends on it. We'll see a few examples below.

For a big dataset to be amenable to SNA, it needs to have the basics building blocks of any SNA technique: nodes and edges. You need to be able to track the identify of nodes and edges across the various records of a dataset. And if you want to integrate several datasets into an SNA, you need to be able to track the identity of nodes and edges across dataset.

Identity of nodes

An identifiable node is an entity with distinctive characteristics that allows you to track it within and between datasets. The larger the dataset, the more details you will need to uniquley track the identity of an entity.

Let's work with a straightforward example, a person. A person's name is often not enough to uniquley identify them in a big dataset. When you’re dealing with small data, an incompletely written out name is sufficient to track an individual. Perhaps in your dataset is only one J. Smith. But in a big-data settings, where the processes we track spend to entire nations if not the whole globe, there are many, many J. Smiths. There are many Jonathan Smiths, and maybe even one Jonathan A Smiths. But perhaps there is only one Jonathan A Smith with an MD from Emory in 1988. This level of extreme detail is necessary to uniquley identify nodes in big datasets.

Tracking nodes across datasets

Even if you are able to track the identity of a node within a dataset, it might be difficult to track it with precision across datasets. One dataset might include middle initials, the other might include full middle names. is Jonathan A Smith from one dataset, Jonathan Andrew Smith, or Jonathan Aaron Smith from the other dataset? In situations like this, our match becomes probabilistic. We cannot say for sure which two different-looking nodes from two different datasets are in fact the same node. We bring in additional contextual variables to differentiate between lankly candidates. Maybe Jonathan Andrew Smith is a gastroenterologist and Jonathan Aaron Smith is a notary. In this case Jonathan A Smith with an MD from Emory is likely Jonathan Andrew Smith because of the medical background. It would be even better if we have our physicians National Practitioner Identifyer. When formal organizations uniquley tag individuals with professional IDs, it becomes much easier to track the identities of those individuals when their IDs appear in datasets. In practice, matching across datasets does not have to be perfect as long as its ‘good enough.’ Exactly what ‘good enough’ means is a topic we’ll explore in detail in a later post.

Just for fun, let's work with another example. Say your have access to two book databases, one based on an Amazon webscrape, and one based on an Ebay webscrape. Each of these databases will, among other things, names of authors. If you have only the names of these authors, it will be hard to determine whether all authors with similar are the same or not. Sure, you can manually check 5, 10, or even 100. But can you do this manually for 30,000 names? Formal identifiers such as Google Scholar Profiles, or domanin-specific organizations such as this, are god-sends.

Two scrapes, one from Amazon, the other from Ebay. Authors with publication list (based on specialty) parse out the content.

Identity of edges

In order to build a social network graph, in addition to nodes, you need edges. An edge is a shared process that two or more nodes engage in. This is extremely vague because it is extremely general. A patient that two physicians see is an edge. Seeing this patient is something two physicians have in common. A publication that two academic published together is an edge. When a twitter user re-tweets content from another twitter user, this, too, is an edge. For SNI edges ideally stand for relationships. In practice, a relationship is a vague qualitative term that cannot be directly measured. Instead, you use proxies that in the domain you're interested in indirectly suggest a relationship. In healthcare, physicans who work closley tend to share patients. Authors who publish together, clearly, also work together. Etc.

Most big datasets don’t have nodes and edges in the same record. So, chances are your data will not have straightforward statements like “Authors X and Y have 3 publications together.” Big datasets will say something more like “Author x has these publications, A, B, C, D, E, F, G” and “Author Y has these publications: A, B, D, H." For SNI, you need to link the processes that nodes have in common. You have to be able to say “this publication A from author X is the SAME publication A from author Y.” The number of shared publications, which is a type of edge, has to be inferred. To know that Author A has 7 publications is irrelevant to connecting them to author Y if you can’t identify the edge. In practice, it helps to have an exact identifier of an edge. For books, this could be an International Book Standard Number. If you are matching two books based on just their names, books with very similar names might be false positives. International organizations have taken pains to producing formal identifiers for books. Once you can link authors to these IBSN's, the identity of edges is exact. You can now link nodes with greater accuracy.

We note, in passing, that the identify of an edge has to be precise enough that you can determine whether multiple similar edges are the same edge, and precise enough to link an edge with a person. However, the actual real-world identity of an edge does not have to be known at all! For example, health claims datasets track patients and their encounters with physicans. The physican's NPI make it clear who the physican is. However, in order to maintain the anonimity of the patients, unique cryptograhpic hashes replace any given patient's name. This will allows you to link the same patient with multiple physicans, without comprimising the identity of the patient. You can now say 'Physicans X and Y both saw patients, A, B, and C."

Tracking edges across datasets

Just like it is important to track nodes across dataset to construct a comprehensive world-view, you need to be able to track edges across datasets. There are a few reasons you might want to do this, including to link two edges from different datasets and to de-duplicate edges that both appear in different datasets.

There are many situations in which you would want to link two edges from different datasets. For example, you might want to track the public appearances of two political figures, like Trump and Putin. Perhaps your hypothesis is that more co-appearences might suggest a stronger professional relationship. One dataset might have have a list of places or events in which Trump has appeared. Other will have a list of places or events in which Putin has appeared. You have to be able to say ‘this event in Trump’s dataset is the same as that event in Putin’s dataset!’ For small datasets, this is doable by hand. But in massive datasets, you need to automate the process. If the datasets don’t codify processes that are candidates for edges in similar ways, it will be hard to reliably link the events.

There are other settings in which you might want to drop edges. Consider two big datasets that cover similar topics. The datasets might partially overlap. Each dataset might be sufficient to link nodes via edges. If the datasets contain both some of the same nodes, and some of the same edges, you need to drop edges that exist in both datasets to avoid double-counting. For example, lets say you have two web-scraped datasets covering local political figures and events they go to. One dataset covers appearances at town-halls, and the other datasets covers appearances in any setting that deals with healthcare. Your boss might ask you to map out the network of political figures to determine who is likely to have a strong professional relationship with whom. Clearly, some town-halls deal with healthcare, so some co-appearances in each dataset will be, in fact, the same appearance. You need to track the identity of these edges to make sure you only count them once in the analysis. Otherwise, for any given pair of nodes it will be difficult to determine the proportion of edges that are real co-appearances versus the proportion that are double-counting.

Getting data

In practice, there are several source for big datasets. Sometimes, your organization produces it in-house as part of some other task. For example, does your company’s website maintain server logs? Other times, you scrape it from the internet in massive sweeps. Finally, you might simply buy it from big data vendors. Do you have a few million bucks sitting around terabytes of juicy data? If your multi-million social network analytics company depends on it, this might be the way to go.