Big data social network analytics: querying data

Big data social network analysts is dependent upon data mining. Data mining is the process of perusing complex datasets composed primarily of data that you don't care about, and extracting the data that is relevant to you. A query is computer code used to limit a database to a subset of the data more directly important to your analysis. If you're examining how politicians in Massachusetts seek campaign funding, you can probably limit your MA politicians even though your dataset might contain lots of Idaho politicians. If you care about how logistical networks carry lumber and you have a generic logistics database, query for lumber because you can probably ignore textile and produce logistics. 

In this post, we will explore types of queries that prepare a dataset for social network analysis, processes to construct these queries, and best-practices in how to link a series of queries together.

Querying nodes and edges

Big dataset have to contain nodes and edges. Nodes can be people, edges can be processes, and a social network is people connected together by shared processes that designate relationships. It makes sense, therefore, that when you query a bit dataset for social network analytics its a good idea to query the dataset for nodes or for edges. A node query takes the form “give me all the data related to node X, or node attribute Y”. This can be “give me all the data you have on one of these three physicans” or it can be “give me all data you have on authors or politicians who have any kind of criminal background.” An edge query take the form “give me all the data related to this edge, or edge feature.” This can be “give me all the data you have on this specific list of conferences” or “give me all the data you have on town halls in which the term ‘healthcare’ appears on the agenda.”

These queries produce results that looks similar, but are useful for different things. They are similar in that they produst a list of nodes and edges. But these lists will not be the same. Do you know the exact people you’re interested in? Clearly, do a node query. Do you know what type of person you’re interested in, even though you don’t know the exact person? Do a node feature query. Do an edge query if you have a list of exact processes and you want to see who's linked by those processes. Do an edge attribute query if you want all processes of a class, where you understand the characteristics of the class but not all of the processes that belong in that class.

In practice, I’ve found edge feature queries to be most useful for big data social network analytics. Usually, you know the type of process you’re interested in tracking. Usually, you start with a question about a process. Which politicians are corrupt? Query for corruption. Which writers are influential? Query for the kind of influence you want to understand. Who controls to flow of goods? Query for the goods.

Query generation protocol

Humans and computers do not think alike. A human asks business questions like “Who controls the flow of goods in a logistical chain.” Computers think in 0’s and 1’s. In big data analytics, more than in other fields of analytics, you need a rigorous process to convert questions that humans care about into queries that computers can track, and back again. 

Query generation protocols are ideal when you don’t understand a topic in depth, and therefore don’t know what universe of topics are relevant when constructing a query. If someone says ‘track lumber’ it takes a bit of googling to wrap your head around the machinery of sawmills, the organizational makeup of lumbering crews, the dynamic foci of logging, and the multi-modal transportation mechanisms of a log as it works toward factories. Query generation protocols are also ideal when the datasets you use track processes using a different ‘vocabulary’ than the vocabulary you think in. You might care about “pine trees” when in fact datasets (like this USDA dataset) use industry-special codes for “Fir”, “Douglas Fir”, “California mixed conifer group” that are just codes like “260” and “370.” You need to shift from “pine tree” to a bunch of specialized codes somehow.

It is important to have a pre-established query generation protocol if you want to move fast, and if you want your query your query to be thorough. An incomplete query will distort your social network model. If three lumber organizations have 70/30, 50/50, and 30/70 investment into two kinds of pine trees types, and you miss the code for one of the types, the social network model you construct to track the logging logistics will systematically distort your understanding of the firms invested in the second type of pine tree.

In a big data context, you’re not reliant two codes. You’re reliant on thousands. To reliably shift from vague, human constructs to formal criteria specific to your dataset for the query, the following kind of process worked for me.

Fluffy human qualitative construct to rigorous human qualitative construct.

The language you define your question in is usually not precise. Defining your question in a rigorous way is often the hardest part of the challenge. No technology can do this for you. Think deeply and talk to experts. Do you really care about “cancer” writ large (all the varieties?) or just lymphoma, or specifically Non-Hodgkin's Lymphoma, or the various subtypes…?

Rigorous human qualitative construct to expanded list of interrelated rigorous human qualitative constructs

Mature industries have formal ontologies/ taxonomies relating constructs to each other. This is just a fancy way of saying that any given word relevant to a domain has its relationship mapped out with other words. Some relations are of synonymity. Others are of conceptual hierarchy. There are other relations. If you’re interested in “meningitis” you probably care about “Cerebrospinal fever.” If you’re working with a big dataset produced by an industry that has been around for a while, dig around for such ontologies because the construct you care about might be closely related to a few other constructs that should be lumped into your query.

In the bizz, we call these semantic network traversals. Semantic (meaning) network (interlaced set of constructs) traversal (moving between). Write a set of business rules that define the boundaries of how you traverse the semantic network, and code which utilizes APIs relevant to your domain. An awesome example is the Unified Medical Language System

Expanded list of inter-related rigorous human qualitative constructs to strings to query industry data dictionaries

In this step, you have to convert the list of constructs you want data for into parts of words that can be used to pull codes from specialized dictionaries. Most industries have dictionaries mapping qualitative descriptions to industry-specific codes. Take this ICD-10 example: G03.0: “Nonpyogenic meningitis.” You have the description, “Nonpyogenic meningitis.”You have the diagnostic code, “G03.0.” This code tracks the disease in a variety of datasets. The ICD-10 has more than 14,000 such entries.

An automatic query generation protocol peruses these 14,000 descriptions, written in English, for any matches to your criteria. Your protocol should be good enough to draw up this code for a query interested in “meningitis.” Your job is to come up with all the string that roughly pull for relevant codes. If your interest is in meningitis and other related issues such as encephalomeningitis, you need to ready your query to include the codes with the prefix “encephalo…”. This step is heavy on natural language processing steps such as lemmatization and the utilization of regular expressions

Strings to query industry data dictionaries to criteria used in datasets

There might be dozens of industry dictionaries because there are dozens of kinds of criteria that might track your process. ICD-10 codes are for diseases. If you care about medical devices, you need SKUs. The list of datasets you need is long, and messy. And you’re often going to have to combine dictionaries. For example, the ICD commission has done its job and put together an excellent list of codes. The FDA has not, so if you’re interested in constructing a query for specific pharmaceutical agents you’re going to have to coalesce multiple data dictionaries into a meta-dictionary of National Drug Codes. Part of what differentiates an organization with mature analytic capacities and a group of high-brow amateurs is the centralization of such dictionaries that help you understand your data lakes.

Alright, so let's pretend you have your dictionaries together, and its time to query them. I hope to G*d you automate this step. You’re going to have to peruse millions of records to pull all the codes that have the strings in them that might be of interest. In a future post, we’ll explore algorithms that ensure this process is fast.

Criteria used in datasets to manually filtered criteria used in datasets

You read right. This step is manual. Here is my reasoning. No query is perfect, so you're always dealing with a type1 / type 2 tradeoff. In order to minimize both type 1 and type 2 errors, you usually need two queries. The first is an initial overly-inclusive query with false negative rate. This query should be fast, and crude. Meaning it will be high in false positives. But that's fine. A second query will take care of the high false positive rate as a result of the first query. This second query is usually slow and precise. It's OK that we're working slow in the second query because the results of the first query are much, much smaller than the initial dataset. Cancer screenings, FBI screenings, and other screenings, high-stakes, high-resource screenings all operate under this logic.

Are you interested in pollen allergies? Then manually filter out “Z91.0: Allergy status, other than to drugs and biological substances” which will probably come up for a query with the term “allergy”.

To get this step right I recommend adding a column to your dataset called include, and writing “include” for all rows you want included. Then, have a separate script that subsets your query criteria to only those records with a “include” for that record. This allows you to go back and filter your data in a different way at a later date.

Primary and secondary queries

A query limits an big database in two ways. it limits records to only what's interesting for you. We covered this above. It also limits fields. This is because most big databases are not just long, they are also wide. You have billions of records, and for each record hundreds of rows. Not only do you want just a handful of records, you also want just those fields that are critically important to your analysis. This helps keep the results of your query manageable.

The problem is that you never quite know what fields are important. To handle this, I'll introduce the primary/ secondary query distinction. A primary query pulls fields that are of interest in every project. A secondary query pulls fields of interest only in some projects. Primary queries are straightforward, and secondary queries approach other tables/ columns to complexity. I’ll leave it up to the computer scientists to define how it is we should partition different kinds of concerns among different kinds of secondary scripts.

Primary queries are written to pull data from a dataset that meet some criteria. These will be the criteria you come up with, above, in your code categorization protocol. “Keep a subset of fields from a record if the code 1234xyz appears in field abc.” The kinds of data you pull in a primary query is central to your analysis. In social network analysis, this should be at, the very least, node identifiers, edge identifiers, and some primary key. The node and edge fields are used to construct social networks. The key is used for secondary queries.

Secondary queries are queries designed to pull specific kinds of information from a dataset to “complete the picture” of the primary query. Some examples of complexity that I’ve run into are: location information (‘where did this event occur’), event type (‘what type of event was this’), payor information (‘who funded/ payed for this event’), etc. Secondary queries are used to filter down edges, engineer features, and do other tasks important in a data science context. In big data social network analysis, primary queries pull nodes, edges, and primary keys, while secondary queries pull node attributes and edge attributes.

Coda

… AAANNDD, there you have it, folks. Thank you for your patience. Hopefully you’ve learned something about how one designs queries for big data social network analysis… And even if you’re never going to touch such a thing, hopefully there’s a nugget or two here to benefit you in whatever you study.