A strange difference between small data settings and big data settings is that while in a small-data setting you’re usually carefully to keep data, in a big data setting you’re usually careful to drop data. Let me explain.
In a small data setting, every datapoint might impact the conclusions you draw. These are norms heavily regulated by peer-review academic bodies. Fidelity to the experimental method is taken to imply that the default is to keep data
In a big data social network analytis setting, the mechanisms that generate data are so profoundly unreliable, so hilariously uncalibrated to social network analytics, that throwing out bad or useless data is a virtue. Sometimes bad data messes up your analysis. Other times you simply have enough to spare. Lets talk about these seperatley. Bad data
What does bad data looks like? And why do we drop it?
Bad data can be surprisingly easy to pick up on. Often, it simply has an implausible form. This plausibility can be assessed vis-a-vis industry standards. An example from healthcare informatics, physicians have federally defined National Practitioner Indexes. There are a limited number of these, and most of them are publicly available from the feds
. If what is supposedly a physician NPI is not on the federal list of physician NPIs, you probably don’t care about it. A dizzying percentage of data that is supposed to meet a defined industry standard does not. Another way in which data might have invalid form is if it simply does not look like what you data vendor said it would look like. The data dictionary from your vendor might day "valid IDs look like this: …" and then 30% if IDs don’t look like that. Drop them.
It is important to drop data that has low face validity
for a few reasons. First, it often cannot be traced to a process in the real world you’re interested in. Invalid NPIs won’t map to real people, so don’t waste your time modeling their dynamics. Second, data without face validity might actually map onto processes you should not include in your dataset. What if instead of a physician ID, some records include hospital IDs? Then some of the nodes in your network are now hospitals rather than people, and now you’re modeling a totally different type of process than the one you thought you’re modeling. Third, these data might mess up your algorithms, big time. We’ll cover this in a future post on writing efficient algorithms in big data engineering.
Sometimes, data is bad even if it has high face validity. This is a trickier decision because its based on theory that needs to independently assessed. In statistics, every decision has a type 1/ type 2 tradeoff
. You drop data that meets some threshold when the probability that it is bad data outweighs the probably that it is good data. Just where that threshold is should be judiciously examined. Here is an example. Say you're using an edge type as a proxy for another kind of process you're interested in. Perhaps you're interested in quantifying strength of a professional relationship between two political candidates based on their number of public appearances together. Simply appearing once or twice in public does not imply the existence of a professional relationship. In this situation you might want to set some standard, like 'only if politicians appear together more than 5 times in public together will we consider it likley that they have any kind of professional relationship.' In this situation, you're taking a theoretical (intependantly verifyable) stance that filtering edges below a certain weight will get rid of processes that by and large are not interesting to your analysis. Useless data
What is useless data? And why do we drop it?
Data is useless if it helps you understand processes that are unimportant. For example, perhaps there are nodes (people) who are unimportant. Who are the stakeholders who need to understand the output of your analysis? Chances are that they need to act on a small subset of the possibilities you afford them. You map out what hundreds of millions of people are doing, maybe only a few thousands of them are important. In this situation, the main utility of the nodes you’re not interested in directly is to characterize the social network milieu of the nodes that you are interested in. And in may situations they won’t have any relevance to important nodes. If you drop lots and lots of data that distorts your understanding of the social dynamics of ultimately unimportant processes, then just drop the data!
Data is also useless if it does not help you understand any one node at a deeper level. That is, perhaps it helps you map out edges/ relationships that are relativley unimportant to the node. Let’s think about this logarithmically
. How many people do you know? 1000? 3000? Chances are that a handful of people, lets say 30, have a strong impact on you. Everyone else is secondary. In a big data social network analytic setting we need to remember that the vast impact that a network has on a node, and the node has on a network, will be mediated through a small amount of alters. If 90% of the dynamics of a node are understood by analyzing its top 100 strongest alters, introducing another 1000 nodes for a mere 10% gain in clarity seems … unappealing. This is a vast difference between academia and industry. In academia, you include the other 1000 or get rejected from the peer-reviewed journal. In industry, you consider that 1000 extra alters per node will increase your run-time by 2 days, realize that means $2,000, shrug it off, and move on. You have deadlines to meet.
You want to drop useless data because it slows you down. We’ll talk about this later when considering big data social network algorithms. For now lets just observe that cutting out useless data will often save you days of run-time and thousands of dollars.
Alright, now off you go filtering data. Per usual, feel free to reach out to me with questions, requests, addenda, or commenets at: emil_g_moldovan AT protonmail DOT com