Mathematical Theory of Data Accuracy
Ask an analyst to find issues in a dataset and they’ll come back in a few hours with a long list. Ask them instead to validate a dataset and they’ll be busy for months.
Data inaccuracy is detectable, data accuracy is not.
Why the long tail? I believe that much of the issue comes down to detectability of data quality issues.
Physics provides a useful analogy for data accuracy detectability: It is theorized that 85% of the universe consists of “dark matter” - that is, matter which does not interact with the electromagnetic spectrum and avoids detection.
In much the same way, all datasets contain dark issues - data accuracy issues that cannot be directly observed. These issues are not just difficult to find, but actually totally undetectable.
We will see that dark issues exist in all non-trivial datasets as a result of a simple mathematical structure underlying data problems. The focal point of this analysis is the Maximal Alert Theorem, which says that the only way to determine accuracy issues is to look for violations of real-world assumptions. From this we will develop an equation for enumerating the number of dark issues as a function of our real-world assumptions.
For those interested in the theory itself, and how it compares to other mathematical theories there is an appendix on this topic.
The Setup
Let \(\hat{X}=(x_1, x_2... x_N)\) be observable datapoints measuring some ‘true’ values \(\hat{T}=(t_1, t_2... t_N)\).
For now we assume that the data is discrete and only takes integer values.
Definition: Data Accuracy Issue
A data accuracy issue is simply when \(\hat{X} \neq \hat{T}\).
Note: We are only dealing with discrete data for this blog post. This definition will be amended when dealing with continuos data.
Example
Let’s say you work for the local ferry boat company, and you wish to know the volume of traffic on the ferry. You get a volunteer to observe the ferry on loading and collect data. Let’s define:
Var | Definition |
---|---|
\(t_1\) | The true total cars entering the ferry at the first stop. |
\(x_1\) | The observed total cars entering the ferry at the first stop. |
\(t_2\) | The true total cars leaving the ferry at the second stop. |
\(x_2\) | The observed total cars leaving the ferry at the second stop. |
You can’t see \(t_1\) and \(t_2\) directly, but we hope that our volunteer did a dutiful job so that \(x_1=t_1\) and \(x_2=t_2\).
Data Accuracy Issues at the Ferry
Let’s take a look at some of the data from the ferry.
day | \(x_1\) | \(x_2\) |
---|---|---|
Monday | 9 | 5 |
Tuesday | 7 | 3 |
Wednesday | 5 | 7 |
Now I’m going to append the true number of cars entering and exiting. We normally don’t get to see this.
day | \(t_1\) | \(t_2\) | \(x_1\) | \(x_2\) |
---|---|---|---|---|
Monday | 9 | 5 | 9 | 5 |
Tuesday | 8 | 3 | 7 | 3 |
Wednesday | 7 | 5 | 5 | 7 |
A data accuracy issue is whenever \(\hat{T}\neq \hat{X}\). Let’s mark the rows with data accuracy issues.
day | \(t_1\) | \(t_2\) | \(x_1\) | \(x_2\) | Is DQ Issue |
---|---|---|---|---|---|
Monday | 9 | 5 | 9 | 5 | No |
Tuesday | 8 | 3 | 7 | 3 | Yes |
Wednesday | 7 | 5 | 5 | 7 | Yes |
For Tuesday \(t_1 \neq x_1\) and Wednesday has \(t_1 \neq x_1\) and \(t_2 \neq x_2\) so these two days have data accuracy issues.
Adding Real World Assumptions
Along with our observations \(\hat{X}\), and the true values \(\hat{T}\), we also want to model our real-world assumptions on the data.
We introduce an assumption \(g\), where \(g(\hat{T})=\text{True}\) if the assumption holds, and \(g(\hat{T})=\text{False}\) if the assumption is violated.
Generally we can have many real-world assumptions: \(g_1, g_2.. g_M\).
Let’s go through an example to make this more concrete.
Definition: Plausible
If \(\hat{T}\) satisfies all assumptions \(g_1, g_2.. g_M\), we call \(\hat{T}\) plausible.
Real World Assumptions Example
Let’s go back to the ferry example.
A safe assumption is that \(t_1\geq t_2\). In other words, we can’t have more cars leave the ferry than get on the ferry.
Therefore we define our real-world assumption: \(g(t_1,t_2) = t_1 \geq t_2\). Let’s call this assumption the conservation of cars assumption.
General Data Accuracy Alerts
A data accuracy alert monitors data and fires when a data accuracy issue is detected.
Let’s come up with a definition given our framework.
Definition: Deterministic Alert
An alert that fires only when there is a plausible data accuracy issue. More formally:
A binary function of the data , \(f\), is a deterministic alert if:
When \(\hat{T}\) is plausible and \(f(\hat{X})\) has value \(\text{True}\) then \(\hat{T} \neq \hat{X}\).
…
In other words, if a deterministic alert fires that means something is wrong with the data. It probably won’t catch all of the data accuracy issues, but it should never fire when the data is fine. We don’t care about how the alert behaves when the true values aren’t plausible because that will never happen.
The next example will make this more concrete.
Deterministic Alert Example
Let’s come up with a deterministic alert for our ferry problem.
We know that \(t_1 \geq t_2\) by the conservation of cars assumption. Therefore, if we see \(x_1 \leq x_2\) then we know we have a data accuracy issue.
Let’s define our deterministic alert \(a\) as \(a(x_1,x_2) = x_1 < x_2\). When \(a\) is \(\text{True}\) we know there is a data accuracy issue.
Note: We have not yet proven that \(a\) satisfies the formal definition of a deterministic alert. The next section contains a proposition that shows that \(a\) indeed satisfies the definition.
Given the ferry problem setup, let’s try to find data accuracy issues:
day | \(x_1\) | \(x_2\) | \(a(x_1,x_2)\) |
---|---|---|---|
Monday | 9 | 5 | \(\text{False}\) |
Tuesday | 7 | 3 | \(\text{False}\) |
Wednesday | 5 | 7 | \(\text{True}\) |
Now let’s join back with the true values to see how good this alert is.
day | \(t_1\) | \(t_2\) | \(x_1\) | \(x_2\) | Is DQ Issue | \(a(x_1,x_2)\) |
---|---|---|---|---|---|---|
Monday | 9 | 5 | 9 | 5 | No | \(\text{False}\) |
Tuesday | 8 | 3 | 7 | 3 | Yes | \(\text{False}\) |
Wednesday | 7 | 5 | 5 | 7 | Yes | \(\text{True}\) |
Our data accuracy alert caught the issue on Wednesday, but it did not catch the issue on Tuesday.
Maximal Alert
It is natural to ask - what is the most comprehensive alert on the data? Is there an alert that will catch every issue?
Given two deterministic alerts, \(a_1\) and \(a_2\), then we can define a new deterministic alert by checking if \(a_1\) is violated or if \(a_2\) is violated.
Proposition: If \(a_1\) and \(a_2\) are deterministic alerts, so is \((a_1 \text{ or }a_2)\).
Proof omitted.
We can imagine a maximal deterministic alert which fires if any other deterministic alert fires. If we magically knew all other deterministic alerts we could create this maximal alert by or-ing them all together.
Definition: Maximal Alert
The alert constructed by ‘or-ing’ all other deterministic alerts together. Therefore the maximal alert fires iff one or more other deterministic alerts fire.
…
For a problem with real-world assumptions \(g_1, g_2... g_M\) we also consider the assumption violation alert (or \(ava\)) which fires when any assumption is violated. Formally we can define the ava as follows.
\[ava(\hat{X}) = \neg g_1(\hat{X}) \text{ or } \neg g_2(\hat{X})... \text{ or }\neg g_M(\hat{X})\]Where the \(\neg\) symbol means ‘not’.
Proposition: The assumption violation alert is a deterministic alert.
Proof: First we show that \(\neg g(\hat{X})\) is a deterministic alert for all assumptions \(g\). We need to show that if \(\neg g\) fires and the setup is plausible that \(\hat{T} \neq \hat{X}\).
If the alert fires we have \(g(\hat{X}) = \text{False}\). If the setup is plausible we have \(g(\hat{T}) = \text{True}\).
Therefore \(g(\hat{T}) \neq g(\hat{X})\). This implies that \(\hat{T} \neq \hat{X}\).
Now that we have shown \(\neg g(\hat{X})\) is a deterministic alert, since the \(ava\) is an “or” operation on deterministic alerts, it is an alert.
Q.E.D.
Theorem: Maximal Alert Theorem: The Maximal Alert is the Assumption Violation Alert.
Proof:
We must show that if one of these alerts fire, so does the other one.
If the assumption violation alert fires, so does the maximal one since the maximal alert fires if any alert fires.
If the maximal alert fires, then the assumption violation does as well. We can show this by contradiction:
Say there exists a \((T=\hat{T^*},X=\hat{X^*})\) where the maximal alert fires and the assumption violation alert doesn’t.
Now consider \((\hat{T}=\hat{X^*},\hat{X}=\hat{X^*})\).
- The maximal alert will fire since \(\hat{X}\) is unchanged.
- There are no data accuracy issues since \(\hat{X} = \hat{X^*} = \hat{T}\).
- \(\hat{T}\) does not violate any assumptions because \(\hat{X}\) doesn’t violate \(ava\) and \(\hat{X} = \hat{T}\). In other words, \(\hat{T}\) is plausible.
The maximal alert can’t fire in a plausible situation with no data accuracy issues since this contradicts the definition of a deterministic alert. Therefore the maximal alert can’t fire when the assumption violation alert is silent.
Q.E.D.
Implications
When something looks wrong with the data, a (often lengthy) search for the underlying issue takes place. This search will end with a “smoking gun”. This smoking gun is always a violation of real-world assumptions made on the data.
The interpretation of the Maximal Alert Theorem is that you simply can’t do better than checking for real-world assumption violations. Any alert on the data is really just checking for assumption violations.
The only way to detect data accuracy issues is to make real-world assumptions and look for violations.
It’s worth noting that we achieve optimality by taking our assumptions - which are functions on \(\hat{T}\) - and feeding in the data \(\hat{X}\) instead.
Accuracy Coverage Equation
The accuracy issues that are not caught by the maximal alerts constitute the “dark issues” discussed in the introduction. We will now proceed to quantify the exact number of these dark issues.
In a data accuracy system let \(P\) denote the set of \((t_1... t_N)\) that are plausible (don’t violate real-world assumptions). Let \(\lvert P \rvert\) denote the size of the plausible set.
Let \(\lvert \hat{X} \rvert\) be the number of states the data can be in. For example, in a binary system \(x_1... x_N\) can be in \(2^N\) states.
Theorem: Accuracy Coverage Equation
Fraction of Accuracy Issues Detected = \(\frac{\lvert \hat{X} \rvert- \lvert P \rvert}{ \lvert \hat{X} \rvert - 1}\)
Proof (optional):
Size state space of \((\hat{T},\hat{X})\) where \(\hat{T}\) is plausible= \(\lvert P \rvert\) \(\lvert \hat{X} \rvert\)
- State space of plausible \(\hat{T}\) times state space of \(\hat{X}\)
States with no issues = \(\lvert P \rvert\)
- Every plausible \(\hat{T}\) has only one state \(\hat{X}\) without accuracy issues (\(\hat{X}=\hat{T}\)).
Data Accuracy Issues = \(\lvert P \rvert \lvert \hat{X} \rvert - \lvert P \rvert\)
- Total system states minus states without issues
Detectable Issues = \(\lvert P \rvert(\lvert \hat{X} \rvert- \lvert P \rvert)\)
- State space of plausible \(\hat{T}\) times state space of \(\hat{X}\) which violates \(ava\).
Fraction of Accuracy Issues Detected = \(\frac{\lvert P \rvert(\lvert \hat{X} \rvert- \lvert P \rvert)}{ \lvert P \rvert \lvert \hat{X} \rvert - \lvert P \rvert} = \frac{\lvert \hat{X} \rvert- \lvert P \rvert}{ \lvert \hat{X} \rvert - 1}\)
Q.E.D.
Data Issues are Detectable, Data Correctness is Not
Looking at the coverage equation \(\frac {\lvert \hat{X} \rvert- \lvert P \rvert}{ \lvert \hat{X} \rvert - 1}\) we see that we get 100% coverage only in the case where \(\lvert P \rvert = 1\), which is a trivial case. In the case that \(\lvert P \rvert = 1\), there was no reason to gather any data in the first place, since \(t_1.. t_n\) is fully determined by the assumptions. Therefore we can say the following:
In non-trivial discrete data systems there will always be accuracy issues that are undetectable.
Sadly, this also carries over to the continuous case, which we will cover in the future.
As we gather more expectations about the real-world, we can add assumptions, which will lower \(\lvert P \rvert\), and raise the data coverage.
Summary
We looked at discrete systems and found that the only way to detect data accuracy was by feeding our real-world assumptions onto the data. A data accuracy alert is best when it simply takes all of the real-world assumptions in our system, and checks to see if any has been violated. We derived a surprisingly simple equation for data accuracy coverage in finite systems. In accordance with our intuition, as the number of plausible states decreases, the coverage of data accuracy increases.
This theory is simple and intuitive. It suggests a business process where real-world assumptions are systematically gathered in an organization and turned into a maximal alert.
In a future post, we can instead look at continuous systems where the theory remains simple, but becomes more about degrees of freedom and geometry. That being said, all digital data systems are discrete at some level.
Thanks for reading, and please reach out with comments or suggestions.
Appendix:
Similarities to Channel Coding
The problem statement here is similar to that of noisy channel coding in that there is a ‘message’ - the true values - that get corrupted and must be deciphered.
This theory is much simpler than Shannon’s theory, because we do not wish to make any assumptions on the channel. Quite the opposite, we have gone to great lengths to avoid saying almost anything about the function that transforms \(\hat{T}\) into \(\hat{X}\), (not even giving this function a name).
The reason is to develop a theory that could be applied with very little inspection into the data acquisition process. It is my view that modeling the data acquisition transfer function would be costly for many big data problems because of intricate data ingestion procedures and high data variety, and inability to see the ‘true’ values.
Another major difference between Shannon’s theory and ours is the heavy use of real world assumptions (what many would call assumptions).
In short, in comparison to Shannon’s set up, we are willing to make less assumptions on the channel and more on the message itself. If one was willing to make probabilistic assumptions on the transfer function, they might be able to use the deep results of Shannon’s theory.
Next Steps
The next steps are to relax the assumptions of determinism and discrete values. But relaxing these assumptions comes at a cost.
We will relax the deterministic assumption by allowing probabilistic assumptions. The deterministic theory developed above allows us to make epistemological statements about the data - what is knowable. Probabilistic assumptions (and alerts) are very useful tools, but will not expand what is knowable.