Skip to content

The Trustable Score

The Trustable Score for each element of the set of Statements made to support Expectations is defined recursively from the Evidence scores. Evidence scores are based on either:

  • Automated assessment of Artifacts using a well-defined metric.
  • Calibrated Subject Matter Expert (SME) assessment of the Evidence Statement using the referenced Artifacts.

Warning

The implementation of Trustable Score calculation and its definition in the tools is work-in-progress. This section describes the current behaviour of the tools. For the complete and correct theoretical definition of the score, see the Scoring Roadmap.

Calibrated SME Assessment

Statements in the Trustable Methodology must be verifiable propositions. That is, the notion that they are true or false must be meaningful and measurable.

There are many examples of Statements that are verifiable propositions but are too complex or high-level to easily measure directly (e.g. It will rain tomorrow). The field of Decision Analysis provides a path forward in such cases: Calibrated Probability Assessment. This means using Subject Matter Expert's assessments of their confidence in a given Statement as a probability measure. While early work confirmed this approach had merit in some cases (e.g. Weather forecasting), it also identified assessor overconfidence as having a serious impact on assessor accuracy1. More recent work has identified credible methods for calibrating assessors, such as the use of structured feedback2.

Key Concepts

  • Confidence: A measure of the probability a Statement is true, given by a Subject Matter Expert.
  • Accuracy: Typical error between a Subject Matter Expert's confidence in a Statement and its actual probability.
  • Calibration: Adjustment of Subject Matter Expert measurements to achieve a known standard of accuracy.

Assessing Evidence

In the Trustable Methodology, we require assessors to have two key qualities:

  1. Expertise. Assessors with good knowledge of the subject area are needed in order to reduce epistemic uncertainty. While in theory anyone can provide accurate estimates of their own confidence, these are likely to reflect their uncertainty in the subject (i.e. their scores will lie close to 0.5) and provide little additional value. On the other hand, Subject Matter experts have the context and understanding to offer assessments with reduced epistemic uncertainty (though not necessarily greater accuracy).
  2. Calibration. The Assessor should be calibrated to compensate for their innate overconfidence. This provides improved accuracy, though not necessarily reduced uncertainty.

In practice, this means SMEs must only assess topics within their expertise and must undergo calibration exercises. Furthermore, since confidence assessments are probabilities, their correlation with reality grows with the number of assessments. Therefore, assessment should be performed frequently, by a significant number of individuals and used to infer long-run trends, rather than as an exact reflection of the current reality.

When assessing a Statement, SMEs should consider the following:

  • Is my assessment based on only the referenced Artifacts, or do I need to reference other documents before providing a score?
  • Is my assessment of the whole Statement, or do I need to break the Statement down further before providing a score?
  • Am I sufficiently calibrated?
  • Would an automated validator reach a similar conclusion?

Note

Unscored Evidence is always assumed to have a score of zero.

Calibration

In addition to general calibration training, Statements that can be verified by testing can also be used to help calibrate SMEs. Where testing cannot be used, the following strategies can be used to improve calibration:

  • Self-Validation of historic estimates
  • Cross-Validation using other estimates
  • Statistical Anomaly Detection

SME Scoring Guidance

The SME assessment is an assignment of probability to the likelihood that a statement is true. The purpose of calibration is to ensure that the SME's assessments match the probability of the statement's truth (strictly speaking, if the SME states they believe the statement is correct 90% of the time, it actually is true 90% of the time). Therefore:

  • A score of 0 means the SME is certain the Statement is false
  • A score of 1 means the SME is certain it is true.
  • A score of 0.5 means that the SME has no information or intuition to indicate whether it is more likely to be true or false.

Defining scores for all items

A Trustable Graph comprises a set of Statements \(S\) and a set of directed edges \(L\), such that the graph is defined by the ordered pair \((S,L)\). The existence of an edge \((s,s')\in L\) means that Statement \(s\) is supported, in whole or in part, by the claim made by Statement \(s'\).

The Trustable Score function \(T: S \rightarrow [0,1]\) is defined as,

\[ T(s) = \frac{1}{|\{s': (s,s')\in L\}|}\sum_{s'\in\{s' : (s,s')\in L\}} T(s'). \]

That is, the Trustable Score of a Statement \(s\) is the mean of the scores of its supporting Statements.

Therefore, if scores are defined for all Evidence Statements (recall this is the set of Statements that have no incident edges \(S_E = \{s\in S : (s,s') \not\in L,\; \forall s' \in S\}\)), the definition of \(T(s)\) is sufficient to recursively define the scores of all items in the tree.

Calculating scores for all items

We briefly discuss here how the Trustable Score can be calculated using powers of the adjacency matrix. Given a graph of \(n\) nodes, label the nodes with indices \(1\leq i\leq n\), such that the set of statements \(S\) is equivalent to \(\{s_i:i= 1,...,n\}\). We may then write the entries of adjacency matrix \(\mathbf{W}\) as

\[ w_{ij} = \begin{cases} \frac{1}{|\{s': (s_i,s')\in L\}|},\; (s_i, s_j)\in L \\ 0,\;\text{otherwise} \end{cases}. \]

That is, the \((i,j)^\text{th}\) entry of the adjacency matrix is zero where there is no edge from \(s_i\) to \(s_j\) and the inverse of the number of children of \(s_i\) otherwise.

Then, given the vector of Evidence scores \(\mathbf{t}_E\) whose entries are given by

\[ t_i = \begin{cases} T(s_i), \; s_i \in S_E \\ 0, \; \text{otherwise} \end{cases}, \]

we claim that the trustable score for all nodes \(\mathbf{t}\), \(t_i = T(s_i)\), is given by the sum of the products of adjacency matrix exponents with Evidence scores,

\[ \mathbf{t} =\sum_{i=0,...,n} \mathbf{W}^i \mathbf{t}_E, \]

since \(\mathbf{W}^m \mathbf{t}_E\) contains the contributions to the score of all nodes from paths of length \(m>0\) and an acyclic digraph of \(n\) nodes cannot contain paths of length greater than \(n\).

Note this coincides exactly with the result presented in the Roadmap, under the assumption that all leaf scores are considered to be correctness scores, \(\mathbf{t}_E={\mathbf{c}_r}_E\) and that the argument is complete, such that \(\mathbf{C}_p=\mathbf{I}\).

Equivalent Adjacency List Implementation

Our reference implementation of the Trustable Score calculation uses the graphalyzer backend. graphalyzer represents Trustable graphs as directed acyclic graphs using adjacency lists and evaluates the score by dynamic programming over the graph. For acyclic graphs this permits score computation in time proportional to the number of nodes and edges.

This recursion is mathematically equivalent to the matrix power-series formulation above, but evaluates it directly on the graph structure.

Trustable Score

The Trustable Scores can be computed by the following recurrence:

\[ s(v) = c(v)\left( r(v) + \sum_{u \in succ(v)} w_{vu}s(u) \right) \]

where:

  • \(v, u\) are nodes in the Trustable graph,
  • \(s(v)\) is the Trustable score of node \(v\),
  • \(c(v)\) is the completeness factor associated with node \(v\),
  • \(r(v)\) is the correctness value of node \(v\),
  • \(succ(v)\) is the set of immediate successor nodes of \(v\)
  • \(w_{vu}\) is the weight of the directed edge \(v \rightarrow u\).

For leaf (Evidence) nodes, the successor set is empty and the definition reduces to \((s(v)=c(v)r(v))\).

The code matching the formulation is as follows:

n = graph.size
score = np.zeros(n)
for v in reversed(graph.topological_order):
    score[v] = completeness[v] * (
        correctness[v]
        + sum(weight_vu * score[u] for u, weight_vu in graph.successors[v])
    )
return score

The implementation iterates through the nodes in reverse topological order, ensuring that all successor nodes are scored before their parents. Each node’s score is computed as a single expression matching the recurrence: the correctness value plus weighted successor contributions, scaled by completeness. Leaf (Evidence) nodes have no successors, so their score reduces to completeness[v] * correctness[v].

Node Sensitivity

For node sensitivity, the accumulated influence from node v onto node t can be calculated with the following recurrence:

\[ b(t) = 1 \]
\[ b(u) = \sum_{v \in pred(u)} w_{vu}c(v)b(v) \]

where:

  • \(t\) is the target node,
  • \(v\) and \(u\) are nodes in the graph,
  • \(b(v)\) is the accumulated influence of node \(v\) on the target \(t\),
  • \(c(v)\) is the completeness factor of node \(v\),
  • \(w_{vu}\) is the weight of the directed edge \(v \rightarrow u\).

Because the graph is acyclic, this recurrence can be evaluated in a single topological traversal of the graph.

The code matching the forumulation is as follows:

sensitivity = np.zeros(graph.size)
sensitivity[t] = 1.0
pos_t = graph.topological_order.index(t)
for u in graph.topological_order[pos_t + 1 :]:
    sensitivity[u] += sum(
        weight_vu * completeness[v] * sensitivity[v]
        for v, weight_vu in graph.predecessors[u]
    )
return sensitivity

The sensitivity of the target node is initialised to 1. The graph is then traversed in topological order. For each edge v -> u, influence is propagated from v to u, scaled by:

  • the edge weight w_vu
  • the completeness factor of the parent node v

This procedure accumulates the total influence of every node on the target. Because the graph is acyclic, each propagation step is performed exactly once, yielding linear time complexity in the size of the graph.

Edge Sensitivity

For Edge Sensitivity, we compute \(\frac{\partial s(t)}{\partial w_{vu}}\) for all nodes \(t\) by propagating the influence of \(v\) to its ancestors in reverse topological order:

\[ b(v) = 1 \]
\[ b(t) = c(t)\left( \sum_{i \in succ(t)} w_{ti}b(i) \right) \quad \text{for } t \neq v \]
\[ \frac{\partial s(t)}{\partial w_{vu}} = c(v)s(u)b(t) \]

where:

  • \(v\) is the parent node of the edge,
  • \(u\) is the child node of the edge,
  • \(t\) is a target node,
  • \(i\) is a successor node of \(t\),
  • \(w_{vu}\) is the weight of the edge \(v \rightarrow u\),
  • \(s(u)\) is the Trustable score of node \(u\),
  • \(c(v)\) is the completeness factor of node \(v\),
  • \(b(t) = \frac{\partial s(t)}{\partial s(v)}\) is the sensitivity of node \(t\)'s score to node \(v\)'s score

The code matching the formulation is as follows:

n = graph.size
v, u = edge
score = _vector_score(graph, completeness, correctness)

dscore_dv = np.zeros(n)
dscore_dv[v] = 1.0

pos_v = graph.topological_order.index(v)

for t in reversed(graph.topological_order[:pos_v]):
    dscore_dv[t] = completeness[t] * sum(
        weight_ti * dscore_dv[i] for i, weight_ti in graph.successors[t]
    )

return completeness[v] * score[u] * dscore_dv

The implementation first computes the global score vector. It then computes \(f(t) = \partial s(t) / \partial s(v)\) for all \(t\) in a single reverse topological traversal, starting from \(f(v) = 1\) and accumulating weighted successor contributions for each ancestor. The final edge sensitivity is the product of the completeness of \(v\), the score of \(u\), and \(f(t)\).

This is equivalent to the chain rule applied to the recursive score equation, but avoids computing per-node sensitivities separately, yielding linear time complexity in the size of the graph.


  1. Lichtenstein S, Fischhoff B, Phillips LD. 1982. Calibration of probabilities: The state of the art to 1980. In Judgment under Uncertainty: Heuristics and Biases. pp306-334. Cambridge University Press. 

  2. Moore A, Swift S et al. 2019. Confidence Calibration in a Multiyear Geopolitical Forecasting Competition. Management Science. 61(11) pp3552-3565. https://doi.org/10.1287/mnsc.2016.2525