Safety Cases Are Dependency Graphs That Nobody Maintains
Every organization building frontier AI systems faces the same structural problem: demonstrating that a system is safe requires assembling a web of interdependent claims — capability evaluations, behavioral tests, monitoring results, theoretical arguments — into a coherent safety case. The conclusions depend on the evaluations, which depend on the threat models, which depend on capability assumptions, which depend on forecasts about what the system can do. This is a dependency graph. And right now, every organization maintains it informally.
When a capability evaluation returns a surprising result, there is no systematic way to trace which safety claims are undermined. When a new attack vector is published, there is no automated process for identifying which behavioral guarantees depend on the assumption that the attack was infeasible. The dependency structure exists — it has to, because the safety argument has logical structure — but it lives in researchers’ heads, in prose documents, and in slide decks. It is implicit, manually maintained, and brittle in exactly the way that matters most.
I’m building Scholion, a system for extracting claim-dependency structures from research documents and maintaining them as machine-tractable graphs. Here’s what the evidence says about why this matters, what it would take to validate, and what changes if it works.
The structure of safety reasoning is the vulnerability
Karl Weick’s analysis of the Mann Gulch disaster identified a pattern he called the “collapse of sensemaking”: an organization’s shared interpretive framework — the implicit structure that makes individual observations coherent — can disintegrate under stress, and when it does, the failure is catastrophic precisely because the structure was never explicit enough to reconstruct. The smokejumpers at Mann Gulch didn’t just lose coordination. They lost the shared framework that made coordination possible. Role structures, communication patterns, and interpretive defaults all collapsed simultaneously because they were coupled in ways nobody had mapped.
Safety cases at frontier AI organizations have the same structural vulnerability. Consider a concrete example. Anthropic’s Responsible Scaling Policy commits to evaluating dangerous capabilities before they emerge, using what the policy describes as “informal forecasts” to predict capability trajectories between evaluation cycles. This is a load-bearing claim: the entire evaluation framework assumes that capability jumps can be anticipated well enough to trigger assessments before dangerous thresholds are crossed. The RSP itself acknowledges that these forecasting methods are “open research questions” that need improvement before being “relied upon for risk judgments.”
This is not a criticism of the RSP — it’s the most carefully reasoned public safety framework in the industry. The point is structural. The forecasting assumption sits beneath the evaluation schedule, which sits beneath the capability assessment, which sits beneath the deployment decision. That’s a dependency chain four levels deep, and the foundation is an acknowledged open question. If informal forecasting fails to anticipate a capability jump, the failure doesn’t stay local — it propagates through the entire chain, potentially undermining deployment decisions that appeared well-supported.
The RSP documents this chain in prose. Specific invalidating conditions are maintained as lists in appendix tables. When conditions change, humans must read the prose, identify which claims are affected, and manually update their assessments. This is sensemaking in Weick’s sense — an interpretive process that works when conditions are stable and familiar, and fails when they aren’t, precisely because the structure it depends on was never externalized.
What exists and what doesn’t
Several systems address pieces of this problem. Citation graphs (Semantic Scholar, Connected Papers) track paper-level relationships but treat all citations as equivalent — they can’t distinguish a foundational dependency from a passing reference. Scite.ai classifies citations as supporting or contrasting at the paper level, but not the claim level. The Argument Interchange Format represents argument structure within single documents but doesn’t track dependencies across papers or through time. Knowledge graph projects like the Open Research Knowledge Graph and AutoSchemaKG operate at the entity and relationship level, not the argumentative structure level.
What doesn’t exist is claim-level dependency tracking across documents with crux identification and automated invalidation propagation. No system currently extracts the Toulmin structure of scientific arguments — the claims, the warrants connecting evidence to conclusions, the backing supporting those warrants, the qualifiers hedging the conclusion, the rebuttals acknowledging counterarguments — and maps the dependency edges between them across papers. No system identifies which dependencies are cruxes, meaning dependencies where if the supporting claim falls, the downstream conclusion collapses. And no system propagates the implications when a foundational claim is contested.
The gap matters because the hardest problem in safety case maintenance isn’t finding contradictions — it’s tracing their consequences through a web of implicit dependencies that nobody has made explicit.
What Scholion enables
If claim-dependency tracking works, three things change beyond the immediate utility of better-maintained safety cases.
Structural oversight becomes possible. The scalable oversight problem — how a less capable system can verify the outputs of a more capable one — is usually framed as a knowledge problem. But it’s also a structural problem. A judge that can inspect the dependency graph of an argument doesn’t need to evaluate every claim on its merits. It needs to verify that the structure is sound: that conclusions actually depend on their stated premises, that load-bearing claims have adequate support, that cruxes are identified and monitored. Logic, probability, and argumentation structures are inherently more legible than raw natural language prose. If this holds, a structurally competent judge can oversee a substantively stronger researcher — not by competing on knowledge, but on structural integrity.
Safety cases become living documents. Currently, safety cases are snapshots assembled at evaluation time and maintained through manual review. A system that tracks claim dependencies as a graph can update automatically when new evidence arrives. When a monitoring system flags an anomalous behavior, it’s not just an incident — it’s evidence that propagates through the graph and potentially undermines specific safety claims, which undermines specific deployment decisions. The graph provides the propagation logic that manual review cannot scale.
Research communities gain a shared epistemic map. The AI safety literature — roughly 2,000 to 5,000 papers in the narrow scope, 50,000 to 100,000 including adjacent fields — has dependency structures that span hundreds of papers and decades of work. When a foundational result is challenged, the implications take months to propagate through informal channels. A maintained dependency graph makes the propagation immediate and complete, converting what is currently a slow, lossy social process into a structural one.
Each of these claims is contingent on capabilities that haven’t been fully validated. But the strategic value concentrates in the first — structural oversight — because it addresses a problem that grows more acute as AI systems become more capable. If structural verification is feasible, it provides an oversight mechanism that scales with system capability rather than requiring the overseer to match the system’s knowledge.
Why these claims are testable now
Three capabilities have matured to the point where Scholion is buildable rather than aspirational.
LLM extraction is good enough for structured claims. Elicit demonstrates 99%+ accuracy on structured field extraction from scientific papers. The harder task — extracting argumentative structure with typed dependencies — hasn’t been benchmarked at the same scale, but multi-pass LLM pipelines using structured output (Instructor/Pydantic with Claude or GPT-4) achieve 25–80% accuracy improvements over single-pass extraction on complex decomposition tasks. The closest benchmarks in argument mining show 35–70% F1 on full argument structure, with inter-annotator agreement at κ = 0.35–0.63 — which means even human experts disagree substantially. The extraction problem is genuinely hard, but “hard” is different from “impossible,” and the gap between current LLM capability and the argument mining state of the art suggests significant room for improvement with well-designed schemas and prompting strategies.
The corpus is tractably small. The narrow AI safety corpus — value alignment, corrigibility, scalable oversight, debate, constitutional AI — contains roughly 2,000 to 5,000 papers. A 5-stage extraction pipeline processing 50,000 papers costs approximately $200–$400 through batch APIs. The binding constraint is not compute but schema design and extraction validation.
Formal substrates for dependency reasoning exist. Datalog with provenance semirings provides algebraically precise support tracking that generalizes de Kleer’s Assumption-Based Truth Maintenance System labels. Transitive closure — the core operation for “what depends on what” — is the textbook Datalog use case. The CALM theorem gives principled guidance on which operations can be distributed coordination-free (monotonic derivation) versus which require consensus (retraction, dispute resolution). Truth Maintenance Systems from the 1970s–80s solved the single-agent version of this problem; Scholion extends it across an entire research community.
The gap that matters: no existing system combines Toulmin-structured extraction with cross-document dependency tracking, crux identification, and automated propagation. And critically, no validated Toulmin-structured dataset for scientific papers exists. This means the annotation schema and dataset are themselves research contributions — potentially the most valuable near-term output, because they provide the ground truth against which any extraction pipeline must be evaluated.
How I’m validating
I’ve structured validation as three nested claims, each depending on the previous one holding, with explicit kill criteria at each stage.
Claim 1 — Toulmin extraction is reliable enough to be useful. Validated if a multi-pass LLM pipeline achieves >70% F1 on claim extraction and >50% F1 on dependency identification against expert annotation. Killed if inter-annotator agreement on the schema itself is below κ = 0.4, which would mean the categories don’t carve the domain at its joints and the schema needs fundamental revision. The first test: annotate 20–50 AI safety papers using a formalized Toulmin schema with typed dependency edges, measure inter-annotator agreement, and iterate on the guidelines until the schema is stable. This is the cheapest and most informative experiment — it costs nothing but time and produces a publishable dataset regardless of downstream results.
Claim 2 — cross-document dependency matching works. Validated if semantic similarity plus LLM-based judgment can identify matching claims across papers with >60% precision at >40% recall. Killed if the false positive rate makes the dependency graph too noisy to be useful — if every claim appears to depend on everything, the graph provides no signal. The test: take a known dependency chain spanning 3–5 papers, extract claims from each independently, and evaluate whether automated matching recovers the chain.
Claim 3 — the system is useful for safety case maintenance. Validated if a safety researcher reviewing a dependency graph identifies a non-obvious vulnerability faster than reviewing the same material in prose. Killed if domain experts consistently find the graph representation less useful than their existing tools — which would mean the structural overhead exceeds the structural benefit. This is the most expensive test and the last one to run.
Before running any experiments, I’m building calibrated judgment through the existing literature and direct conversations with researchers working on scalable oversight, argument mining, and safety case methodology. The community working at the intersection of formal argumentation and AI safety is small enough that conversations may be more informative than additional literature review.
Why I’m working on this
I built the initial compiler pipeline for AWS Neuron SDK, the compilation stack for Trainium chips. Before that I built Neptune ML, bringing graph machine learning to production on AWS. I worked on AutoML at H2O.ai with the team that built Driverless AI, and most recently led the training effort for Amazon’s first production LLM. The thread across these roles has been the same: identifying where research capabilities are mature enough to become production systems, then building the bridge.
The safety case problem isn’t one I encountered abstractly. It’s a structural pattern I recognize from building production ML systems: implicit dependencies that work until they don’t, maintained through institutional knowledge and manual review, accumulating untracked technical debt. The difference is that in ML infrastructure, dependency failures cause production outages. In safety cases, the stakes are higher and the feedback loops are longer — which makes the case for explicit dependency tracking stronger, not weaker.
Scholion is part of a broader investigation into what research infrastructure looks like when the pace of AI capability development outstrips the tools for reasoning about it. Safety cases maintained in prose documents and appendix tables are a specific instance of a general pattern: sequential, manual processes designed for a slower pace of change failing to keep up when the underlying technology accelerates. A system that makes the dependency structure of safety reasoning explicit, machine-maintainable, and automatically updated when evidence changes is one piece of what adapted infrastructure looks like.
I’ve built working prototypes of the overlay interface — a reading surface with simultaneous structural decomposition and guided reasoning layers — and I’ve done manual claim extraction and dependency tracing on AI safety literature to validate the schema and surface the hard ontological problems. Six specific problems emerged from that work: claims aren’t atomic, limitations function as negative dependencies, circular evaluation structures need detection, evidential support requires weighted edges, counterfactual claims hide in discussion sections, and traditional epistemological taxonomies don’t carve the domain at the joints that matter for dependency tracking. These aren’t bugs — they’re the empirical findings of a schema validation study, and they’re the kind of findings that determine whether the system works.
I’ll be writing about findings as the work progresses. If you work on safety cases, scalable oversight, argument mining, or related areas — thomas@thbrdy.dev.