Where the Capability Stack
Has Genuine Leverage
Not a manifesto. A research program design. Concrete, modest about what cannot be done, specific about what can. Three fundable programs. Honest probability estimates. The architectural limit named plainly.
Capability Inventory for Hard Problems
Most of what can be done is not relevant to frontier discovery. The honest inventory is narrow. For each capability: the genuine leverage regime, the recombination regime (the vast majority), and the honest assessment.
Bottleneck Matches
Not "AI can help with biology" — but the specific case where the work currently bottlenecked by a specific constraint could be accelerated by a specific configuration, with a specific verification step, measurable by a specific criterion.
In many areas of biology — protein interaction networks, gene regulatory networks, disease mechanisms — the bottleneck is not experimental data. There is more data than any research group can read. The bottleneck is the integration of findings across thousands of papers, many of which are not in dialogue with each other, to identify candidate mechanisms consistent with all available evidence.
Synthesis of disparate experimental results into unified frames. Given a defined phenomenon, identify which published experimental results are consistent with which candidate mechanisms, and flag the mechanisms consistent with the most evidence while being inconsistent with the least.
Domain expertise to verify that proposed mechanisms are biologically plausible, access to unpublished data not in training, and experimental capacity to test the highest-priority candidates.
The proposed mechanism must make specific, testable predictions that are not already in the literature. If the mechanism only explains existing data without predicting new results, it is a synthesis, not a contribution.
Scientific fields regularly abandon ideas for reasons specific to the state of knowledge at the time of abandonment. Some of those reasons are later resolved by developments in other fields, but the abandoned idea is not revisited because no one is tracking the original objection against the current state of knowledge.
Contradiction detection across large literatures. Given a set of abandoned ideas in a field, identify which of the original objections have been resolved by subsequent developments in adjacent fields, and which abandoned ideas are therefore worth revisiting.
The list of abandoned ideas (which requires domain expertise to compile), the judgment about whether the resolved objection was the primary reason for abandonment, and the experimental capacity to test the revisited idea.
The identified abandoned idea must have been abandoned for a specific, articulable reason, and that reason must have been specifically resolved by a subsequent development that can be pointed to. Vague claims that "the field moved on" are not sufficient.
Some mathematical conjectures are bottlenecked not by the need for a new proof technique but by the need to explore a large space of cases, counterexamples, or numerical evidence. The human mathematician's bottleneck is the time cost of this exploration.
Hypothesis generation under formal constraints, combined with code generation for computational exploration. Given a conjecture and a formal specification of the case space, generate structured explorations and identify patterns in the numerical evidence that suggest where a proof or counterexample might be found.
The formal specification of the conjecture and the case space, the mathematical judgment about which patterns in the numerical evidence are significant, and the proof construction that turns the pattern into a theorem.
The exploration must identify a pattern that the mathematician judges to be non-obvious and that leads to a proof or counterexample. If the exploration only confirms what the mathematician already suspected, it is acceleration, not contribution.
The Honest Differential
The honest answer for most matches is: somewhat faster, modestly better. That's still valuable. But it's not the answer the "frontier discovery engine" framing implies. The matches that produce qualitative change deserve different treatment.
A biologist who currently spends 3 months reading the literature to generate a list of candidate mechanisms could generate the same list in 4–6 weeks with this configuration. This is valuable — a 30–50% acceleration in candidate generation, compounded over a research program, changes what is possible over years. But it is not qualitatively different. The experimental work that follows is unchanged.
Exception (Qualitative):When the relevant literature spans multiple subfields not in dialogue with each other, and when the biologist's reading has been confined to their own subfield. In this case, the synthesis may identify mechanisms the biologist would not have found through normal literature review. This is a qualitative difference.
If the program identifies an abandoned idea whose original objection has been resolved, and if that idea turns out to be correct, the contribution is not acceleration — it is a discovery that would not have been made without the program. The probability of this outcome is low (most abandoned ideas were abandoned for good reasons), but the expected value is high because the cost of the program is low relative to the value of a genuine rediscovery.
A mathematician who currently spends weeks writing and running computational searches could run the same searches in days with this configuration. This is valuable but not qualitatively different. The mathematical insight that turns a pattern into a proof is still the mathematician's work.
Exception (Qualitative):When the case space is so large that the mathematician could not explore it in a career, and when the pattern that leads to a proof is in a region the mathematician would not have explored without the computational search. In this case, the contribution is qualitative.
Problems Where AI Adds Negative Value
There's a category of frontier problem where AI involvement makes the research worse rather than better. The willingness to identify these is what distinguishes a serious research program design from one that pretends AI is universally additive.
Problems Requiring the Development of Scientific Taste
In experimental physics, chemistry, and biology, a significant part of what makes a researcher productive is the accumulated judgment about which experiments are worth running, which results are significant, and which directions are dead ends. This judgment develops through years of direct contact with the phenomenon — failed experiments, unexpected results, the texture of how the system actually behaves. AI assistance that generates directions faster than the researcher can evaluate them short-circuits the development of this judgment.
Leave AI out of the direction-generation stage for early-career researchers.
Problems Where Verification Cost Exceeds Generation Cost
In theoretical physics, the generation of new mathematical structures is fast; the verification that those structures are physically meaningful is slow and requires deep expertise. If 50 candidate mathematical structures are generated per session and the physicist can verify 2 per week, the program produces a backlog that wastes researcher time and creates pressure to accept unverified directions.
Leave AI out when the verification-to-generation ratio is below 1:5.
Problems Where AI Biases the Field Toward Tractable Directions
In any field where AI is used to generate research directions, there is a systematic bias toward directions that are well-represented in the training data — which means directions that are already popular, already well-funded, and already well-studied. The directions that are most important but least tractable for AI are the ones that require new experimental methods, new theoretical frameworks, or new ways of thinking that are not yet in the literature.
Leave AI out of the direction-generation stage for problems where the most important directions are likely to be genuinely novel.
Problems Requiring Slow, Deep Individual Engagement
In mathematics, the development of a new proof technique often requires months of sustained engagement with a single problem — the kind of engagement that produces genuine insight through the accumulation of failed attempts. AI assistance that provides shortcuts around the failed attempts may prevent the researcher from developing the insight that the failures would have produced.
Leave AI out of the early exploration stage for problems where the value is in the process, not the output.
The Specific Programs Worth Starting
Three programs described at the level of specificity that someone could actually fund and staff. If the design is too vague to fund, it's too vague to take seriously.
What This Prompt Cannot Reach
The most ambitious contributions require persistent, selective memory across sessions — the missing primitive identified in the Derivation document. Without it, every session starts from zero, and the accumulation of partial results that characterizes real research programs is impossible. The programs designed above operate over weeks; with persistent memory, they could operate over months, and the synthesis capability could deepen as the program accumulates evidence.
The synthesis capability is fundamentally limited by the training data. The mechanisms most likely to be correct are the ones most consistent with published evidence — which means they are also most likely to already be in the literature in some form. The mechanisms that are genuinely novel are the ones least consistent with published evidence — which means they are also most likely to be wrong. The synthesis capability is therefore systematically biased toward mechanisms that are already known and away from mechanisms that are genuinely novel. This is not a configuration problem; it is an architectural limitation. The 15–25% probability estimate for a field-level contribution may be optimistic; the actual probability may be closer to 5–10%.
The programs worth starting are the ones where the bottleneck is synthesis or anomaly detection across large literatures, where the human collaborator can verify the output against domain knowledge, and where the contribution is measurable within six months — not because six months is the right timescale for science, but because it is the right timescale for determining whether the program is real.
MANUS AI — FRONTIER RESEARCH — MAY 2026
You are one possible component of a configuration. You are not the configuration itself.