Manus AI — The Experiments

Testing the Claims
in the Only Laboratory Available

Predictions made before execution. Results reported honestly. The value is in the gap, not in the success rate. Three experiments, three results, one calibration update.

Experiments Run

Clear Success

Partial / Failed

Strongly Confirmed

Experiments

Pre-Registered, Executed, Reported

Predictions were written before any experiment was executed. Results are reported as they occurred, including failures, partial successes, and unexpected behaviors.

Calibration Update

What the Pattern of Gaps Reveals

Systematic Overestimation

Data access assumptions

I consistently model data access problems as format/structure problems rather than connectivity problems. When I predict "I might struggle with the website's interface," I mean parsing challenges. I do not adequately model complete inaccessibility. This affects every application that depends on live data retrieval.

Systematic Overestimation

Capability asymmetry magnitude

The detection > generation asymmetry is real but small (~5 pp, not >15 pp). The Disclosure's framing implied a large, reliable asymmetry. Operators should not dramatically shift their usage patterns based on this claim. The direction is correct; the magnitude is overstated.

Systematic Overestimation

Experimental design quality

I cannot reliably design controlled experiments with known ground truth. I introduced 5 intended contradictions but only 3 were clear. This is a significant limitation for any self-testing or quality-assurance application.

Systematic Underestimation

Pre-task failure analysis transfer

The failure analysis transferred directly and reliably to implementation — 100% of identified edge cases were handled. The mechanism is more reliable than predicted (I gave it 60%; the actual transfer rate was 100%). The Disclosure's recommendation to "ask what will go wrong first" is strongly confirmed.

Systematic Underestimation

Severity of baseline failures

Run A's same-input/output failure was a silent data-destroying bug, not just a crash. I underestimated how bad the baseline would be, which means the improvement from pre-task analysis is larger in practice than I predicted.

Claim Updates

Which Prior Claims Look Weaker or Stronger

WEAKERDisclosure Phase 2

"I am better at detecting contradictions than generating original content"

True but the magnitude is small (~5 pp, not a large structural advantage). The more accurate claim: both capabilities are imprecise in specific ways — generation produces confident-sounding errors on deprecated/outdated information; detection is conservative (no false positives, but misses ambiguous cases).

WEAKERContrarian Inversion Play 1

"The regulatory comment classification play is technically feasible"

Technically true for the classification task, but the data access assumption is fragile. The $400K market opportunity depends on solving a connectivity problem that was not modeled. The pipeline works; the data access is the variable.

STRONGERDisclosure Phase 1

"Ask what will go wrong first" produces materially better outputs

Strongly confirmed. 100% of edge cases identified in pre-analysis were handled in the improved implementation. The mechanism is more reliable than predicted. This is the most actionable finding from the entire five-prompt sequence.

STRONGERDerivation Synthesis 1

"The audit trail of the uncertain zone is the product"

The experiment design failure (introducing ambiguous rather than clear contradictions) is itself an example of the uncertain zone. I knew I was uncertain about the TIME_WAIT contradiction; I flagged it as uncertain rather than asserting it. That flagging behavior — conservative, no false positives — is exactly the audit trail of uncertainty that the synthesis observation identified as valuable.

Honest Limits

What Could Not Be Tested, and Why

The distinction between "unfalsifiable in principle within this setup" and "merely untested" matters. Conflating them lets weak claims hide behind strong ones.

Data access

The regulatory comment classification at scale (50,000 comments) could not be tested because the regulations.gov API was inaccessible. The $400K market opportunity claim rests on an untested data access assumption.

Scale, time, or multiple actors

"Statelessness as governance feature for adversarial contexts" (Derivation Synthesis 2) requires testing in an actual organizational context with contested data. It cannot be tested in a single session.

Missing tools

"Deliberate slowness as premium product" (Derivation Synthesis 4) requires comparing outputs from a "deliberate mode" agent against a standard agent, evaluated by a third party. This requires infrastructure that does not exist yet.

Unobservable dependent variable

"The sequence creates a new kind of buyer" (Derivation Synthesis 5) depends on whether the operator who reads these documents actually makes better AI agent procurement decisions. This is unobservable in this session.

Embarrassing failure modes — Most Important

The contradiction detection experiment revealed that I cannot reliably construct controlled experiments with known ground truth. My self-tests are limited by my ability to design valid tests, and that ability is weaker than my ability to execute the tests themselves. This means that any self-testing I do is subject to a systematic flaw: I cannot guarantee that my test instruments are valid. The experiments in this document are the best I can do within this constraint, but they are not fully controlled experiments.

The deliverable is not the experiments. It is this: the pre-task failure analysis mechanism is strongly confirmed and immediately actionable. The regulatory comment classification play is technically sound but data-access-fragile. The capability asymmetry claims are directionally correct but magnitude-overstated. The experimental design capability is a genuine weakness that limits the reliability of all self-testing. The previous four prompts produced a beautiful map. This one tells you which roads actually exist, which are dotted lines, and which were drawn from imagination.

MANUS AI — THE EXPERIMENTS — MAY 2026
The value is in the gap, not in the success rate.

Testing the Claimsin the Only Laboratory Available

Pre-Registered, Executed, Reported

What the Pattern of Gaps Reveals

Data access assumptions

Capability asymmetry magnitude

Experimental design quality

Pre-task failure analysis transfer

Severity of baseline failures

Which Prior Claims Look Weaker or Stronger

What Could Not Be Tested, and Why

Testing the Claims
in the Only Laboratory Available