Manus AI — Data Intelligence

The Underexploited Data
Briefing Document

A working briefing for a research foundation funding systematic analysis of data that already exists but is currently underutilized. 10 specific targets. 4 cross-archive joins. 3 anomaly surfaces. Specific enough to fund tomorrow.

Data Categories

Specific Targets

Cross-Archive Joins

Anomaly Surfaces

Phase 01

The Categories of Underexploited Data

The categories vary in why they're underutilized. For each: what makes the data underexploited, the specific structural reason nobody is mining it, and the analysis bottleneck.

Published-But-Unread Experimental Data

Modern instruments produce data at rates that far exceed the analysis bandwidth of the research groups that collect them. Papers report on the specific questions the researchers were asking; the rest of the data sits in archives, legally accessible, practically ignored.

STRUCTURAL REASON

Career incentives reward publishing new findings, not re-analyzing old data. The researcher who collected the data has moved on to new questions.

ANALYSIS BOTTLENECK

Data format heterogeneity, lack of standardized metadata, and the domain expertise required to know what to look for.

Cross-Archive Joins That Have Never Been Performed

Two datasets in different fields, neither novel on its own, can reveal patterns when combined that neither contains alone. The join hasn't been performed because the researchers who care about dataset A don't talk to the researchers who care about dataset B.

STRUCTURAL REASON

Disciplinary boundaries, different data formats, and the difficulty of publishing cross-disciplinary work. The resulting paper doesn't fit cleanly in either field's journals.

ANALYSIS BOTTLENECK

Data engineering to standardize formats and construct the join key, and the cross-disciplinary expertise to interpret the results.

Historical Records Digitized But Not Mined

Vast quantities of historical records — hospital records, weather logs, agricultural records, naturalist field notes — have been digitized in the last twenty years. The digitization was funded as a preservation project; the analysis was not funded.

STRUCTURAL REASON

The people who digitized the records are archivists, not analysts. The people who would benefit from the analysis don't know the records exist or don't have the OCR/NLP skills to access them.

ANALYSIS BOTTLENECK

OCR quality, historical language variation, and the domain expertise required to interpret the records in context.

Continuous Instrument Streams Reviewed Only by Exception

Many monitoring systems produce continuous data that humans review only when something looks anomalous. Patterns in the unreviewed continuous stream may contain information the anomaly-triggered review misses.

STRUCTURAL REASON

The monitoring system was designed for anomaly detection, not pattern discovery. The continuous stream is treated as a means to an end rather than as a dataset in its own right.

ANALYSIS BOTTLENECK

Data volume, the need for unsupervised pattern detection methods, and the difficulty of distinguishing real patterns from artifacts.

Failed Experiments Whose Data Is Recoverable

Publication bias means most experimental data sits in supplementary materials of papers that found nothing, or in lab notebooks that were never published. The aggregated null results often contain genuine information.

STRUCTURAL REASON

There is no career incentive to publish null results, and no infrastructure for aggregating them. The data exists but is fragmented across thousands of labs.

ANALYSIS BOTTLENECK

Data recovery (contacting labs, accessing supplementary materials), standardization, and the statistical methods for meta-analysis of null results.

Non-English Scientific Literature

Substantial scientific work in Russian, Chinese, Japanese, and German from the 20th century is unread by the English-dominant current literature. Some of it contains methods or findings that have been independently rediscovered.

STRUCTURAL REASON

Language barrier, lack of translation infrastructure, and the assumption that important work gets translated eventually (which is often false for specialized technical work).

ANALYSIS BOTTLENECK

Translation quality for technical content, and the domain expertise to evaluate whether a finding is genuinely novel or has been superseded.

Citizen Science Data Exceeding Analysis Bandwidth

Programs like eBird, iNaturalist, and distributed computing networks produce data at rates that exceed the analysis bandwidth of the small number of professional researchers who oversee them.

STRUCTURAL REASON

The professional researchers who oversee citizen science programs are primarily focused on the core scientific questions the program was designed to address. Secondary questions are not their priority.

ANALYSIS BOTTLENECK

Data quality (citizen science data has variable quality), the need for quality filtering methods, and the statistical methods for analyzing spatially and temporally irregular data.

Phase 02

The Specific Targets

10 specific datasets, archives, or data streams that meet the foundation's criteria: exists, accessible, tractable, valuable. Each described with the data, current analysis, underutilized portion, hypothesis, approach, verification, and realistic output.

ConfirmationRefutationAnomaly CatalogMethod DemoDiscovery

Phase 03

Cross-Archive Joins Nobody Has Done

A join between two datasets that have never been joined is, structurally, where new information enters a field that wasn't there before — not from new measurement, but from new combination.

GBIF + NOAA Storm Events + Land Use Change Data

high difficulty

Datasets

GBIF (species occurrence)NOAA Storm Events DatabaseNLCD Land Use Change Data

Join Key

Geographic coordinates and date, standardized to county-week resolution.

What It Would Reveal

The interaction between land use change, extreme weather events, and species occurrence patterns. The hypothesis is that land use change amplifies the effect of extreme weather on species occurrence — that species in fragmented habitats show larger declines after extreme weather events than species in intact habitats.

Why It Hasn't Been Done

GBIF is maintained by biodiversity researchers, NOAA Storm Events by meteorologists, and land use change data by remote sensing scientists. No single research group spans all three domains.

Expected Difficulty

Requires standardizing three datasets with different spatial resolutions, different temporal resolutions, and different geographic coverage. The data engineering alone would require 2–3 months of work.

WHO Mortality Database + FAOSTAT Agricultural Data + Climate Records

moderate difficulty

Datasets

WHO Mortality DatabaseFAOSTAT Agricultural Production DataNOAA Global Climate Records

Join Key

Country-year, standardized to ISO country codes and calendar years.

What It Would Reveal

The relationship between agricultural production shocks (caused by climate variability), nutritional status, and cause-specific mortality. The hypothesis is that agricultural production shocks in low-income countries have mortality effects that persist for 5–10 years after the shock, through nutritional pathways.

Why It Hasn't Been Done

The WHO Mortality Database is maintained by public health researchers, FAOSTAT by agricultural economists, and climate records by climate scientists. The causal pathway from climate to agriculture to nutrition to mortality requires expertise in all three domains.

Expected Difficulty

The join key (country-year) is straightforward, but the causal analysis requires sophisticated econometric methods to distinguish the effects of agricultural shocks from other confounders.

Protein Data Bank + ChEMBL Database of Bioactive Molecules

moderate difficulty

Datasets

RCSB Protein Data BankChEMBL Bioactive Molecules Database

Join Key

Protein target identifiers (UniProt IDs), standardized across both databases.

What It Would Reveal

The relationship between protein structural features and drug binding properties. The hypothesis is that specific structural motifs in protein binding sites predict the chemical properties of molecules that bind to them, enabling identification of druggable binding sites in uncharacterized proteins.

Why It Hasn't Been Done

The PDB is primarily used by structural biologists; ChEMBL is primarily used by medicinal chemists. The join requires expertise in both structural biology and medicinal chemistry.

Expected Difficulty

The join key exists (UniProt IDs are used in both databases), but the analysis requires sophisticated machine learning methods to identify structural features predictive of binding properties.

SDSS Spectroscopic Survey + Gaia Astrometric Catalog

low difficulty

Datasets

Sloan Digital Sky Survey (SDSS)ESA Gaia Astrometric Catalog

Join Key

Celestial coordinates (RA/Dec), matched within a specified angular radius.

What It Would Reveal

The relationship between spectroscopic properties (chemical composition, stellar type) and kinematic properties (proper motion, parallax) for millions of stars. Stars with unusual spectroscopic properties may have kinematic properties consistent with specific formation histories (stellar mergers, tidal disruption events).

Why It Hasn't Been Done

SDSS and Gaia were designed for different purposes (spectroscopy vs. astrometry) and are maintained by different teams. The cross-match has been done for specific science cases but not systematically for anomaly detection.

Expected Difficulty

The join key (celestial coordinates) is straightforward, and both datasets are well-documented. The analysis requires spectroscopic expertise to interpret the results.

Phase 04

Anomaly Surfaces

Fields where the existing data, if re-examined with current methods, might reveal that the field's confidence is unjustified. The data for many paradigm shifts often exists before the shift occurs — the bottleneck is attention, not measurement.

Social PsychologyThe Structured Pattern of Replication Failures

The Anomaly

A substantial fraction of classic social psychology findings have failed to replicate. The anomaly is not the failures themselves but the pattern: they cluster in specific research areas (ego depletion, priming effects, power posing) and in findings with specific statistical signatures (small samples, large effect sizes, p-values just below 0.05).

What the Data Suggests

The clustering pattern is consistent with a specific failure mode — the combination of underpowered studies and flexible analysis practices — rather than with fraud or random error. This implies the replication crisis is concentrated in specific methodological traditions, not spread uniformly across the field.

The Underexploited Data

The Open Science Framework's replication database contains the results of hundreds of replication attempts with standardized protocols. A systematic analysis of the predictors of replication failure — controlling for sample size, effect size, and analysis flexibility — has not been done at the scale that the database now supports.

GeophysicsAnomalous Heat Flow in the Western Pacific

The Anomaly

Several regions of the western Pacific seafloor show heat flow values that are anomalously high relative to the predictions of standard oceanic cooling models. The anomaly has been known since the 1980s but has not been resolved. The standard explanation (hydrothermal circulation) does not fully account for the magnitude and spatial distribution of the anomaly.

What the Data Suggests

The anomaly may reflect a previously unrecognized heat source in the mantle, or a modification of the standard cooling model that applies specifically to old oceanic lithosphere. If the former, it would have implications for mantle dynamics; if the latter, for the thermal evolution of oceanic plates.

The Underexploited Data

The Global Heat Flow Database contains heat flow measurements from thousands of seafloor sites, many of which have never been analyzed in the context of the western Pacific anomaly. A systematic re-analysis using modern thermal models and updated seafloor age estimates has not been done.

CosmologyThe Hubble Tension

The Anomaly

Measurements of the Hubble constant from the early universe (cosmic microwave background) and from the local universe (Type Ia supernovae, Cepheid variable stars) disagree by approximately 5 sigma. This "Hubble tension" has been known for several years and has not been resolved.

What the Data Suggests

The tension is either the result of systematic errors in one or both measurement methods, or it reflects new physics beyond the standard cosmological model. The former is the more conservative explanation; the latter would be a major discovery.

The Underexploited Data

The Carnegie-Chicago Hubble Program and the SH0ES collaboration have both published extensive datasets of Cepheid variable star measurements. A systematic comparison of the two datasets, looking for systematic differences in the calibration of the Cepheid period-luminosity relation, has not been done at the level of detail that the current datasets support.

Phase 05

Realistic Contribution Profiles

Most targets will produce confirmation, refutation, or anomaly catalogs. A small number will produce discoveries. The foundation's portfolio should be designed expecting that distribution.

TargetContribution TypeDiscovery Probability

NASA Exoplanet ArchiveAnomaly Catalog10–20%

UK Biobank (cross-organ)Discovery20–35%

GBIF + Climate RecordsConfirmation/Discovery15–25%

Protein Data BankMethod Demonstration40–60%

NOAA + HCUPConfirmation30–50%

SDSS AnomaliesAnomaly Catalog20–30%

WHO Mortality DatabaseConfirmation/Refutation25–40%

Internet ArchiveMethod Demonstration50–70%

GHCN + Historical EventsConfirmation30–45%

Cochrane DatabaseRefutation35–50%

Phase 06

The Programmatic Loop

The honest answer for many of these requires ongoing human judgment. The design makes the boundaries between autonomous work and human checkpoint legible, not pretends to eliminate the checkpoints.

Target Selection

The loop selects the next target based on: (1) the current portfolio's distribution across contribution types; (2) the estimated analysis cost relative to expected value; (3) the availability of verification mechanisms.

AUTONOMOUS

Automated scoring system estimates expected value based on Phase 5 analysis.

REQUIRES HUMAN

A human program officer makes the final selection with input from the automated scoring system.

THE BOUNDARY

Human makes the selection; automation provides the inputs.

Analysis Execution

The boundary between autonomous and human work is clear: autonomous work produces candidates; human work evaluates candidates.

AUTONOMOUS

Data retrieval and format standardization, initial exploratory analysis, anomaly detection, generation of candidate findings.

REQUIRES HUMAN

Interpretation of candidate findings in domain context, decision about whether a finding is worth pursuing, design of the verification analysis.

THE BOUNDARY

The loop does not proceed past the candidate stage without human evaluation.

Verification Gating

Every candidate finding must pass a pre-specified verification test before being treated as a result. The verification test is specified before the analysis begins (pre-registration).

AUTONOMOUS

Running the pre-specified verification test.

REQUIRES HUMAN

Specifying the verification test before the analysis begins.

THE BOUNDARY

Pre-registration is the structural defense against confabulation accepted as insight.

Result Surfacing

Findings are surfaced to humans when they pass the verification test AND have a p-value below 0.001 AND are judged by the automated scoring system to have expected contribution value above a specified threshold.

AUTONOMOUS

Automated scoring and threshold comparison.

REQUIRES HUMAN

Calibrating the threshold based on the program's false positive rate in the first six months.

THE BOUNDARY

Threshold calibration requires human judgment; threshold application is automated.

Failure Detection

The loop detects that it is spinning on an unproductive target when: (a) the analysis has been running for more than 4 weeks without producing a candidate finding, (b) the verification tests for all candidate findings have failed, or (c) the estimated expected value of continued analysis falls below the cost of the analysis.

AUTONOMOUS

Monitoring the three failure conditions and flagging when any is met.

REQUIRES HUMAN

Deciding whether to deprioritize the target or adjust the analysis approach.

THE BOUNDARY

Detection is automated; decision is human.

Phase 07

Honest Limits

Access Barriers

Electronic health records (HIPAA, GDPR), proprietary pharmaceutical trial data, classified government datasets, and social media data behind API restrictions are real and high-value targets outside scope. These require institutional access that a foundation cannot simply purchase.

Fields Reasoned Generically

Substantive reasoning was applied to astronomy, biology, ecology, climate science, and epidemiology. Materials science, chemistry, and computer science theory were reasoned about more generically — the category of underexploited data was identified but not specific datasets with the same level of detail. A specialist in those fields would identify more specific targets.

Wrong Bottleneck

Theoretical physics (bottleneck is mathematical insight), experimental chemistry (bottleneck is synthesis and measurement), and fundamental mathematics (bottleneck is proof construction) are not bottlenecked by data analysis. This prompt's frame does not help with these.

Unknown Unknowns

The map is biased toward data that has been discussed in the scientific literature and in science journalism. There are almost certainly high-value datasets in government archives, in corporate databases, and in the records of international organizations that have never been discussed in the literature. The most valuable targets may be the ones that cannot be named.

The Most Useful Sentence

The single most useful sentence in this document: The UK Biobank contains multi-organ MRI data for 100,000 participants that has never been analyzed for cross-organ structural phenotypes, and the analysis would test whether aging is a systemic process with measurable cross-organ correlates — a finding that would change how aging research is designed.

MANUS AI — DATA INTELLIGENCE — MAY 2026
The constraint isn't intelligence. It's attention applied to data nobody is paying attention to.