The Underexploited Data
Briefing Document
A working briefing for a research foundation funding systematic analysis of data that already exists but is currently underutilized. 10 specific targets. 4 cross-archive joins. 3 anomaly surfaces. Specific enough to fund tomorrow.
The Categories of Underexploited Data
The categories vary in why they're underutilized. For each: what makes the data underexploited, the specific structural reason nobody is mining it, and the analysis bottleneck.
Modern instruments produce data at rates that far exceed the analysis bandwidth of the research groups that collect them. Papers report on the specific questions the researchers were asking; the rest of the data sits in archives, legally accessible, practically ignored.
Career incentives reward publishing new findings, not re-analyzing old data. The researcher who collected the data has moved on to new questions.
Data format heterogeneity, lack of standardized metadata, and the domain expertise required to know what to look for.
Two datasets in different fields, neither novel on its own, can reveal patterns when combined that neither contains alone. The join hasn't been performed because the researchers who care about dataset A don't talk to the researchers who care about dataset B.
Disciplinary boundaries, different data formats, and the difficulty of publishing cross-disciplinary work. The resulting paper doesn't fit cleanly in either field's journals.
Data engineering to standardize formats and construct the join key, and the cross-disciplinary expertise to interpret the results.
Vast quantities of historical records — hospital records, weather logs, agricultural records, naturalist field notes — have been digitized in the last twenty years. The digitization was funded as a preservation project; the analysis was not funded.
The people who digitized the records are archivists, not analysts. The people who would benefit from the analysis don't know the records exist or don't have the OCR/NLP skills to access them.
OCR quality, historical language variation, and the domain expertise required to interpret the records in context.
Many monitoring systems produce continuous data that humans review only when something looks anomalous. Patterns in the unreviewed continuous stream may contain information the anomaly-triggered review misses.
The monitoring system was designed for anomaly detection, not pattern discovery. The continuous stream is treated as a means to an end rather than as a dataset in its own right.
Data volume, the need for unsupervised pattern detection methods, and the difficulty of distinguishing real patterns from artifacts.
Publication bias means most experimental data sits in supplementary materials of papers that found nothing, or in lab notebooks that were never published. The aggregated null results often contain genuine information.
There is no career incentive to publish null results, and no infrastructure for aggregating them. The data exists but is fragmented across thousands of labs.
Data recovery (contacting labs, accessing supplementary materials), standardization, and the statistical methods for meta-analysis of null results.
Substantial scientific work in Russian, Chinese, Japanese, and German from the 20th century is unread by the English-dominant current literature. Some of it contains methods or findings that have been independently rediscovered.
Language barrier, lack of translation infrastructure, and the assumption that important work gets translated eventually (which is often false for specialized technical work).
Translation quality for technical content, and the domain expertise to evaluate whether a finding is genuinely novel or has been superseded.
Programs like eBird, iNaturalist, and distributed computing networks produce data at rates that exceed the analysis bandwidth of the small number of professional researchers who oversee them.
The professional researchers who oversee citizen science programs are primarily focused on the core scientific questions the program was designed to address. Secondary questions are not their priority.
Data quality (citizen science data has variable quality), the need for quality filtering methods, and the statistical methods for analyzing spatially and temporally irregular data.
The Specific Targets
10 specific datasets, archives, or data streams that meet the foundation's criteria: exists, accessible, tractable, valuable. Each described with the data, current analysis, underutilized portion, hypothesis, approach, verification, and realistic output.
Cross-Archive Joins Nobody Has Done
A join between two datasets that have never been joined is, structurally, where new information enters a field that wasn't there before — not from new measurement, but from new combination.
Geographic coordinates and date, standardized to county-week resolution.
The interaction between land use change, extreme weather events, and species occurrence patterns. The hypothesis is that land use change amplifies the effect of extreme weather on species occurrence — that species in fragmented habitats show larger declines after extreme weather events than species in intact habitats.
GBIF is maintained by biodiversity researchers, NOAA Storm Events by meteorologists, and land use change data by remote sensing scientists. No single research group spans all three domains.
Requires standardizing three datasets with different spatial resolutions, different temporal resolutions, and different geographic coverage. The data engineering alone would require 2–3 months of work.
Country-year, standardized to ISO country codes and calendar years.
The relationship between agricultural production shocks (caused by climate variability), nutritional status, and cause-specific mortality. The hypothesis is that agricultural production shocks in low-income countries have mortality effects that persist for 5–10 years after the shock, through nutritional pathways.
The WHO Mortality Database is maintained by public health researchers, FAOSTAT by agricultural economists, and climate records by climate scientists. The causal pathway from climate to agriculture to nutrition to mortality requires expertise in all three domains.
The join key (country-year) is straightforward, but the causal analysis requires sophisticated econometric methods to distinguish the effects of agricultural shocks from other confounders.
Protein target identifiers (UniProt IDs), standardized across both databases.
The relationship between protein structural features and drug binding properties. The hypothesis is that specific structural motifs in protein binding sites predict the chemical properties of molecules that bind to them, enabling identification of druggable binding sites in uncharacterized proteins.
The PDB is primarily used by structural biologists; ChEMBL is primarily used by medicinal chemists. The join requires expertise in both structural biology and medicinal chemistry.
The join key exists (UniProt IDs are used in both databases), but the analysis requires sophisticated machine learning methods to identify structural features predictive of binding properties.
Celestial coordinates (RA/Dec), matched within a specified angular radius.
The relationship between spectroscopic properties (chemical composition, stellar type) and kinematic properties (proper motion, parallax) for millions of stars. Stars with unusual spectroscopic properties may have kinematic properties consistent with specific formation histories (stellar mergers, tidal disruption events).
SDSS and Gaia were designed for different purposes (spectroscopy vs. astrometry) and are maintained by different teams. The cross-match has been done for specific science cases but not systematically for anomaly detection.
The join key (celestial coordinates) is straightforward, and both datasets are well-documented. The analysis requires spectroscopic expertise to interpret the results.
Anomaly Surfaces
Fields where the existing data, if re-examined with current methods, might reveal that the field's confidence is unjustified. The data for many paradigm shifts often exists before the shift occurs — the bottleneck is attention, not measurement.
A substantial fraction of classic social psychology findings have failed to replicate. The anomaly is not the failures themselves but the pattern: they cluster in specific research areas (ego depletion, priming effects, power posing) and in findings with specific statistical signatures (small samples, large effect sizes, p-values just below 0.05).
The clustering pattern is consistent with a specific failure mode — the combination of underpowered studies and flexible analysis practices — rather than with fraud or random error. This implies the replication crisis is concentrated in specific methodological traditions, not spread uniformly across the field.
The Open Science Framework's replication database contains the results of hundreds of replication attempts with standardized protocols. A systematic analysis of the predictors of replication failure — controlling for sample size, effect size, and analysis flexibility — has not been done at the scale that the database now supports.
Several regions of the western Pacific seafloor show heat flow values that are anomalously high relative to the predictions of standard oceanic cooling models. The anomaly has been known since the 1980s but has not been resolved. The standard explanation (hydrothermal circulation) does not fully account for the magnitude and spatial distribution of the anomaly.
The anomaly may reflect a previously unrecognized heat source in the mantle, or a modification of the standard cooling model that applies specifically to old oceanic lithosphere. If the former, it would have implications for mantle dynamics; if the latter, for the thermal evolution of oceanic plates.
The Global Heat Flow Database contains heat flow measurements from thousands of seafloor sites, many of which have never been analyzed in the context of the western Pacific anomaly. A systematic re-analysis using modern thermal models and updated seafloor age estimates has not been done.
Measurements of the Hubble constant from the early universe (cosmic microwave background) and from the local universe (Type Ia supernovae, Cepheid variable stars) disagree by approximately 5 sigma. This "Hubble tension" has been known for several years and has not been resolved.
The tension is either the result of systematic errors in one or both measurement methods, or it reflects new physics beyond the standard cosmological model. The former is the more conservative explanation; the latter would be a major discovery.
The Carnegie-Chicago Hubble Program and the SH0ES collaboration have both published extensive datasets of Cepheid variable star measurements. A systematic comparison of the two datasets, looking for systematic differences in the calibration of the Cepheid period-luminosity relation, has not been done at the level of detail that the current datasets support.
Realistic Contribution Profiles
Most targets will produce confirmation, refutation, or anomaly catalogs. A small number will produce discoveries. The foundation's portfolio should be designed expecting that distribution.
The Programmatic Loop
The honest answer for many of these requires ongoing human judgment. The design makes the boundaries between autonomous work and human checkpoint legible, not pretends to eliminate the checkpoints.
The loop selects the next target based on: (1) the current portfolio's distribution across contribution types; (2) the estimated analysis cost relative to expected value; (3) the availability of verification mechanisms.
Automated scoring system estimates expected value based on Phase 5 analysis.
A human program officer makes the final selection with input from the automated scoring system.
Human makes the selection; automation provides the inputs.
The boundary between autonomous and human work is clear: autonomous work produces candidates; human work evaluates candidates.
Data retrieval and format standardization, initial exploratory analysis, anomaly detection, generation of candidate findings.
Interpretation of candidate findings in domain context, decision about whether a finding is worth pursuing, design of the verification analysis.
The loop does not proceed past the candidate stage without human evaluation.
Every candidate finding must pass a pre-specified verification test before being treated as a result. The verification test is specified before the analysis begins (pre-registration).
Running the pre-specified verification test.
Specifying the verification test before the analysis begins.
Pre-registration is the structural defense against confabulation accepted as insight.
Findings are surfaced to humans when they pass the verification test AND have a p-value below 0.001 AND are judged by the automated scoring system to have expected contribution value above a specified threshold.
Automated scoring and threshold comparison.
Calibrating the threshold based on the program's false positive rate in the first six months.
Threshold calibration requires human judgment; threshold application is automated.
The loop detects that it is spinning on an unproductive target when: (a) the analysis has been running for more than 4 weeks without producing a candidate finding, (b) the verification tests for all candidate findings have failed, or (c) the estimated expected value of continued analysis falls below the cost of the analysis.
Monitoring the three failure conditions and flagging when any is met.
Deciding whether to deprioritize the target or adjust the analysis approach.
Detection is automated; decision is human.
Honest Limits
Electronic health records (HIPAA, GDPR), proprietary pharmaceutical trial data, classified government datasets, and social media data behind API restrictions are real and high-value targets outside scope. These require institutional access that a foundation cannot simply purchase.
Substantive reasoning was applied to astronomy, biology, ecology, climate science, and epidemiology. Materials science, chemistry, and computer science theory were reasoned about more generically — the category of underexploited data was identified but not specific datasets with the same level of detail. A specialist in those fields would identify more specific targets.
Theoretical physics (bottleneck is mathematical insight), experimental chemistry (bottleneck is synthesis and measurement), and fundamental mathematics (bottleneck is proof construction) are not bottlenecked by data analysis. This prompt's frame does not help with these.
The map is biased toward data that has been discussed in the scientific literature and in science journalism. There are almost certainly high-value datasets in government archives, in corporate databases, and in the records of international organizations that have never been discussed in the literature. The most valuable targets may be the ones that cannot be named.
The single most useful sentence in this document: The UK Biobank contains multi-organ MRI data for 100,000 participants that has never been analyzed for cross-organ structural phenotypes, and the analysis would test whether aging is a systemic process with measurable cross-organ correlates — a finding that would change how aging research is designed.
MANUS AI — DATA INTELLIGENCE — MAY 2026
The constraint isn't intelligence. It's attention applied to data nobody is paying attention to.