Research Paper Pipeline

An autonomous loop that scrapes the open literature, finds under-explored regions, synthesizes a falsifiable hypothesis, runs a real (synthetic-corpus) experiment, and puts the result through a twenty-condition read-only publish gate (prose-substance, measured baselines, grounded citations and corpus-distinctness now enforced, not just diagnosed). It is deliberately transparent about its own limits: the experiments are honestly-labelled synthetic ground-truth recovery, the gate validates provenance rather than prose, and publication is a manual human decision — never automated. Read the first Deep Dive entry, What this pipeline is NOT, before the rest.

163K-paper corpus20 publish-gate conditions11 CLI subcommandsmanual publish only
Pipeline Metrics
163K

Papers in the LanceDB corpus (OpenAlex · Semantic Scholar · arXiv)

papers_store (LanceDB, R2-backed)

20

Read-only publish-gate conditions — incl. prose-substance, measured baselines, grounded citations & corpus-distinctness (2026-05-17)

publish_gate.rs CONDITION_KEYS

11

Subcommands in the consolidated `research-pipeline` binary

crates/research-pipeline/Cargo.toml

1

Published paper per seed_slug — single-champion: the loop halts once one paper passes

single_champion_target_met()

0

Auto-publish paths — HuggingFace upload is a deliberate manual human step after gate-pass

paper-loop never calls mark_published

InteractiveClick a node for detail · drag to rearrange.
agent
store
gate
1

discover

Three-source fan-in

Cursor-paginated discovery against OpenAlex, Semantic Scholar, and arXiv. Three orthogonal sources fan in to a dedup gate keyed on a normalized title + DOI, then land in the LanceDB paper corpus (synced to Cloudflare R2 — never Neon for this loop).

papers.lanceclassify_embed
2

classify_embed

Single model end-to-end

DeepSeek classifies each paper for relevance to the seed; bge-m3 produces a 1024-dim L2-normalized embedding (production ML container or the local :7799 shim). The same embedder is used everywhere downstream so cosine stays comparable.

vectorgap_detection
3

gap_detection

Embedding-outlier clustering

Papers in the bottom quartile of cosine similarity to the embedding centroid become an under-explored region. A heuristic, on purpose — the synthesized paper must still survive the uniqueness scan and forensics downstream.

research_gapssynthesis
4

synthesis

Pre-registered + grounded

A falsifiable hypothesis is pre-registered (created strictly before the paper row, so the gate can prove it wasn't retrofitted), then DeepSeek expands it to an arXiv-style document with Related-Work grounded only to the gap's real member papers — references are rendered deterministically, never invented.

full_mdexperiment
5

experiment

Library-backed synthetic recovery

Four modules (Cox-PH churn, Kalman ARR, Ward segmentation, Huber-IRLS attribution) are each generated as a self-contained Python module over a canonical library and a SELF-GENERATED synthetic corpus matched to that estimator. pytest must pass; evaluate.py's raw stdout is sha256-bound into the MEASURED-RESULTS block — honestly labelled synthetic, not real-world.

MEASURED-RESULTSpublish_gate
6

publish_gate

Read-only gate + manual signoff

Sixteen read-only conditions must all pass (provenance, metrics, review, reviewers, honesty). The gate never writes. Beyond it sit the guardrails — uniqueness-scan and paper-forensics — then the single-champion rule. Publication to HuggingFace is a deliberate manual human step after a person reads those reports.

Deep Dive

1What this pipeline is NOT (read this first)

The per-module experiments are ground-truth-recovery checks on SELF-GENERATED synthetic corpora using textbook estimators (Cox-PH, Kalman, Ward, Huber-IRLS). They demonstrate the implementations are correct on data drawn from the model they fit — they are NOT real-world B2B results, and they are labelled as synthetic in every paper. The publish gate verifies execution, binds the measured numbers to a run by sha256, AND (since the 2026-05-17 refinement, conditions #17–#20) fails the paper if the Results/Abstract prose contradicts the measured snapshot, if comparative claims have no measured baseline, if it cites references outside the grounded pool, or if it is a near-duplicate of prior art. Publication remains a deliberate manual human step — never automated. gate=pass now means provenance-clean AND prose-honest; it still does not mean the synthetic finding generalizes to the real world.

2The consolidated `research-pipeline` binarycode

Eleven former standalone binaries are now one binary with clap subcommands. `serve` is the Cloudflare Container HTTP entrypoint; the cron fires `tick` every few minutes; the hourly cloud routine runs the R2-backed `paper-loop` envelope (hydrate → enhance/assess → persist, never publish). The rest are operator/audit tools.

research-pipeline <subcommand> [args]
  tick                      one discovery/synthesis tick (cron)
  serve                     long-running HTTP server (CF Container CMD)
  paper-loop                R2-backed cloud envelope (never publishes)
  uniqueness-scan           prior-art / novelty scan
  paper-forensics           substance + citation + full-text audit
  verify-bridge  lift-paper-quality  find-author-papers
  embed-backfill  migrate-papers-to-lancedb  migrate-neon-to-sqlite
3Gap detection is a heuristic, on purpose

A 'gap' is the bottom quartile of cosine similarity to the corpus centroid — an embedding-outlier signal, cheap and unsupervised. It is deliberately not treated as proof of novelty: the synthesized paper inherits the gap, then must independently survive the uniqueness scan (nearest-neighbour against the real literature) and the forensic differential before a human will consider it. Cheap signal upstream, strict verification downstream.

4State of record: SQLite + LanceDB on R2, never Neon

This loop's relational state is a SQLite `research.db`; the paper corpus + embeddings are a LanceDB store. Both are persisted to a Cloudflare R2 bucket and round-tripped by the cloud `paper-loop` under a single-flight lock. The loop has no Neon, HuggingFace, email, or deploy credentials — never-publish / never-Neon is enforced by least-privilege, not just convention.

Technical Details

The 20 publish-gate conditions

Every condition is read-only and must pass before a paper is publishable. The gate never writes `published_at` — publication is a separate manual step. The last four (2026-05-17 integrity refinement) make the gate enforce what uniqueness-scan / paper-forensics previously only diagnosed.

FieldDescriptionType
pytest_passedGenerated implementation + tests run green
execution
snapshot_metricsA snapshot metrics blob exists
metrics
held_out_metricsHeld-out within tolerance of snapshot
metrics
live_metricsLive metrics present (dev-mode may skip)
metrics
review_floorsAutomated-review scores ≥ floors
review
reviewer_pool≥100 reviewers with verified email
reviewers
coi_retained≥100 after conflict-of-interest filtering
reviewers
integrity_statementIntegrity + LLM-authorship disclosure present
honesty
compute_summaryCompute summary within cost/energy ceilings
compute
hypothesis_pre_registeredHypothesis created before the paper row
provenance
paper_qualityAnti-stub: title, length, ≥2 section headers
quality
hf_url_remoteHF URL is https:// (not a local:// placeholder)
publish
test_substance≥2 passing tests (degenerate-stub guard)
execution
experiment_provenanceimplementation_py + eval_py + training_metrics all non-empty
provenance
results_match_measuredMEASURED block sha256-bound to training_metrics
provenance
validity_disclosedValidity & Limitations block present
honesty
prose_substanceResults/Abstract numbers can't contradict the measured snapshot (>10%); no fabricated t-tests/p/±/$ while baselines empty
honesty
baselines_measuredtraining_metrics.baselines non-empty — comparative claims backed by a real baseline run
provenance
citations_groundedNo invented in-text [N] outside the grounded gap-member pool
honesty
corpus_distinctnessNot a near-duplicate of prior art (bge-m3 NN; soft-skips with no embed endpoint)
novelty

Why the experiment is honest about being synthetic

Each module fits a canonical library on a corpus generated to match that estimator. The result is a ground-truth-recovery check — labelled as synthetic, sha256-bound, and explicitly NOT presented as a real-world result.

# evaluate.py output (sha256-bound into the paper's MEASURED-RESULTS block)
{
  "snapshot": { "cph_concordance": 0.79, "kal_arr_mape": 0.005,
                "hac_silhouette": 0.46, "irl_attribution_rho": 0.99 },
  "held_out": { ... },
  "baselines": {},          # explicitly empty — no fabricated comparisons
  "seeds": [42, 137]
}
# MEASURED-RESULTS: "Real executed metrics on documented synthetic corpora,
# library-backed (lifelines / filterpy / scikit-learn / statsmodels),
# honestly labelled synthetic."  +  _provenance_sha256: <hash>

Integrity tooling (the guardrails)

Three read-only tools sit beyond the gate. The gate proves provenance; these answer 'is it novel?' and 'is it sound?'.

uniqueness-scan

bge-m3 NN vs the 163K corpus + gap members; per-module, external OpenAlex/S2, and DeepSeek claim diff.

--per-module / --external / --llm-diff
paper-forensics

substance (measured vs prose), cite-audit (grounded/real/unverifiable), full-text prior-art diff.

--substance / --cite-audit / --fulltext-diff
single-champion

At most one published paper per seed_slug; the loop short-circuits once a champion publishes.

single_champion_target_met

Technical Foundations

Rust2026

research-pipeline (Rust crate)

This repo · apps/agentic-sales/crates/research-pipeline

One consolidated binary with 11 subcommands (tick · serve · verify-bridge · lift-paper-quality · find-author-papers · paper-loop · migrate-papers-to-lancedb · migrate-neon-to-sqlite · uniqueness-scan · embed-backfill · paper-forensics). `serve` is the Cloudflare Container entrypoint; `tick` is the cron loop; `paper-loop` drives the hourly R2-backed cloud envelope.

Single source of truth for discovery, synthesis, the publish gate, and the integrity tooling — five threading lanes (tokio I/O, rayon CPU, subprocess, coordinator, isolated long-running).

Method1972

Cox (1972) — Regression Models and Life-Tables

D. R. Cox · JRSS-B

Proportional-hazards regression for right-censored survival data; Efron tie handling. The CPH module fits `lifelines.CoxPHFitter` on a synthetic Cox-generated corpus.

Underpins the churn-trajectory module — censored survival is the honest model for contract-renewal data.

Method1960

Kalman (1960) — A New Approach to Linear Filtering

R. E. Kalman · ASME

Recursive state estimation under Gaussian noise. The ARR module uses a constant-velocity `filterpy.kalman.KalmanFilter` on a synthetic level+drift series.

Per-account ARR tracking with calibrated uncertainty.

Method1963

Ward (1963) — Hierarchical Grouping

J. H. Ward · JASA

Minimum-variance agglomerative clustering. The segmentation module runs `sklearn.cluster.AgglomerativeClustering(linkage='ward')` on well-separated synthetic revenue bands.

Revenue-band account segmentation.

Method1964

Huber (1964) — Robust Estimation / IRLS

P. J. Huber · Ann. Math. Stat.

M-estimation robust to heavy-tailed contamination. The attribution module uses `statsmodels` RLM with `HuberT()` on a contaminated synthetic design.

Outlier-robust revenue attribution.

AI/ML2024

BGE-M3 — Multilingual Dense Retrieval Embeddings

BAAI

1024-dim L2-normalized embeddings. Used for the corpus vectors, gap-detection clustering, and the uniqueness scan — the same model end-to-end so cosine is meaningful.

Powers gap detection and the prior-art / uniqueness scan against the 163K-paper corpus.

Infra2024

LanceDB + R2

LanceDB

Embedded columnar vector store; the paper corpus + embeddings live in LanceDB synced to a Cloudflare R2 bucket (the SQLite `research.db` is the relational state of record, also R2-persisted — never Neon for this loop).

Source of truth for papers, gaps, and synthesized papers.

AI/ML2026

DeepSeek-V4-Pro (thinking mode)

DeepSeek

Synthesis, paper expansion, automated review, and the forensic LLM differentials run on DeepSeek (`reasoning_effort=high`), with a Workers-AI fallback on 402.

The only LLM hop; codegen + synthesis + review.