Research Paper Pipeline
An autonomous loop that scrapes the open literature, finds under-explored regions, synthesizes a falsifiable hypothesis, runs a real (synthetic-corpus) experiment, and puts the result through a twenty-condition read-only publish gate (prose-substance, measured baselines, grounded citations and corpus-distinctness now enforced, not just diagnosed). It is deliberately transparent about its own limits: the experiments are honestly-labelled synthetic ground-truth recovery, the gate validates provenance rather than prose, and publication is a manual human decision — never automated. Read the first Deep Dive entry, What this pipeline is NOT, before the rest.
Papers in the LanceDB corpus (OpenAlex · Semantic Scholar · arXiv)
papers_store (LanceDB, R2-backed)
Read-only publish-gate conditions — incl. prose-substance, measured baselines, grounded citations & corpus-distinctness (2026-05-17)
publish_gate.rs CONDITION_KEYS
Subcommands in the consolidated `research-pipeline` binary
crates/research-pipeline/Cargo.toml
Published paper per seed_slug — single-champion: the loop halts once one paper passes
single_champion_target_met()
Auto-publish paths — HuggingFace upload is a deliberate manual human step after gate-pass
paper-loop never calls mark_published
discover
Three-source fan-inCursor-paginated discovery against OpenAlex, Semantic Scholar, and arXiv. Three orthogonal sources fan in to a dedup gate keyed on a normalized title + DOI, then land in the LanceDB paper corpus (synced to Cloudflare R2 — never Neon for this loop).
papers.lance→classify_embedclassify_embed
Single model end-to-endDeepSeek classifies each paper for relevance to the seed; bge-m3 produces a 1024-dim L2-normalized embedding (production ML container or the local :7799 shim). The same embedder is used everywhere downstream so cosine stays comparable.
vector→gap_detectiongap_detection
Embedding-outlier clusteringPapers in the bottom quartile of cosine similarity to the embedding centroid become an under-explored region. A heuristic, on purpose — the synthesized paper must still survive the uniqueness scan and forensics downstream.
research_gaps→synthesissynthesis
Pre-registered + groundedA falsifiable hypothesis is pre-registered (created strictly before the paper row, so the gate can prove it wasn't retrofitted), then DeepSeek expands it to an arXiv-style document with Related-Work grounded only to the gap's real member papers — references are rendered deterministically, never invented.
full_md→experimentexperiment
Library-backed synthetic recoveryFour modules (Cox-PH churn, Kalman ARR, Ward segmentation, Huber-IRLS attribution) are each generated as a self-contained Python module over a canonical library and a SELF-GENERATED synthetic corpus matched to that estimator. pytest must pass; evaluate.py's raw stdout is sha256-bound into the MEASURED-RESULTS block — honestly labelled synthetic, not real-world.
MEASURED-RESULTS→publish_gatepublish_gate
Read-only gate + manual signoffSixteen read-only conditions must all pass (provenance, metrics, review, reviewers, honesty). The gate never writes. Beyond it sit the guardrails — uniqueness-scan and paper-forensics — then the single-champion rule. Publication to HuggingFace is a deliberate manual human step after a person reads those reports.
Deep Dive
Technical Details
The 20 publish-gate conditions
Every condition is read-only and must pass before a paper is publishable. The gate never writes `published_at` — publication is a separate manual step. The last four (2026-05-17 integrity refinement) make the gate enforce what uniqueness-scan / paper-forensics previously only diagnosed.
| Field | Description | Type |
|---|---|---|
| pytest_passed | Generated implementation + tests run green | execution |
| snapshot_metrics | A snapshot metrics blob exists | metrics |
| held_out_metrics | Held-out within tolerance of snapshot | metrics |
| live_metrics | Live metrics present (dev-mode may skip) | metrics |
| review_floors | Automated-review scores ≥ floors | review |
| reviewer_pool | ≥100 reviewers with verified email | reviewers |
| coi_retained | ≥100 after conflict-of-interest filtering | reviewers |
| integrity_statement | Integrity + LLM-authorship disclosure present | honesty |
| compute_summary | Compute summary within cost/energy ceilings | compute |
| hypothesis_pre_registered | Hypothesis created before the paper row | provenance |
| paper_quality | Anti-stub: title, length, ≥2 section headers | quality |
| hf_url_remote | HF URL is https:// (not a local:// placeholder) | publish |
| test_substance | ≥2 passing tests (degenerate-stub guard) | execution |
| experiment_provenance | implementation_py + eval_py + training_metrics all non-empty | provenance |
| results_match_measured | MEASURED block sha256-bound to training_metrics | provenance |
| validity_disclosed | Validity & Limitations block present | honesty |
| prose_substance | Results/Abstract numbers can't contradict the measured snapshot (>10%); no fabricated t-tests/p/±/$ while baselines empty | honesty |
| baselines_measured | training_metrics.baselines non-empty — comparative claims backed by a real baseline run | provenance |
| citations_grounded | No invented in-text [N] outside the grounded gap-member pool | honesty |
| corpus_distinctness | Not a near-duplicate of prior art (bge-m3 NN; soft-skips with no embed endpoint) | novelty |
Why the experiment is honest about being synthetic
Each module fits a canonical library on a corpus generated to match that estimator. The result is a ground-truth-recovery check — labelled as synthetic, sha256-bound, and explicitly NOT presented as a real-world result.
# evaluate.py output (sha256-bound into the paper's MEASURED-RESULTS block)
{
"snapshot": { "cph_concordance": 0.79, "kal_arr_mape": 0.005,
"hac_silhouette": 0.46, "irl_attribution_rho": 0.99 },
"held_out": { ... },
"baselines": {}, # explicitly empty — no fabricated comparisons
"seeds": [42, 137]
}
# MEASURED-RESULTS: "Real executed metrics on documented synthetic corpora,
# library-backed (lifelines / filterpy / scikit-learn / statsmodels),
# honestly labelled synthetic." + _provenance_sha256: <hash>Integrity tooling (the guardrails)
Three read-only tools sit beyond the gate. The gate proves provenance; these answer 'is it novel?' and 'is it sound?'.
bge-m3 NN vs the 163K corpus + gap members; per-module, external OpenAlex/S2, and DeepSeek claim diff.
--per-module / --external / --llm-diffsubstance (measured vs prose), cite-audit (grounded/real/unverifiable), full-text prior-art diff.
--substance / --cite-audit / --fulltext-diffAt most one published paper per seed_slug; the loop short-circuits once a champion publishes.
single_champion_target_metTechnical Foundations
research-pipeline (Rust crate)
This repo · apps/agentic-sales/crates/research-pipelineOne consolidated binary with 11 subcommands (tick · serve · verify-bridge · lift-paper-quality · find-author-papers · paper-loop · migrate-papers-to-lancedb · migrate-neon-to-sqlite · uniqueness-scan · embed-backfill · paper-forensics). `serve` is the Cloudflare Container entrypoint; `tick` is the cron loop; `paper-loop` drives the hourly R2-backed cloud envelope.
Single source of truth for discovery, synthesis, the publish gate, and the integrity tooling — five threading lanes (tokio I/O, rayon CPU, subprocess, coordinator, isolated long-running).
Cox (1972) — Regression Models and Life-Tables
D. R. Cox · JRSS-BProportional-hazards regression for right-censored survival data; Efron tie handling. The CPH module fits `lifelines.CoxPHFitter` on a synthetic Cox-generated corpus.
Underpins the churn-trajectory module — censored survival is the honest model for contract-renewal data.
Kalman (1960) — A New Approach to Linear Filtering
R. E. Kalman · ASMERecursive state estimation under Gaussian noise. The ARR module uses a constant-velocity `filterpy.kalman.KalmanFilter` on a synthetic level+drift series.
Per-account ARR tracking with calibrated uncertainty.
Ward (1963) — Hierarchical Grouping
J. H. Ward · JASAMinimum-variance agglomerative clustering. The segmentation module runs `sklearn.cluster.AgglomerativeClustering(linkage='ward')` on well-separated synthetic revenue bands.
Revenue-band account segmentation.
Huber (1964) — Robust Estimation / IRLS
P. J. Huber · Ann. Math. Stat.M-estimation robust to heavy-tailed contamination. The attribution module uses `statsmodels` RLM with `HuberT()` on a contaminated synthetic design.
Outlier-robust revenue attribution.
BGE-M3 — Multilingual Dense Retrieval Embeddings
BAAI1024-dim L2-normalized embeddings. Used for the corpus vectors, gap-detection clustering, and the uniqueness scan — the same model end-to-end so cosine is meaningful.
Powers gap detection and the prior-art / uniqueness scan against the 163K-paper corpus.
LanceDB + R2
LanceDBEmbedded columnar vector store; the paper corpus + embeddings live in LanceDB synced to a Cloudflare R2 bucket (the SQLite `research.db` is the relational state of record, also R2-persisted — never Neon for this loop).
Source of truth for papers, gaps, and synthesized papers.
DeepSeek-V4-Pro (thinking mode)
DeepSeekSynthesis, paper expansion, automated review, and the forensic LLM differentials run on DeepSeek (`reasoning_effort=high`), with a Workers-AI fallback on 402.
The only LLM hop; codegen + synthesis + review.