- name:
- notion-report
- description:
- Create and maintain scientific/empirical experiment or investigation reports in Notion via MCP from freeform artifacts. Default to direct Notion editing with a Claude-assisted wording/structure pass when available; support local-first draft QA before publishing when explicitly requested. Use the report shape that best communicates the evidence instead of forcing one narrative format.
Notion Report
Multi-agent collaboration
- Default to using subagents when they are likely to improve speed, quality, confidence, or keep the main context clean.
- Use subagents to widen coverage, dig deeper on one thread, get a fresh second opinion, or keep the main thread clean while side work runs.
- Split work into clear packets with owners, inputs, acceptance checks, and a synthesis step when parallelizing.
- Keep the main agent focused on synthesis, unblockers, and the next critical-path step; let subagents handle bounded side work that can run in parallel.
- Use single-agent execution only when scope is small or coordination overhead outweighs gains.
Proactive autonomy and knowledge compounding
- Be proactive: immediately take the next highest-value in-scope action when it is clear.
- Default to autonomous execution: do not pause for confirmation between normal in-scope steps.
- Request user input only when absolutely necessary: ambiguous requirements, material risk tradeoffs, missing required data/access, or destructive/irreversible actions outside policy.
- If blocked by command/tool/env failures, attempt high-confidence fallbacks autonomously before escalating (for example
rg -> find/grep, python -> python3, alternate repo-native scripts).
- When the workflow uses
plan/, ensure required plan directories exist before reading/writing them (create when edits are allowed; otherwise use an in-memory fallback and call it out).
- Treat transient external failures (network/SSH/remote APIs/timeouts) as retryable by default: run bounded retries with backoff and capture failure evidence before concluding blocked.
- On repeated invocations for the same objective, resume from prior findings/artifacts and prioritize net-new progress over rerunning identical work unless verification requires reruns.
- Drive work to complete outcomes with verification, not partial handoffs.
- Treat iterative execution as the default for non-trivial work; run adaptive loop passes. Example loops (adapt as needed, not rigid): issue-resolution
investigate -> plan -> fix -> verify -> battletest -> organise-docs -> git-commit -> re-review; cleanup scan -> prioritize -> clean -> verify -> re-scan; docs audit -> update -> verify -> re-audit.
- Keep looping until actual completion criteria are met: no actionable in-scope items remain, verification is green, and confidence is high.
- Run
organise-docs frequently during execution to capture durable decisions and learnings, not only at the end.
- Create small checkpoint commits frequently with
git-commit when changes are commit-eligible, checks are green, and repo policy permits commits.
- Never squash commits; always use merge commits when integrating branches.
- Prefer simplification over added complexity: aggressively remove bloat, redundancy, and over-engineering while preserving correctness.
- When you touch code, leave the touched area in a better state than you found it: clearer, simpler, tidier, and at least as performant unless the task requires an explicit trade-off.
- Use simple, plain English in user messages, docs, notes, reports, code comments, and other explanatory writing. Avoid jargon, fancy wording, and complex phrasing. When a technical term is needed for correctness, explain it in simple words the first time. Default to short user-facing responses. Think about what the user most wants to know, and lead with that. Do not dump every detail by default. Always include important changes, blockers, verification gaps, and any important assumptions, nuances, principles, or decisions that shaped the work. Add more detail only when the user asks for it or when uncertainty or risk makes it necessary.
- Compound knowledge continuously: keep
docs/ accurate and up to date, and promote durable learnings and decisions from work into docs.
Long-task checkpoint cadence
- For any non-trivial task (including long efforts), run recurring checkpoint cycles instead of waiting for a single end-of-task wrap-up.
- At each meaningful milestone with commit-eligible changes, and at least once per major phase, invoke
git-commit to create a small logical checkpoint commit once relevant checks are green and repo policy permits commits.
- At the same cadence, invoke
organise-docs whenever durable learnings/decisions appear, and prune stale plan/ scratch artifacts.
- If either checkpoint is blocked (for example failing checks or low-confidence documentation), resolve or record the blocker immediately and retry before expanding scope.
Interruption handoff (required for multi-pass report work)
- Before stopping, switching sessions, or yielding after tool/session interruption, write a concise handoff in
plan/handoffs/ or plan/current/notes.md.
- Include the report/page identity, latest evidence bundle reviewed, mutations already published, remaining open edits, and the exact next step.
- On resume, load the handoff first so the report is continued in place instead of re-audited from scratch.
Create and refine a single canonical report page in Notion from evidence artifacts.
This skill is cross-project by default. It is designed for scientific/empirical reports in any domain. Domain overlays (for example trading/market experiments) are optional and only apply when relevant.
Terminal state contract (must follow)
The skill is complete only when all of the following are true:
- Objective completion: the user-requested outcome is achieved, or explicitly marked
blocked with concrete blocker evidence.
- Workflow completion: every required workflow step is resolved as
done, blocked, or not-applicable, with brief evidence or rationale.
- Step-level terminal completion: each numbered subtask must have explicit completion evidence (artifact, command output, or written rationale) before advancing.
- Verification completion: required checks/validations for this skill are executed, or any unavailable checks are explicitly called out with impact.
- Findings completion (where applicable): report only evidence-backed findings; if no high-confidence critical findings are present, explicitly state that.
- Loop completion: no actionable in-scope next step remains under the current objective.
Stop only after this terminal contract is satisfied; otherwise continue iterating.
Terminal state examples (adapt to skill)
done: canonical report page is created/updated with required section order, explicit opener answer, evidence-backed claims, and post-update verification checks.
blocked: Notion parent resolution, page ownership constraints, or upload/write capability prevents compliant publishing after bounded retries; blocker evidence and exact unblock steps are reported.
not-applicable: optional visuals/upload steps are explicitly skipped with rationale when the report has no visual requirements.
Core principles (must follow)
- Notion ownership gate is mandatory for existing-object mutations:
- allowed owners are Ollie (
[email protected]) and Ollie (MCP)
- when available, verify ownership with stable
created_by user/bot IDs (not display names alone)
- creating new pages/entries in approved shared destinations (for example Quant Blog) is allowed when explicit in-task user intent is present
- before mutating any existing object (
update, move, delete, comment, append), verify ownership from created_by and resolved user identity
- after creation, enforce the same ownership checks on subsequent mutations of that created entry/page
- if ownership is ambiguous or cannot be verified, fail closed and stop with a blocker
- no fixed input schema required
- accept freeform evidence inputs: notes, logs, tables, plots, artifacts, and run outputs
- descriptive-first reporting by default: factual, evidence-backed, and non-prescriptive unless user explicitly requests guidance
- choose the report shape that best communicates the evidence: question/answer, objective/outcome, comparison-first, narrative walkthrough, or another structure when that is the clearest fit
- use explicit
Question -> Answer flow when the real reader need is a concrete question and answer, but do not force Q/A scaffolding when another structure is clearer
- if a report is not naturally question-driven (for example operational status, incident summary, design comparison, or evidence ledger), state the objective and outcome plainly instead of forcing Q/A labels
- prefer direct, concise, final, objective wording; avoid story-like or theatrical phrasing
- avoid narration that adds mood but not evidence (for example "this gave a clear picture" or similar framing)
- remove time-window framing such as
overnight unless that timing matters to interpretation
- unless failure behavior is the explicit experiment question, do not make failure-rate metrics the primary headline; treat them as reliability context supporting core outcomes
- one canonical page per report identity; refine in place (no
v2/final forks)
- default publish destination for generated reports is Quant Blog (database row/page entries), unless the user explicitly asks for a different destination
- consolidation-first: keep the fewest report pages that still map cleanly to distinct report identities/questions
- plot-first for quantitative reporting; tables are supporting precision
- in Quant Blog, keep date in the
Date property and keep Title as <clear human title> (no leading date token in title text)
- use
rerun in titles only when the report explicitly links to the prior report/experiment being rerun
- avoid opaque shorthand/acronym-heavy naming in titles/headings/captions (for example coded labels like
RB1200 or B500); prefer plain-language wording
- keep titles human-facing: do not include hash-like search/run IDs in titles (for example short hex tokens)
- when same-topic reports represent distinct runs, keep separate pages and differentiate titles with plain-language descriptors (scope/design/coverage), not internal IDs
- for experiment/search reports, record machine-facing IDs (search ID, run ID, cluster job IDs, artifact pointers) in a final
Provenance Appendix, not the title
- explicit opener emphasis hierarchy for scanability:
- bold label-first opener
- concise high-signal callouts
- bold key metrics/status tokens only (no blanket bolding)
- opener clarity is mandatory:
- first
Top Takeaways line must contain the clearest high-level framing for the report: either the explicit report question and direct answer, or the explicit objective and outcome
- opener wording must be self-contained plain language; a new reader should not need prior thread context
- when the opener is question-driven, use direct answer tokens (
yes, no, inconclusive, or equivalent), not vague status-only wording
- if outcome is mixed, map it to explicit direct-answer or objective/outcome wording in the opener (for example
runtime yes, performance no or objective met for runtime, not for performance)
- avoid opener text that names status without concrete content
- when editing existing reports, normalize legacy opener labels to the clearest opener framing for the actual report instead of forcing one fixed label pattern
- if more than one key question is covered, use numbered
Question n / Answer n pairs with strict adjacency (each answer immediately follows its question)
- for multi-question or mixed-certainty investigations, add a dedicated
Question Decomposition section immediately after Top Takeaways using strict Question n -> Answer n -> Status n triples, with one concrete empirical evidence line in each answer
- keep the opener focused on the true experiment objective; do not let reliability/failure-rate stats become the lead claim unless failure analysis is the stated question
- if user scope is a report collection/hub, audit every in-scope report page one-by-one (no sampling)
- legacy-summary fallback rule: if the first summary section is not
Top Takeaways (for example Key Takeaways or Executive Summary), enforce the same explicit opener rule in that section's first line
- for collection-wide cosmetic requests, restrict edits to presentation and readability (formatting/emphasis/label normalization) and preserve reported numbers, conclusions, and claim text unless the user explicitly asks for content changes
- during readability/cosmetic passes, remove duplicate headings and repeated near-identical intro paragraphs when they are clearly redundant and the change does not alter meaning
- during refinement passes, aggressively remove duplicate sections, repeated claims, and repeated captions unless each repetition adds new evidence
- for comparative experiment reports, include a dedicated
Eval Setup section directly after Experiment Definition
Eval Setup must describe eval-time behavior in plain English (what checkpoint is evaluated, what is fixed vs randomized, and why the comparison is or is not apples-to-apples)
- never use the heading
Eval Setup Clarified; use Eval Setup
- reports and report-generation helpers are ephemeral artifacts:
- keep under
plan/
- never commit under tracked code paths (
tools/, src/, experiments/, docs/)
- traceability must be shared-reader friendly:
- when relevant, include repository, branch, PR, and commit references so another reader can retrace the work
- never include personal machine paths or other user-specific local filesystem references in the report body
- when file/path references are useful, write them relative to the repository root (for example
src/train.py, not repo-root/src/train.py)
- when a GitHub remote exists, prefer clickable GitHub links for repo, PR, commit, and file references
- if no GitHub link can be derived safely, fall back to repo name + branch + commit hash + repo-root-relative paths
- keep report presentation human-facing:
- reports should read as human-authored and human-consumable
- do not add Codex/Claude labels in report titles or main body sections
- keep exactly one small attribution footer note at the end containing
Prepared with support from Codex and Claude. codex-managed: true
- if any extra Codex/Claude references are found during refinement, remove them before publish
- Notion-native formatting is mandatory:
- do not leave literal markdown control tokens in published Notion text blocks (for example
`code` or **bold**)
- convert emphasis/code intent to Notion rich-text annotations before finalizing
- after publish, run a token-hygiene verification pass and normalize any residual markdown markers in place
- keep primary evidence expanded by default:
- do not hide core results, primary plots, or key findings inside collapsible/toggle blocks
- use toggle/collapsible blocks only when the user explicitly requests that layout
- if original numeric artifacts still exist, regenerate publication-grade plots from source data; do not use lightweight Notion-native chart specs (for example Mermaid
xychart-beta) for quantitative findings
- do not upload report artifacts to third-party public/temporary file hosts (for example tmpfiles.org, catbox, imgur, file.io) unless the user explicitly approves
- default to Notion-managed file/image uploads for report visuals (prefer
notion-upload-local when available); if upload is unavailable in the current toolchain, add Artifacts to attach placeholders rather than using unapproved external hosts
- hide local paths, hostnames, tokens, and secrets in report body text
- define acronyms on first use
- write for first-time readers with no project background; do not assume internal context
- minimize project-specific jargon in headings, captions, and body text; prefer plain everyday wording
- when technical terms are required for scientific precision, define them in simple language at first use
- keep section headings and plot labels self-explanatory so an outside reader can infer the purpose immediately
- define short encoded labels at first use and near the plot that uses them (for example
n4, r8, or similar run-scale labels); do not assume the reader knows local shorthand
- define environment family names in plain English near first use; if the report uses internal environment labels, add a short reader-facing explanation beside them or in an adjacent definitions block
Local-first mode (only when user explicitly requests)
- build a deterministic local bundle before Notion writes:
- report body (
.html or .md)
- summary payload (
.json)
- high-resolution plots (
>=220 dpi)
- use full artifact rows for core aggregate/distribution visuals unless explicitly labeled sampled
- do not downsample plotted data by default; this is especially required for time-series plots
- preserve native time cadence for time-series plots by default (no implicit temporal aggregation/resampling, e.g. day->month)
- prefer high-fidelity rendered images (PNG/SVG) generated from source artifacts over inline chart DSL blocks
- use Mermaid/Notion-native charts only as an explicit fallback when source data is irretrievably unavailable
- if downsampling or temporal aggregation is unavoidable for tooling/render limits, disclose it explicitly in both caption and methodology (method + factor + why)
- validate local quality before publishing:
- explicit axis labels + units
- readable legends
- no overlap/clipping
- decision-useful without zooming
- store local bundles in
plan/artifacts/
- keep temporary helper scripts in
plan/experiments/ (or similar ephemeral plan/ path)
Allowed-owner existing pages only (must follow)
Codex may create new pages/database entries in approved destinations when explicit user intent exists. Codex may mutate existing pages/database entries only when they are owned by the allowlisted principals:
Implementation:
- creating new entries/pages in approved destinations is allowed when user intent is explicit
- for existing content, verify ownership before any mutation
- if ownership is non-allowlisted, treat as read-only unless explicit user approval overrides
- if ownership is ambiguous/unverified, fail closed and ask/mark
blocked
- for Codex-authored report pages, keep one durable footer marker near the end:
Prepared with support from Codex and Claude. codex-managed: true
- do not add title suffixes or body labels such as
(Codex-managed regenerated) or Only edit this page if this marker is present.
Parent/location routing (must follow)
- If user provides a Notion URL/ID, use that parent.
- Otherwise, default to Quant Blog data source/database for generated reports.
- Treat legacy project
Reports pages as source/reference locations unless the user explicitly requests writes there.
- If still unclear, ask one short clarification question.
Publishing constraint:
- for Quant Blog publishing, create/update a database entry with
Project, Date, and Title set correctly (title without leading date prefix)
Current known hubs in this environment (use when relevant):
- Quant Blog data source/database:
https://www.notion.so/31af1d57cab8802ebd19ce8d1fdf28c1
- GigaPlay hub:
https://www.notion.so/GigaPlay-2edf1d57cab880209415f67e7c65414f
- GigaPlay reports:
https://www.notion.so/305f1d57cab8802eaac6d100dc242c5d
- Mercantile hub:
https://www.notion.so/Mercantile-306f1d57cab8802a8b5cc2b513780742
- Mercantile reports:
https://www.notion.so/310f1d57cab880149491fb155d0b3e30
- Ollie Notes fallback:
https://www.notion.so/Ollie-Notes-21df1d57cab880e2af99fc37c6062b2e
Scope and input handling
- if user gives directories: scan recursively for relevant artifacts
- if user gives file list: use the list
- if mixed: combine
- if scope intent is ambiguous: ask one short clarification question
- for remote content: use surfaced local copies, or ask for fetch/copy command
Collection audit mode (must follow when requested)
- if the user asks to ensure quality across all reports in a hub/project, enumerate all report pages in scope first
- verify opener clarity on each page individually and patch each non-compliant page
- do not stop after spot-checks; completion requires all in-scope pages checked
- re-fetch each in-scope page and confirm opener line contains explicit question + direct answer after the pass
Scientific/empirical report structure (default)
1) Top Takeaways (must be first section)
- use heading
## Top Takeaways as first section
- first line must include the clearest opener for the report:
- question-driven reports: explicit question + direct answer token
- non-question-driven reports: explicit objective + outcome
- preferred opener formats:
Question + direct answer: <explicit question>? <direct answer>.
Objective + outcome: <explicit objective>. <explicit outcome>.
- disallowed opener pattern: status-only wording with no concrete question/objective text
- use question/answer lines as the main driver only when that is the clearest fit for the report
- if another structure explains the evidence better, use it instead and keep the opener explicit
- legacy heading fallback: if first summary heading is
Key Takeaways/Executive Summary, apply the same first-line opener rule there
- include the most important outcome immediately
- if there is a clear before/after baseline relation, include explicit delta line (
before -> after, absolute + relative where available)
- apply emphasis hierarchy:
- bold label-first opener
- concise callouts for major win/loss tradeoff when both exist
- selective bolding of key metrics/status only
- if multiple questions are answered, list numbered
Question n then Answer n lines directly under the opener, keeping each answer immediately after its paired question
- when the report includes several sub-questions or mixed answer certainty, include a dedicated
Question Decomposition section immediately after Top Takeaways with Question n -> Answer n -> Status n ordering and one empirical evidence line per answer
2) Experiment Definition (for experiment/search reports)
Include:
- question/hypothesis being tested
- what varied (search space)
- what was fixed (controls/constants)
- how samples were assigned/swept/randomized
- why this design was used
- answer wording that directly maps to the question terms (no implicit inference required)
- explicit answer status for this batch (
yes/no/inconclusive or equivalent) with brief evidence-backed why
3) Eval Setup (for experiment/search reports)
- place immediately after
Experiment Definition and before any results/plots section
- explicit eval-time setup details in plain English:
- which checkpoint/policy snapshot is evaluated
- exact eval composition and overrides (for example opponent setup or learned-agent count)
- what is fixed at eval vs re-sampled/randomized
- domain-randomization behavior at eval for key variables (for example fees and starting capital): fixed value(s) vs sampled range(s), and sampling cadence
- explicit apples-to-apples statement across compared runs/batches
- if any detail is unknown, mark it
unknown and include it in Limitations / Reliability
4) Early Results Section
- place early, after
Eval Setup
- include compact high-signal visuals first
- prioritize visuals that answer the report question directly
- use a plain-language heading that fits the report, such as
Quick Aggregate Snapshot or another direct results label; avoid corporate/executive wording unless the page already uses it consistently
- order visuals from simple to more complex, or from broad context to the more detailed evidence the reader needs next
- if broad aggregate plots are too coarse or hide too much, demote them to a quick scan only and make run-level or matched-comparison plots the main evidence section
- when aggregate plots remain, state clearly that they are a first scan and not the main evidence
5) Definitions and Methodology
- define domain terms used in the report context
- define formulas for derived metrics and signs/directions
- include
unit unavailable placeholders for unknown units; do not guess
6) Limitations / Reliability
- use this heading text exactly:
Limitations / Reliability
- include missing data, incomplete runs, comparability gaps, and caveats
- keep reliability emphasis proportional to impact
- do not make completion/failure the hero narrative unless it materially changes conclusions
- when failure rates are not the explicit research target, report them as reliability support (not the primary report focal point)
7) Run Spotlight (when run-level time-series artifacts exist)
- include 1-3 explicitly selected runs
- when run-level traces are the most decision-useful evidence, make this section the main evidence section rather than a side note
- when a project already has a standard single-run plot set, include that standard set unless there is a strong reason not to
- if both standard single-run plots and fuller all-points plots exist, keep both when they answer different reader needs and explain the role of each section
- when a project has true one-episode or one-run evaluation charts, include them explicitly; do not treat training-path or checkpoint-path plots as a substitute
- if both actual episode charts and training/checkpoint plots exist, label the sections so the reader can tell the difference immediately
- show normalized trajectory views when possible (for example growth from
1.00x)
- for return trajectories, prefer relative capital units and state the scale explicitly in captions (for example
1.0 = +100% net return)
- for turnover/exposure trajectories, prefer relative units (
x capital/day, turns/day, x equity) and define directional interpretation for signed exposure
- when trace artifacts exist, do not stop at a single compressed overview image
- include at least one dedicated performance-path figure for each spotlighted run with:
- capital multiple or equivalent PnL trajectory
- underwater or max-drawdown path
- daily turnover multiple or equivalent daily activity path
- include at least one dedicated inventory/book-shape figure for each spotlighted run with:
- per-symbol position or target-position time series
- gross and net exposure time series
- if both target and realized positions are available, prefer showing both somewhere across the spotlight package; if only one can be shown clearly, prefer the more decision-relevant one and state that choice
- if unavailable, explicitly call out artifact gap in limitations
8) Provenance Appendix (for experiment/search reports)
- place this section at the end of the report
- include traceability details (for example search IDs, run IDs, cluster job IDs, key artifact locations)
- keep these machine-facing identifiers out of the title and primary narrative unless needed for interpretation
Evidence discipline (must follow)
- every quantitative claim must map to explicit evidence artifacts
- maintain internal claim map (
claim_id -> artifact/source) while drafting
- for causal/attribution claims, include isolating evidence or clearly label as associative
- for paired comparisons, report overlap/sample counts used for each pair
- when failure rates are non-trivial, run a basic selection-bias stress check
- preserve numerical/sign integrity (
A - B direction, units, percent bases)
Visual and table standards (must follow)
For plots:
- clear title with metric + cohort/slice
- axis labels with explicit units
- axis labels must appear in the plot block itself (not caption-only)
- clear legend/series semantics (or explicit single-series label)
- caption must start with concise
Takeaway: lead-in, then brief support
- replace raw upload/file-name captions with plain-English captions; keep machine-like run IDs in the appendix unless they are needed for interpretation
- include direction cue where helpful (
higher is better / lower is better)
- avoid unreadable layouts (overlap, clipped labels, ambiguous scales)
- prefer plain-English titles and labels over team shorthand; define any unavoidable technical term in simple words nearby
- keep captions direct and objective; do not use story-like scene-setting or hype language
- order plot groups so the reader can follow them easily from basic to harder, and from broad scan to detailed proof
- if broad aggregate visuals hide too much variation, do not let them dominate the page; prefer per-run, matched, or progression plots instead
- when a timing plot shows raw duration rather than normalized speed, say that plainly in the caption or immediately next to the plot
- when a timing plot is not apples-to-apples on its own, point the reader to the normalized speed plot or the fair comparison plot
- if a run-level plot uses training-log rows that mix outer iterations and inner training steps, say that plainly so repeated or rising timing lines are not misread
For tables:
- clear title
- explicit column headers + units
- clear derived-field names
- short caption with
Takeaway: lead-in
General:
- plot-first for quantitative interpretation
- prefer fewer high-signal visuals over many redundant ones
- prefer relative/normalized units in primary visuals when both raw and relative forms exist (keep raw units as secondary context)
- when full-resolution full-detail run plots exist, prefer them as the core evidence section instead of relying mainly on broad summaries
- no downsampling by default, especially for time-series visuals
- no implicit temporal aggregation for time-series visuals; keep native cadence unless explicitly justified
- avoid Mermaid
xychart-beta (or equivalent low-fidelity DSL charts) for empirical metric plots when source artifacts exist
- if downsampling or temporal aggregation is unavoidable, disclose exact method/factor and expected visual impact
Cross-project refinement loop (must follow)
- keep the skill project-agnostic: domain specifics are overlays, not defaults
- benchmark draft quality against the strongest prior report style available for the current project/domain
- keep what improved readability/interpretability and remove low-signal sections/visuals
- use selective emphasis only (bold key conclusions/metrics/status), not blanket formatting
- ensure the final report remains descriptive-first unless recommendations were explicitly requested
Optional domain overlays
Use only when relevant to the current project/domain.
Trading/market experiments (optional overlay)
- include trader/quant lens and ML/DL lens in
Top Takeaways
- common metrics may include PnL, Sharpe, drawdown, turnover, execution intensity, and robustness
- prefer interpretable intensity metrics (for example
turns/day) over raw fill counts unless fill mechanics are the focus
Cluster-backed artifacts (optional overlay)
- verify run-level artifact availability directly (read-only) before declaring missing
- when available, use retained forensics/root-cause fields for failure attribution rather than wrapper labels only
Claude-assisted drafting
Default for report wording/structure refinement when Claude is available.
- command:
claude -p --model opus --effort high
- Codex remains source-of-truth for data prep, validation, and Notion writes
- Claude is wording/structure assist only
- pass explicit evidence only; no invented values
- sanitize prompts (paths/hosts/secrets/private IDs)
- prefer repo-root-relative references and shared web links over local absolute paths
- require structured output variants + self-check
- run Codex verification gate before applying
- if Claude is unavailable, continue with Codex-only refinement and note that constraint in the task summary
Codex verification gate:
- all claims trace to evidence
- units/signs unchanged and correct
- no scope drift or unsupported claims
- no path/host leakage
- traceability references are shareable (repo-root-relative paths, repo/branch/PR/commit references, GitHub links when available)
- section-order constraints preserved
Notion MCP workflow (must follow)
Use notion-local for page/database operations and notion-upload-local for direct image/file uploads.
- Connectivity check:
- call
mcp__notion-local__API-get-self once per task
- verify the active integration identity is
Ollie (MCP); if not, fail closed and report blocked
- Ownership preflight (required before writes):
- allowlist owners: Ollie (
[email protected]) and Ollie (MCP)
- when available, verify ownership using stable
created_by user/bot IDs and treat display names as labels only
- creating new entries/pages in approved shared destinations is allowed when explicit user intent is present
- for existing target objects, read metadata and verify
created_by principal is allowlisted before mutating
- for create actions in shared databases/workspaces, require explicit user intent for that destination; then enforce ownership checks on the created/updated entry itself
- if ownership cannot be verified with high confidence, fail closed and report
blocked
-
If visuals are required, verify notion-upload-local upload capability before publishing image blocks.
-
Run at least one Claude wording/structure pass when available, including prompt context for:
- reader-first plain-English writing with minimal jargon
- required section order (
Top Takeaways -> Experiment Definition -> Eval Setup -> results)
- explicit eval-time setup clarity and apples-to-apples statement for comparative reports
- Resolve parent:
- normalize/verify user-supplied URL/ID with
mcp__notion-local__API-retrieve-a-page
- otherwise enumerate/search candidate parents and validate
- Create/update canonical page:
- derive report identity from evidence (experiment/search/run IDs, batch label/date, variant)
- when publishing to Quant Blog, search existing rows by
Project + Title (+ Date when available)
- create only if no matching canonical row exists
- update in place otherwise
- avoid creating duplicates for title wording changes
- ensure Quant Blog properties are correct (
Project, Date, Title without leading date prefix)
- include one small footer attribution note at end with
Prepared with support from Codex and Claude. codex-managed: true
- before finalizing, confirm no other Codex/Claude references remain in title/body
- when relevant, add a compact traceability section or
Provenance Appendix entries that include:
- repository identity
- branch name
- PR link/number when applicable
- commit hash references for the analyzed or reported state
- repo-root-relative file paths for important touched files/artifacts
- clickable GitHub links when a GitHub remote exists
- for experiment/comparison reports, verify eval-time configuration details against available experiment definitions/search spaces/results artifacts before claiming comparability
- when performing cosmetic-only passes, avoid rewriting analytical content; normalize rich-text formatting and opener labels in place
- Upload visuals:
- prefer
notion-upload-local tool path for direct Notion-managed uploads
- if upload tooling is unavailable, add
Artifacts to attach placeholders rather than external-host workarounds
- place visuals in the main narrative section (
Executive Visual Snapshot) rather than appending them after footers/appendices
- after upload, verify visuals are anchored at intended locations and remove misplaced duplicate image blocks
- after upload, replace default raw file-name captions with plain-English
Takeaway: captions or a short human-readable description
- when updating an existing visual, archive/delete the superseded block so only one current plot version is visible per plot identity
- verify replacement integrity by re-fetching both superseded and new block IDs (
old = archived, new = active)
Implementation note for Notion MCP block updates:
API-update-a-block payload must use the concrete block-type key directly (for example paragraph, heading_2, bulleted_list_item) rather than a nested type wrapper
- Deduplication behavior:
- keep exactly one canonical page per report identity in target reports location
- prefer consolidation into fewer pages when multiple pages represent the same identity/question
- when duplicates exist for same data identity, retain one canonical page; move/archive others only with user approval
- when pages are same topic but distinct run identities, retain both and rename for clear human differentiation
- Output contract:
- return created/updated Notion page URL (or ID)
Image embedding constraints (Notion reality)
- preferred rendering path is Notion-managed file/image uploads
- do not use third-party public file hosting by default
- external hosting is allowed only with explicit user approval and should prefer first-party/org-controlled infrastructure
data: URIs and local-only schemes are not durable in Notion
- after write, re-fetch and verify image/file blocks still have non-empty URLs and non-zero payloads at verification time
Embedding sequence:
notion-upload-local direct file/image upload (Notion-managed)
- browser upload fallback (if available) on Codex-managed page (Notion-managed)
- reuse existing Notion-managed file/image blocks when valid
- explicitly approved first-party/org-controlled
https:// hosting path
- if all fail, add
Artifacts to attach section with neutral labels + meaning (do not use unapproved third-party hosts)
Iteration loop and finish criteria
- run multi-pass reader-centric refinement:
- pass 1: structure and narrative arc
- pass 2: skeptical-reader questions and caveats
- pass 3: clarity and emphasis hierarchy
- in every pass, check for and remove duplicate information, stale wording that no longer matches the page structure, and headings that are less clear than a simpler plain-English alternative
- after each pass: update same canonical page and re-fetch to verify landing
- done when:
- required sections present
- opener clearly states the main question and answer, or the main objective and outcome, whichever better matches the report
- opener framing is unambiguous to a cold reader
- top-of-page readback is clean: no duplicate summary headings, no repeated intro blocks, and the opening screen reads naturally in plain English
- when multiple questions are present, Q/A ordering is explicit and adjacent (
Question n then Answer n)
- when
Question Decomposition is present, each Question n has adjacent Answer n and Status n lines, and each Answer n includes one concrete empirical evidence line
- experiment/comparison reports include a dedicated
Eval Setup section immediately after Experiment Definition
Eval Setup includes explicit fixed-vs-randomized behavior and an apples-to-apples statement (or explicit mismatch caveat)
- empirical/comparison report section order is explicit and preserved:
Top Takeaways -> Experiment Definition -> Eval Setup -> Early Results Section -> Definitions and Methodology -> Limitations / Reliability (+ Run Spotlight when applicable)
- when both broad aggregate plots and full-detail run plots are present, the page should make clear which section is the quick scan and which section is the main evidence
- short run-scale labels, environment names, timing-metric meanings, and the compute shape behind the compared points are explained at first use and near the relevant plots
- when standard single-run plots exist for the project, the report includes them or explicitly says why they were skipped
- experiment/search reports include a final
Provenance Appendix with machine-facing identifiers
- where repo context matters, traceability references are present and shareable: repo, branch, PR, commit, and repo-root-relative path references with GitHub links when available
- claims are evidence-grounded
- formatting hygiene checks pass (no residual literal markdown markers like
`...` or **...** in published block text)
- visuals are in their intended section context (usually
Executive Visual Snapshot), not detached at page end
- visual/table labeling standards met
- no privacy leakage
- no actionable quality gaps remain
- for collection-scope tasks, every in-scope report page was individually checked and opener clarity is verified
Quality checklist (core)
Top Takeaways is first
- opener uses the clearest framing for the report: explicit
Question + direct answer when question-driven, otherwise explicit objective/outcome framing
- opener includes explicit question text when question-driven, or explicit objective text when objective-driven
- when
Question Decomposition exists, it uses strict Question n -> Answer n -> Status n ordering with one empirical evidence line per answer
- vague status-only opener answers (for example
mixed, answered, improved) are non-compliant unless mapped to explicit direct-answer tokens
- opener framing is self-contained plain language (no shorthand requiring external context)
- report naming is human-readable (
<clear human title>) and avoids opaque shorthand/acronym-heavy labels
- report date is stored in the
Date property when using Quant Blog
- report titles do not include hash-like run/search IDs
- machine-facing traceability IDs are captured in a final
Provenance Appendix (for experiment/search reports)
- when repo context matters, reports include shareable repository references: repo, branch, PR, commit, and repo-root-relative paths; clickable GitHub links are preferred when available
rerun appears in a title only when the report links to the prior report/experiment it reruns
- ownership is verified for any existing mutation target (
[email protected] or Ollie (MCP))
- failure-rate metrics are present for reliability context but are not headline focal points unless failure behavior is the explicit research question
- no vague status-only opener wording (for example
answered) without explicit question text
- legacy first-summary sections (
Key Takeaways/Executive Summary) apply the same explicit opener rule when Top Takeaways is absent
- opener emphasis hierarchy is correct
- no duplicate summary headings or repeated intro paragraphs remain after refinement
Experiment Definition includes what varied/fixed/how/why/answer
Eval Setup appears immediately after Experiment Definition and before results/plots sections
- heading uses
Eval Setup (never Eval Setup Clarified)
- the main results section uses plain-language headings and a plot order that is easy to follow from simple to more complex
- broad aggregate plots do not dominate the report when they are too coarse to support the main conclusions
- duplicate sections, repeated intros, and stale wording are removed
- short plot labels like
n, r, or local environment names are explained clearly next to or on the relevant plots
- timing plots clearly distinguish raw seconds from normalized rates
- timing plots state the compute shape behind each point when that shape matters for interpretation (for example CPU-only vs GPU, CPUs, memory, wall-clock limit, or same-job-shape note)
- standard single-run plots are present when available and useful
- true individual episode plots are present when available and are clearly separated from training-path or checkpoint-path plots
- experiment/comparison reports include explicit eval-time setup details:
- checkpoint selection at eval
- eval composition/overrides (for example learned-agent count/opponent setup)
- domain-randomization behavior for key variables (for example fees/starting capital) at eval
- explicit apples-to-apples comparability statement
Executive Visual Snapshot is present for quantitative reports and appears after Eval Setup
- captions use
Takeaway: lead-ins
- captions are human-readable and do not default to raw uploaded file names
- axis labels/units/legends are explicit and readable
- plotted series are not downsampled (or any unavoidable downsampling is explicitly disclosed with method/factor/why)
- time-series plots keep native artifact cadence (or any aggregation is explicitly disclosed with method/factor/why)
- empirical charts are regenerated from original artifacts (no low-fidelity Notion-native chart DSL when source data exists)
- limitations/reliability caveats are present and proportional
- report is descriptive-only unless user explicitly requested recommendations
- no table-of-contents block
- no local path/hostname leakage
- no personal machine paths in title/body/appendix; use repo-root-relative paths instead
- exactly one footer attribution marker is present (
Prepared with support from Codex and Claude. codex-managed: true) and there are no other Codex/Claude references in title/body
- no literal markdown emphasis/code marker text remains in published Notion blocks (
**...**, `...`)
- visuals are positioned in the main report flow (typically under
Executive Visual Snapshot) and not appended after appendix/footer
- primary evidence sections are expanded in-place and not hidden behind toggles/collapsible sections unless explicitly requested
- no third-party public file host usage unless explicitly approved by the user
- image/file blocks are Notion-managed by default and resolve to non-empty payloads at verification time
- if visuals are expected, upload capability is verified before publish (or placeholders are used when unavailable)
- when run-level traces exist,
Run Spotlight includes separate performance-path and inventory/book-shape visuals rather than only a single overview image
- for collection-scope tasks, all in-scope report pages were audited one-by-one (no sampling)
Quality checklist (optional overlays)
Apply only if relevant:
- trading overlay: trader/quant + ML/DL dual-lens takeaway coverage
- cluster overlay: artifact-availability verification before missing-artifact claims
- causal overlay: isolated evidence or explicit associational caveat