versa/skills/notebook-osm/SKILL.md
Standalone marimo notebook (osm-monaco-viz.py) that self-authors TWO Airflow DAGs (osm + gtfs), triggers them via REST, runs polars + pyarrow analytics on both datasets, and renders TWO maps: streets via MapLibre GL JS + martin vector tiles + 3D terrain, transit via folium with 98 bus-stop CircleMarkers. Use when working with the notebook content itself, the dual-DAG self-authoring pattern, the two URL spaces (server-side AIRFLOW_API_INTERNAL_URL vs browser-bound MARTIN_PUBLIC_URL), the MapLibre/folium rendering split, or the surfaced-and-fixed bug catalog.
npx skillsauth add overthinkos/overthink-plugins notebook-osmInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A pure data layer that ships ONE file: a 10-cell self-contained marimo notebook demonstrating the full marimo stack end-to-end.
The notebook self-authors its own Airflow DAGs at runtime — no DAG files are baked into the image. It writes both the OSM pipeline DAG and the GTFS transit pipeline DAG into the airflow DAGs folder, triggers them in parallel via the REST API, polls both to success, then runs analysis + visualisation cells.
| Property | Value |
|----------|-------|
| Type | Pure data layer (no packages, no services, no tasks) |
| Dependencies | none (file-only) |
| Data | data/notebooks/ → workspace:notebooks/ (deploy-time bind/seed) |
| Eval | notebook-osm-viz-present — file exists in volume |
| Notebook | /workspace/notebooks/osm-monaco-viz.py (~20 cells) |
| # | Purpose | Returns |
|---|---|---|
| 1 | imports | mo, os, time, textwrap, requests, Path, polars, folium |
| 2 | _resolved_urls — diagnostic cell: renders a polars DataFrame of the seven URL env vars (purpose / env_var / value / side / is_default) and exports each as a typed value so downstream map cells consume them as parameters instead of re-reading os.environ per cell. The user's first window into "what URLs am I actually using" — verifies port: [auto] allocations and any cross-pod URL overrides without inspecting charly.yml. | urls, martin, airflow_api_internal, airflow_public, airflow_dags_dir, versatiles_public, versatiles_style, versatiles_assets |
| 3 | markdown header (mo.md) | rendered intro + URL strategy table |
| 4 | self-author all six DAG files (osm + gtfs + gpqtiles + duckdb-mvt + duckdb-freestiler + shortbread) | dag_files, dag_ids = [6 ids], dags_dir |
| 5 | trigger ALL SIX DAGs in parallel + poll until each succeeds | dag_run_states = {6 ids → success} |
| 6 | OSM GPU geom_kb histogram via cudf-polars-cu13 | df_gpu, parquet_path |
| 7 | OSM tag-key histogram via pyarrow (workaround) | df_tags |
| 8 | DuckDB Spatial geom-type counts (INSTALL spatial) | df_duckdb_spatial |
| 9 | polars-st bounding-box area top-10 (.st.* namespace) | df_polars_st |
| 10 | geopolars version + public-API inventory (alpha tripwire) | df_geopolars |
| 11 | streets MapLibre GL JS map (vector tiles + 3D terrain) — monaco source via the martin parameter from cell #2 | renders inline iframe |
| 12 | helper: build_pipeline_maplibre_html — shared MapLibre template for the pipeline comparison cells | (build_pipeline_maplibre_html,) |
| 13 | gpq-tiles MapLibre map — monaco-gpqtiles source via martin parameter | renders inline iframe |
| 14 | DuckDB ST_AsMVT MapLibre map — monaco-duckdb-mvt source via martin parameter | renders inline iframe |
| 15 | DuckDB → freestiler MapLibre map — monaco-duckdb-freestiler source via martin parameter | renders inline iframe |
| 16 | versatiles convert round-trip demo: PMTiles → .versatiles → PMTiles | df_versatiles_convert |
| 17 | Shortbread MapLibre map — monaco-shortbread on versatiles serve, styled with @versatiles/style.colorful() | renders inline iframe |
| 18 | cudf-polars CUDA probe + GPU vs CPU benchmark on 2M-row synthetic numeric data | df_cudf_polars_bench |
| 19 | GTFS analytics — stops/routes/trips counts + top routes | df_route_stops, df_stops, gtfs_summary |
| 20 | render df_route_stops as table | (display) |
| 21 | transit folium map (98 bus-stop CircleMarkers) | renders folium |
The four MapLibre cells (11, 13, 14, 15) share the martin URL via
cell #2's exports — no per-cell os.environ.get("MARTIN_PUBLIC_URL")
re-reads. The diagnostic cell is the single source of truth for all
seven URL env vars (R3).
The five DAGs are independent (notebook_osm_pipeline,
notebook_gtfs_pipeline, notebook_osm_gpqtiles_pipeline,
notebook_osm_duckdb_mvt_pipeline,
notebook_osm_duckdb_freestiler_pipeline) but triggered + polled
together by the single trigger cell. Cells 13-15 depend on the
existence of monaco.parquet which the OSM DAG produces in its
pbf_to_geoparquet task — so the three pipeline DAGs run only
after the original OSM DAG's task chain produces the parquet
intermediate.
The cell uses textwrap.dedent('''...''') to embed a complete DAG
file as a string literal, then Path.write_text(...) to drop it
into ${AIRFLOW_DAGS_DIR} (defaults /workspace/dags,
overridable for cross-pod). Idempotent — overwriting on every
notebook run keeps the DAG body in sync with the notebook (single
source of truth).
The OSM DAG ends with a reload_martin task that runs
supervisorctl restart martin after writing pmtiles — see
/charly-versa:osm-tools-layer for why.
The GTFS DAG uses gtfs_parquet.parse_gtfs(zip_path) then
write_parquet(dir) to produce one .parquet per GTFS table.
# 1. Get JWT from /auth/token
# 2. For each DAG: poll /api/v2/dags/<id> until registered (max 90s,
# given AIRFLOW__DAG_PROCESSOR__REFRESH_INTERVAL=10s)
# 3. Unpause if needed (PATCH /api/v2/dags/<id>)
# 4. Trigger (POST /api/v2/dags/<id>/dagRuns with logical_date)
# 5. Poll BOTH DAGs concurrently until each reaches success/failed
Returns a dag_run_states dict so dependent cells can gate on
dag_run_states["notebook_osm_pipeline"] == "success".
| Use | Where consumed | URL |
|---|---|---|
| Notebook → Airflow REST API | Inside the pod (Python requests) | AIRFLOW_API_INTERNAL_URL (default http://localhost:8080) |
| Folium / MapLibre tile URL | User's browser on the host | MARTIN_PUBLIC_URL (default http://127.0.0.1:23000) |
The marimo kernel runs INSIDE the pod (container-internal reach).
The user's browser runs OUTSIDE the pod (only host-port mappings
work). Mixing the two is the most common failure mode — see
/charly-versa:versa "Required charly.yml env block".
tags columnThe tags column is Arrow Map<String,String>. polars 1.38's
arrow2 MapArray reader decodes the logical type as
List<List<Struct>> and asserts a plain Struct inner — panics on
.collect(), .read_parquet(), AND .collect(engine="gpu"). NOT
GPU-specific.
Workaround in cell 7: read tags via pyarrow.read_table (which
handles map<string,string> correctly) + tally tag KEYS via
collections.Counter. The GPU cell (cell 6) avoids tags entirely
by grouping on a flat geom_kb column.
Branca's _repr_html_ (folium's HTML renderer) emits a
"trust" wrapper around the iframe srcdoc only when
figure.height is None. The wrapper hides the actual map content
in non-Jupyter renderers (marimo).
Fix: m.get_root().height = "500px" sets the Figure's
height (NOT the Map's) → branca takes the simple-iframe branch
without the trust wrapper. One-line fix using pure folium API.
Tippecanoe → pmtiles → martin produces VECTOR tiles
(Mapbox Vector Tile / PBF format,
Content-Type: application/x-protobuf). Folium's TileLayer treats
responses as PNG/JPG → silent failure → grey map.
Fix in cell 11 (the streets MapLibre cell): replaced folium with MapLibre GL JS via
mo.iframe. Vector source pointed at martin's TileJSON URL,
polygon/line/circle layers filtered by geometry-type.
Per the official MapLibre 3D terrain example:
terrainSource + hillshadeSource raster-dem sources at
https://tiles.mapterhorn.com/tilejson.json (CORS-permissive)hillshade layer renders relief shadingterrain: { source: 'terrainSource', exaggeration: 1.5 }sky: {} for atmospheric horizonpitch: 60, maxPitch: 85TerrainControl lets the user toggle terrainThe whole scene is elevated — the vector OSM data follows the DEM topography.
Both download_pbf and download_gtfs HEAD-check Last-Modified
before re-fetching. Saves bandwidth on warm-cache re-runs (Geofabrik
~12 MB PBF + transitous.org ~770 KB GTFS).
End-to-end run via marimo's own export (executes every cell server-side):
charly cmd versa "/home/user/.pixi/envs/default/bin/marimo \
export ipynb /workspace/notebooks/osm-monaco-viz.py \
--include-outputs --sort topological -o /tmp/notebook-run.ipynb -f"
Expected results (verified):
{"notebook_osm_pipeline": "success", "notebook_gtfs_pipeline": "success", ...}'highway' count 4571'5' with 46 distinct stopscircleMarker calls/charly-versa:versa — image composing this layer/charly-versa:versa-layer — the runtime kernel/charly-versa:airflow-layer — DAG trigger REST API/charly-versa:osm-tools-layer — martin + tippecanoe + reload pattern/charly-versa:maputnik-layer — companion vector-tile editor/charly-versa:versa-mcp — read-only inspection of this notebook's sessiontools
OpenCharly CLI (charly) binary installed into container/VM images for in-container use. Use when working with charly binary deployment inside containers, native D-Bus support, or the full charly toolchain (charly binary + virtualization + gocryptfs + socat).
development
Operator CachyOS workstation profile — a kind:local template + target:local deploy that installs the full dev stack (30 candies) onto a CachyOS host via ShellExecutor. Lives in the overthinkos/cachyos submodule. MUST be invoked before editing or applying the charly-cachyos workstation profile.
tools
Fedora box with the full charly toolchain using shared candies. Rootless-first — runs as uid=1000 with passwordless sudo (no root, no cap_add: ALL). Same candy list as charly-arch. Includes NVIDIA GPU runtime. MUST be invoked before building, deploying, configuring, or troubleshooting the charly-fedora box.
tools
Arch Linux box with the full charly toolchain. Rootless-first — runs as uid=1000 with passwordless sudo (no root, no cap_add: ALL). Composes /charly-coder:charly-mcp so the box is reachable as an MCP gateway on port 18765. NVIDIA GPU runtime composed in. MUST be invoked before building, deploying, configuring, or troubleshooting the charly-arch box.