Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

arm-examples/llm-debug-test-failures

Name: llm-debug-test-failures
Author: arm-examples

skills/llm-debug-test-failures/SKILL.md

npx skillsauth add arm-examples/llm-runner llm-debug-test-failures

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Debug failing model/tests

Use this when llm-cpp-ctest-* fails due to:

model output drift (expected anchors not found)
context/runtime parameters (context full, truncation, batch sizing)
prompt/template issues (chat formatting differences)
backend/framework regressions

Windows note: if python3 isn’t available, use python (or py -3) for the scripts below.

Workflow

1) Re-run the failing test with maximum signal

From the build dir you used:

ctest --test-dir ./build --output-on-failure -V

To run just one failing test, use the test name shown by CTest:

ctest --test-dir ./build -R llm-cpp-ctest-<config> -V --output-on-failure

Tip: ctest --test-dir ./build -N lists tests without running them.

Optional helper (run once, then rerun only failing tests verbosely):

python3 skills/llm-debug-test-failures/scripts/rerun_failing_ctest.py build --cpp-transcript ./llm-test-transcript.txt

2) Inspect the model response and config/runtime values

The C++ and JNI tests print the model response and key context when an assertion fails.

If you want the C++ tests to print prompts/responses even when assertions pass, set:

LLM_TEST_DEBUG_RESPONSES=1 (environment variable), or
add --debug-responses when running the llm-cpp-tests binary directly.

If you want a file you can attach to bugs/PRs (works even when CTest output is hard to read), set:

C++: LLM_TEST_TRANSCRIPT_PATH=./llm-test-transcript.txt (or pass --transcript <path>)
JNI: add -Dllm.tests.transcript=./llm-jni-transcript.txt to the Java command

For JNI tests, you can also force printing prompts/responses even when tests pass by adding -Dllm.tests.debug=true to the Java command (copy it from ctest -V output and add the flag). If the failure is “output drift” but the answer is still correct:

constrain the prompt first (“Answer with a single word.”)
then update the expected anchors (keep them high-signal)

Reference: skills/llm-add-model-support/references/output-validation.md.

3) Validate the config file and model paths

Confirm the config JSON exists under model_configuration_files/ and is referenced by test/CMakeLists.txt (CONFIG_FILE_NAME list).
Confirm --model-root points at resources_downloaded/models (CTest passes this automatically).
If the error is “context is full”, inspect contextSize + batchSize in the config JSON and any test overrides.

4) Trace into the backend integration (if needed)

Most backend-specific issues live under:

src/cpp/frameworks/llama_cpp/
src/cpp/frameworks/onnxruntime_genai/
src/cpp/frameworks/mediapipe/
src/cpp/frameworks/mnn/

Look for:

prompt/template construction differences
model loading paths derived from config fields
modality handling (text vs vision) and batch/context sizing

5) Trace into the upstream framework source (when it looks like a framework bug)

If the wrapper code looks correct but behavior/crashes originate in the underlying framework, use the build tree’s fetched sources.

Identify the backend and pinned revision:
- From the failing test output: use the printed framework/config summary.
- Or run: python3 scripts/dev/framework_versions.py --build-dir build
- The pin and local source path are also defined in the backend’s CMakeLists.txt:
  - llama.cpp: src/cpp/frameworks/llama_cpp/CMakeLists.txt (LLAMA_GIT_SHA, LLAMA_SRC_DIR)
  - MNN: src/cpp/frameworks/mnn/CMakeLists.txt (MNN_GIT_TAG, MNN_SRC_DIR)
  - ONNX: src/cpp/frameworks/onnxruntime_genai/CMakeLists.txt (ONNXRUNTIME_GIT_TAG, ONNXRT_GENAI_GIT_TAG, *_SRC_DIR)
  - MediaPipe: src/cpp/frameworks/mediapipe/CMakeLists.txt (MEDIAPIPE_GIT_SHA, MEDIAPIPE_SRC_DIR)
Open the fetched upstream code under your build directory (default locations):
- llama.cpp: build/llama.cpp/
- MNN: build/mnn/
- onnxruntime: build/onnxruntime/
- onnxruntime-genai: build/onnxruntime-genai/
- mediapipe: build/mediapipe/
Connect the dots from wrapper → upstream:
- Start at the wrapper implementation (e.g. src/cpp/frameworks/mnn/MnnImpl.cpp) and follow the calls into upstream headers/sources (includes typically reference the fetched *_SRC_DIR).
- Use a local search tool to find the symbol or error string in upstream sources (e.g. grep -RIn -- "<symbol-or-error>" build/mnn).
Decide “wrapper bug vs framework bug”:
- Wrapper bug signals: incorrect mapping of contextSize/batchSize, wrong model path, wrong prompt template, modality mismatch, or misuse of the framework API.
- Framework bug signals: crash/assert inside upstream code with correct inputs, regression tied to a version bump, or behavior that contradicts framework docs for the pinned revision.

Practical heuristics (fast triage):

If the model won’t load:
- Wrapper-side first:
  - Does LlmConfig expand llmModelName relative to --model-root the way the backend expects?
  - Is the config pointing at a file vs a directory (some frameworks expect a folder with multiple artifacts)?
  - Are optional artifacts (e.g. projection model for vision) set/expanded correctly?
- Framework-side likely when:
  - all paths exist and are readable, but load fails with a framework internal error, assert, or crash.
If you hit “context is full” / truncation:
- Wrapper-side first:
  - Verify the test/config values printed in the summary: contextSize, batchSize, and any test overrides.
  - Check whether the wrapper counts tokens/bytes differently than the framework (off-by-one style issues), or whether it re-encodes prior chat turns unexpectedly.
- Framework-side likely when:
  - the same prompt/config works on a prior pinned revision but fails on the new one (regression after bump).
If the failure is output drift (anchors not found):
- Wrapper-side first:
  - Prompt/template formatting differences (chat template application, missing system prompt, stopwords).
  - Wrong backend selected (confirm framework= in the config summary).
  - Model mismatch (a different model artifact is being loaded than expected).
- Framework-side likely when:
  - identical prompt/config/model on the same revision produces unstable output across runs (nondeterminism), or a known decoding change landed upstream.
If the failure is crash / SIGSEGV / abort:
- Wrapper-side first:
  - invalid inputs (null/empty strings, empty model path, missing image path, bad tensor sizes)
  - lifetime/order issues (free/reset during active decode; encode called with unexpected isFirstMessage sequencing)
- Framework-side likely when:
  - crash happens inside fetched upstream sources with valid inputs, especially reproducible with a minimal prompt.

Backend “where to look first”:

llama.cpp: wrapper chat template/tokenization glue in src/cpp/frameworks/llama_cpp/, fetched sources in build/llama.cpp/
mnn: model artifact layout + file IO in src/cpp/frameworks/mnn/, fetched sources in build/mnn/
onnxruntime-genai: session/init + provider config in src/cpp/frameworks/onnxruntime_genai/, fetched sources in build/onnxruntime*/
mediapipe: bazel-built engine wiring in src/cpp/frameworks/mediapipe/, fetched sources in build/mediapipe/

arm-examples/llm-debug-test-failures

skills/llm-debug-test-failures/SKILL.md

Debug failing LLM integration tests caused by model output drift, incorrect context/runtime parameters (contextSize, batchSize, threads), prompt/template mismatches, or backend/framework regressions. Use when tests fail and you need to see the model response, reproduce a single failing CTest, or trace issues into src/cpp/frameworks (llama.cpp, onnxruntime-genai, mediapipe, mnn).

4 stars

development

Updated Apr 4, 2026

$ install --global

skillsauth

npx skillsauth add arm-examples/llm-runner llm-debug-test-failures

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 23, 2026, 1:36 AM1.8s1 file scanned

SKILL.md

name:: llm-debug-test-failures
description:: Debug failing LLM integration tests caused by model output drift, incorrect context/runtime parameters (contextSize, batchSize, threads), prompt/template mismatches, or backend/framework regressions. Use when tests fail and you need to see the model response, reproduce a single failing CTest, or trace issues into src/cpp/frameworks (llama.cpp, onnxruntime-genai, mediapipe, mnn).

Debug failing model/tests

Use this when llm-cpp-ctest-* fails due to:

model output drift (expected anchors not found)
context/runtime parameters (context full, truncation, batch sizing)
prompt/template issues (chat formatting differences)
backend/framework regressions

Windows note: if python3 isn’t available, use python (or py -3) for the scripts below.

Workflow

1) Re-run the failing test with maximum signal

From the build dir you used:

ctest --test-dir ./build --output-on-failure -V

To run just one failing test, use the test name shown by CTest:

ctest --test-dir ./build -R llm-cpp-ctest-<config> -V --output-on-failure

Tip: ctest --test-dir ./build -N lists tests without running them.

Optional helper (run once, then rerun only failing tests verbosely):

python3 skills/llm-debug-test-failures/scripts/rerun_failing_ctest.py build --cpp-transcript ./llm-test-transcript.txt

2) Inspect the model response and config/runtime values

The C++ and JNI tests print the model response and key context when an assertion fails.

If you want the C++ tests to print prompts/responses even when assertions pass, set:

LLM_TEST_DEBUG_RESPONSES=1 (environment variable), or
add --debug-responses when running the llm-cpp-tests binary directly.

If you want a file you can attach to bugs/PRs (works even when CTest output is hard to read), set:

C++: LLM_TEST_TRANSCRIPT_PATH=./llm-test-transcript.txt (or pass --transcript <path>)
JNI: add -Dllm.tests.transcript=./llm-jni-transcript.txt to the Java command

constrain the prompt first (“Answer with a single word.”)
then update the expected anchors (keep them high-signal)

Reference: skills/llm-add-model-support/references/output-validation.md.

3) Validate the config file and model paths

Confirm the config JSON exists under model_configuration_files/ and is referenced by test/CMakeLists.txt (CONFIG_FILE_NAME list).
Confirm --model-root points at resources_downloaded/models (CTest passes this automatically).
If the error is “context is full”, inspect contextSize + batchSize in the config JSON and any test overrides.

4) Trace into the backend integration (if needed)

Most backend-specific issues live under:

src/cpp/frameworks/llama_cpp/
src/cpp/frameworks/onnxruntime_genai/
src/cpp/frameworks/mediapipe/
src/cpp/frameworks/mnn/

Look for:

prompt/template construction differences
model loading paths derived from config fields
modality handling (text vs vision) and batch/context sizing

5) Trace into the upstream framework source (when it looks like a framework bug)

If the wrapper code looks correct but behavior/crashes originate in the underlying framework, use the build tree’s fetched sources.

Identify the backend and pinned revision:
- From the failing test output: use the printed framework/config summary.
- Or run: python3 scripts/dev/framework_versions.py --build-dir build
- The pin and local source path are also defined in the backend’s CMakeLists.txt:
  - llama.cpp: src/cpp/frameworks/llama_cpp/CMakeLists.txt (LLAMA_GIT_SHA, LLAMA_SRC_DIR)
  - MNN: src/cpp/frameworks/mnn/CMakeLists.txt (MNN_GIT_TAG, MNN_SRC_DIR)
  - ONNX: src/cpp/frameworks/onnxruntime_genai/CMakeLists.txt (ONNXRUNTIME_GIT_TAG, ONNXRT_GENAI_GIT_TAG, *_SRC_DIR)
  - MediaPipe: src/cpp/frameworks/mediapipe/CMakeLists.txt (MEDIAPIPE_GIT_SHA, MEDIAPIPE_SRC_DIR)
Open the fetched upstream code under your build directory (default locations):
- llama.cpp: build/llama.cpp/
- MNN: build/mnn/
- onnxruntime: build/onnxruntime/
- onnxruntime-genai: build/onnxruntime-genai/
- mediapipe: build/mediapipe/
Connect the dots from wrapper → upstream:
- Start at the wrapper implementation (e.g. src/cpp/frameworks/mnn/MnnImpl.cpp) and follow the calls into upstream headers/sources (includes typically reference the fetched *_SRC_DIR).
- Use a local search tool to find the symbol or error string in upstream sources (e.g. grep -RIn -- "<symbol-or-error>" build/mnn).
Decide “wrapper bug vs framework bug”:
- Wrapper bug signals: incorrect mapping of contextSize/batchSize, wrong model path, wrong prompt template, modality mismatch, or misuse of the framework API.
- Framework bug signals: crash/assert inside upstream code with correct inputs, regression tied to a version bump, or behavior that contradicts framework docs for the pinned revision.

Practical heuristics (fast triage):

If the model won’t load:
- Wrapper-side first:
  - Does LlmConfig expand llmModelName relative to --model-root the way the backend expects?
  - Is the config pointing at a file vs a directory (some frameworks expect a folder with multiple artifacts)?
  - Are optional artifacts (e.g. projection model for vision) set/expanded correctly?
- Framework-side likely when:
  - all paths exist and are readable, but load fails with a framework internal error, assert, or crash.
If you hit “context is full” / truncation:
- Wrapper-side first:
  - Verify the test/config values printed in the summary: contextSize, batchSize, and any test overrides.
  - Check whether the wrapper counts tokens/bytes differently than the framework (off-by-one style issues), or whether it re-encodes prior chat turns unexpectedly.
- Framework-side likely when:
  - the same prompt/config works on a prior pinned revision but fails on the new one (regression after bump).
If the failure is output drift (anchors not found):
- Wrapper-side first:
  - Prompt/template formatting differences (chat template application, missing system prompt, stopwords).
  - Wrong backend selected (confirm framework= in the config summary).
  - Model mismatch (a different model artifact is being loaded than expected).
- Framework-side likely when:
  - identical prompt/config/model on the same revision produces unstable output across runs (nondeterminism), or a known decoding change landed upstream.
If the failure is crash / SIGSEGV / abort:
- Wrapper-side first:
  - invalid inputs (null/empty strings, empty model path, missing image path, bad tensor sizes)
  - lifetime/order issues (free/reset during active decode; encode called with unexpected isFirstMessage sequencing)
- Framework-side likely when:
  - crash happens inside fetched upstream sources with valid inputs, especially reproducible with a minimal prompt.

Backend “where to look first”:

llama.cpp: wrapper chat template/tokenization glue in src/cpp/frameworks/llama_cpp/, fetched sources in build/llama.cpp/
mnn: model artifact layout + file IO in src/cpp/frameworks/mnn/, fetched sources in build/mnn/
onnxruntime-genai: session/init + provider config in src/cpp/frameworks/onnxruntime_genai/, fetched sources in build/onnxruntime*/
mediapipe: bazel-built engine wiring in src/cpp/frameworks/mediapipe/, fetched sources in build/mediapipe/

Related Skills

arm-examples/llm-update-downloads

tools

VerifiedTrustedCommunity

Update scripts/py/requirements.json entries (URLs + sha256sum) for models/tools, validate hash changes, and keep downloads deterministic without committing artifacts. Use when adding or refreshing model/tool downloads.

4SKILL.mdUpdated Apr 4, 2026

arm-examples/llm-update-downloads

arm-examples/llm-session-start

tools

VerifiedTrustedCommunity

Run fast “session start / doctor” checks for this repository (toolchain + wiring sanity, framework version report, optional upstream update check), optionally generate a debug bundle, and when needed bump pinned backend framework versions with build+ctest verification. Use at session start or when upgrading llama.cpp/onnxruntime-genai/mediapipe/mnn pins.

4SKILL.mdUpdated Apr 4, 2026

arm-examples/llm-session-start

arm-examples/llm-jni-smoke

tools

VerifiedTrustedCommunity

Run a fast JNI-focused build/test smoke check (JNI on, minimal test run), and isolate JNI toolchain issues. Use when changing JNI/Java code or validating JNI setup.

4SKILL.mdUpdated Apr 4, 2026

arm-examples/llm-jni-smoke

arm-examples/llm-config-schema-change

testing

VerifiedTrustedCommunity

Safely add or change model config schema keys (JSON) and update parsing, tests, and docs. Use when editing model_configuration_files schema or LlmConfig parsing without doing broader model onboarding.

4SKILL.mdUpdated Apr 4, 2026

arm-examples/llm-config-schema-change

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/arm-examples/llm-runner.git

# Copy into Claude Code skills folder (global)
cp -r llm-runner/skills/llm-debug-test-failures ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

arm-examples/llm-runner

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT