skills/llm-debug-test-failures/SKILL.md
Debug failing LLM integration tests caused by model output drift, incorrect context/runtime parameters (contextSize, batchSize, threads), prompt/template mismatches, or backend/framework regressions. Use when tests fail and you need to see the model response, reproduce a single failing CTest, or trace issues into src/cpp/frameworks (llama.cpp, onnxruntime-genai, mediapipe, mnn).
npx skillsauth add arm-examples/llm-runner llm-debug-test-failuresInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this when llm-cpp-ctest-* fails due to:
Windows note: if python3 isn’t available, use python (or py -3) for the scripts below.
From the build dir you used:
ctest --test-dir ./build --output-on-failure -V
To run just one failing test, use the test name shown by CTest:
ctest --test-dir ./build -R llm-cpp-ctest-<config> -V --output-on-failure
Tip: ctest --test-dir ./build -N lists tests without running them.
Optional helper (run once, then rerun only failing tests verbosely):
python3 skills/llm-debug-test-failures/scripts/rerun_failing_ctest.py build --cpp-transcript ./llm-test-transcript.txt
The C++ and JNI tests print the model response and key context when an assertion fails.
If you want the C++ tests to print prompts/responses even when assertions pass, set:
LLM_TEST_DEBUG_RESPONSES=1 (environment variable), or--debug-responses when running the llm-cpp-tests binary directly.If you want a file you can attach to bugs/PRs (works even when CTest output is hard to read), set:
LLM_TEST_TRANSCRIPT_PATH=./llm-test-transcript.txt (or pass --transcript <path>)-Dllm.tests.transcript=./llm-jni-transcript.txt to the Java commandFor JNI tests, you can also force printing prompts/responses even when tests pass by adding -Dllm.tests.debug=true to the Java command (copy it from ctest -V output and add the flag).
If the failure is “output drift” but the answer is still correct:
Reference: skills/llm-add-model-support/references/output-validation.md.
model_configuration_files/ and is referenced by test/CMakeLists.txt (CONFIG_FILE_NAME list).--model-root points at resources_downloaded/models (CTest passes this automatically).contextSize + batchSize in the config JSON and any test overrides.Most backend-specific issues live under:
src/cpp/frameworks/llama_cpp/src/cpp/frameworks/onnxruntime_genai/src/cpp/frameworks/mediapipe/src/cpp/frameworks/mnn/Look for:
If the wrapper code looks correct but behavior/crashes originate in the underlying framework, use the build tree’s fetched sources.
Identify the backend and pinned revision:
python3 scripts/dev/framework_versions.py --build-dir buildCMakeLists.txt:
src/cpp/frameworks/llama_cpp/CMakeLists.txt (LLAMA_GIT_SHA, LLAMA_SRC_DIR)src/cpp/frameworks/mnn/CMakeLists.txt (MNN_GIT_TAG, MNN_SRC_DIR)src/cpp/frameworks/onnxruntime_genai/CMakeLists.txt (ONNXRUNTIME_GIT_TAG, ONNXRT_GENAI_GIT_TAG, *_SRC_DIR)src/cpp/frameworks/mediapipe/CMakeLists.txt (MEDIAPIPE_GIT_SHA, MEDIAPIPE_SRC_DIR)Open the fetched upstream code under your build directory (default locations):
build/llama.cpp/build/mnn/build/onnxruntime/build/onnxruntime-genai/build/mediapipe/Connect the dots from wrapper → upstream:
src/cpp/frameworks/mnn/MnnImpl.cpp) and follow the calls into upstream headers/sources (includes typically reference the fetched *_SRC_DIR).grep -RIn -- "<symbol-or-error>" build/mnn).Decide “wrapper bug vs framework bug”:
contextSize/batchSize, wrong model path, wrong prompt template, modality mismatch, or misuse of the framework API.Practical heuristics (fast triage):
If the model won’t load:
LlmConfig expand llmModelName relative to --model-root the way the backend expects?If you hit “context is full” / truncation:
contextSize, batchSize, and any test overrides.If the failure is output drift (anchors not found):
framework= in the config summary).If the failure is crash / SIGSEGV / abort:
isFirstMessage sequencing)Backend “where to look first”:
llama.cpp: wrapper chat template/tokenization glue in src/cpp/frameworks/llama_cpp/, fetched sources in build/llama.cpp/mnn: model artifact layout + file IO in src/cpp/frameworks/mnn/, fetched sources in build/mnn/onnxruntime-genai: session/init + provider config in src/cpp/frameworks/onnxruntime_genai/, fetched sources in build/onnxruntime*/mediapipe: bazel-built engine wiring in src/cpp/frameworks/mediapipe/, fetched sources in build/mediapipe/tools
Update scripts/py/requirements.json entries (URLs + sha256sum) for models/tools, validate hash changes, and keep downloads deterministic without committing artifacts. Use when adding or refreshing model/tool downloads.
tools
Run fast “session start / doctor” checks for this repository (toolchain + wiring sanity, framework version report, optional upstream update check), optionally generate a debug bundle, and when needed bump pinned backend framework versions with build+ctest verification. Use at session start or when upgrading llama.cpp/onnxruntime-genai/mediapipe/mnn pins.
tools
Run a fast JNI-focused build/test smoke check (JNI on, minimal test run), and isolate JNI toolchain issues. Use when changing JNI/Java code or validating JNI setup.
testing
Safely add or change model config schema keys (JSON) and update parsing, tests, and docs. Use when editing model_configuration_files schema or LlmConfig parsing without doing broader model onboarding.