.agents/skills/python-code-reviewer/SKILL.md
Review Python OpenInference instrumentation code for correctness and completeness. Use this skill when reviewing a Python instrumentor package — whether it's a new instrumentor, a PR that modifies one, or when the user asks to audit/review/check an existing instrumentor's code quality. Trigger on phrases like "review the instrumentor", "check the code", "audit the package", "is this instrumentor correct", or any request to validate an OpenInference Python instrumentation package against project standards.
npx skillsauth add arize-ai/openinference python-code-reviewerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Review a Python OpenInference instrumentation package against the project's established patterns and conventions. This is a checklist-driven review — go through each section, report findings with file paths and line numbers, and surface issues organized by severity.
Step 1: Identify the package to review
python/instrumentation/openinference-instrumentation-<name>/__init__.py, _wrappers.py (or equivalent), pyproject.toml,
and the full tests/ directoryStep 2: Pull the instrumented library source and use it as ground truth
OpenInference instrumentors work by monkey-patching functions in the library they instrument. All correctness judgments — whether wrappers target the right methods, handle the right signatures, process the right data structures, and cover the right edge cases — must be verified against the actual library source code. Do NOT make assumptions about how the instrumented library works.
Note: The tox env name
<pkg>and the library's Python import path<library>often differ. For example,google_genaiis the tox env name but the library installs asgoogle/genai/in site-packages. Checktest-requirements.txtorpyproject.tomlto find the actual library package name.
Set up the tox environment to install the pinned library version. Look up the
tox envlist in python/tox.ini to find the correct env name (use the highest Python
version available, e.g., py314, py313):
cd python && uvx --with tox-uv tox run -e <pyVER>-ci-<pkg> -- --co -q
(-- --co -q tells pytest to collect without running, which triggers the install.)
If the .tox env already exists, skip this step.
If tox setup fails (missing Python version, dependency conflicts), fall back to
pip install <library> in a temporary venv to unblock the review.
Locate the installed library source at:
python/.tox/<pyVER>-ci-<pkg>/lib/python<X.Y>/site-packages/<library>/
Reference the library source throughout the review. Before flagging any finding, verify it against the actual code:
Calibrate severity based on what the library actually does:
Step 3: Run all review sections below
Step 4: Present findings organized by severity:
The tox.ini install pattern matters because a broken pattern silently installs the wrong version of the library, making the "pinned version" test target useless.
Read python/tox.ini and find the commands_pre entries for this package.
Correct pattern (google_adk style — 4 steps, substitute <pkg> with the actual
package name):
<pkg>: uv pip uninstall -r test-requirements.txt
<pkg>: uv pip install --reinstall-package openinference-instrumentation-<pkg> .
<pkg>: python -c 'import openinference.instrumentation.<pkg>'
<pkg>: uv pip install -r test-requirements.txt
Broken pattern (causes under-resolution — the pinned version test may silently test the wrong version):
<pkg>: uv pip install --reinstall {toxinidir}/instrumentation/openinference-instrumentation-<pkg>[test]
Flag the broken pattern as Critical — it defeats the purpose of version-pinned testing.
Check that test-requirements.txt exists in the package root and contains:
openai==2.8.0)opentelemetry-sdkpytest-recording (if using VCR cassettes)If test-requirements.txt is missing entirely, flag as Critical (the correct tox
pattern depends on it).
Verify that the tox envlist has both pinned and -latest variants:
py3{10,14}-ci-{pkg,pkg-latest}
And that the -latest variant upgrades the library:
pkg-latest: uv pip install -U <library-name>
Read tests/conftest.py and verify these fixtures exist:
Required fixtures:
in_memory_span_exporter — returns InMemorySpanExporter()tracer_provider — creates TracerProvider with SimpleSpanProcessor wired to the exporterinstrument (autouse) — calls Instrumentor().instrument(tracer_provider=...),
clears exporter, yields, then calls .uninstrument() and clears againScope considerations:
VCR config fixture (if using cassettes):
@pytest.fixture(scope="session")
def vcr_config() -> dict[str, Any]:
return {
"before_record_request": _strip_request_headers,
"before_record_response": _strip_response_headers,
"decode_compressed_response": True,
"record_mode": "once",
}
With helper functions that strip sensitive headers from recorded cassettes.
If the instrumentor calls external APIs (LLM providers, embedding services, etc.):
@pytest.mark.vcr decoratortests/cassettes/ (pytest-recording default)pytest-recording should be in test-requirements.txtIf tests use mocking instead of VCR, that's acceptable but note it as a pattern difference.
This is the most important testing pattern. Tests should verify ALL span attributes, not just spot-check a few. The pattern prevents regressions where unexpected attributes appear or expected ones disappear silently.
Correct pattern:
attributes = dict(span.attributes or {})
assert attributes.pop(OPENINFERENCE_SPAN_KIND) == OpenInferenceSpanKindValues.CHAIN.value
assert attributes.pop(INPUT_VALUE)
assert attributes.pop(INPUT_MIME_TYPE) == JSON
assert attributes.pop(OUTPUT_VALUE)
assert attributes.pop(OUTPUT_MIME_TYPE) == JSON
# ... pop all remaining attributes ...
assert not attributes # Nothing unexpected left
What to flag:
assert not attributes — Highspan.attributes[KEY] or span.attributes.get(KEY) instead of pop — Medium
(functional but doesn't catch unexpected extras)assert not attributes at the end — HighThere should be at least one test that uses using_attributes() context manager and
verifies that context attributes appear on spans:
with using_attributes(
session_id="test-session",
user_id="test-user",
metadata={"key": "value"},
tags=["tag-1", "tag-2"],
prompt_template="template {var}",
prompt_template_version="v1.0",
prompt_template_variables={"var": "value"},
):
# run instrumented code
Then verify these attributes appear on the spans via pop assertions.
Check which conventions apply based on the type of library being instrumented. Not every instrumentor needs every attribute — match the conventions to what the library actually does.
Every span must have:
OPENINFERENCE_SPAN_KIND — set to the appropriate kind enum valueINPUT_VALUE + INPUT_MIME_TYPE — what went into the operationOUTPUT_VALUE + OUTPUT_MIME_TYPE — what came outMIME types should be application/json for structured data (dicts, Pydantic models) and
text/plain for strings. Flag if MIME type is missing when value is set — High.
When setting input/output attributes, the instrumentor should use:
from openinference.instrumentation import get_input_attributes, get_output_attributes
span.set_attributes(dict(get_input_attributes(val, mime_type=OpenInferenceMimeTypeValues.JSON)))
These should set:
LLM_MODEL_NAME — the model identifierLLM_PROVIDER — the provider name (e.g., "openai", "anthropic")LLM_INVOCATION_PARAMETERS — JSON of parameters like temperature, max_tokensLLM_INPUT_MESSAGES — array of message objects with role and contentLLM_OUTPUT_MESSAGES — array of response message objectsLLM_TOKEN_COUNT_PROMPT / LLM_TOKEN_COUNT_COMPLETION / LLM_TOKEN_COUNT_TOTAL — token usageSpan kind should be LLM.
EMBEDDING_MODEL_NAMEEMBEDDING_EMBEDDINGS — the embedding vectors (unless masked by TraceConfig)EMBEDDING_TEXT — input textSpan kind should be EMBEDDING.
When the library supports tool use or function calling:
TOOL_NAME — name of the tool/functionTOOL_DESCRIPTION — description (if available)TOOL_PARAMETERS — JSON schema of parametersSpan kind should be TOOL.
These typically produce multiple span kinds in a hierarchy:
CHAIN for orchestration/workflow spansAGENT for agent execution spansTOOL for tool invocationsLLM for underlying model callsRETRIEVAL_DOCUMENTS — array of retrieved documentsDOCUMENT_ID, DOCUMENT_CONTENT, DOCUMENT_SCORE, DOCUMENT_METADATASpan kind should be RETRIEVER.
For instrumentors that create multiple spans, verify:
parent.span_id matches parent span's context.span_id)trace_idTests should verify hierarchy explicitly:
trace_ids = {span.context.trace_id for span in spans}
assert len(trace_ids) == 1 # All in one trace
assert child_span.parent.span_id == parent_span.context.span_id
Flag missing hierarchy tests as High for multi-span instrumentors.
Common correct hierarchies:
CHAIN -> LLM (simple chain with LLM call)CHAIN -> AGENT -> TOOL (agent framework)CHAIN -> AGENT -> LLM (agent making LLM calls)CHAIN -> RETRIEVER -> EMBEDDING (RAG pipeline)CHAIN -> CHAIN -> LLM (nested chains)If the instrumented library uses threads or async:
contextvars.copy_context() if needed)ThreadPoolExecutor or similar and the
instrumentor doesn't handle context propagation — CriticalEvery wrapper should check suppression at the start. Either pattern is acceptable:
# Pattern 1: private key (common in this repo)
if context_api.get_value(context_api._SUPPRESS_INSTRUMENTATION_KEY):
return wrapped(*args, **kwargs)
# Pattern 2: public API
from opentelemetry.context import suppress_instrumentation
if suppress_instrumentation():
return wrapped(*args, **kwargs)
Missing suppression check — High.
The instrumentor should accept and respect TraceConfig:
OITracer or use it to mask attributes before setting themhide_inputs and hide_outputs should workMissing TraceConfig support — Medium (functional but incomplete).
Organize findings into a table:
| Severity | Section | Finding | Location |
|----------|---------|---------|----------|
| Critical | 1.1 | Uses broken tox install pattern | python/tox.ini:142 |
| High | 2.3 | Tests don't use exhaustive pop assertions | tests/test_instrumentor.py:85 |
| ... | ... | ... | ... |
Then list what's working well — positive findings help the user understand what doesn't need to change.
Finally, ask the user what they'd like to do:
tox run -e test-<pkg>development
Investigate and propose fixes for Python canary cron failures in the openinference repo. Use when the user mentions Python canary failures, Python cron failures, or when the auto-fix CI job reports Python instrumentation canary issues.
tools
Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, review experiments, inspect datasets, and query the GraphQL API. Use when debugging AI/LLM applications, analyzing trace data, working with Phoenix observability, or investigating LLM performance issues.
development
Keep hand-written docs/ documentation in JS packages accurate and up to date with their source code. Use this skill whenever: (1) source files in a JS package that has a docs/ folder are modified — especially exports, function signatures, types, or public API changes, (2) the user asks to "update docs", "sync docs", "check if docs are accurate", "review the documentation", or similar, (3) new exports or features are added to a JS package and the docs need to reflect them. Also trigger when the user mentions documentation drift, stale examples, or missing API coverage in any JS package under js/packages/.
development
Review Java OpenInference instrumentation code for correctness and completeness. Use this skill when reviewing a Java instrumentor package — whether it's a new instrumentor, a PR that modifies one, or when the user asks to audit/review/check an existing instrumentor's code quality. Trigger on phrases like "review the instrumentor", "check the Java code", "audit the package", "is this instrumentor correct", or any request to validate an OpenInference Java instrumentation package against project standards.