skills/streaming-testing/SKILL.md
Compare A2A streaming behaviour across supervisor versions. Captures SSE events, analyzes metadata flags (is_narration, is_final_answer), and produces side-by-side comparison reports.
npx skillsauth add cnoe-io/ai-platform-engineering streaming-testingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Compare A2A streaming behaviour across supervisor versions (e.g. 0.3.0 vs 0.2.41). Captures SSE events, analyzes metadata correctness, and generates side-by-side markdown reports.
docker-compose.dev.yaml)caipe-supervisor)caipe-supervisor-041)# One command — captures from both supervisors, analyzes, compares
./skills/streaming-testing/run_comparison.sh "what can you do?"
# Output lands in /tmp/streaming-comparison-<timestamp>.md
All scripts live in scripts/ and use Python stdlib only.
| Script | Purpose | Usage |
|--------|---------|-------|
| trace_a2a_streaming.py | SSE timeline with summary stats | python3 scripts/trace_a2a_streaming.py 8000 "query" |
| capture_a2a_events.py | Save raw A2A events to JSON | python3 scripts/capture_a2a_events.py http://localhost:8000/ "query" /tmp/out.json |
| analyze_a2a_metadata.py | Validate metadata flags in a capture | python3 scripts/analyze_a2a_metadata.py /tmp/out.json |
| compare_a2a_events.py | Side-by-side comparison of two captures | python3 scripts/compare_a2a_events.py /tmp/a.json /tmp/b.json |
| analyze_accumulation_flow.py | Content accumulation + duplicate detection | python3 scripts/analyze_accumulation_flow.py "query" --port 8000 |
| validate_artifacts.py | Parse/validate artifact reports | python3 scripts/validate_artifacts.py --query "query" --port 8000 |
python3 scripts/capture_a2a_events.py http://localhost:8000/ "what can you do?" /tmp/cap-030.json
python3 scripts/capture_a2a_events.py http://localhost:8041/ "what can you do?" /tmp/cap-041.json
python3 scripts/analyze_a2a_metadata.py /tmp/cap-030.json
python3 scripts/analyze_a2a_metadata.py /tmp/cap-041.json
This validates:
is_final_answer is present on streaming_result artifacts[FINAL ANSWER] marker leaked into textis_task_complete=, Returning structured response)is_narration and is_final_answer are mutually exclusivefinal_result or partial_result artifact existspython3 scripts/compare_a2a_events.py /tmp/cap-030.json /tmp/cap-041.json --output /tmp/comparison.md
Generates a markdown report with:
For a live timeline view of each SSE event as it arrives:
python3 scripts/trace_a2a_streaming.py 8000 "what can you do?"
python3 scripts/trace_a2a_streaming.py 8041 "what can you do?"
| Metric | What it means | Good value |
|--------|--------------|------------|
| Time to first content | How long until the user sees text | < 5s |
| Total time | End-to-end response time | < 20s (simple), < 120s (RAG) |
| has is_final_answer | Supervisor correctly tags the final answer | True |
| Narration events | Narration chunks for typing-status display | 0 for simple queries, > 0 for RAG |
| Marker leaks | [FINAL ANSWER] leaked to client | Must be 0 |
| Metadata leaks | Internal state leaked to client | Must be 0 |
generate_structured_response node builds JSON preamble before the content key. Time to first content will be higher than 0.2.41.is_narration flag was added in 0.3.0.# Check containers
docker ps | grep supervisor
# Restart
docker restart caipe-supervisor
docker restart caipe-supervisor-041
The scripts have a default 120s (trace) or 300s (capture) timeout. For RAG queries, increase the timeout:
# capture_a2a_events.py reads timeout from the URL (not a CLI arg).
# For trace, it's hardcoded at 120s in http.client timeout.
# If queries consistently time out, check supervisor logs:
docker logs caipe-supervisor 2>&1 | tail -20
If capture_a2a_events.py produces 0 artifacts:
curl http://localhost:8000/.well-known/agent.jsondocker logs caipe-supervisor 2>&1 | tail -50# Simple query comparison
./skills/streaming-testing/run_comparison.sh "what can you do?"
# RAG query comparison (takes longer)
./skills/streaming-testing/run_comparison.sh "what is caipe?"
# Just analyze a single capture
python3 scripts/analyze_a2a_metadata.py /tmp/cap-030.json --verbose
# Custom labels in comparison report
python3 scripts/compare_a2a_events.py /tmp/cap-030.json /tmp/cap-041.json \
--label-a "0.3.0 (post-fix)" --label-b "0.2.41 (golden)"
testing
Generate a comprehensive sprint progress report from Jira with velocity metrics, burndown analysis, blocker identification, and team workload distribution. Use when preparing sprint reviews, standups, or tracking sprint health mid-cycle.
development
Scan GitHub repositories for security vulnerabilities including Dependabot alerts, code scanning results, and secret scanning findings. Use when auditing repository security, preparing compliance reports, or triaging vulnerability alerts.
development
Perform a comprehensive code review of a specific GitHub Pull Request. Analyzes code changes, checks for bugs, security issues, test coverage, and coding standards compliance. Use when a user provides a PR URL or asks to review a specific pull request.
testing
List and analyze all open pull requests across GitHub repositories. Shows review status, CI/CD check results, age, and reviewers. Use when triaging PRs, checking team velocity, or identifying stale reviews that need attention.