platforms/codex/skills/midea-recall-diagnose/SKILL.md
用于排查 sit/uat/prod 环境下 `/rag-recall/api/search/keyword` 未召回目标 doc/faq 的问题。支持两种输入:1) 完整请求(headers+body;若 `headers.appId` 缺失但 `body.appId` 存在,可回填);2) requestId+targetId。统一走“回放 -> ELK -> ES -> 代码最小核对”,禁止 broad search 和冲突口径。
npx skillsauth add codingsamss/ai-dotfiles midea-recall-diagnoseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
SKILL.md > references/*.md。冲突时只按本文件执行。keyword 回放只能用终端 curl -X POST,禁止浏览器地址栏访问。ELK + ES,禁止调用 /rag-recall/api/search/trace/recordInfo。keyword 回放外,ELK 取证一律使用 python3 scripts/elk_api_query.py(Kibana API);ES 取证一律使用 python3 scripts/es_proxy_query.py。sit/uat 走 Kibana Dev Tools 的 console proxy,prod 走中立云控制台 requestEs。正常排查禁止 Playwright 页面查询 ELK/ES,禁止 curl 直连 ES,也禁止手写 shell 请求代理接口。headers + body 后,必须先回放并获取 fresh requestId,再查 ELK/ES。requestId + TRACE_TARGET_ES + targetId。requestId/targetId/TRACE_TARGET_ES 必须完整精确匹配,禁止 * 通配(如 replay_*)。targetId 单独扫 3 天日志再逐步收敛。python3 scripts/elk_guard.py ... --kql '<KQL>' 校验;校验失败禁止继续查 ELK。elk_api_query.py 必须忽略本地代理环境变量(HTTP_PROXY/HTTPS_PROXY/ALL_PROXY 等)并直连 ELK;禁止走本地代理。回放时间点 ±15 分钟;无结果再扩到 now-3d~now。TRACE_TARGET_ES 只会在 traceTargetIds 非空时打印;若原始请求 traceTargetIds=[],原 requestId 很可能查不到该类日志,必须回放并注入 targetIds。phase=request 会携带 requestDsl=...,phase=response 会携带 isError/tookMs/returnedHitCount/totalHitCount;targetUrl 形如 GET /<index或逗号分隔索引> [cluster=N] (<desc>)。样例见 references/trace-target-es-format.md。headers.appId 缺失但 request.body.appId 存在,可回填为回放请求头;appChannel 同理。除这两个已核对字段外,其他关键鉴权头不得猜。requestDsl 实际字段为准;字段不明确再查 ES _mapping。targetUrl 中的 [cluster=...] 直接解析集群;若没有集群标识,再从 requestDsl / targetUrl 提取实际索引名做路由;禁止固定地址直查。[cluster=...],不得再要求用户补 sourceSystem;只有在无集群标识且 requestDsl 命中共享索引导致多集群歧义时,才可用 sourceSystem 辅助消歧;若仍不能唯一定位,必须中止,禁止 fallback。sit/uat 可通过 Kibana Dev Tools 背后的 POST /api/console/proxy?path=<path>&method=<method> 执行 _count/_search/_mapping;prod 不走 Kibana/ELK 地址,必须通过中立云 ES 控制台 POST /Elasticsearch/2024-01-11/dataRetrievales/requestEs,请求体包含 cinsId/path/method/body。两种通道都只能经 scripts/es_proxy_query.py 执行,正常排查不启动 Playwright。CHAIN_NAME 提取真实阶段顺序;拿不到则动态读取关键链路代码(SearchLiteFlowService + LiteFlowConstants);都失败才回退默认顺序。phase=response hit=false 判定;仅凭下游 phase=request 缺失不得判该下游阶段丢失。response hit=false,下游无 phase=request 只允许表述为“下游未触发/未发起 request(由上游丢失导致)”,禁止表述为“下游阶段丢失”。phase=response hit=false 证据,禁止输出“向量召回没了/丢了”;必要时只能写“向量阶段 request 未触发(上游 first-loss 在 XXX)”。python3 scripts/first_loss_guard.py 校验。2~4 个文件,禁止全量扫代码。appId(优先 headers.appId,其次 body.appId)或其他关键鉴权头时,不得猜测,必须要求补齐。env + targetType + targetIds + request.headers + request.bodyenv + targetType + targetIds + requestIdrequestId + 注入 traceTargetIds)。scripts/elk_guard.py 生成并校验 KQL,再用 requestId + TRACE_TARGET_ES + targetId 查 ELK。cat >/tmp/diag_input.json <<'JSON'
<input-json>
JSON
jq -e . /tmp/diag_input.json >/dev/null
python3 scripts/prepare_diagnosis.py --input /tmp/diag_input.json
body.requestId 替换为 fresh 值(原ID_replay_<ts> 或 uuidgen)。targetIds 合并到 body.traceTargetIds。headers.appId 缺失但 body.appId 存在,可用 body.appId 回填请求头;appChannel 同理。curl -X POST '<base_url>/rag-recall/api/search/keyword' \
-H 'Content-Type: application/json' \
-H 'appId: <appId>' \
-H 'appChannel: <appChannel>' \
-d '<body-with-fresh-requestId-and-traceTargetIds>'
requestId。requestId、总命中数、错误信息。requestId + targetId + TRACE_TARGET_ES(可加 link_id=requestId)。# 生成推荐 KQL
python3 scripts/elk_guard.py \
--request-id '<replayRequestId>' \
--target-id '<targetId>' \
--mode first \
--emit-template
# 校验你将要执行的 KQL;失败则停止,不得继续
python3 scripts/elk_guard.py \
--request-id '<replayRequestId>' \
--target-id '<targetId>' \
--mode first \
--kql '<your-kql>'
--emit-template 输出,不允许“因为太长”而删减到 requestId-only / targetId-only。scripts/elk_api_query.py 走 ELK API(自动复用浏览器会话 cookie);禁止 Playwright 页面查询 ELK。python3 scripts/elk_api_query.py \
--env prod \
--request-id '<replayRequestId>' \
--target-id '<targetId>' \
--mode first \
--time-window 'now-15m~now'
CHAIN_NAME 阶段顺序,再按顺序找首个 phase=response hit=false 的 cmpId。phase=request 仅用于说明“是否触发”,不可替代 phase=response 作为 first-loss 证据。DOC 目标输出时,必须同时给出:
FIRST_LOSS(按完整链路顺序)DOC_PATH_FIRST_LOSS(仅 DOC 主线:full_range_meta_filter -> full_range_docTxtRecall -> recall_doc_vector_v3_filter -> doc_item_vector_retrieval_batch_es -> full_range_rerank)python3 scripts/first_loss_guard.py \
--target-type DOC \
--chain-line 'CHAIN_NAME[_FULL_RANGE_SEARCH_WITH_LLM_] full_range_meta_filter[...]==>full_range_docTxtRecall[...]==>doc_item_vector_retrieval_batch_es[...]==>full_range_rerank[...]' \
--events '[{"cmpId":"full_range_docTxtRecall","phase":"response","hit":true},{"cmpId":"doc_item_vector_retrieval_batch_es","phase":"response","hit":false}]' \
--assert-first-loss doc_item_vector_retrieval_batch_es
CHAIN_NAME,直接让脚本从代码提取链路顺序:python3 scripts/first_loss_guard.py \
--target-type DOC \
--repo-root '<rag-recall-root>' \
--chain-id '_FULL_RANGE_SEARCH_WITH_LLM_' \
--events '<events-json>'
--chain-order 仅用于调试覆盖,不作为常规输入。first_loss_guard.py 返回 BLOCKED/FAIL,禁止输出“向量阶段首次丢失”。python3 scripts/prepare_diagnosis.py \
--input /tmp/diag_input.json \
--config references/env-config.local.yaml \
--request-dsl '<requestDsl-or-raw-elk-line>' \
--source-system '<sourceSystem-if-needed>' | jq '.esConsoleRoute'
requestDsl index route is ambiguous:先检查是否命中了共享 FAQ 索引;必要时补一个 sourceSystem 做消歧。unable to resolve ES console route 或 sourceSystem ... has no ES cluster mapping:立即中止并补齐有效的 requestDsl/sourceSystem 证据。sit/uat 默认 transport=kibana_console_proxy,只需要实际 path/index 与 DSL;不同环境的索引后缀必须以 ELK requestDsl/targetUrl 为准,禁止猜后缀。prod 默认 transport=zhongli_cloud_proxy,必须先解析到唯一中立云 cluster,并使用该 cluster 的 instance_id 作为 cinsId;生产禁止退回 Kibana/ELK 地址查 ES 实际内容。Q1 原始 requestDsl 复跑。Q2 目标存在性(DOC 用 doc_id;FAQ 用 knowledge_base_id)。Q3 保留 filter + 去掉 text must(仅文本阶段需要)。scripts/es_proxy_query.py 走配置的 ES transport(自动复用浏览器 cookie),禁止 Playwright 页面操作。prod 中立云代理调用固定格式:POST /Elasticsearch/2024-01-11/dataRetrievales/requestEs,请求体包含 cinsId、path、method、body;其中 body 必须是传给 ES 的 JSON 字符串,而不是对象。sit/uat Kibana console proxy 调用固定格式:POST /api/console/proxy?path=<path>&method=<method>,请求体是传给 ES 的 JSON 对象。requestDsl 复跑示例:python3 scripts/es_proxy_query.py \
--env prod \
--request-dsl '<requestDsl-or-raw-elk-line>' \
--path '/<index>/_search' \
--method POST \
--body '<requestDsl-json-body>'
python3 scripts/es_proxy_query.py \
--env prod \
--request-dsl '<requestDsl-or-raw-elk-line>' \
--path '/<index>/_count' \
--method POST \
--body '{"query":{"term":{"doc_id":"<targetId>"}}}'
curl 直连 ES;若浏览器 cookie 失效或代理接口返回未登录/无权限,先中止并要求用户在浏览器重新登录控制台后重跑脚本,不得 fallback 到 Playwright 页面点击。prod 返回 PROD_ES_PROXY_NOT_CONFIGURED,说明本地 skill 配置还没有真实中立云 requestEs 地址;这属于 skill 维护态问题,允许在维护阶段用 Playwright/Network 捕获真实 API 后写入 env-config.local.*,但正常诊断流程不得现场启动 Playwright。scripts/es_proxy_query.py 增加 --cookie-domain midea.com;仍不得绕过脚本手写请求。requestId + targetId + TRACE_TARGET_ES 生成并校验 KQL,再用 scripts/elk_api_query.py --mode first 查 ELK。now-3d~now。traceTargetIds=[] 或扩窗后仍无有效证据,判定“未完成带 trace 的复现”,要求补全完整请求并回放。触发条件(任一满足):
最小必读文件(按场景选 2~4 个):
api/src/main/java/com/midea/jr/robot/rag/recall/api/web/controller/SearchController.javainfrastructure/src/main/java/com/midea/jr/robot/rag/recall/infrastructure/aspect/EsQueryTraceAspect.javacommon/src/main/java/com/midea/jr/robot/rag/recall/common/utils/TraceTargetScanUtils.javacommon/src/main/java/com/midea/jr/robot/rag/recall/common/constant/LiteFlowConstants.javadomain/src/main/java/com/midea/jr/robot/rag/recall/domain/search/cmp/fullrange/FullRangeDocTxtRecallCmp.javadomain/src/main/java/com/midea/jr/robot/rag/recall/domain/search/cmp/fullrange/RecallDocItemVectorBatchEsCmp.javadomain/src/main/java/com/midea/jr/robot/rag/recall/domain/search/cmp/fullrange/FullRangeFaqTxtRecallCmp.javaphase=response hit=false:该阶段未命中目标的最高优先级证据。下游 phase=request 缺失:只表示该阶段可能未触发,必须依附上游 first-loss 解释,不能单独当作“该阶段丢失”。目标存在性=0:索引缺数据/发布未生效/索引路由不覆盖。目标存在性>0 且 原DSL=0:文本匹配或过滤条件问题。原DSL>0 但最终未返回:排序/阈值/TopN 问题。full_range_rerank 或之后:停止 ES 深挖,按 rerank/准出问题交付。cmpId)response hit=true|false|unknown;可补充 request triggered|not_triggered,但不得用其判 first-loss)文件:行号)total/returned/rank/score)targetId-only、requestId-only、缺 TRACE_TARGET_ES、先 broad search),必须立即中止当前路径。requestId 通配/截断(如 replay_*)或首条 KQL 被降级,按违规处理并立即中止。elk_guard.py 就直接执行 ELK API 查询,按违规处理并立即中止。curl 直连 ES、或手写 shell 调用中立云 requestEs 代理接口(非 scripts/es_proxy_query.py),按违规处理并立即中止。BLOCKED_BY_GUARD: <违规原因>。elk_guard.py 校验通过,再继续。scripts/prepare_diagnosis.pyscripts/elk_guard.pyscripts/elk_api_query.pyscripts/es_proxy_query.pyscripts/first_loss_guard.pyreferences/quick-runbook.mdreferences/trace-target-es-format.mdreferences/env-config.example.yamlreferences/env-config.local.yamldevelopment
Query Midea MX / 美信 local message cache through the MX local HTTP query service from Codex. Use when the user asks to read MX sessions, search chat history, search messages globally or inside a group/session, list recent messages, or page message history. This is read-only and does not require send authorization. Never fall back to reading SQLite or app cache files directly.
development
Safely search MX users or groups and send Midea MX / 美信 IM messages from Codex. Use when the user asks to notify someone, send a message to a person or group, use a configured group alias, @ users, @ all, or send MX file/image messages. Read lookups need no extra authorization; every live send needs explicit user authorization for that exact target and message.
tools
MX channel output rules. Always active in MX conversations.
tools
Use the company WorkSpace `ws` CLI reliably as a delegated coding agent from Codex. Trigger when the user wants Codex to command `ws`, WorkSpace CLI, or the company opencode-derived coding tool to generate code, inspect a repo, run a bounded implementation task, or use a requested WorkSpace model while Codex reviews the output.