skills/gisa-benchmark-general-information-seeking/SKILL.md
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
npx skillsauth add ndpvt-web/arxiv-claude-skills gisa-benchmark-general-information-seekingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to build and operate information-seeking agents that go beyond single-query search. Based on the GISA benchmark methodology, it teaches how to decompose complex information needs into iterative search-and-browse cycles, aggregate findings from many web pages, and return results in strictly typed structured formats (single items, unordered sets, ordered lists, or multi-column tables). The core insight: real information-seeking requires both deep reasoning (multi-hop navigation to locate specific facts) and wide aggregation (synthesizing data scattered across many sources) -- and the agent must plan which strategy to use at each step.
The GISA framework uses a ReAct (Reason + Act) agent loop with exactly two tools: a Search tool (issues queries to a search API and receives ranked results) and a Browse tool (fetches a URL's content and summarizes it). The agent iterates through cycles of reasoning about what information is still missing, formulating targeted search queries, selecting promising URLs from search results, browsing those pages, and extracting structured facts. The critical architectural constraint is a fixed budget (e.g., 30 tool invocations), forcing the agent to plan efficiently rather than exhaustively crawling.
What distinguishes this approach from naive search is the answer-type-aware planning. Before searching, the agent classifies the expected output as one of four types -- item (single value), set (unordered collection), list (ordered sequence with explicit sort criteria), or table (multi-column schema with defined columns and tie-breaking rules). This classification drives the search strategy: items need precise targeted queries; sets need breadth-first coverage; lists need both coverage and ordering verification; tables need systematic column-by-column population across many rows.
The GISA error analysis reveals that 49.2% of failures occur at the search level -- poor query formulation, failure to follow hyperlinks within pages, and inability to resolve conflicting information across sources. The key operational lesson: agents should issue fewer but more targeted queries (humans average 3.5 queries vs. agents' 7.6), browse more pages per query (humans browse 19 pages vs. agents' 4.6), and aggressively follow in-page hyperlinks rather than returning to search.
Classify the answer type. Analyze the user's information need and determine the output format: item (single fact), set (unordered collection), list (ordered sequence -- identify the sort criterion), or table (define the column schema and any tie-breaking/sort rules). Write this classification explicitly before proceeding.
Decompose the query into sub-questions. Break the information need into atomic search tasks. For tables, each column may require separate searches. For lists, identify both the membership criterion (what belongs in the list) and the ordering criterion (how to sort). For sets, enumerate the categories of items to collect.
Formulate the first search query. Write a precise, keyword-optimized search query targeting the highest-priority sub-question. Prefer specific entity names, date ranges, and domain-specific terms over natural language questions. Avoid overly broad queries.
Process search results and select URLs. From the search API response (typically top-10 results), select 2-4 most promising URLs based on domain authority, snippet relevance, and likelihood of containing structured data (e.g., Wikipedia tables, official databases, ranking sites). Prioritize authoritative primary sources over aggregator blogs.
Browse and extract structured facts. Fetch each selected URL, parse the content, and extract facts relevant to the query. Store extracted data in a running structured format (JSON objects or TSV rows). Track which cells/fields are still empty and which sources conflict.
Resolve conflicts and verify. When sources disagree, issue targeted verification queries (e.g., add "site:official-source.org" or search for the specific conflicting fact). Apply a recency heuristic: for live data, prefer the most recently updated source.
Assess completeness and iterate. Check the accumulated data against the answer schema. Identify missing rows, empty cells, or unverified ordering. If gaps remain and the tool budget allows, return to step 3 with refined queries targeting the missing information specifically.
Follow in-page links aggressively. When a browsed page contains hyperlinks to related entities (e.g., a list page linking to individual entries), follow those links rather than issuing new search queries. This mirrors the human browsing pattern that GISA found correlates with higher accuracy.
Normalize the output. Apply consistent formatting: lowercase column headers, strip currency symbols and commas from numbers, round floats to a consistent precision, normalize strings to lowercase for comparison, represent missing values as empty strings. Format tables as TSV with a header row.
Validate and emit the final answer. Verify the output matches the declared answer type. For lists, confirm the sort order. For tables, confirm all required columns are present and rows are ordered correctly. Emit the answer in the specified structured format.
Example 1: Table query -- aggregate scattered data
User: "Find the top 10 highest-grossing films of 2024 worldwide, with columns
for rank, title, worldwide gross, domestic gross, and distributor."
Approach:
1. Classify answer type: TABLE with schema [rank, title, worldwide_gross,
domestic_gross, distributor], sorted by worldwide_gross descending.
2. Search query 1: "highest grossing films 2024 worldwide box office"
3. Browse Box Office Mojo and The Numbers -- extract top 10 with worldwide
gross and domestic gross figures.
4. For any missing distributor fields, search: "[film title] 2024 distributor
production company"
5. Cross-verify gross figures between Box Office Mojo and The Numbers.
Resolve discrepancies by preferring Box Office Mojo (primary source).
6. Normalize: strip "$" and commas from gross figures, lowercase headers.
Output:
rank title worldwide_gross domestic_gross distributor
1 inside out 2 1696200000 652900000 walt disney studios
2 deadpool & wolverine 1338500000 636700000 walt disney studios
3 moana 2 1125700000 449600000 walt disney studios
...
Example 2: List query -- ordered sequence with sort criterion
User: "List the 5 most-spoken languages in the world by total number of
speakers (native + non-native), in descending order."
Approach:
1. Classify answer type: LIST, sort criterion = total speakers descending.
2. Search query: "most spoken languages world total speakers 2024 ethnologue"
3. Browse Ethnologue or Wikipedia's "List of languages by total number of
speakers" page.
4. Extract language names and total speaker counts.
5. Verify ordering: confirm English > Mandarin in total speakers (native +
L2) or vice versa by checking a second source.
6. Normalize: lowercase language names, represent speaker counts as integers.
Output:
1. english
2. mandarin chinese
3. hindi
4. spanish
5. french
Example 3: Set query -- unordered collection
User: "What countries have successfully landed a spacecraft on the Moon?"
Approach:
1. Classify answer type: SET (unordered, membership-only).
2. Search query: "countries successful moon landing spacecraft"
3. Browse NASA page and Wikipedia "Moon landing" article.
4. Extract country names. Follow in-page links to verify recent missions
(e.g., India's Chandrayaan-3, Japan's SLIM).
5. Cross-verify with a second source to ensure no country is missed.
6. Normalize: lowercase country names.
Output:
{soviet union, united states, china, india, japan}
Do:
"2024 Nobel Physics winner nationality" outperform "Nobel Prize 2024".Avoid:
| Error Type | Frequency | Mitigation | |---|---|---| | Ineffective query formulation | ~14% of failures | Reformulate with more specific terms, entity names, date qualifiers, or site-specific operators. If first query returns irrelevant results, do not repeat it -- change the query structure. | | Failure to follow hyperlinks | ~18% of failures | When a browsed page references related pages (e.g., "see also", linked entity names), follow those links directly rather than returning to search. | | Conflicting information across sources | ~18% of failures | Establish a source hierarchy before searching (e.g., official > Wikipedia > news > blog). When conflicts arise, search for the specific disputed fact with a site-restricted query. | | Extraction errors | ~16% of failures | After extracting data, re-read the source passage to verify the extraction. Pay special attention to table cell alignment and header-to-value mapping. | | Instruction-following errors | ~32% of failures | Re-read the output format requirements before emitting the final answer. Verify column order, sort direction, and delimiter format match the specification exactly. |
Paper: GISA: A Benchmark for General Information-Seeking Assistant (Zhu et al., 2026)
Key takeaway: Read Section 4 (Error Analysis) and Table 5 for the breakdown of where search agents fail -- 49% of errors are at the search level (bad queries, missed hyperlinks, unresolved conflicts), and 47% are at the output level (extraction and formatting mistakes). The human trajectory analysis in Section 3.3 shows that fewer queries + more browsing per query is the optimal strategy.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".
development
Build granular error taxonomies from incorrect reasoning traces, then use those rubrics to detect errors in LLM outputs across technical domains. Use when asked to: 'build a rubric for evaluating code solutions', 'create an error taxonomy for math reasoning', 'grade reasoning traces for correctness', 'build a reward function for domain-specific tasks', 'classify errors in chain-of-thought outputs', 'evaluate LLM reasoning without gold labels'.