skills/effgen-enabling-small-language/SKILL.md
Deploy and optimize small language models (SLMs) as autonomous agents using the effGen framework. Implements prompt compression (70-80% context reduction), five-factor complexity routing, intelligent task decomposition, and unified memory for local SLM-based agent systems. Triggers: 'set up effgen agent', 'deploy small language model agent', 'optimize prompts for small model', 'compress agent context for SLM', 'build local AI agent with effgen', 'route tasks by complexity for small models'
npx skillsauth add ndpvt-web/arxiv-claude-skills effgen-enabling-small-languageInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to help users build, deploy, and optimize autonomous agent systems powered by small language models (1.5B-32B parameters) using the effGen framework. Rather than relying on API calls to large cloud models, effGen applies four complementary techniques — prompt compression, task decomposition, complexity-based routing, and unified memory — to make local SLMs perform agent tasks that normally require GPT/Claude-scale models. Claude can guide users through installing effGen, configuring agents with appropriate tools, writing optimized prompts that fit within tight context windows, and architecting multi-agent systems that decompose complex workflows.
effGen's core insight is that small language models fail at agent tasks not because they lack reasoning ability, but because standard frameworks waste their limited context windows on verbose prompts, redundant instructions, and tasks beyond their complexity ceiling. The framework addresses this through four interlocking techniques that have complementary scaling behavior: prompt optimization benefits smaller models more (11.2% gain at 1.5B vs. 2.4% at 32B), while complexity routing benefits larger models more (3.6% at 1.5B vs. 7.9% at 32B), meaning the combination yields consistent improvements at every scale.
Prompt Optimization uses a five-stage pipeline to compress context by 70-80% without losing task semantics. It replaces verbose phrases with concise equivalents (e.g., "due to the fact that" becomes "because"), eliminates redundant sentences, splits long sentences at conjunctions, converts prose to bullet-point structure, and reorganizes content into labeled sections (Task, Instructions, Output). The optimizer is model-size-aware — it categorizes models into four tiers (TINY <1B, SMALL 1-3B, MEDIUM 3-7B, LARGE 7B+) and adjusts token budgets, few-shot example counts, and compression aggressiveness per tier.
Complexity-Based Routing scores incoming tasks on five weighted factors: task length (15%), number of requirements (25%), domain breadth across 8 domains (20%), tool requirements across 8 categories (20%), and reasoning depth from simple lookup to multi-step synthesis (20%). Tasks scoring above a configurable threshold (default 7.0/10) get decomposed into subtasks; those below it execute directly. The router selects from five execution strategies — single agent, parallel sub-agents, sequential sub-agents, hierarchical, or hybrid — based on the dependency structure of the decomposed subtasks.
Install effGen and load a model. Run pip install effgen (or pip install effgen[vllm] for 5-10x faster inference). Load a quantized SLM with load_model("Qwen/Qwen2.5-1.5B-Instruct", quantization="4bit") — 4-bit quantization is critical for fitting models into consumer GPU memory.
Analyze task complexity before choosing architecture. Use the five-factor scoring rubric: count requirements (questions, conjunctions, bullets), identify domains touched (technical, research, business, creative, data, scientific, legal, financial), check tool needs (web search, code execution, file I/O, retrieval), and assess reasoning depth (lookup/define vs. compare/synthesize/architect). Score 0-10; if above 7.0, plan for sub-agent decomposition.
Compress the system prompt and task context. Apply the five-stage optimization pipeline: (a) replace verbose phrases with concise equivalents, (b) split sentences longer than ~20 words at conjunctions, (c) remove duplicate or near-duplicate sentences, (d) convert paragraph-form instructions to bullet lists, (e) organize into labeled sections. Target 70-80% reduction. For a SMALL model (1-3B), budget roughly 512 tokens for the system prompt and cap few-shot examples at 30% of total prompt budget.
Select and configure tools. Choose from effGen's built-in tools: Calculator, WebSearch (DuckDuckGo), PythonREPL, CodeExecutor (Docker-sandboxed), FileOps, Retrieval (embedding-based RAG), AgenticSearch (grep-based exact match). Register them in the AgentConfig.tools list. For retrieval agents, prepare your knowledge base and select the embedding backend (SentenceTransformer or SimpleEmbedding) and vector store (FAISS or Chroma).
Configure the agent with an optimized system prompt. Create an AgentConfig with the compressed prompt, selected tools, and model. Use task-specific prompt adaptations: for coding tasks, preserve syntax structure; for reasoning tasks, inject "step by step" directives; for analysis tasks, request structured output.
Decompose complex tasks into subtasks. For tasks scoring above the complexity threshold, use the decomposition engine to split into parallel (independent) or sequential (dependent) subtasks. Validate: check for circular dependencies, balance complexity across subtasks, and run topological sort for execution order. Visualize the dependency graph before execution.
Route subtasks to appropriate execution strategies. Map each subtask to a specialization (research, coding, analysis, synthesis) and assign to sub-agents. Choose parallel execution for independent subtasks (e.g., "search for X" and "search for Y"), sequential for dependent chains (e.g., "fetch data" then "analyze data"), or hybrid for mixed dependency structures.
Configure the unified memory system. Set up short-term memory for active conversation context, long-term memory (JSON or SQLite backend) for persistent facts and session history, and vector memory (FAISS or Chroma) for semantic search over past interactions. Classify stored information by MemoryType and ImportanceLevel to manage retrieval priority.
Run the agent and monitor execution. Execute with agent.run("task") or use the CLI (effgen run "task"). For interactive sessions, use effgen chat. For production deployment, use effgen serve --port 8000 to expose an API. Enable DETAILED_LOGGING during development to trace tool calls, decomposition decisions, and memory operations.
Iterate on the complexity threshold and prompt compression. Review the router's decision history to analyze strategy distribution and sub-agent usage rates. If simple tasks are being over-decomposed, raise the threshold above 7.0. If complex tasks fail in single-agent mode, lower it. Monitor prompt token counts and adjust compression aggressiveness per model tier.
Example 1: Local calculator agent with a 1.5B model
User: "Set up a basic effGen agent that can do math and run Python code locally on my machine with a small model."
Approach:
Output:
from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator, PythonREPL
model = load_model("Qwen/Qwen2.5-1.5B-Instruct", quantization="4bit")
# Compressed system prompt for SMALL model tier (~80 tokens vs ~400 verbose)
system_prompt = """Role: math/code assistant
- Solve math with Calculator tool
- Run code with PythonREPL tool
- Return final numeric answer
- Show work in steps"""
config = AgentConfig(
name="math_agent",
model=model,
tools=[Calculator(), PythonREPL()],
system_prompt=system_prompt
)
agent = Agent(config=config)
result = agent.run("Calculate compound interest on $10000 at 5% for 3 years")
print(result)
Example 2: Complexity-scored task routing for a multi-domain query
User: "I have queries of varying difficulty — some are simple lookups, others need web research and code generation. How do I route them efficiently?"
Approach:
Output:
from effgen.core.complexity_analyzer import ComplexityAnalyzer
from effgen.core.router import SubAgentRouter
analyzer = ComplexityAnalyzer()
# Simple task — scores low, runs as single agent
simple = analyzer.analyze("What is the capital of France?")
# => ComplexityScore(overall=2.1, should_decompose=False)
# Complex task — scores high, triggers decomposition
complex_task = analyzer.analyze(
"Research the latest advances in quantum computing, write a Python "
"simulation of Grover's algorithm, and produce a comparison table "
"of gate-based vs. annealing approaches with citations."
)
# => ComplexityScore(overall=8.7, should_decompose=True)
# Factors: requirements=3, domains=[technical,research,scientific],
# tools=[web_search, code_executor], reasoning=synthesis
router = SubAgentRouter(complexity_analyzer=analyzer)
decision = router.route(complex_task)
# => Strategy: hybrid
# Subtask 1 (parallel): "Research quantum computing advances" -> research agent
# Subtask 2 (parallel): "Write Grover's algorithm simulation" -> coding agent
# Subtask 3 (sequential, depends on 1): "Compare gate-based vs annealing" -> analysis agent
Example 3: Prompt compression for a retrieval-augmented agent
User: "My RAG agent's prompts are too long for my 3B model's context window. Help me compress them."
Approach:
Output:
from effgen.prompts.optimizer import PromptOptimizer, OptimizerConfig
config = OptimizerConfig(model_size="SMALL") # 1-3B tier
optimizer = PromptOptimizer(config=config)
verbose_prompt = """
You are a helpful research assistant. Please carefully analyze the provided
documents and answer the user's question based on the information contained
within them. If the answer is not found in the documents, please let the
user know that you could not find the relevant information. Please provide
citations for any claims you make. Due to the fact that accuracy is important,
please double-check your answers before responding.
"""
optimized = optimizer.optimize(verbose_prompt)
# Result (~75% compression):
# "Role: research assistant
# - Answer from provided docs only
# - Cite sources
# - State if info not found
# - Verify before responding"
print(f"Original: ~{len(verbose_prompt.split())} words")
print(f"Optimized: ~{len(optimized.split())} words")
# Original: ~65 words -> Optimized: ~18 words
quantization="4bit" when loading models for consumer GPUs — this is essential for fitting 3B+ models into 8GB VRAM while maintaining adequate quality for agent tasks.CodeExecutor) for any agent that runs user-provided or LLM-generated code — SLMs are more prone to generating unsafe code than large models.ensure_within_context() method automatically truncates supplementary context while preserving the core task prompt. Monitor for this and consider moving to the next model tier up.max_retries in the agent config.SimpleEmbedding with cosine similarity over numpy arrays. For production, install the full backend (pip install effgen[faiss] or pip install effgen[chroma]).Paper: EffGen: Enabling Small Language Models as Capable Autonomous Agents (Srivastava et al., 2026). Focus on Section 3 for the four-technique architecture, Table 2 for the complementary scaling analysis (prompt optimization vs. complexity routing across model sizes), and the 13-benchmark evaluation in Section 5. Code: github.com/ctrl-gaurav/effGen — MIT licensed, installable via pip install effgen.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".