skills/dziribot-rag-intelligent-conversational/SKILL.md
Build dialect-aware RAG conversational agents that handle non-standard orthography, code-switching, and multi-script input. Uses a dual-path architecture: deterministic NLU for structured flows + RAG fallback for open-domain queries. Trigger phrases: 'build a dialect chatbot', 'RAG agent for Arabic dialect', 'handle code-switching in chatbot', 'multi-script NLU pipeline', 'Algerian Arabic conversational agent', 'dialect-aware customer service bot'
npx skillsauth add ndpvt-web/arxiv-claude-skills dziribot-rag-intelligent-conversationalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill teaches Claude to build hybrid conversational agents that handle non-standardized dialects with code-switching, multi-script input, and orthographic variation. The core architecture from DziriBOT combines a fast deterministic NLU path (Rasa DIET or fine-tuned transformer) for structured intent routing with a RAG fallback path for knowledge-intensive open-domain responses. This dual-path design achieves sub-100ms latency on structured queries while still answering novel questions grounded in enterprise documentation.
Dual-Path Routing Architecture. DziriBOT routes every user message through two paths. The deterministic path uses a Rasa NLU pipeline (WhitespaceTokenizer + RegexFeaturizer + CountVectorsFeaturizer with character n-grams + DIET classifier) or a fine-tuned DziriBERT model to classify intents across 69 classes with sub-100ms latency. When the classifier confidence exceeds a threshold, the system triggers structured dialogue flows (forms, slot-filling, API calls). When confidence is low, the message falls through to the dynamic RAG path, which embeds the query with intfloat/multilingual-e5-base, retrieves relevant chunks from a FAISS HNSW index, re-ranks results, and generates a grounded answer via a quantized LLM (Llama-3.2-3B INT8).
Multi-Script Preprocessing Pipeline. The critical enabler is a normalization layer that unifies orthographic chaos before any model sees the text. For Arabic script: all Alef variants collapse to plain Alef, Ta Marbuta maps to terminal Ha, Alef Maqsura maps to Ya, Kashida decorative elongations are stripped, and diacritics are removed. For Latin/Arabizi script: phonetic numeral de-substitution (7->h, 3->a, 9->q), lowercasing, apostrophe normalization. Both scripts get repeated-character squashing ("baaaaazef" -> "bazef") and privacy masking (phone numbers -> [PHONE] token). This preprocessing alone accounts for significant accuracy gains, particularly on rare intents with high orthographic noise.
Low-Resource Data Augmentation. With 69 intent classes and a severe long-tail distribution (50% of classes have <10 examples), the paper addresses data scarcity through three augmentation strategies: manual paraphrasing (3-5 semantic variants per rare intent by native speakers), lexical synonym substitution within the dialect ("nheb" -> "bghit"/"hab"), and supervised back-translation through French with a semantic similarity threshold of >0.8 to filter bad translations. This brought the minimum class size to 13 (Arabic) and 28 (Latin) samples per intent.
Enumerate all intent classes for your domain. Expect a long-tail distribution. For each intent, write at least 10-15 seed utterances covering common spelling variants, both scripts if applicable, and code-switched forms. Store as JSONL with fields: text, intent, script (arabic/latin), entities.
Implement a normalization module with these ordered steps:
import re
def normalize_arabic(text: str) -> str:
"""Normalize Arabic-script dialect text."""
# 1. Strip diacritics (tashkeel)
text = re.sub(r'[\u064B-\u065F\u0670]', '', text)
# 2. Unify Alef variants -> plain Alef
text = re.sub(r'[\u0622\u0623\u0625]', '\u0627', text)
# 3. Alef Maqsura -> Ya
text = text.replace('\u0649', '\u064A')
# 4. Ta Marbuta -> Ha
text = text.replace('\u0629', '\u0647')
# 5. Remove Kashida (Tatweel)
text = text.replace('\u0640', '')
# 6. Squash repeated characters (3+ -> 1)
text = re.sub(r'(.)\1{2,}', r'\1', text)
return text.strip()
def normalize_arabizi(text: str) -> str:
"""Normalize Latin-script (Arabizi) dialect text."""
text = text.lower()
# Phonetic numeral de-substitution
numeral_map = {'7': 'h', '3': 'a', '9': 'q', '5': 'kh', '2': 'a'}
for num, letter in numeral_map.items():
text = text.replace(num, letter)
# Normalize apostrophes
text = re.sub(r"['\u2018\u2019\u0060]", "'", text)
# Squash repeated characters
text = re.sub(r'(.)\1{2,}', r'\1', text)
return text.strip()
def mask_pii(text: str) -> str:
"""Replace phone numbers with [PHONE] token."""
text = re.sub(r'\b0[567]\d{8}\b', '[PHONE]', text)
return text
Use Unicode block detection to classify input as Arabic-script or Latin-script, then apply the corresponding normalizer:
def detect_script(text: str) -> str:
arabic_chars = sum(1 for c in text if '\u0600' <= c <= '\u06FF')
latin_chars = sum(1 for c in text if 'a' <= c.lower() <= 'z')
return 'arabic' if arabic_chars > latin_chars else 'latin'
For each intent with fewer than 30 examples: (a) write 3-5 manual paraphrases per utterance, (b) apply lexical synonym substitution using a dialect synonym dictionary, (c) optionally back-translate through a pivot language (e.g., French or MSA) and filter with cosine similarity > 0.8 against the original embedding.
Set up a Rasa config.yml pipeline optimized for dialect:
pipeline:
- name: WhitespaceTokenizer # Preserves dialect spelling artifacts
- name: RegexFeaturizer # Domain patterns: USSD codes, phone numbers
- name: CountVectorsFeaturizer # Unigram lexical features
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 3
max_ngram: 4 # Character n-grams capture subword patterns
- name: DIETClassifier
epochs: 100
embedding_dimension: 128
learning_rate: 0.001
drop_rate: 0.2
weight_sparsity: 0.1
sparse_input_dropout_rate: 0.2
number_of_transformer_layers: 1
For higher accuracy (+1-5% F1 over DIET), fine-tune a pretrained dialect BERT:
[CLS] token output (768-dim -> num_intents)When NLU confidence is below threshold (e.g., < 0.65):
intfloat/multilingual-e5-base with prefix encoding ("query: " for user questions, "passage: " for document chunks).def route_message(text: str, nlu_result: dict, confidence_threshold: float = 0.65):
"""Route to deterministic flow or RAG fallback."""
if nlu_result['confidence'] >= confidence_threshold:
return {
'path': 'deterministic',
'intent': nlu_result['intent'],
'entities': nlu_result['entities']
}
else:
return {
'path': 'rag',
'query': text
}
Craft a bilingual system prompt that grounds responses in retrieved context and instructs the LLM to respond in the user's dialect:
You are a helpful customer service assistant. Answer ONLY based on the
provided context. If the context does not contain the answer, say you
don't know. Respond in the same language/script the user wrote in.
Context:
{retrieved_chunks}
User question: {query}
Answer:
Report accuracy, weighted F1, and macro F1 separately for each script (Arabic and Latin). Pay special attention to macro F1 — it exposes failures on rare intents that weighted F1 can mask. Target: >85% macro F1 on both scripts.
Example 1: Building a Telecom Customer Service Bot for Algerian Darja
User: "I need a chatbot for an Algerian telecom company. Customers write in Arabic, French, and Arabizi. The bot should handle balance checks, offer inquiries, and complaints, but also answer general questions about promotions from our docs."
Approach:
check_balance, ask_offer_details, file_complaint, activate_service, general_promo_question, etc.check_balance:
*505#.Output: A dual-path bot where "bghit nactivi win max" triggers a structured activation flow, while "واش الفرق بين pixX و sama" retrieves promotional docs and generates a comparison answer in Darja.
Example 2: Adapting the Architecture to Hinglish (Hindi + English)
User: "I want to apply this dialect chatbot approach to Hinglish customer support for an Indian e-commerce company."
Approach:
\u0900-\u097F) vs Latin.multilingual-e5-base embeddings (already supports Hindi) and reindex product catalog docs.Output: Same dual-path architecture, but the preprocessing, base transformer, and training data are swapped for the target dialect. The routing logic, FAISS indexing, and prompt template structure remain identical.
Example 3: Adding RAG Fallback to an Existing Rasa Bot
User: "I have a working Rasa chatbot but it fails on questions about our product docs. How do I add a RAG fallback?"
Approach:
action_rag_fallback triggered when NLU confidence < 0.65.multilingual-e5-base (prefix: "query: ").rules:
- rule: RAG fallback on low confidence
steps:
- intent: nlu_fallback
- action: action_rag_fallback
FallbackClassifier threshold to 0.65 in your pipeline config.Output: The existing bot handles known intents as before. Unknown or ambiguous queries now get grounded answers from your documentation instead of "I don't understand."
char_wb n-grams to generalize across "bghit"/"beghit"/"bghiit".intfloat/multilingual-e5-base with explicit prefix encoding (query: / passage:) for embedding — it significantly outperforms symmetric embedding models on cross-lingual retrieval.Paper: DziriBOT: RAG Based Intelligent Conversational Agent for Algerian Arabic Dialect — Bechiri & Lanasri, 2026. Look for: the five-tiered pipeline architecture, DIET vs DziriBERT comparison tables (Table results showing 87.38% vs 86.98% accuracy), the Arabic/Arabizi normalization rules, and the RAG latency breakdown across hardware configurations.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".