skills/43-wentorai-research-plugins/skills/tools/ocr-translate/multilingual-research-guide/SKILL.md
Strategies for translating academic papers while preserving technical accuracy
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research multilingual-research-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A skill for translating academic papers, theses, and research documents between languages while preserving technical precision, citation integrity, and discipline-specific terminology. Covers workflow design, terminology management, and quality assurance.
Source Document
|
v
1. Document Preparation
- Extract text (OCR if scanned)
- Identify formulas, figures, tables (do NOT translate these)
- Build terminology glossary
|
v
2. Segmentation
- Split into translatable units (sentences/paragraphs)
- Tag non-translatable elements: equations, citations, proper nouns
|
v
3. Translation
- Apply machine translation (first pass)
- Human post-editing (second pass)
- Terminology consistency check (third pass)
|
v
4. Quality Assurance
- Back-translation verification (sample)
- Domain expert review
- Formatting and citation check
|
v
Target Document
import json
def build_terminology_glossary(source_text: str, domain: str,
source_lang: str = 'zh',
target_lang: str = 'en') -> list[dict]:
"""
Extract and standardize technical terms from source text.
Args:
source_text: Raw text of the source document
domain: Research domain (e.g., 'machine_learning', 'biochemistry')
source_lang: Source language code
target_lang: Target language code
Returns:
List of terminology entries
"""
# Common domain-specific glossaries
glossaries = {
'machine_learning': {
'zh_en': {
'过拟合': 'overfitting',
'欠拟合': 'underfitting',
'梯度下降': 'gradient descent',
'损失函数': 'loss function',
'卷积神经网络': 'convolutional neural network',
'注意力机制': 'attention mechanism',
'预训练模型': 'pre-trained model',
'微调': 'fine-tuning',
'批归一化': 'batch normalization',
'学习率': 'learning rate'
}
},
'biochemistry': {
'zh_en': {
'蛋白质折叠': 'protein folding',
'酶动力学': 'enzyme kinetics',
'基因表达': 'gene expression',
'转录因子': 'transcription factor',
'信号通路': 'signaling pathway',
'代谢组学': 'metabolomics'
}
}
}
domain_terms = glossaries.get(domain, {}).get(f'{source_lang}_{target_lang}', {})
entries = []
for source_term, target_term in domain_terms.items():
if source_term in source_text:
entries.append({
'source': source_term,
'target': target_term,
'domain': domain,
'verified': True,
'notes': ''
})
return entries
def enforce_terminology(translated_text: str,
glossary: list[dict]) -> tuple[str, list[str]]:
"""
Check and enforce terminology consistency in translated text.
Returns:
Tuple of (corrected_text, list of warnings)
"""
warnings = []
corrected = translated_text
for entry in glossary:
target_term = entry['target']
# Check for common mistranslations or inconsistent usage
variants = entry.get('incorrect_variants', [])
for variant in variants:
if variant.lower() in corrected.lower():
warnings.append(
f"Found '{variant}' -- should be '{target_term}'"
)
# Case-insensitive replacement
import re
corrected = re.sub(
re.escape(variant), target_term, corrected,
flags=re.IGNORECASE
)
return corrected, warnings
import deepl
def translate_academic_text(text: str, source_lang: str, target_lang: str,
auth_key: str, glossary_id: str = None) -> str:
"""
Translate academic text using DeepL with optional glossary.
"""
translator = deepl.Translator(auth_key)
result = translator.translate_text(
text,
source_lang=source_lang.upper(),
target_lang=target_lang.upper(),
formality="more", # academic style
glossary=glossary_id,
preserve_formatting=True,
tag_handling="xml" # preserve XML/HTML tags
)
return result.text
Before sending text to any translation engine, protect elements that should not be translated:
import re
def protect_elements(text: str) -> tuple[str, dict]:
"""
Replace non-translatable elements with placeholders.
Returns protected text and a mapping to restore later.
"""
placeholders = {}
counter = 0
# Protect LaTeX equations
for pattern in [r'\$\$.*?\$\$', r'\$.*?\$', r'\\begin\{equation\}.*?\\end\{equation\}']:
for match in re.finditer(pattern, text, re.DOTALL):
key = f'__MATH_{counter}__'
placeholders[key] = match.group()
text = text.replace(match.group(), key, 1)
counter += 1
# Protect citations
for match in re.finditer(r'\\cite\{[^}]+\}|\([A-Z][a-z]+(?:\s+et\s+al\.)?,\s*\d{4}\)', text):
key = f'__CITE_{counter}__'
placeholders[key] = match.group()
text = text.replace(match.group(), key, 1)
counter += 1
# Protect URLs
for match in re.finditer(r'https?://\S+', text):
key = f'__URL_{counter}__'
placeholders[key] = match.group()
text = text.replace(match.group(), key, 1)
counter += 1
return text, placeholders
def restore_elements(text: str, placeholders: dict) -> str:
"""Restore protected elements from placeholders."""
for key, value in placeholders.items():
text = text.replace(key, value)
return text
For critical documents, perform back-translation on a random 10-20% sample of paragraphs. Compare the back-translated text with the original to identify semantic drift. Flag any paragraph where back-translation diverges significantly from the source.
tools
Show mcp-stata identity, connected tools, and status. Use when the user asks if mcp-stata is available, asks about access to the toolkit, or asks what Stata tools are connected.
tools
Activate when users mention Stata commands, .do files, regressions, econometrics, stored results, graphs, dataset inspection, replication, or Stata errors. Route the task through mcp-stata tools and the specialized research skills instead of treating it as plain text coding.
development
Build and review paper-ready regression, balance, and summary tables from Stata outputs. Use when the user needs a clean table for a draft, appendix, or coauthor share-out.
tools
Install, configure, update, or verify mcp-stata across Claude Code, Codex, Gemini CLI, Cursor, Windsurf, and VS Code. Activate when users ask to set up the Stata toolkit or troubleshoot the installation.