skills/from-code-centric-concept-centric-teaching/SKILL.md
Generate LLM-assisted coding labs that teach concepts through 'Vibe Coding' — producing working code paired with mandatory conceptual reflection, prompt logging, and critical thinking assessments. Use when: 'create a vibe coding lab for transformers', 'design an NLP exercise with reflection questions', 'build a concept-focused coding tutorial', 'generate a lab that teaches X concept not just syntax', 'create a coding assignment with critical reflection', 'design a learn-by-prompting exercise'.
npx skillsauth add ndpvt-web/arxiv-claude-skills from-code-centric-concept-centric-teachingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and deliver Vibe Coding learning experiences — a pedagogical method from Al-Khalifa (2026) where LLM-generated code serves as a vehicle for conceptual mastery rather than an end in itself. Instead of asking learners to write code from scratch (and get stuck on syntax), Claude generates working implementations and then guides the learner through structured reflection questions, prompt logging analysis, and concept-probing modifications. The result: learners spend cognitive effort on understanding why rather than debugging what.
Vibe Coding inverts the traditional learn-to-code workflow. In conventional instruction, students write code to prove they understand concepts. In Vibe Coding, the LLM writes the code and the student proves understanding through three assessment channels: (1) prompt logging — documenting how they directed the LLM and why they made specific requests, (2) critical reflection — answering structured questions that probe conceptual understanding of the generated code, and (3) modification exercises — making targeted changes that require understanding the code's architecture, not just its syntax.
The core insight is that debugging syntax errors and fighting import paths consumes cognitive bandwidth that could be spent on understanding attention mechanisms, loss functions, or architectural tradeoffs. By offloading implementation to the LLM, the learner's limited attention is redirected toward the concepts the code embodies. This is not "letting the AI do the work" — the reflection and modification components are harder than writing the code, because they require genuine understanding.
The method works because it structures three layers of accountability: the prompt log shows the learner's intent, the reflection questions test their comprehension, and the modification tasks prove their transfer ability. A learner who cannot explain why a particular loss function was chosen, or what happens when you remove an attention head, has not learned — regardless of whether the code runs.
Identify the target concept. Determine the core idea the learner should master (e.g., "self-attention in transformers", "TF-IDF weighting", "backpropagation through time"). Separate the concept from the implementation details.
Generate a working, well-commented implementation. Write complete, runnable code that demonstrates the concept. Include inline comments that label conceptual components (e.g., # Query, Key, Value projections — the core of attention). Use realistic data or provide sample data inline.
Annotate the code with concept markers. Add # CONCEPT: comments at critical junctures that link code to theory. For example: # CONCEPT: Softmax here normalizes attention weights to sum to 1, creating a probability distribution over input tokens.
Write 4-6 critical reflection questions. These must test understanding, not recall. Use question types:
Design 2-3 targeted modification exercises. Each modification should require conceptual understanding to complete correctly. Specify what to change, what the expected behavioral difference is, and what concept the modification tests. Example: "Replace the learned positional embeddings with sinusoidal encodings. Run both versions on the sample input and explain why the outputs differ."
Create the prompt log template. Provide a structured format for the learner to record: (a) the prompt they gave the LLM, (b) what they were trying to achieve conceptually, (c) whether the output matched their expectation, and (d) how they refined the prompt and why.
Include a concept map or dependency note. State what prerequisite concepts the learner should already know, and what concepts this lab prepares them for next. This situates the lab in a learning progression.
Provide verification checkpoints. Give the learner concrete ways to verify both that the code works (expected output snippets) and that their conceptual understanding is correct (key insights they should have reached by the end).
Add an "explore further" section. List 2-3 extensions that the learner could pursue using the same vibe coding approach — prompting the LLM for variants and then analyzing the differences conceptually.
Example 1: Teaching Self-Attention via Vibe Coding
User: "Create a vibe coding lab that teaches self-attention in transformers"
Approach:
Output structure:
# === VIBE CODING LAB: Self-Attention Mechanism ===
# Goal: Understand how self-attention computes contextual representations
# Prerequisites: Matrix multiplication, softmax function, embeddings
import torch
import torch.nn.functional as F
def self_attention(x, W_q, W_k, W_v):
"""
Compute self-attention for input sequence x.
x: (batch, seq_len, d_model)
"""
# CONCEPT: Each token gets three representations — what it's looking for (Q),
# what it offers (K), and what it actually contains (V)
Q = x @ W_q # Queries
K = x @ W_k # Keys
V = x @ W_v # Values
d_k = Q.shape[-1]
# CONCEPT: Dot product measures similarity between queries and keys.
# Scaling by sqrt(d_k) prevents softmax from saturating for large dimensions.
scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
# CONCEPT: Softmax converts raw scores to a probability distribution.
# Each token now has a weighted "attention" over all other tokens.
attn_weights = F.softmax(scores, dim=-1)
# CONCEPT: The output for each token is a weighted sum of all Value vectors,
# where weights reflect relevance determined by Q-K similarity.
output = attn_weights @ V
return output, attn_weights
# --- Sample usage with synthetic data ---
torch.manual_seed(42)
batch, seq_len, d_model = 1, 4, 8
x = torch.randn(batch, seq_len, d_model)
W_q = torch.randn(d_model, d_model)
W_k = torch.randn(d_model, d_model)
W_v = torch.randn(d_model, d_model)
output, weights = self_attention(x, W_q, W_k, W_v)
print("Attention weights shape:", weights.shape) # (1, 4, 4)
print("Attention weights (row sums should be 1.0):")
print(weights.squeeze().sum(dim=-1)) # Should print tensor of 1.0s
## Critical Reflection Questions
1. **Explain-why**: Why do we divide by sqrt(d_k) before applying softmax?
What would happen to the attention distribution without this scaling
as d_k increases to 512 or 1024?
2. **Predict-change**: If you replaced `F.softmax(scores, dim=-1)` with
`F.relu(scores)`, would the attention mechanism still work? What
property of the output would be lost, and why does it matter?
3. **Connect-theory**: The attention weight matrix has shape (seq_len, seq_len).
Each row sums to 1. What does row i, column j represent in plain language?
How does this relate to the concept of "contextual embeddings"?
4. **Identify-limitation**: This implementation has no masking. For what
NLP task would this be a problem, and what would go wrong specifically?
## Modification Exercises
**Mod 1**: Add a causal mask so that position i can only attend to positions
<= i. Print the attention weights and verify the upper triangle is zero.
*Concept tested*: Autoregressive generation and information leakage.
**Mod 2**: Split Q, K, V into 2 attention heads (d_model/2 each), compute
attention separately, and concatenate. Compare the attention patterns
of each head on the same input.
*Concept tested*: Why multi-head attention captures diverse relationships.
Example 2: Teaching TF-IDF for Information Retrieval
User: "Help me design a concept-focused tutorial on TF-IDF"
Approach:
Output structure:
# === VIBE CODING LAB: TF-IDF from Scratch ===
# Goal: Understand why TF-IDF works better than raw term frequency
# Prerequisites: Basic probability, logarithms
import math
from collections import Counter
documents = [
"the cat sat on the mat",
"the dog sat on the log",
"cats and dogs are common pets",
]
# CONCEPT: Term frequency measures how important a word is WITHIN a document.
# A word that appears 5 times in a doc is more relevant to that doc than
# one appearing once.
def term_frequency(term, document):
words = document.split()
return words.count(term) / len(words)
# CONCEPT: Inverse document frequency measures how DISTINCTIVE a word is
# across the entire collection. Words appearing in every document (like "the")
# get low IDF; rare words get high IDF.
def inverse_document_frequency(term, all_docs):
num_docs_containing = sum(1 for doc in all_docs if term in doc.split())
if num_docs_containing == 0:
return 0
return math.log(len(all_docs) / num_docs_containing)
# CONCEPT: TF-IDF = TF * IDF. This balances local importance (TF) with
# global distinctiveness (IDF). Common words get suppressed; distinctive
# words get boosted.
def tfidf(term, document, all_docs):
return term_frequency(term, document) * inverse_document_frequency(term, all_docs)
# Compute TF-IDF for key terms
for term in ["the", "cat", "pets"]:
for i, doc in enumerate(documents):
score = tfidf(term, doc, documents)
print(f"TF-IDF('{term}', doc{i}) = {score:.4f}")
print()
## Critical Reflection Questions
1. **Explain-why**: "the" appears in docs 0 and 1 but its TF-IDF is 0.0000
for doc 2. Meanwhile "pets" has a high score for doc 2. Explain the
mathematical reason AND the intuitive reason this is desirable for search.
2. **Predict-change**: If you added 100 more documents that all contain
the word "cat", what would happen to the TF-IDF score of "cat" in doc 0?
What does this tell you about TF-IDF's behavior with corpus growth?
3. **Connect-theory**: Search engines moved beyond TF-IDF to neural
retrieval. What fundamental limitation of TF-IDF does neural retrieval
address? (Hint: consider synonyms.)
## Modification Exercise
Change the IDF formula to use `log(1 + N/df)` instead of `log(N/df)`.
Recompute scores and explain what practical problem the +1 smoothing solves.
Example 3: Generating a Prompt Log Template
User: "I want to learn about CNNs for text classification using vibe coding"
Claude provides the code lab, then includes this prompt log template:
## Prompt Log (fill this in as you work)
| # | Your Prompt to the LLM | Conceptual Goal | Output Match Expectation? | Refinement & Why |
|---|------------------------|-----------------|--------------------------|------------------|
| 1 | "Write a 1D CNN for text classification in PyTorch" | Get a baseline implementation to study conv-over-text architecture | Partially — used 2D conv instead of 1D | Asked specifically for Conv1d with embedding input shape |
| 2 | "Add multiple filter sizes (3, 4, 5) like the Kim 2014 paper" | Understand how different n-gram windows capture different features | Yes — parallel convolutions with concat | N/A |
| 3 | ... | ... | ... | ... |
### Prompt Log Reflection
After completing the lab, answer: What did your prompt refinements reveal
about your evolving understanding of the concept? Which misconceptions
did prompting help you identify?
# CONCEPT: annotations in generated code so the learner can map code regions to ideas even if they struggle with the language.Al-Khalifa, H. (2026). From Code-Centric to Concept-Centric: Teaching NLP with LLM-Assisted "Vibe Coding." arXiv:2602.01919v1. Accepted at Teaching NLP Workshop @ EACL 2026. Key takeaway: The three-component assessment structure (prompt log + critical reflection + modification exercises) is what makes LLM-assisted learning rigorous rather than passive.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".