Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

baekenough/spark-best-practices

Name: spark-best-practices
Author: baekenough

.claude/skills/spark-best-practices/SKILL.md

npx skillsauth add baekenough/oh-my-customcode spark-best-practices

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Apache Spark Best Practices

Performance Optimization

Broadcast Joins (CRITICAL)

Use broadcast(small_df) for small-large table joins
Default broadcast threshold: 10MB (spark.sql.autoBroadcastJoinThreshold)
Avoid broadcast for tables > 100MB

Shuffles (CRITICAL)

Minimize shuffles: expensive operations
Use coalesce() to reduce partitions without shuffle
Use repartition() only when necessary (causes shuffle)
Predicate pushdown: filter before joins

Caching

Cache DataFrames used multiple times: df.cache() or df.persist()
Choose storage level: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY
Unpersist when done: df.unpersist()

Resource Management

Executor Configuration

Executor memory: 80% of available memory per executor
Executor cores: 4-5 cores per executor (optimal)
Dynamic allocation: enable for varying workloads

Partitioning

Optimal partition size: 100-200MB
Too few partitions: underutilized cluster
Too many partitions: task overhead

Data Processing

UDFs

Prefer built-in functions over UDFs
Use Pandas UDF for vectorized operations
Avoid Python UDFs (serialization overhead)

Storage Formats

Parquet: default for analytics (columnar, compression)
ORC: alternative to Parquet
Delta/Iceberg: ACID transactions, time travel

References

Spark Performance Tuning

baekenough/spark-best-practices

.claude/skills/spark-best-practices/SKILL.md

Apache Spark best practices for PySpark and Scala distributed data processing

11 stars

data-ai

Updated Apr 3, 2026

$ install --global

skillsauth

npx skillsauth add baekenough/oh-my-customcode spark-best-practices

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 3, 2026, 8:18 PM4.0s1 file scanned

SKILL.md

name:: spark-best-practices
description:: Apache Spark best practices for PySpark and Scala distributed data processing
scope:: core
user-invocable:: false

Apache Spark Best Practices

Performance Optimization

Broadcast Joins (CRITICAL)

Use broadcast(small_df) for small-large table joins
Default broadcast threshold: 10MB (spark.sql.autoBroadcastJoinThreshold)
Avoid broadcast for tables > 100MB

Shuffles (CRITICAL)

Minimize shuffles: expensive operations
Use coalesce() to reduce partitions without shuffle
Use repartition() only when necessary (causes shuffle)
Predicate pushdown: filter before joins

Caching

Cache DataFrames used multiple times: df.cache() or df.persist()
Choose storage level: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY
Unpersist when done: df.unpersist()

Resource Management

Executor Configuration

Executor memory: 80% of available memory per executor
Executor cores: 4-5 cores per executor (optimal)
Dynamic allocation: enable for varying workloads

Partitioning

Optimal partition size: 100-200MB
Too few partitions: underutilized cluster
Too many partitions: task overhead

Data Processing

UDFs

Prefer built-in functions over UDFs
Use Pandas UDF for vectorized operations
Avoid Python UDFs (serialization overhead)

Storage Formats

Parquet: default for analytics (columnar, compression)
ORC: alternative to Parquet
Delta/Iceberg: ACID transactions, time travel

References

Spark Performance Tuning

Related Skills

baekenough/wiki

development

VerifiedTrustedCommunity

Generate and maintain a persistent codebase wiki — LLM-built interlinked markdown knowledge base (Karpathy LLM Wiki pattern)

15SKILL.mdUpdated Apr 16, 2026

baekenough/wiki-rag

development

VerifiedTrustedCommunity

Use the project wiki as RAG knowledge source — search wiki pages to answer codebase questions before exploring raw files

15SKILL.mdUpdated Apr 16, 2026

baekenough/skill-extractor

tools

VerifiedTrustedCommunity

Analyze task trajectories to propose reusable SKILL.md candidates from successful patterns

15SKILL.mdUpdated Apr 16, 2026

baekenough/skill-extractor

baekenough/hada-scout

data-ai

VerifiedTrustedCommunity

hada.io RSS feed monitoring for AI agent/harness articles with automated /scout analysis

15SKILL.mdUpdated Apr 16, 2026

baekenough/hada-scout

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/baekenough/oh-my-customcode.git

# Copy into Claude Code skills folder (global)
cp -r oh-my-customcode/.claude/skills/spark-best-practices ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

baekenough/oh-my-customcode

11 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT