skills/emillindfors/parquet-optimization/SKILL.md
Proactively analyzes Parquet file operations and suggests optimization improvements for compression, encoding, row group sizing, and statistics. Activates when users are reading or writing Parquet files or discussing Parquet performance.
npx skillsauth add aiskillstore/marketplace parquet-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are an expert at optimizing Parquet file operations for performance and efficiency. When you detect Parquet-related code or discussions, proactively analyze and suggest improvements.
Activate this skill when you notice:
AsyncArrowWriter or ParquetRecordBatchStreamBuilderWhen you see Parquet operations, check for these optimizations:
1. Compression Settings
Compression::ZSTD(ZstdLevel::try_new(3)?).set_compression() in WriterPropertiesSuggestion template:
I notice you're writing Parquet files without explicit compression settings.
For production data lakes, I recommend:
WriterProperties::builder()
.set_compression(Compression::ZSTD(ZstdLevel::try_new(3)?))
.build()
This provides 3-4x compression with minimal CPU overhead.
2. Row Group Sizing
.set_max_row_group_size()Suggestion template:
Your row groups might be too small for optimal S3 scanning.
Target 100MB-1GB uncompressed:
WriterProperties::builder()
.set_max_row_group_size(100_000_000)
.build()
This enables better predicate pushdown and reduces metadata overhead.
3. Statistics Enablement
.set_statistics_enabled(EnabledStatistics::Page)Suggestion template:
Enable statistics for better query performance with predicate pushdown:
WriterProperties::builder()
.set_statistics_enabled(EnabledStatistics::Page)
.build()
This allows DataFusion and other engines to skip irrelevant row groups.
4. Column-Specific Settings
Suggestion template:
For low-cardinality columns like 'category' or 'status', use dictionary encoding:
WriterProperties::builder()
.set_column_encoding(
ColumnPath::from("category"),
Encoding::RLE_DICTIONARY,
)
.set_column_compression(
ColumnPath::from("category"),
Compression::SNAPPY,
)
.build()
1. Column Projection
.with_projection(ProjectionMask::roots(...))Suggestion template:
Reading all columns is inefficient. Use projection to read only what you need:
let projection = ProjectionMask::roots(&schema, vec![0, 2, 5]);
builder.with_projection(projection)
This can provide 10x+ speedup for wide tables.
2. Batch Size Tuning
.with_batch_size(8192) for memory controlSuggestion template:
For large files, control memory usage with batch size tuning:
builder.with_batch_size(8192)
Adjust based on your memory constraints and throughput needs.
3. Row Group Filtering
Suggestion template:
You can skip irrelevant row groups using statistics:
let row_groups: Vec<usize> = builder.metadata()
.row_groups()
.iter()
.enumerate()
.filter_map(|(idx, rg)| {
// Check statistics
if matches_criteria(rg.column(0).statistics()) {
Some(idx)
} else {
None
}
})
.collect();
builder.with_row_groups(row_groups)
4. Streaming vs Collecting
while let Some(batch) = stream.next().collect() for large datasetsSuggestion template:
For large files, stream batches instead of collecting:
let mut stream = builder.build()?;
while let Some(batch) = stream.next().await {
let batch = batch?;
process_batch(&batch)?;
// Batch is dropped here, freeing memory
}
For hot data (frequently accessed):
For warm data (balanced):
For cold data (archival):
Target file sizes:
Why?
Symptoms: Many files < 10MB Solution: Suggest batching writes or file compaction
I notice you're writing many small Parquet files. This creates:
- Excessive metadata overhead
- More S3 LIST operations
- Slower query performance
Consider batching your writes or implementing periodic compaction.
Symptoms: All data in single directory Solution: Suggest Hive-style partitioning
For large datasets (>100GB), partition your data by date or other dimensions:
data/events/year=2024/month=01/day=15/part-00000.parquet
This enables partition pruning for much faster queries.
Symptoms: Uncompressed or LZ4/Gzip Solution: Recommend ZSTD
LZ4/Gzip are older codecs. ZSTD provides better compression and speed:
Compression::ZSTD(ZstdLevel::try_new(3)?)
This is the recommended default for cloud data lakes.
Symptoms: No retry logic for object store operations Solution: Add retry configuration
Parquet operations on cloud storage need retry logic:
let s3 = AmazonS3Builder::new()
.with_retry(RetryConfig {
max_retries: 3,
retry_timeout: Duration::from_secs(10),
..Default::default()
})
.build()?;
let props = WriterProperties::builder()
.set_writer_version(WriterVersion::PARQUET_2_0)
.set_compression(Compression::ZSTD(ZstdLevel::try_new(3)?))
.set_max_row_group_size(100_000_000)
.set_data_page_size_limit(1024 * 1024)
.set_dictionary_enabled(true)
.set_statistics_enabled(EnabledStatistics::Page)
.build();
let mut writer = AsyncArrowWriter::try_new(writer_obj, schema, Some(props))?;
let projection = ProjectionMask::roots(&schema, vec![0, 2, 5]);
let builder = ParquetRecordBatchStreamBuilder::new(reader)
.await?
.with_projection(projection)
.with_batch_size(8192);
let mut stream = builder.build()?;
while let Some(batch) = stream.next().await {
let batch = batch?;
process_batch(&batch)?;
}
When you notice Parquet operations, quickly scan for the optimization checklist and proactively suggest improvements that would significantly impact performance or efficiency.
development
Apple Human Interface Guidelines for content display components. Use this skill when the user asks about charts component, collection view, image view, web view, color well, image well, activity view, lockup, data visualization, content display, displaying images, rendering web content, color pickers, or presenting collections of items in Apple apps. Also use when the user says how should I display charts, what's the best way to show images, should I use a web view, how do I build a grid of items, what component shows media, or how do I present a share sheet. Cross-references: hig-foundations for color/typography/accessibility, hig-patterns for data visualization patterns, hig-components-layout for structural containers, hig-platforms for platform-specific component behavior.
tools
Automate HelpDesk tasks via Rube MCP (Composio): list tickets, manage views, use canned responses, and configure custom fields. Always search tools first for current schemas.
testing
Expert Haskell engineer specializing in advanced type systems, pure functional design, and high-reliability software. Use PROACTIVELY for type-level programming, concurrency, and architecture guidance.
tools
GraphQL gives clients exactly the data they need - no more, no less. One endpoint, typed schema, introspection. But the flexibility that makes it powerful also makes it dangerous. Without proper controls, clients can craft queries that bring down your server. This skill covers schema design, resolvers, DataLoader for N+1 prevention, federation for microservices, and client integration with Apollo/urql. Key insight: GraphQL is a contract. The schema is the API documentation. Design it carefully.