skills/starrocks-expert/SKILL.md
Expert guidance for StarRocks database operations including table management, data loading, query optimization, materialized views, partitioning strategies, and cluster configuration. Use when working with StarRocks databases, OLAP queries, data warehousing, ETL pipelines, or when the user mentions StarRocks, distributed analytics, or columnar storage.
npx skillsauth add tiansuyu/agent-skills starrocks-expertInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
StarRocks is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
Check table structure:
SHOW CREATE TABLE database_name.table_name;
DESC database_name.table_name;
View partitions:
SHOW PARTITIONS FROM database_name.table_name;
Check data loading status:
SHOW LOAD WHERE label = 'your_load_label';
SHOW STREAM LOAD;
Query execution analysis:
EXPLAIN VERBOSE SELECT ...;
SHOW QUERY PROFILE '<query_id>';
Table Types:
Data Loading Methods:
Materialized Views:
Use this skill when:
For detailed table design guidance, see TABLE.md:
date_trunc() expressionsFor detailed infrastructure guidance, see INFRA.md:
For detailed query optimization guidance, see QUERY.md:
CREATE TABLE IF NOT EXISTS database_name.table_name (
event_time DATETIME NOT NULL COMMENT 'Event timestamp',
id BIGINT NOT NULL COMMENT 'Business ID',
amount DECIMAL(18, 2) COMMENT 'Transaction amount',
status VARCHAR(50) COMMENT 'Status code',
created_at DATETIME COMMENT 'Creation timestamp'
)
DUPLICATE KEY (event_time, id)
PARTITION BY date_trunc('day', event_time) -- Expression-based partitioning
DISTRIBUTED BY HASH(id) BUCKETS 16
PROPERTIES (
"replication_num" = "3",
"storage_medium" = "SSD",
"storage_cooldown_time" = "2024-12-31 23:59:59"
);
See TABLE.md for detailed guidance on table types, partitioning strategies, and index selection.
curl --location-trusted -u user:password \
-H "label:load_label_$(date +%s)" \
-H "column_separator:," \
-H "columns:date_col,id,amount,status,created_at" \
-T data.csv \
http://fe_host:8030/api/database_name/table_name/_stream_load
LOAD LABEL database_name.load_label_20240101
(
DATA INFILE("s3://bucket/path/data_*.parquet")
INTO TABLE table_name
FORMAT AS "parquet"
)
WITH BROKER
(
"aws.s3.access_key" = "your_access_key",
"aws.s3.secret_key" = "your_secret_key",
"aws.s3.region" = "us-east-1"
)
PROPERTIES (
"timeout" = "3600"
);
When optimizing queries:
Partitions are created automatically based on data:
-- Partitions auto-created on INSERT
CREATE TABLE events (
event_time DATETIME,
user_id BIGINT
)
DUPLICATE KEY (event_time)
PARTITION BY date_trunc('day', event_time) -- Auto-creates daily partitions
DISTRIBUTED BY HASH(user_id) BUCKETS 16;
-- Hourly granularity for high-frequency data
PARTITION BY date_trunc('hour', event_time)
-- Monthly for long-term storage
PARTITION BY date_trunc('month', event_time)
CREATE TABLE metrics (
tenant_id INT,
dt DATETIME,
metric_name STRING,
value DOUBLE
)
PRIMARY KEY (tenant_id, dt, metric_name)
PARTITION BY tenant_id, date_trunc('day', dt)
DISTRIBUTED BY HASH(tenant_id) BUCKETS 16;
-- Show partitions first
SHOW PARTITIONS FROM table_name;
-- Drop specific partition by name
ALTER TABLE table_name DROP PARTITION p20230101_10001;
CREATE MATERIALIZED VIEW mv_daily_summary
BUILD IMMEDIATE
REFRESH ASYNC START('2024-01-01 00:00:00') EVERY(INTERVAL 1 DAY)
AS
SELECT
date_col,
status,
COUNT(*) as total_count,
SUM(amount) as total_amount,
AVG(amount) as avg_amount
FROM database_name.table_name
GROUP BY date_col, status;
SHOW LOAD WHERE label = 'your_label'SHOW PROC '/backends'EXPLAIN VERBOSE to check execution plan-- Check compaction status
SHOW PROC '/backends';
SHOW TABLET STATUS FROM table_name;
-- Trigger manual compaction if needed
ALTER TABLE table_name COMPACT;
date_trunc() expressions for time-based partitioning❌ Don't use VARCHAR for time columns that will be partitioned ❌ Don't apply functions to partition columns in WHERE (breaks pruning) ❌ Don't create too many buckets (causes small files) ❌ Don't create too few buckets (causes data skew) ❌ Don't skip partition pruning predicates ❌ Don't use UNIQUE KEY if you only need deduplication (use DUPLICATE + GROUP BY) ❌ Don't load data without labels (prevents duplicate detection) ❌ Don't create composite partitions that exceed 100k total partitions
"replication_num" = "3" # Number of replicas
"storage_medium" = "SSD" # Storage type (SSD/HDD)
"enable_persistent_index" = "true" # For primary key tables
"bloom_filter_columns" = "id,email" # Bloom filter index
"colocate_with" = "group_name" # Colocate tables
SET enable_pipeline_engine = true; # Enable vectorized execution
SET pipeline_dop = 0; # Parallelism (0 = auto)
SET query_timeout = 300; # Query timeout in seconds
SET enable_spill = true; # Enable disk spill for large joins
When working with a new StarRocks project:
Understand the use case
Design table schema
Set up data pipeline
Optimize queries
Monitor and maintain
devops
Local text-to-speech via sherpa-onnx (offline, no cloud)
devops
Feishu cloud storage file management. Activate when user mentions cloud space, folders, drive.
devops
Feishu document read/write operations. Activate when user mentions Feishu docs, cloud docs, or docx links.
devops
Local text-to-speech via sherpa-onnx (offline, no cloud)