spark-optimization

Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.

Author

Install

Hot:5

Download and extract to your skills directory

Copy command and send to OpenClaw for auto-install:

Download and install this skill https://openskills.cc/api/download?slug=sickn33-skills-spark-optimization&locale=en&source=copy

Apache Spark Performance Optimization Guide

Skill Overview


Spark Optimization provides production-grade performance tuning solutions for Apache Spark, covering partition strategies, memory tuning, shuffle optimization, and handling data skew. It helps address slow Spark jobs, OOM issues, and scalability problems.

Use Cases

1. Optimizing Slow Spark Jobs


When a Spark job takes too long to run, you need to investigate bottlenecks from perspectives such as the execution plan, partition strategy, and shuffle overhead. This skill offers a systematic performance analysis approach, including how to interpret the Spark UI, analyze the execution plan, and apply solution patterns for common performance issues.

2. Scaling Large-Scale Data Pipelines


When processing TB-level data, configuring executor resources, partition strategies, and memory parameters appropriately is crucial. This skill provides configuration templates for production environments and tuning recommendations to help Spark jobs scale smoothly to larger data volumes.

3. Diagnosing Data Skew and Memory Issues


Data skew can cause a few tasks to slow down the entire job, while improper memory settings can lead to frequent GC or OOM. This skill provides methods for detecting skew, salting strategies, and best practices for memory allocation to quickly pinpoint and resolve these issues.

Core Features

1. Intelligent Partition Strategy


Compute the optimal number of partitions based on data size (128–256MB per partition) to avoid excessive scheduling overhead from too many partitions, or resource waste from too few. Supports optimizations such as partition pruning and dynamic partition writing to reduce unnecessary data scanning.

2. Join Performance Optimization


Select the best join strategy for different scenarios: use broadcast joins for small tables to avoid shuffle; use bucket joins with pre-sorting to eliminate runtime shuffle for large tables; for data skew scenarios, leverage AQE (Adaptive Query Execution) automatic optimization and provide manual salting solutions.

3. Memory and Cache Management


Accurately calculate the memory allocation ratios for executors, balancing execution memory and cache memory to avoid memory overflow or frequent spilling. Provides guidance on choosing cache/persist and the timing for checkpointing to ensure efficient memory utilization.

Common Questions

If a Spark job runs very slowly, where should you start troubleshooting?


First, use the Spark UI to check the execution plan and stage durations. Focus on whether there is data skew (a few tasks taking far longer than the average), frequent disk spilling, or excessively long GC time. Common causes include: unreasonable partition counts, too much shuffled data, too many small files, and high serialization overhead. It is recommended to enable AQE (adaptive optimization), use explain() to analyze the execution plan, and then adjust shuffle partition counts and broadcast thresholds accordingly.

How many Spark partitions should be set?


The recommended partition size should be kept between 128MB and 256MB.
Partition count = data size (GB) × 1024 / partition size (MB).
For example, with 100GB of data and 128MB partitions, you need about 800 partitions. Too few partitions can lead to low executor utilization and memory pressure, while too many increase task scheduling overhead. With AQE enabled, you can set spark.sql.shuffle.partitions=auto so Spark can automatically coalesce small partitions.

What solutions exist for Spark data skew?


Data skew appears when a small number of tasks take much longer than the average. Solutions include:
1) Enable AQE skew join for automatic optimization;
2) Salt skewed keys to spread hot data across multiple partitions;
3) Use broadcast join to broadcast small tables to all executors;
4) Adjust the partition strategy by increasing partition counts along the skew dimension.

You can use spark_partition_id() to compute the data volume per partition; if the skew ratio exceeds 2x, it needs to be addressed.