I often find myself in situations where I need to optimize Spark scripts for processing large volumes of data. However, I frequently encounter challenges such as memory failures or excessively long evaluation times, which can lead to significant costs.
What are the best strategies you employ to test the performance of Spark jobs that handle heavy workloads without incurring substantial expenses? I'm particularly interested in methods that can help identify bottlenecks and optimize performance effectively while keeping costs low. Any insights or recommendations would be greatly appreciated
I'm no expert in Spark, and Spark has been something that I've been trying to learn more about recently, so I can't answer your question based on my own experience that well.
Still, I follow a lot of great Data Engineers in the field, and one of them happens to have written about this literally today! I'll quote from him:
Many high-paying data engineering jobs require expertise with distributed data processing, usually Apache Spark. Distributed data processing systems are inherently complex; add to the fact that Spark provides us with multiple optimization features (knobs to use), and it becomes tricky to know what the right approach is.
You realize there are too many moving parts (configs, metadata, partitioning, clustering, bucketing, transformation types, join strategies, lazy loading, testing, etc.) to learn to become a Spark expert. On top of all the moving parts, you also need to know how they interact and influence each other during data processing. Trying to understand all of the components of Spark feels like fighting an uphill battle with no end in sight; there is always something else to learn or know about.
What if you knew precisely how Apache Spark works internally and the optimization techniques that you can use? Distributed data processing system's optimization techniques (partitioning, clustering, sorting, data shuffling, join strategies, task parallelism, etc.) are like knobs, each with its tradeoffs.
When it comes to gaining Spark (& most distributed data processing system) mastery, the fundamental ideas are:
- Reduce the amount of data (think raw size) to be processed.
- Reduce the amount of data that needs to be moved between executors in the Spark cluster (data shuffle).
I recommend thinking about reducing data to be processed and shuffled in the following ways:
- Data Storage: How you store your data dictates how much it needs to be processed. Does your query often use a column in its filter? Partition your data by that column. Ensure that your data uses file encoding (e.g., Parquet) to store and use metadata when processing. Co-locate data with bucketing to reduce data shuffle. If you need advanced features like time travel, schema evolution, etc., use table format (such as Delta Lake).
- Data Processing: Filter before processing (Spark automatically does this with Lazy loading), analyze resource usage (with UI) to ensure maximum parallelism, know the type of code that will result in data shuffle, and identify how Spark performs joins internally to optimize its data shuffle.
- Data Model: Know how to model your data for the types of queries to expect in a data warehouse. Analyze tradeoffs between pre-processing and data freshness to store data as one big table.
- Query Planner: Use the query plan to check how Spark plans to process the data. Ensure metadata is up to date with statistical information about your data to help Spark choose the optimal way to process it.
- Writing efficient queries: While Spark performs many optimizations under the hood, writing efficient queries is a key skill. Learn how to write code that is easily readable and able to perform necessary computations.
He also has a great diagram showing these concepts together, but since I don't think I can share that in this answer, feel free to message me on Taro Slack or LinkedIn and I'll share it there!
Also, if you haven't already, I recommend asking ChatGPT or Bard for tips, as they can point you in the right direction.
Hope this helps!
Thank you Gideon! This is very helpful!
Just to follow up, Joseph Machado's new course on Spark is out and I think it's a great way of learning more about Spark!
https://josephmachado.podia.com/efficient-data-processing-in-spark?coupon=SUBSPECIAL524
I havent used Spark, but I used Dask which is a python first version of PySpark that also does distributed data processing. Below are my insights for debugging in dask but similar stuff should apply to spark
I found using the dask dashboard really important (see spark dashboard)
To improve performance without running large jobs I try to benchmark performance on small(ish) datasets and see how each change affects performance. Most spark/dask jobs are groupby + apply.
Spark basically parallelizes multiple tiny jobs. So if you can isolate and benchmark the tiny job to be as efficient as possible then this will carry over to the large jobs as well.
So I test the apply code on a small dataset and test how different changes to my apply function affects performance.
In my experience 95% of the issues are from partition sizes. If you're struggling with performance changing partition size is best first approach. Ideally start with 100MB partitions. Ensure that the partitions are not getting bloated throughout your code if you're adding/removing columns. If an individual partition/apply function takes a while, then reduce partition sizes.
Honestly the biggest performance gains came from reframing the problem/rethinking the approach to the actual problem you are solving. Talking with your coworkers and getting outside perspective helps a lot
Sometimes I get too stuck trying to optimize code and looking from an outside perspective you might realize you can trade off a bit of accuracy of the features for a huge performance gain.
For e.g. i remember this story of how instead of trying to make elevators faster they just put mirrors and people were busy looking at the mirrors so they felt the elevators are faster