PySpark tips for day-to-day use
PySpark tips for day-to-day use:
1. Checkpointing 📝:
Use `df.checkpoint()` to break the lineage and avoid long dependency chains in iterative jobs.
2. Partition Pruning 🌳:
Leverage `df.filter("date = '2024-01-01'")` on partitioned data to skip unnecessary data scans.
3. Z-Ordering for Performance ⚡:
In Delta tables, `OPTIMIZE` with `ZORDER BY` to speed up queries on frequently filtered columns.
4. Dynamic Partition Overwrite 🔄:
Enable `spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")` for efficient partition updates.
5. Avoid Wide Transformations 📉:
Minimize shuffles by reducing operations like `groupBy()` and `join()` when possible.
6. Use `mapPartitions` for Efficiency 🚀:
When heavy computation is needed, process data in batches with `df.rdd.mapPartitions()`.
7. Memory-Efficient Joins 🧠:
If data fits in memory, convert to RDD and use `rdd.collectAsMap()` for smaller joins.
8. Use `selectExpr()` for SQL Expressions 🗃️:
Simplify transformations: `df.selectExpr("col1 2 as col1_doubled")`.
9. Schema Inference Control 🧐:
For CSVs, avoid automatic schema inference with `inferSchema=False` for faster reads.
10. Audit DataFrames 🕵️♂️:
Use `df.explain(True)` to see execution plans and optimize accordingly.
Comments
Post a Comment