Remove duplicates from a large dataset in PySpark?
How would you remove duplicates from a large dataset in PySpark?
1. Load Dataset into Dataframe
df = spark.read.csv("path/to/data.csv", header = True, inferSchema = True)
2. Check for Duplicates
dup_count = df.count() - df.dropDuplicates().count()
3. Partition the data to optimize performance
df_repartitioned = df.repartition(100)
4. Remove duplicates using the dropDuplicates() method
df_no_dup = df_repartitioned.dropDuplicates()
5. Cache the resulting Dataframe to avoid recomputing
df_no_dup.cache()
6. Save the cleaned dataset
df_no_dup.write.csv("path/to/cleaned/data.csv", header = True)
Why Partitioned the data in step-3?
Partition the data helps to distribute the computation across the multiple nodes, making the process more efficient and scalable
Why you cached the resulting Dataframe in step-5:
Caching the dataframe avoids recomputing the entire dataset when saving the cleaned data, which can sinificantly improve performance
Comments
Post a Comment