Pandas vs PySpark vs SQL

November 10, 2024

Example/ Operation	PySpark	Python - Pandas	SQL
Data Loading	Df = Spark.read.csv("file.csv")	Df = Pd.read_csv("file.csv)	Select * from table_name
Filtering Rows	Df.filter(df['col1'] > 50)	Df[df['col1'] > 50]	Select * from table_name where col1 > 50
Selecting Columns	Df.select('col1', 'col2')	Df[['col1', 'col2']]	Select col1, col2 from table_name
Grouping and Aggregation	Df.groupBy("col1").agg({"col1": "Sum"}) Note: There is dictionary in agg	Df.groupby('col1').sum()	Select col1, sum(col1) from table_name group by col1
Joining Tables/DataFrames	Df1.join(df2, df1.id = df2.id)	Df1.merge(df2, on = 'id')	Select * from table1 join table2 on table1.id = table2.id
Adding a New Column	Df = df.withColumn("new_col", df["col"] + 10)	Df['new_col'] = df["col"] + 10	ALTER TABLE table_name ADD COLUMN new_col INT;
Removing a Column	Df = df.drop("col1")	Df = df.drop("col1", axis = 1)	ALTER TABLE table_name DROP COLUMN column;
Handling Missing Data	df.fillna(value)	df.fillna(value)	UPDATE table_name SET column = value WHERE column IS NULL
Sorting Data	Df.orderBy(df["col"].desc())	Df.sort_values('col', ascending = False)	SELECT * FROM table_name ORDER BYcolumn DESC
Window Functions	From pyspark.sql.window import window Window = window.partitionBy("col") Df.withColumn('rank', rank().over(window))	df['rank'] = df['column'].rank()	Select col1, RANK() over (Partition by col1)
Saving Data	Df.write.csv("file.csv")	Df.to_csv("file.csv")	COPY(SELECT*FROM table_name) TO 'file.csv' WITH CSV
Performance	Optimized for distributed processing	Limited by memory	Dependent on database capabilities
Handling Null Values	Df.fillna(value).dropna().replace(to_replace, value)	df.fillna(value).dropna().replace(to_replace, value)	COALESCE, NULLIF, IS NULL, IS NOT NULL
Pivoting/Unpivoting Data	Df.groupBy('col').pivot("column_to_pivot").sum("value")	df.pivot_table(values='value' , index='column' , columns='column_to_pivot' , aggfunc='sum')	PIVOT(SUM(value)FOR column_to_pivot IN (...)) (DBMS dependent)
Iterating Over Rows	Not recommended due to distributed nature; use map or foreach	for index, row in df.iterrows() (slower for large datasets)	Typically not used in SQL; handled via queries
Performance Optimization	Catalyst optimizer,Tungsten execution engine	N/A - Single Thread execution	Indexing, query optimization, parallel execution (depends on DBMS)
DataFrame/Query Caching	Df.cache()	N/A - entire DataFrame is in memory	CACHETABLE or CREATEINDEX(DBMS dependent)
Schema Management	Supports schema enforcement and evolution via DataFrame schema	Schema is inferred; can be set manually	Managed via CREATETABLE and ALTER TABLE
Data Visualization	Not directly supported, but can integrate with libraries like matplotlib	Supports matplotlib, seaborn, etc.	Not supported; data is visualized via external tools like Tableau, Power BI
Debugging	Log-based debugging, explain() for execution plans	Direct inspection with print, info(), describe()	EXPLAIN for query plans, error messages
File Formats Supported	CSV, JSON, Parquet, ORC, Avro, etc	CSV,Excel, JSON, HDF5, etc.	Varies; often requires external tools or extensions for non-tabular formats
Complex Aggregations	df.groupBy("column").agg({"col1": "sum" , "col2": "max"})	df.groupby("column").agg({"col1": "sum" , "col2": "max"})	Complex aggregations via GROUP BY, subqueries
Window Functions	Extensive support, e.g., rank(), row_number(), lag()	Limited support via rolling, expanding	Extensive support, e.g., ROW_NUMBER(), RANK(), LAG()
Date/Time Functions	date_add(), date_sub(), datediff(), to_date()	pd.to_datetime(), df['date'].dt functions	DATEADD, DATEDIFF,TO_DATE,TO_CHAR
Parallel Processing	Yes, across multiple nodes	No, single machine	No, typically single machine unless parallel query execution is supported
Data Source Integration	Reads from HDFS, S3, JDBC, etc	Reads from local files, databases via connectors	Reads from various database tables

Search This Blog

Data Engineered

Pandas vs PySpark vs SQL

Comments

Post a Comment

Popular posts from this blog

Databricks Auto Loader

Optimizing Spark memory - Resolving OutOfMemory (OOM)

PySpark tips for day-to-day use