Pandas vs PySpark vs SQL

 Pandas vs PySpark vs SQL

 

Example/ Operation

PySpark

Python - Pandas

SQL

Data Loading

Df = Spark.read.csv("file.csv")

Df = Pd.read_csv("file.csv)

Select * from table_name

Filtering Rows

Df.filter(df['col1'] > 50)

Df[df['col1'] > 50]

Select * from table_name where col1 > 50

Selecting Columns

Df.select('col1', 'col2')

Df[['col1', 'col2']]

Select col1, col2 from table_name

Grouping and Aggregation

Df.groupBy("col1").agg({"col1": "Sum"})

Note: There is dictionary in agg

Df.groupby('col1').sum()

Select col1, sum(col1) from table_name group by col1

Joining Tables/DataFrames

Df1.join(df2, df1.id = df2.id)

Df1.merge(df2, on = 'id')

Select * from table1 join table2 on table1.id = table2.id

Adding a New Column

Df = df.withColumn("new_col", df["col"] + 10)

Df['new_col'] = df["col"] + 10

ALTER TABLE table_name ADD COLUMN new_col INT;

Removing a Column

Df = df.drop("col1")

Df = df.drop("col1", axis = 1)

ALTER TABLE table_name DROP COLUMN column;

Handling Missing Data

df.fillna(value)

df.fillna(value)

UPDATE table_name SET column = value WHERE column IS NULL

Sorting Data

Df.orderBy(df["col"].desc())

Df.sort_values('col', ascending = False)

SELECT * FROM table_name ORDER BYcolumn DESC

Window Functions

From pyspark.sql.window import window

Window = window.partitionBy("col")

Df.withColumn('rank', rank().over(window))

df['rank'] = df['column'].rank()

Select col1, RANK() over (Partition by col1)

Saving Data

Df.write.csv("file.csv")

Df.to_csv("file.csv")

COPY(SELECT*FROM table_name) TO 'file.csv' WITH CSV

Performance

Optimized for distributed processing

Limited by memory

Dependent on database capabilities

Handling Null Values

Df.fillna(value).dropna().replace(to_replace, value)

df.fillna(value).dropna().replace(to_replace, value)

COALESCE, NULLIF, IS NULL, IS NOT NULL

Pivoting/Unpivoting Data

Df.groupBy('col').pivot("column_to_pivot").sum("value")

df.pivot_table(values='value' , index='column' , columns='column_to_pivot' , aggfunc='sum')

PIVOT(SUM(value)FOR column_to_pivot IN (...)) (DBMS dependent)

Iterating Over Rows

Not recommended due to distributed nature; use map or foreach

for index, row in df.iterrows() (slower for large datasets)

Typically not used in SQL; handled via queries

Performance Optimization

Catalyst optimizer,Tungsten execution engine

N/A - Single Thread execution

Indexing, query optimization, parallel execution (depends on DBMS)

DataFrame/Query Caching

Df.cache()

N/A - entire DataFrame is in memory

CACHETABLE or CREATEINDEX(DBMS dependent)

Schema Management

Supports schema enforcement and evolution via DataFrame schema

Schema is inferred; can be set manually

Managed via CREATETABLE and ALTER TABLE

Data Visualization

Not directly supported, but can integrate with libraries like matplotlib

Supports matplotlib, seaborn, etc.

Not supported; data is visualized via external tools like Tableau, Power BI

Debugging

Log-based debugging, explain() for execution plans

Direct inspection with print, info(), describe()

EXPLAIN for query plans, error messages

File Formats Supported

CSV, JSON, Parquet, ORC, Avro, etc

CSV,Excel, JSON, HDF5, etc.

Varies; often requires external tools or extensions for non-tabular formats

Complex Aggregations

df.groupBy("column").agg({"col1": "sum" , "col2": "max"})

df.groupby("column").agg({"col1": "sum" , "col2": "max"})

Complex aggregations via GROUP BY, subqueries

Window Functions

Extensive support, e.g., rank(), row_number(), lag()

Limited support via rolling, expanding

Extensive support, e.g., ROW_NUMBER(), RANK(), LAG()

Date/Time Functions

date_add(), date_sub(), datediff(), to_date()

pd.to_datetime(), df['date'].dt functions

DATEADD, DATEDIFF,TO_DATE,TO_CHAR

Parallel Processing

Yes, across multiple nodes

No, single machine

No, typically single machine unless parallel query execution is supported

Data Source Integration

Reads from HDFS, S3, JDBC, etc

Reads from local files, databases via connectors

Reads from various database tables

 

 

 

 

 

 

 

 

 

Comments

Popular posts from this blog

Databricks Auto Loader

Optimizing Spark memory - Resolving OutOfMemory (OOM)

PySpark tips for day-to-day use