Pandas vs PySpark vs SQL
Pandas vs PySpark vs SQL
|
Example/ Operation |
PySpark |
Python - Pandas |
SQL |
|
Data Loading |
Df = Spark.read.csv("file.csv") |
Df = Pd.read_csv("file.csv) |
Select * from table_name |
|
Filtering Rows |
Df.filter(df['col1'] > 50) |
Df[df['col1'] > 50] |
Select * from table_name where col1 > 50 |
|
Selecting Columns |
Df.select('col1', 'col2') |
Df[['col1', 'col2']] |
Select col1, col2 from table_name |
|
Grouping and Aggregation |
Df.groupBy("col1").agg({"col1": "Sum"}) Note: There is dictionary in agg |
Df.groupby('col1').sum() |
Select col1, sum(col1) from table_name group by col1 |
|
Joining Tables/DataFrames |
Df1.join(df2, df1.id = df2.id) |
Df1.merge(df2, on = 'id') |
Select * from table1 join table2 on table1.id = table2.id |
|
Adding a New Column |
Df = df.withColumn("new_col", df["col"] + 10) |
Df['new_col'] = df["col"] + 10 |
ALTER TABLE table_name ADD COLUMN new_col INT; |
|
Removing a Column |
Df = df.drop("col1") |
Df = df.drop("col1", axis = 1) |
ALTER TABLE table_name DROP COLUMN column; |
|
Handling Missing Data |
df.fillna(value) |
df.fillna(value) |
UPDATE table_name SET column = value WHERE column IS NULL |
|
Sorting Data |
Df.orderBy(df["col"].desc()) |
Df.sort_values('col', ascending = False) |
SELECT * FROM table_name ORDER BYcolumn DESC |
|
Window Functions |
From pyspark.sql.window import window Window = window.partitionBy("col") Df.withColumn('rank', rank().over(window)) |
df['rank'] = df['column'].rank() |
Select col1, RANK() over (Partition by col1) |
|
Saving Data |
Df.write.csv("file.csv") |
Df.to_csv("file.csv") |
COPY(SELECT*FROM table_name) TO 'file.csv' WITH CSV |
|
Performance |
Optimized for distributed processing |
Limited by memory |
Dependent on database capabilities |
|
Handling Null Values |
Df.fillna(value).dropna().replace(to_replace, value) |
df.fillna(value).dropna().replace(to_replace, value) |
COALESCE, NULLIF, IS NULL, IS NOT NULL |
|
Pivoting/Unpivoting Data |
Df.groupBy('col').pivot("column_to_pivot").sum("value") |
df.pivot_table(values='value' , index='column' , columns='column_to_pivot' , aggfunc='sum') |
PIVOT(SUM(value)FOR column_to_pivot IN (...)) (DBMS dependent) |
|
Iterating Over Rows |
Not recommended due to distributed nature; use map or foreach |
for index, row in df.iterrows() (slower for large datasets) |
Typically not used in SQL; handled via queries |
|
Performance Optimization |
Catalyst optimizer,Tungsten execution engine |
N/A - Single Thread execution |
Indexing, query optimization, parallel execution (depends on DBMS) |
|
DataFrame/Query Caching |
Df.cache() |
N/A - entire DataFrame is in memory |
CACHETABLE or CREATEINDEX(DBMS dependent) |
|
Schema Management |
Supports schema enforcement and evolution via DataFrame schema |
Schema is inferred; can be set manually |
Managed via CREATETABLE and ALTER TABLE |
|
Data Visualization |
Not directly supported, but can integrate with libraries like matplotlib |
Supports matplotlib, seaborn, etc. |
Not supported; data is visualized via external tools like Tableau, Power BI |
|
Debugging |
Log-based debugging, explain() for execution plans |
Direct inspection with print, info(), describe() |
EXPLAIN for query plans, error messages |
|
File Formats Supported |
CSV, JSON, Parquet, ORC, Avro, etc |
CSV,Excel, JSON, HDF5, etc. |
Varies; often requires external tools or extensions for non-tabular formats |
|
Complex Aggregations |
df.groupBy("column").agg({"col1": "sum" , "col2": "max"}) |
df.groupby("column").agg({"col1": "sum" , "col2": "max"}) |
Complex aggregations via GROUP BY, subqueries |
|
Window Functions |
Extensive support, e.g., rank(), row_number(), lag() |
Limited support via rolling, expanding |
Extensive support, e.g., ROW_NUMBER(), RANK(), LAG() |
|
Date/Time Functions |
date_add(), date_sub(), datediff(), to_date() |
pd.to_datetime(), df['date'].dt functions |
DATEADD, DATEDIFF,TO_DATE,TO_CHAR |
|
Parallel Processing |
Yes, across multiple nodes |
No, single machine |
No, typically single machine unless parallel query execution is supported |
|
Data Source Integration |
Reads from HDFS, S3, JDBC, etc |
Reads from local files, databases via connectors |
Reads from various database tables |
|
|
|
|
|
|
|
|
|
|
Comments
Post a Comment