Posts

Showing posts from November, 2024

Databricks Auto Loader

Databricks Auto Loader Explained Databricks Auto Loader: Complete Explanation In Databricks, Auto Loader is a data ingestion tool designed to handle continuous and scalable data loading from cloud storage (such as AWS S3, Azure Blob Storage, or Google Cloud Storage) directly into Databricks’ Delta Lake. It’s particularly useful in data engineering and ETL (Extract, Transform, Load) pipelines for automatically managing new data as it arrives. Key Features of Databricks Auto Loader Incremental Data Ingestion : Auto Loader can incrementally ingest new files from a specified source location. It only processes files that have been added or updated, reducing processing time and costs. Schema Inference and Evolution : Auto Loader can automatically infer the schema of your data and evolve it over time as the structure of the data changes, reducing manual intervention. Highly Scalable and Optimized : Built for high scalability, Auto Lo...

Pandas vs PySpark vs SQL

 Pandas vs PySpark vs SQL   Example/ Operation PySpark Python - Pandas SQL Data Loading Df = Spark.read.csv("file.csv") Df = Pd.read_csv("file.csv) Select * from table_name Filtering Rows Df.filter(df['col1'] > 50) Df[df['col1'] > 50] Select * from table_name where col1 > 50 Selecting Columns Df.select('col1', 'col2') Df[['col1', 'col2']] Select col1, col2 from table_name Grouping and Aggregation Df.groupBy("col1").agg({"col1": "Sum"}) Note: There is dictionary in agg Df.groupby('col1').sum() Select col1, sum(col1) from table_name group by col1 Joining Tables/DataFrames Df1.join(df2, df1.id = df2.id) ...

Cloud Services for Azure and AWS

Cloud Services for Azure and AWS (Amazon Web Services) Azure and AWS (Amazon Web Services) offer similar services and tools for data engineering, each with its own suite of solutions to build, manage, and optimize data pipelines, storage, and analytics. Here’s a comparison of the key Azure Data Engineering services alongside their equivalent services from AWS: 1. Data Storage Solutions Azure Service AWS Equivalent Azure Blob Storage Amazon S3 (Simple Storage Service) - A scalable object storage for unstructured data (e.g., images, videos, documents) - A highly scalable object storage used for backup, archival, and data lakes. Azure Data Lake Storage (ADLS) Amazon S3 with Glacier / AWS Lake Formation - A specialized version of Azure Blob Storage with hierarchical namespace and enhanced performance for big data analytics. - Amazon S3 used with AWS Lake Formation to manage large-scale data lakes, with advanced security and governance features. Azure SQL Database Amazon RDS (Relational Dat...

Azure Data Factory (ADF)

  Introduction to Azure Data Factory (ADF) Azure Data Factory (ADF) is a cloud-based data integration service offered by Microsoft Azure . It provides a powerful platform for creating, scheduling, and managing data workflows that can move and transform data from a wide variety of sources. ADF makes it easier to handle data at scale and automate data-driven processes for cloud-based data engineering. Whether you're managing ETL tasks or building complex data pipelines, ADF is a central tool in modern data integration. Key Features of Azure Data Factory Data Integration and Movement Connect to Various Data Sources: ADF supports a wide variety of data sources including cloud platforms (Azure Blob Storage, Azure SQL Database, Amazon S3) and on-premises data sources (SQL Server, Oracle). You can integrate data from virtually any platform into your workflow. Data Movement: ADF allows for secure data movement across different environments, enabling the synchronization and consolidation...

Understanding RANK() vs. DENSE_RANK()

Understanding RANK() vs. DENSE_RANK() in SQL The RANK() and DENSE_RANK() window functions are powerful tools in SQL for ranking rows within a result set. This guide explains their differences, particularly in how they handle ties in ranking. Overview RANK(): Assigns a rank to each row, leaving gaps in rank values when there are ties. DENSE_RANK(): Similar to RANK(), but does not leave gaps in the ranking sequence when there are ties. Syntax function() OVER (PARTITION BY <column> ORDER BY <column>) PARTITION BY: (optional) Divides the result set into partitions for ranking. ORDER BY: Specifies the order in which to rank the rows. Example Dataset Consider the following employees table: Name Salary Alice 70000 Bob 70000 Charlie 60000 David 80000 Eva 60000 Example Que...

Mastering SQL: Finding Nth Highest and Nth Minimum Salaries

Finding Nth Highest and Nth Minimum Salary in SQL Here are SQL queries to find the 2nd Highest Salary , 2nd Minimum Salary , Nth Highest Salary , and Nth Minimum Salary from an employees table. In SQL, the OFFSET clause is used in conjunction with the LIMIT clause to control which rows are returned in a query result. Specifically, OFFSET specifies the number of rows to skip before starting to return rows from the query. <number> after LIMIT indicates how many rows to return. <number> after OFFSET indicates how many rows to skip. 1. 2nd Highest Salary SELECT DISTINCT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 1; Explanation: This query sorts salaries from highest to lowest and skips the highest salary to return the second-highest. 2. 2nd Minimum Salary SELECT DISTINCT salary FROM employees ORDER BY salary ASC LIMIT 1 OFFSET 1; Explanation: This query sorts salaries from lowest to highest and skips the minimum salary to return the second ...

Window functions in SQL

Mastering SQL Window Functions: A Comprehensive Guide Window functions in SQL are powerful tools for performing complex calculations across a set of rows related to the current row. Unlike typical aggregate functions that group rows, window functions allow each row to retain its original details. This makes them ideal for analytics and ranking tasks. In this guide, we’ll explore some of the most commonly used window functions and how to use them effectively. 1. ROW_NUMBER() - Assigning Sequential Numbers The ROW_NUMBER() function assigns a unique, sequential integer to rows within a partition (group) of the result set. Use Case: Useful for adding unique IDs or ranking rows without ties. SELECT id, name, value, ROW_NUMBER() OVER (PARTITION BY name ORDER BY value DESC) AS row_num FROM my_table; 2. RANK() - Ranking with Ties Allowed RANK() assigns a rank to each row within a partition. Rows with the same values get the same rank, but gaps are left in the rankin...

Challenges in Data Engineering in day to day life?

Daily issues faced by Data Engineers:

Remove duplicates from a large dataset in PySpark?

How would you remove duplicates from a large dataset in PySpark?   1. Load Dataset into Dataframe df = spark.read.csv("path/to/data.csv", header = True, inferSchema = True)  2. Check for Duplicates dup_count = df.count() - df.dropDuplicates().count() 3. Partition the data to optimize performance  df_repartitioned = df.repartition(100) 4. Remove duplicates using the dropDuplicates() method df_no_dup = df_repartitioned.dropDuplicates() 5. Cache the resulting Dataframe to avoid recomputing df_no_dup.cache() 6. Save the cleaned dataset df_no_dup.write.csv("path/to/cleaned/data.csv", header = True) Why Partitioned the data in step-3? Partition the data helps to distribute the computation across the multiple nodes, making the process more efficient and scalable  Why you cached the resulting Dataframe in step-5: Caching the dataframe avoids recomputing the entire dataset when saving the cleaned data, which can sinificantly improve performance