Posts

Showing posts from March, 2026

Interview Q2 - 𝗗𝗲𝗹𝘁𝗮 𝗟𝗮𝗸𝗲 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗶𝘀𝘀𝘂𝗲𝘀

𝗠𝗼𝘀𝘁 𝗗𝗲𝗹𝘁𝗮 𝗟𝗮𝗸𝗲 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗶𝘀𝘀𝘂𝗲𝘀 𝗰𝗼𝗺𝗲 𝗱𝗼𝘄𝗻 𝘁𝗼 𝟰 𝘁𝗵𝗶𝗻𝗴𝘀. 𝗔𝗻𝗱 𝗺𝗼𝘀𝘁 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝘀 𝘂𝘀𝗲 𝗼𝗻𝗹𝘆 𝟭 𝗼𝗿 𝟮 𝗼𝗳 𝘁𝗵𝗲𝗺 𝗰𝗼𝗿𝗿𝗲𝗰𝘁𝗹𝘆. 𝗛𝗲𝗿𝗲'𝘀 𝘁𝗵𝗲 𝗳𝘂𝗹𝗹 𝗯𝗿𝗲𝗮𝗸𝗱𝗼𝘄𝗻 𝗼𝗳 𝗩𝗔𝗖𝗨𝗨𝗠 𝘃𝘀 𝗢𝗣𝗧𝗜𝗠𝗜𝗭𝗘 𝘃𝘀 𝗭-𝗢𝗥𝗗𝗘𝗥 𝘃𝘀 𝗟𝗜𝗤𝗨𝗜𝗗 𝗖𝗟𝗨𝗦𝗧𝗘𝗥𝗜𝗡𝗚 👇 ───────────────────── 𝗪𝗵𝗮𝘁 𝗲𝗮𝗰𝗵 𝗼𝗻𝗲 𝗱𝗼𝗲𝘀 ───────────────────── 𝗩𝗔𝗖𝗨𝗨𝗠 → Storage cleanup Removes obsolete files no longer referenced by the Delta transaction log. 𝗢𝗣𝗧𝗜𝗠𝗜𝗭𝗘 → File compaction Compacts many small files into larger files for efficient reads. 𝗢𝗣𝗧𝗜𝗠𝗜𝗭𝗘 → Query pruning Co-locates related column values in the same files to improve data skipping. 𝗟𝗜𝗤𝗨𝗜𝗗 𝗖𝗟𝗨𝗦𝗧𝗘𝗥𝗜𝗡𝗚 → Adaptive data layout Automatically reorganizes data layout dynamically without rewriting entire tables. ───────────────────── 𝗗𝗲𝗰𝗶𝘀𝗶𝗼𝗻 𝗠𝗮𝗽 ───────────────────── Storage Growth → VACUUM Too Many Files → OPTIM...

Interview Q1 - 𝟭 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗿𝗼𝘄 𝗗𝗲𝗹𝘁𝗮 𝘁𝗮𝗯𝗹𝗲

  𝗗𝗲𝗹𝗲𝘁𝗶𝗻𝗴 𝟱 𝗿𝗼𝘄𝘀 𝗳𝗿𝗼𝗺 𝗮 𝟭 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗿𝗼𝘄 𝗗𝗲𝗹𝘁𝗮 𝘁𝗮𝗯𝗹𝗲 𝘀𝗵𝗼𝘂𝗹𝗱 𝗡𝗢𝗧 𝘁𝗮𝗸𝗲 𝟰𝟬 𝗺𝗶𝗻𝘂𝘁𝗲𝘀. 𝗕𝘂𝘁 𝗶𝘁 𝗱𝗼𝗲𝘀. 𝗔𝗻𝗱 𝗵𝗲𝗿𝗲'𝘀 𝗲𝘅𝗮𝗰𝘁𝗹𝘆 𝘄𝗵𝘆. ───────────────────── 𝗧𝗵𝗲 𝗣𝗿𝗼𝗯𝗹𝗲𝗺 ───────────────────── You have a 1 billion row Delta table. Business asks you to delete 5 incorrect customer records. You run: DELETE FROM customers WHERE customer_id IN ( ... ) Expected runtime: Seconds Actual runtime: 30+ Minutes Why? Because without Deletion Vectors, Delta must: → Read the entire parquet file → Remove the rows → Rewrite the entire file → Update the Delta log Even if you delete 1 row — the entire file is rewritten. That's millions of rows rewritten for just 5 deletes. Example: file1.parquet → 1,000,000 rows, 2 deleted file2.parquet → 900,000 rows, 1 deleted → Delta rewrites 1.9 million rows for 3 deletes. ───────────────────── 𝗧𝗵𝗲 𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻: 𝗗𝗲𝗹𝗲𝘁𝗶𝗼𝗻 𝗩𝗲𝗰𝘁𝗼𝗿𝘀 ───────────────────── Instead of ...