Blogs

ETL in Databricks

, November 7, 2025150 Views

Introduction

In the era of big data, transforming raw data into meaningful insights is crucial. Databricks, built on Apache Spark, provides a unified platform to perform ETL (Extract, Transform, Load) at scale. In this blog, we’ll explore what ETL is, how it works in Databricks, and walk through a hands-on practical example using a sample dataset.

What is ETL?

ETL stands for:

  • Extract: Retrieve data from various sources (databases, APIs, files).
  • Transform: Clean, filter, join, or reshape the data.
  • Load: Store it into a target system.

Why Use Databricks for ETL?

  • Built on Apache Spark: High-performance distributed computing.
  • Delta Lake Support: ACID transactions and schema enforcement.
  • Notebooks Interface: Easy to code, visualize, and collaborate.
  • Unified Data Platform: Ideal for batch and real-time processing.

ETL in Databricks Using an API (Hands-On)

API We’ll Use:

We’ll use the “Fake Store API” – a public dummy API that returns fake product data in JSON format.
URL: https://fakestoreapi.com/products

STEP 1: Extract – Call API and Load into DataFrame

  • requests: Python library for making HTTP requests.
  • pandas: Used to convert the JSON into tabular format before converting to a Spark DataFrame.
  • SparkSession: Entry point to use Spark APIs (though not used directly here, it’s needed if using Spark DataFrames).
  • requests.get(…): Makes a GET call to the API.
  • The API https://fakestoreapi.com/products returns a JSON array of product objects (like [{…}, {…}]).
  • .json(): Converts the response body (which is JSON) into a list of Python dictionaries.
  • pd.DataFrame(data): Converts the list of dictionaries to a pandas DataFrame. Each product becomes a row, and dictionary keys become columns.

OUTPUT:

STEP 2: TRANSFORM – Clean & Enrich the Data

Step 2.1 Why this is needed:

  • col() is a Spark function used to reference a column by name when performing transformations.
  • Required for all the .withColumn(…) operations coming next.

Step 2.2 Converts Pandas → Spark:

  • df_pd is a Pandas DataFrame (single-machine memory).
  • createDataFrame(df_pd) converts it into a Spark DataFrame (df_raw), which:
    • Can handle big data
    • Allows distributed processing
    • Can write to Delta tables, Parquet, etc.
  • Now you’re in the world of PySpark.

Step 2.3 Flatten Nested Columns

  • The API contains a nested rating field:
    “rating”: { “rate”: 3.9, “count”: 120 }
  • withColumn(…): Extracts values from nested fields and adds them as new columns.
  • .drop(“rating”): Removes the original nested field.

What’s happening:

  • The rating column is a nested struct (like a dictionary).
  • You extract two new columns:
    • rating_score → from rate
    • rating_count → from count
  • Then you remove the original rating column using .drop(“rating”).

After this step, your data looks like this:

Step 2.4 Add a New Column (Derived Value)

What this does:

  • Creates a new column price_with_tax
  • Multiplies the original price by 18 (adds 18% tax)
  • You now have:
    • Original price
    • Tax-inclusive price_with_tax

Step 2.5 Rename Columns

  • Makes names more descriptive and analytics-friendly.

Step 2.6 Filter and Categorize

  • Categorizes products into “High” or “Low” based on price.
  • when(…).otherwise(…) is like SQL CASE WHEN.

OUTPUT:

Step 2.7 Handle Null Values

  • Ensures nulls are replaced with defaults to prevent analytics issues.

STEP 3: LOAD – Save to Delta Table

  • .format(“delta”): Saves using Delta Lake (supports ACID, schema evolution).
  • .mode(“overwrite”): Replaces existing data at the path.

Register Delta Table (for SQL access)

ETL in Databricks vs. No-Code Tools (Comparison)

Feature Databricks (Code-Based ETL) No-Code ETL Tools
User Interface Requires coding (Python, SQL, PySpark) in notebooks Drag-and-drop GUI, visual transformations
Flexibility Very high – supports custom logic, API calls, conditions Limited to predefined connectors and functions
Scalability Enterprise-grade; handles big data via Spark Depends on platform; some limited to moderate data volumes
Supported Sources Any API, database, file, or real-time stream Limited to supported connectors
Complex Logic (e.g., loops) Fully programmable using Python, Spark, or SQL Limited – hard to implement pagination, conditionals, or looping logic
Debugging/Testing Granular, code-level debugging Visual, step-by-step testing; limited code visibility
Monitoring Custom logs or integration with Azure Monitor Built-in UI monitoring, alerts, retry policies
Learning Curve Steep for beginners Easy for non-developers
Version Control Git integration supported Versioning may be limited or manual
Use Case Fit Best for complex, large-scale, or unstructured ETL Best for simple, structured, repeatable ETL tasks

Pros of ETL in Databricks

Advantage Description
Full Flexibility Use Python, SQL, Scala, or R to build custom logic, API connectors, or complex workflows. Ideal for advanced transformations, conditional logic, and dynamic handling.
Supports Complex Workflows Easily implement loops, if-else, retry logic, pagination for APIs, joins, window functions, etc.
Scalable and Distributed Runs on Apache Spark – optimized for large-scale data. Processes billions of rows across multiple nodes.
Connects to Anything Can call any REST API, connect to databases, message queues, blob storage, etc., using Python libraries or Spark connectors.
Interactive Development Notebooks allow step-by-step development, debugging, and visualization in real time.
Version Control & DevOps Friendly Git integration for collaborative development and CI/CD pipelines.
ML & Advanced Analytics Ready Seamlessly integrate machine learning models, SQL analytics, and visualization in the same environment.
Enterprise Security Integration with Azure AD, data lake permissions, Unity Catalog, and data masking.

Cons of ETL in Databricks

Limitation Description
Steeper Learning Curve Requires knowledge of Python, PySpark, SQL, and Spark internals. Not ideal for business users or analysts without technical background.
Cost Higher operational cost due to cluster startup time, compute resource pricing, and long-running jobs if not optimized.
Overkill for Simple Tasks For basic CSV-to-table jobs, it can be too complex compared to no-code options like Dataflows or Pipelines.
Manual Job Orchestration (if not using Workflows) While Workflows and Job APIs exist, scheduling and monitoring aren’t as intuitive as in low-code tools like Azure Data Factory or Microsoft Fabric Pipelines.
Delta Format Dependency Optimized for Delta Lake; exporting to formats like CSV requires explicit conversion steps.
Debugging Can Be Tricky Especially in distributed Spark jobs with complex joins or transformations, tracking errors across partitions can be difficult.
Cold Start Delays Spark clusters take 1–3 minutes to start, which adds overhead for short-running ETL jobs.

 

Leave a Reply

Your email address will not be published. Required fields are marked *