Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Tal Sofer
Tal Sofer Author

Tal Sofer is a product manager at Treeverse, the company...

Published on February 28, 2024

In a previous article, we explored how dbt and Databricks work together. We will now review a specific feature of dbt Cloud – dbt Cloud Jobs and demonstrate how to use Databricks workflows to achieve the same functionality. 

If you are using dbt-core with Databricks as your data platform, this article can give you an overview of how to build your own dbt jobs. Additionally, it offers guidelines on when to opt for dbt Cloud jobs over Databricks workflows and vice versa.

Understanding dbt Cloud Jobs

dbt Cloud Jobs are a solution for executing a collection of dbt commands within a dbt environment. These jobs eliminate the manual execution of dbt commands from the command line, offering a simplified approach to handling data transformations. There are two primary types of dbt Cloud Jobs: Deploy Jobs and Continuous Integration (CI) Jobs, each addressing distinct use cases.

Deploy Jobs 

  • Designed for constructing production data assets.
  • Executed in a production environment with triggers like scheduled runs and execution upon completing another job (job chaining).
  • Ensures sequential execution for a controlled deployment process.

Continuous Integration (CI) Jobs 

  • Designed for building and testing new dbt code before merging into production.
  • Triggered by opening or updating pull requests in a Git repository linked to a dbt cloud project.
  • Operates in a dedicated environment against a staging database, running in parallel for effective collaboration.
  • Maintains isolation by using ephemeral schemas with a unique naming convention.

dbt Cloud makes it simple to define dbt Cloud Deploy and CI jobs.

dbt Cloud job
My first dbt cloud jobs, Source: dbt Cloud

Databricks dbt jobs

If you’re utilizing dbt-core for your projects instead of dbt Cloud and your data platform is Databricks, you might have wondered, “Can I run dbt-core jobs?” The exciting news is that you absolutely can! And this post will walk you through how to leverage Databricks workflows to achieve that.

Databricks Workflows: What are they? 

Databricks Workflows present a fully managed orchestration service for data processing, machine learning, and analytics pipelines on the Databricks Platform. These workflows empower you to create jobs capable of executing diverse tasks to implement your data processing and analytics workflows. Tasks within a job can run notebooks, Spark submit jobs, SQL queries, Python applications, dbt transformations, and more. Jobs comprise one or more tasks, allowing control over their order and their execution mode: parallel or sequential.

The Power of dbt Tasks

One type of task offered by Databricks is the dbt task, a tool to seamlessly integrate dbt transformations into your workflow. Remember this task type, as it forms the fundamental building block for the dbt-core job we are about to construct. 

Creating Your First Databricks dbt Job

To create a Databricks dbt job, follow Databricks’ guide on incorporating dbt transformations into a Databricks job. The initial setup involves defining a Databricks job containing a single dbt task, as illustrated in the example below. 

Create your first Databricks dbt job
Source: Databricks

This Databricks job comprises a single dbt task, utilizing the example jaffle_shop dbt project. The task executes three dbt commands:

  1. deps: Pulls the latest version of the project dependencies from Git.
  2. seed: Loads CSV files located in the seed-paths directory of the example project, ensuring raw data availability for transformations.
  3. run: Executes the compiled SQL model code against the warehouse catalog and schema defined in the job.

Building my first Databricks dbt job, I was on a mission to determine whether Databricks jobs could encompass every key feature of dbt Cloud jobs or if dbt Cloud jobs provided significant added value over Databricks dbt jobs. The next section summarizes my conclusions comparing the two types of jobs.

dbt Cloud Jobs and Databricks dbt Jobs: Comparison

Supported use cases

Examining the supported use cases, dbt Cloud offers Deploy and CI jobs. To determine if Databricks jobs offer equivalent support, the comparison is categorized into three criteria:

  • Supported trigger types
  • Job environment control
  • Execution modes 

Supported trigger types

  • Run on Schedule: Databricks jobs can run on a schedule.
  • Execution upon Completing Another Job: Databricks jobs can include multiple tasks with controllable execution order to simulate dbt Cloud job chaining. Execution order is controlled by defining task dependencies.
  • Execution upon Opening or Updating a Pull Request: While Databricks does not provide an official GitHub action tailored for this scenario, you can create custom GitHub actions that use the Databricks CLI. This approach allows you to trigger a Databricks dbt job when a pull request is created or updated within your GitHub repository.
  • By API: Databricks allows job triggering via API.

Job environment control

Databricks jobs can control the dbt-core version, define job destination, and set the job source code, mirroring dbt Cloud environments. When implementing dbt environments in Databricks jobs, consider the following:

  • Control the dbt-core Version: Databricks dbt tasks are associated with a specific dbt-databricks version, influencing the dbt-core version the job runs.
  • Define the Job Destination: You can set the values of the dbt task catalog and schema to establish the destination for the job. This determines the target database and schema for the dbt transformations.
  • Set the Job Source Code: Populate the source for the dbt task by incorporating code from Git providers. Databricks supports different Git ref types, such as branches, tags, or commits, allowing you to specify the source accurately.
How to implement dbt cloud environments in Databricks
How to implement dbt cloud environments in Databricks, Source: Databricks

Execution Modes

In dbt Cloud, Deploy jobs run sequentially, while CI jobs can run concurrently with the use of paid offerings. To cater to both the Deploy and CI use cases, it’s essential to have a mechanism for running jobs either concurrently or sequentially.

In Databricks, jobs have the capability to run concurrently, with the maximum number of concurrent runs being configurable. However, achieving sequential execution is not straightforward. Yet, there are two approaches to implementing sequential runs for dbt jobs:

  1. The Hard Way: Develop business logic that utilizes the Databricks jobs API to trigger jobs sequentially.
  2. The Easy Way: Instead of defining multiple Databricks jobs, consider defining a single job with multiple dbt dependent tasks. This simplifies the process of achieving sequential execution for dbt jobs within the Databricks environment.
Easy way to implement dbt job chaining in Databricks
The easy way to implement dbt job chaining in Databricks: one job with dependent dbt tasks, Source: Databricks

Run Details 

Examining run details, both dbt Cloud and Databricks provide comprehensive information. The Databricks dbt job run details closely mirror dbt Cloud’s experience, showing executed dbt commands, status, and facilitating artifact downloads. However, Databricks jobs include additional details from the Spark ecosystem, which might be confusing for dbt-focused users.

dbt cloud run details
dbt cloud run details, Source: dbt Cloud
Databricks dbt job run details
Databricks dbt job run details, Source: Databricks

Job monitoring capabilities

Both dbt Cloud and Databricks offer robust job run visibility with detailed run history, and they support sending email notifications and notifications to other systems. While comparing the monitoring capabilities of these two platforms, I discovered a minor feature unique to dbt Cloud—source freshness checks that do not fail the entire dbt job. This functionality is achieved by selecting the source freshness checkbox on a dbt Cloud job definition. If you want to replicate this source freshness check in your Databricks dbt job, you can incorporate the source freshness dbt command. However, it’s essential to note that in Databricks, if the source freshness check fails, the entire job fails.

Summary

When comparing dbt Cloud jobs to Databricks dbt jobs, it becomes clear that Databricks provides a viable alternative, particularly for executing Deploy jobs. Both options have their pros and cons, outlined in the table below:

Pros Cons
dbt cloud job • Easy to setup
• Ops knowledge or familiarity with Databricks not required.
• Aligns with dbt language and abstractions (e.g. Environment is a job parameter).
• Encourages best-practices. (e.g. source freshness and docs generation checks are embedded into job creation).
• Some key job triggers are only available in the paid dbt Cloud offering (Job chaining and concurrent CI checks)
Databricks dbt job • Implements most dbt Cloud features without paid offerings.
• Handles dbt specifics well.
• Allows running jobs with various task types in complex workflows without using an orchestrator.
• Requires familiarity with Databricks GitHub integration not supported out of the box.
• Complexity in implementing certain aspects of dbt Cloud jobs (e.g. Environments).

In conclusion, the following recommendations provide guidance on when to opt for one solution over the other:

dbt Cloud:

  • Prefer dbt Cloud for running CI jobs.
  • Choose dbt Cloud if Databricks familiarity is a limitation.

Databricks dbt:

  • Use Databricks dbt jobs for complex jobs with multiple task types if avoiding additional orchestrators is preferred.
  • A cost-effective choice when not opting for paid dbt Cloud features.

Git for Data – lakeFS

  • Get Started
    Get Started