Databricks Autoloader: Ingesting Data with Ease and Efficiency

Idan Novogroder

Last updated on April 26, 2024

Home > Blog > Databricks Autoloader: Ingesting Data with Ease and Efficiency

Try lakeFS open source. Watch how it works

You can ingest data files from external sources using a variety of technologies, from Oracle and SQL Server to PostgreSQL and systems like SAP or Salesforce. When putting this data into your data lake, you might run into the issue of identifying new files and orchestrating processes. This is where Databricks Autoloader helps.

Databricks Autoloader (also spelled Auto Loader) detects new files and stores data about processed files in the RockDb database’s checkpoint locations. It helps to control late arriving data and optimize compute resource use. Furthermore, since it employs structured streaming, it opens the door to creating a near real-time process to populate databases.

Databricks Auto Loader, which was introduced around the beginning of 2020, has become a key part in ingestion processes across many teams. It offers a highly efficient method for progressively processing fresh data while also ensuring that each file is handled exactly once.

Keep reading to learn everything you need to know about Databricks Autoloader.

What is Databricks Autoloader?

Databricks Auto Loader gradually and efficiently processes new data files as they arrive in your cloud storage and handles them progressively. It can handle billions of files to move or backfill a database and the near-real-time intake of millions of files each hour.

It supports the following cloud storage solutions:

AWS S3
Azure Blob Storage
Azure Data Lake
Storage Gen2
Google Cloud Storage
ADLS Gen1
Databricks File System

Databricks Auto Loader supports the following file formats:

JSON
CSV
XML
PARQUET
AVRO
ORC
TEXT
BINARYFILE

CloudFiles is a structured streaming source that Auto Loader provides. Given an input directory path on the cloud file storage, the CloudFiles source automatically processes new files as they arrive – with the option of additionally processing old files in that directory.

In Delta Live Tables, Auto Loader supports both Python and SQL. Databricks recommends using Auto Loader when you’re using Apache Spark Structured Streaming to consume data from cloud object storage.

Another important feature of Autoloder is that it stores state data at a checkpoint position within the RocksDB key-value store. Because the state is kept at this checkpoint, it may resume where it left off even if it fails, assuring exactly-once semantics.

Key Capabilities of Databricks Autoloader

Autoloader monitors the process of ingestion to handle data only once

Autoloader identifies files and ensures that their metadata is saved in a scalable key-value store (RocksDB) in its pipeline’s checkpoint location. This key-value store guarantees that data is only handled once.

In the event of a failure, Autoloader can resume from the last checkpoint location and continue to give exactly-once assurances while writing data into Delta Lake. To achieve fault tolerance or exactly-once semantics, you don’t need to keep or manage any state yourself.

Autoloader + Delta Live Tables for incremental ingestion

For incremental data ingestion, Databricks suggests using Autoloader in Delta Live Tables. Delta Live Tables enhances the capability of Apache Spark Structured Streaming and allows you to build a production-quality data pipeline with just a few lines of declarative Python or SQL.

Event log and metrics used for automatic handling monitoring

You don’t need to provide a schema or checkpoint location because Delta Live Tables handles these parameters for your pipelines automatically.

Other key capabilities of Databricks Autoloader include:

Autoscaling computing infrastructure to save resources,
Checking data quality against expectations,
Handling schema evolution automatically.

Common Databricks Autoloader Patterns

Auto Loader automates various typical data input operations. Here are some examples of common patterns:

Using glob patterns to filter folders or files

When given a path, glob patterns can be used to filter directories and files.

Enabling simple ETL

Using the following pattern and activating schema inference with Auto Loader is an easy approach to getting your data into Delta Lake without losing any data. Databricks advises executing the following code in an Azure Databricks job to resume your stream automatically when the schema of your source data changes.

By default, the schema is assumed to be a string types, any parsing problems (which should amount to zero if everything remains as a string), and any additional columns will fail the stream and evolve the schema.

Avoiding data loss in well-structured data sets

Databricks advises utilizing the rescuedDataColumn when you know your schema but want to be in the know anytime you get unusual data.

Enabling semi-structured flexible data pipelines

When you get data from a third-party vendor, you might not be aware of when they add new columns to the information they give. Or you might not have the capacity to update your data pipeline.

You may now use schema evolution to restart the stream and have Autoloader automatically update the inferred schema. You can also use schemaHints for some of the “schemaless” fields provided by the vendor.

Converting nested JSON data

Since Autoloader infers top-level JSON columns as strings, you may end up with nested JSON objects that require further modifications. To further alter complicated JSON material, use the semi-structured data access APIs.

How to Use Databricks Autoloader

Configure Databricks Autoloader

Here’s how to set up your Databricks Autoloader.

You can customize your Autoloader setup using the following option, which is common for both file detection techniques: the Directory Listing and File Notification modes.

Here’s a selection of helpful functions:

cloudFiles.allowOverwrites – with the default setting of true, this determines whether to allow modifications to input directory files to overwrite existing data.
cloudFiles.format – it describes the data that comes from the source route. For example, it accepts.json for JSON files,.csv for CSV files, and so on.
cloudFiles.includeExistingFiles – by default, this checks whether to include existing files in the Stream Processing Input Path or simply deal with new files that arrive after initial setup.
cloudFiles.inferColumnTypes – if set to false, this checks whether to infer precise column types when using schema inference.
cloudFiles.maxBytesPerTrigger – this specifies the maximum number of bytes processed by the Autoloader with each trigger.
cloudFiles.maxFileAge – this property specifies how long an event is tracked for deduplication reasons. It’s typically used when absorbing millions of files per hour at a high rate.
cloudFiles.resourceTags – key-value pairs that help identify the appropriate resources.
cloudFiles.schemaEvolutionMode – this one defines several modes for schema evolution, such as when new columns are discovered in the data.
cloudFiles.schemaHints – this is the data schema information that you gave to the Autoloader.
cloudFiles.schemaLocation – it indicates the place where the inferred schema and related updates will be stored.
cloudFiles.validateOptions – this function determines if the Autoloader settings described thus far have a valid value.
cloudFiles.backfillInterval – the file notification option does not ensure 100% delivery of the submitted file. Backfills can be used to guarantee that all files are handled. The interval at which the Backfills are triggered is determined by this parameter.

Set up Autoloader File Notification Mode

Autoloader automatically configures notification and queue services that subscribe to file events from the input directory while in file notification mode.

You can use file alerts to scale Autoloader to consume millions of files each hour. The file notification mode is more performant and scalable than the directory listing mode for large input directories or a high volume of files, but it requires extra cloud permissions.

You can toggle between file alerts and directory listings at any time while still ensuring that data is processed precisely once.

Note: You might be unable to change the source path for Autoloader in the file notification mode. If you use it and change the path, you might fail to ingest files that are already in the new directory at the time of the directory update.

The Autoloader file notification mode uses cloud resources. To automatically provision the cloud infrastructure for it, you need to have a higher level of permissions.

When you pick the cloudFiles option, Autoloader will automatically set up file alerts for you. All you need to do it set useNotifications to true and provide the permissions required to build cloud resources. You might also need to specify extra settings to allow Autoloader to produce these resources.

Notification events for files

When a file is added to an S3 bucket, AWS S3 generates an ObjectCreated event regardless of whether it was submitted using a put or a multi-part upload.

For processing a file, Autoloader watches for the FlushWithClose event, and for file discovery, its streams support the RenameFile operation. To determine the size of a renamed file, an API request to the storage system is required.

Ingest CSV, JSON, or Image Data with Databricks Autoloader

Autoloader supports data intake for a variety of file formats, including:

JSON
CSV
PARQUET
AVRO
TEXT
BINARYFILE
ORC

You may use the following code to import CSV data into the Autoloader:

Copy Code

format("cloudFiles") 
in spark.readStream.alternative("cloudFiles.format", "csv")

For ingesting JSON files, set “csv” to json.

Scheduled Batch & Streaming Loads

You may start using Databricks Autoloader’s features for Streaming Jobs with the following:

Copy Code

spark.readStream.format("cloudFiles").option("cloudFiles.format", 
"json").load("/input/path")

In this example, Databricks Autoloader builds cloudFiles that expect JSON files with a Directory Input Path that is constantly checked for new files, comparable to setting a streaming source.

For example, if data arrives at regular intervals, like every few hours, you can use Autoloader to create a Scheduled Job and lower your operating costs by taking advantage of the Structured Streaming’s Trigger Once-only mode.

Benefits of Using Databricks Autoloader

Cost-efficiency

The cost of finding files using Autoloader scales with the number of files ingested rather than the number of directories into which the files may fall.

Scalability

Autoloader can quickly discover billions of files. Asynchronous backfills can be conducted to prevent wasting computation resources.

Support for schema inference and evolution

Autoloader can identify schema drifts, inform you when schema changes occur, and recover data that might otherwise be ignored or lost.

Cloud native

Autoloader uses cloud native APIs to get lists of files in storage.

Cost savings on cloud storage

Furthermore, Autoloader’s file notification mode might help you save even more money on cloud storage by bypassing directory listing entirely.

Finding files is cost-effective

Autoloader can set up file notification services on storage automatically, making file finding significantly less expensive.

Conclusion

Databricks Autoloader is a solid data ingestion tool that offers a versatile and dependable method for dealing with schema changes, data volume fluctuations, and recovering from job failures.

If you work with massive amounts of data and load data files incrementally, you must consider the long-term upkeep of your pipeline. It might be tempting to improvise a solution to address current demands, but a more solid strategy will pay off in the long run. Using Databricks Autoloader, you can create a scalable, reliable, and stable data intake pipeline.

For more insights about data pipeline management, take a look at this article: ETL Testing: A Practical Guide.