In the past five years, we’ve seen many concepts and new tools in the data ecosystem contribute to implementing engineering best practices in data. This trend includes the data mesh, data quality testing, observability, and data monitoring.
The practices we would like to borrow from software engineering and use in data engineering and data science revolve around data lifecycle management from development to production. To do that, we first need the ability to easily get an isolated version of the data that we can work with. It’s the basis for any of those best practices.
Data formats, databases, and data version control systems provide the ability to isolate a copy of the data, each with their own terminology, as often happens when a new concept emerges. Whether you are using a shallow copy, a shallow clone, or a branch, you want to achieve the same thing: isolation.
Keep reading to learn more about the concept of data shallow copy and dive into the use cases from Databricks Delta Lake, Iceberg, Snowflake, and lakeFS.
Let’s get the terminology right
Shallow copy vs. deep copy – do they mean the same thing?
A shallow copy duplicates the original element but doesn’t replicate any elements that the original element references.
A deep copy duplicates not only the original element but also any elements that the original element references. The copy operation produces a new instance of the referenced object for each reference in the original element, copies the contents of the referenced object into the new instance of the object, and stores a reference to the new instance of the object in the new element.
For example, if the element is a table – either a Delta Lake or a Snowflake table – a shallow copy will create a version of the table but will not copy the data itself; it will only point to it. A deep copy, on the other hand, will also copy the data, creating a complete copy of the table in the storage.
A concept similar to shallow copy is a branch. If you consider your data lake as a repository, a branch would be a shallow copy of all datasets in the repository. This term should be intuitive to the users of source control systems like Git. But in our discussion, it refers to datasets and not source code. Iceberg, Nessie Project, and lakeFS use this terminology. The last two differ from the first in functionality, but the basic idea of a shallow copy applies to all.
Shallow clone – is this term related?
You may have heard another term: shallow clone. What is it, and how does it differ from shallow copy?
A shallow clone is a concept found in the version control system Git. Git shallow clone allows you to get only the most recent commits rather than the entire repository history. It doesn’t matter if your project has years of history and thousands of commits – you can use it to choose a given depth to pull.
Note: The term “table clone” or “table shallow clone” is used in data to refer to a shallow copy, which might be confusing as it’s not directly related to Git’s shallow clone.
When it comes to data, shallow copies are shallow clones, or clones, and in some cases, they are even referred to as branches. As long as we understand that they all mean the same thing, we’re good.
Shallow copy of a table: differences in terminology and use cases
Delta clone: Delta Lake by Databricks
In Databricks, a Delta clone clones a source Delta table to a destination Delta table at a given version. A clone can be deep or shallow: deep clones copy the data from the source, whereas shallow clones don’t.
A deep clone is a clone that replicates the source table data as well as the existing table’s information to the clone destination. Furthermore, stream information is cloned such that a stream that writes to the Delta table can be paused on a source table and resumed from where it left off on the destination of a clone.
In a shallow clone, the data files aren’t copied to the clone target. The table information is the same as the source. Shallow clones make use of data files located in the source directory.
No wonder that these clones are more cost-efficient to produce. Deep clones are independent of the source from which they were copied, but they are costly to make since a deep clone replicates both the data and the information.
- Data archiving
- Machine learning model replication
- Short-term tests on a production table
- Data exchange
- Copy SQL to clipboard
- Override table properties
Zero Copy Cloning: Snowflake
One of the most challenging aspects of conventional databases is transferring database items from one environment to another. It calls for careful preparation, significant storage expenses, and lengthy wait times for the full procedure to be completed.
This is where Zero Copy Cloning comes in. This Snowflake feature streamlines the process of duplicating your data without incurring any additional storage expenses or excessive wait times.
It allows you to quickly and easily build a duplicate of any table, schema, or whole database without incurring any additional expenses because the derived copy shares the underlying storage with the original object.
The most powerful aspect of Zero Copy Cloning is that the cloned and original items are independent of each other, which means that any modifications made to one object don’t affect the other. Until you modify it, the copied item uses the same storage as the original. This may be quite handy for creating backups that are free until the copied item is altered.
Any modifications made to a cloned snapshot, on the other hand, generate extra storage components, resulting in higher expenditures.
How does Zero Copy Cloning Work?
Snowflake tables automatically split all data into micro-partitions, which are the smallest continuous units of storage. Each micro-partition includes 50 MB to 500 MB of uncompressed data. This micro-partitioning is performed automatically on all Snowflake tables.
When a database object is cloned, Snowflake generates new metadata information referring to the original source object’s micro-partitions rather than producing duplicates of existing micro-partitions. There is no need for user involvement because Snowflake’s Cloud Services Layer handles all of these processes.
Which objects can Snowflake Zero Copy Cloning support?
Data containment objects
Data configuration and transformation objects
- File formats
Table Clones: Google Cloud BigQuery
A Google Cloud BigQuery table clone is a small, readable replica of another table (known as the base table). Users are only charged for data storage in the table clone that varies from the base table, so there is no storage cost for a table clone at first. Aside from the billing model for storage and some extra metadata for the base table, a table clone is comparable to a conventional table in that it can be queried, copied, deleted, and so on.
When you create a table clone, it becomes independent of the original table. Changes made to the base table or clone are not reflected in the other.
If you require read-only, lightweight replicas of your tables, you can use table snapshots.
- Creating duplicates of production tables for development and testing.
- Creating sandboxes that enable users to develop their own analytics and data manipulations without having to physically replicate all of the production data. Only modified data is charged.
Branch: Apache Iceberg
In Apache Iceberg, a table clone is called a branch. Iceberg includes a snapshot log in Iceberg table metadata that reflects the changes made to a table. Snapshots serve as the foundation for reader isolation and time travel.
Iceberg offers branches and tags, which are named references to snapshots of tables with their own separate lifecycles. This comes in handy for more comprehensive snapshot lifespan management.
Branches are discrete lineages of snapshots that point to the lineage’s head. They contain retention attributes that specify the minimum number of snapshots to keep on the branch and the maximum age of individual snapshots to keep on the branch.
- You can use branching and tagging to meet GDPR standards while also keeping critical historical snapshots for audits.
- Users can also create experimental branches for testing and verifying new tasks as part of data engineering procedures.
- Historical tags can be used to save crucial historical images for auditing purposes.
A branch of multiple datasets, in any format
Here, a branch is a real branch, like what one would expect when thinking about Git, but for datasets.
lakeFS, an open-source data version control system, combines principles from object storage such as S3 with Git concepts, including branches. However, instead of working with code, these branches work with data.
In lakeFS, branches enable users to create their own “isolated” view of the repository. Changes made on one branch are not reflected on other branches. And once they’re done with the work, users can apply modifications from one branch to another by merging them.
Branches work as a reference to a commit as well as a set of uncommitted modifications. Note that the initial creation of a branch is a metadata operation that doesn’t duplicate objects.
- To begin, build a new branch from main to make an immediate “copy” of your production data.
- Before exposing the isolated branch, make modifications or updates to understand their impact.
- Finally, merge the feature branch back to main to atomically promote the improvements to production.
CI/CD for data
You can connect data quality checks to commit and merge processes using lakeFS hooks and make sure that validation checks are performed at each stage where necessary.
Isolated dev/test environments
lakeFS lets you create isolated dev/test environments for ETL testing in an instant and at a low cost thanks to copy-on-write. You can test and validate changes without affecting production data, as well as execute analysis and experimentation on production data in an isolated clone.
Looking at data as it was at a certain point is very valuable in at least two scenarios: ML experiments and troubleshooting a production error. With lakeFS, you can version all components of an ML experiment, including the data, and use copy-on-write to reduce the footprint of data versions.
Since data is constantly changing, it is difficult to understand its state at the moment of the mistake. To isolate a problem for faster troubleshooting, you can use lakeFS to generate a branch from a commit.
Rollback of data changes
Human error, misconfiguration, or broad-scale systemic impacts happen all the time. With lakeFS, rescuing data from deletion or corruption events becomes an immediate one-line operation: simply locate a suitable historical commit and then restore to or copy from it.
Regardless of what it’s called – shallow copy, branch, time travel, or Delta clone – the concept is incredibly useful to data practitioners.
Branches help to maintain stability in the data pipeline while allowing team members to apply isolated modifications to the code and test their impact before merging. Branching makes it easier to create problem fixes, add new features, troubleshoot problems, and integrate new versions once they have been tested in isolation.
Explore more use cases of the branching mechanism that is part of lakeFS at every step of the way: in test, during deployment, and in production.
Table of Contents