Israeli startup Treeverse is developing dataset copy management and version control for data pipeline builders with its open-source Lakefs product.
Analytics and AI/ML data supply pipelines depend upon consistent, repeatable and reliable delivery of clean data sets extracted from source data lakes. Such pipelines are equivalent to software programs and they take effort and time to develop and test. The testing effort requires datasets on which the pipeline operate and these are mostly copies of a source dataset. If the source dataset has a snapshot, copies made then can be used to create virtual copies, timestamped sets of pointers to a source dataset that can be used for pipeline development.