As companies race to adopt AI technology, many firms in highly-regulated fields such as Healthcare, Financial Services, and Defence are at risk of being left behind. How is it possible to innovate and move fast while at the same time appeasing the demands of regulators?
Companies that shy away from this new technology will almost certainly be left behind their competitors, but early adopters run the risk of having to pay exorbitant fines for regulatory violations.
The solution to this quandary lies in finding the secret sauce of software tools that allows for the adoption of the latest AI technologies while automating the process of fulfilling regulatory requirements. In this context, lakeFS has emerged as a key technology for industry leaders to adopt AI Tech successfully.
Data Quality
Data quality is a key concern in highly-regulated industries. Consider for example the datasets used to develop a CNN model used for computer vision in autonomous cars. Computer Vision experts want to make sure that the model is trained on a variety of cases so that it can recognise pedestrians, pets, and other vehicles. Defects in data quality are a major safety concern and can result in loss of life and damage to property.

The granular version control that lakeFS provides allows analytical teams to track exactly which version of a dataset was used to train each model. Models trained on data found to be faulty or incomplete can be flagged and replaced with models with better data lineage.
Furthermore, many data versioning systems cannot scale to the large video/image datasets required for robust CNN model training. lakeFS is engineered from the ground-up to support the largest datasets with lightning-fast performance.
Reproducibility
Reproducibility is another requirement for highly-regulated industries. In the Insurance Industry, regulators will periodically audit how a particular set of quotes were computed, for the sake of legal compliance. For instance, consider a GBM model used in computing actuarial risk, a key model in the pricing of policies. The Insurance Regulator suspects that race or gender are being illegally considered in the pricing of policies and asks for evidence how the model was trained. A complete data lineage and history of the modeling process is necessary in order to replay the training process to the satisfaction of the regulator.
lakeFS provides full back-tracability of a model development process by documenting the history of all changes within an Object Store (S3, MinIO, and many more). The structure of branches and commits that lakeFS provides maps well with the non-linear process of model training, as different datasets and hyperparameters are tried and the performance of the resulting models are compared. In this way, it is easy to ‘jump’ to any stage of the training process and reproduce the result.
Performance at Scale
One of the technical challenges for regulated industries is governance at scale. Many softwares either provide computational scale or governance but not both. lakeFS stands out in this regard in that it operates at the metadata level, letting big data tools provide computational scale. Solutions like the lakeFS HDFS implementation make this integration seamless for Data Engineering teams. As for Machine Learning teams who want high-performance model training, lakeFS has rich Python language support and can map datasets seamlessly onto the local filesystem for lightning fast analytics at scale. In this way, lakeFS can perform for the largest datasets in Aerospace, Banking, Health-Tech, and many more.
Governance
AI Governance is particularly difficult in large organizations, due to the large number of teams building models over the same datasets. As each team prepares the data in the way that suits it, we end up with numerous copies and permutations of the same information. This is a nightmare for GDPR and other data governance concerns.
For example, organizations with health data provide very limited access to patient’s personal health data. AI teams are generally required to work with anonymized data or even synthetic data. How can we effectively manage these datasets and their permissions?
lakeFS provides a rich language of visibility and permissions on top of its commits and branches model. Branches can provide teams with a private “view” on the data without copying and pasting it all over the place. Separate read and write permissions can then be defined so that employees can only access the data allowed to them. With regard to retention, when we are required to delete a dataset, we can be sure that we have deleted all of its copies.
Team Collaboration
Large enterprises are divided into many smaller teams who must collaborate in order to build solutions together. Regulated industries must control this collaboration for the sake of privacy, safety, and other legalities.

Consider, for example, an aerospace company developing models for analysis of aerial photography. They have thousands of employees developing models for identifying objects, for aggregating maps from a set of photos, for analyzing time-series of images. These models get reused across different types of cameras, deployed on satellites, high-altitude photography, and low altitude drone footage. The challenge of managing these diverse datasets effectively across the organization is considerable. lakeFS provides a rich framework for teams to manage and share their data and models effectively.
lakeFS as a Key Infrastructure for AI
The world is at a turning point, where every industry is racing to leverage AI technology. The winners in this race will win big, but the losers will go home with nothing. Innovating in this space for highly regulated industries such as Healthcare, Autonomous Vehicles, and Financial Services is a particularly difficult challenge. lakeFS is a key technology empowering organizations to innovate confidently in today’s rapidly evolving AI landscape.


