Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Guy Hardonag
Guy Hardonag Author

Guy has a rich background in software engineering, playing an...

Last updated on September 1, 2024

Data is the core of any ML project. The vast volumes of machine learning data are the most important factor in training algorithms to deliver quality output.

The problem is that all datasets are faulty to some extent. That’s why taking time and effort to prepare data is a good idea. Data preparation is a critical phase in the machine learning process, but it’s just one of the many steps required in the production of machine learning datasets.

Keep reading to learn the best practices for preparing machine learning data, along with the first steps you can take in this challenging but rewarding area of your ML project.

Understanding the role of data in machine learning

When attempting to address real-world issues with ML applications, the first step is generally data preparation. Raw data often contains inconsistencies that you need to address before the dataset can be fed to machine learning algorithms. 

Are you wondering what potential problems you might face when working with raw data?

Here are some potential issues you may encounter: 

  1. Missing dataset values – It’s not always possible to have values for each row of feature variables that correspond to the target variable. A data collection system may be unable to capture particular values due to technological reasons. Such errors are subsequently recorded in the dataset as missing values.
  2. Different file formats – When working on actual machine learning datasets, it’s quite uncommon to see the same data format for all files. Prepare to see multiple formats.
  3. Variable value inconsistency – Variables in a dataset may contain values that aren’t useful. You need to get rid of them to optimize your data processing effort.

Key steps in data preparation for machine learning

1. Data Collection

Data collection or gathering sounds like a piece of cake. But it’s far from that. 

In most organizations, data is segregated into many departments and even tracking points within departments. Marketers may have access to a CRM, but that system works in complete isolation from the web analytics solution. 

If you have various channels of engagement, acquisition, and retention, consolidating all data streams into centralized storage will be challenging. To handle this task, various approaches have emerged over the years.

ETL and data warehouses

Many organizations start by storing data in warehouses. These are often designed for structured (or SQL) records compatible with conventional table formats. All of the sales records, payrolls, and CRM data are likely to fall into this group. 

Another common aspect of working with warehouses is transforming data before loading it into the warehouse. This method is known as Extract, Transform, and Load (ETL). Testing data at the ETL stage is a smart move.

The trouble with this strategy is that you never know which data will be valuable to the ML project. As a result, warehouses are typically used to access data via business intelligence interfaces to view metrics that we know we need to track.

ELT and data lakes

Data lakes are storage systems that can store both structured and unstructured data, such as images, videos, voice recordings, PDF files, etc. However, even when data is organized, it’s not converted before storage. 

You load the data in its current state and determine how to utilize and transform it later, on demand. This method is known as Extract, Load, and Transform (ELT). Data lakes are considered to be better-suited for machine learning. 

2. Data Cleaning

The next step is data cleaning, which means removing all the potential issues in your machine learning dataset.

Many data engineers make missing values a priority since they can significantly affect prediction accuracy. In machine learning, assumed or estimated values are “more right” for an algorithm than absent ones. 

Here are a few strategies for ML-focused data cleaning:

  • Replace missing values with dummy values, such as “n/a” for category categories or “0” for numerical values
  • Substitute mean figures for the missing numerical values
  • Use the most frequently occurring items to fill in the category values

You can automate the process of data cleaning if you use an ML-as-a-Service platform. For example, Azure Machine Learning allows you to select among available methodologies, whereas Amazon ML does so automatically.

3. Exploratory Data Analysis (EDA)

Next, it’s time for exploratory data analysis to understand the data you’re working with early in the process and develop insights into its usefulness for the project. A common mistake is to launch into ML model building without taking the time to really understand the data you managed to collect.

Data exploration entails looking at the type and distribution of data included inside each variable, as well as the relationships between variables and how they fluctuate in relation to the output you’re expecting or hoping to achieve.

This stage helps identify issues such as collinearity (variables that move together) or circumstances that call for dataset standardization and other data manipulations. It can also reveal opportunities to improve model performance, such as lowering dataset dimensionality.

Data visualization greatly helps here. Human brains are great at spotting patterns along with data that doesn’t match the pattern, and data visualization is essential for enabling this.

4. Data Transformation

Finally, it’s time to transform the data. As you work on your problem, you will almost certainly have to review various transformations of your preprocessed data. This stage will depend on the specific algorithm you are using as well as your understanding of the issue domain.

Three common transformations are scaling, attribute decomposition, and attribute aggregation:

  • Scaling – The preprocessed data may include properties with a variety of scales for different amounts, such as dollars, kilograms, and sales volume. Many machine learning approaches prefer data characteristics with the same scale, such as 0 to 1 for the least and highest value for a specific feature. Consider any feature scaling that may be required.
  • Decomposition – A date may include day and time components that may be further subdivided for ML purposes. Perhaps just the time of day is important to your project? Consider what feature decompositions are possible.
  • Aggregation – Some features can be aggregated into a single feature that is more relevant to your application. Consider the capabilities of feature aggregations to streamline your project.

Best practices for data management in machine learning projects

Keep data quality in check

The first thing you should consider is whether your data can be trusted for ML training. As the saying goes “garbage in, garbage out.” Even the most advanced machine learning algorithms can’t work in the presence of bad data, so data quality is an essential objective of every ML team, and the pre-processing phase should ensure the quality of the data.

There are many questions you could ask at this point:

  • Were there any technical issues with the data transfer? 
  • How many missing values do you have in your data? 
  • Is your data sufficient for your task?
  • Is your data biased? 

Data quality effort starts at the first stage of data collection, through ingestion and transformation. You should make data quality management a core part of your ML project. 

Establish a data governance framework

Data governance is a collection of policies, methods, and technologies that assure the correct administration and use of data in projects such as machine learning applications.

Data governance is a critical component of data management within a business. It has become even more crucial considering the growth of machine learning applications.

A data governance framework focuses on ensuring and improving:

  1. Data integrity, safety, and reliability
  2. Data collection and storage
  3. Data sharing
  4. Maintaining data quality while protecting data privacy
  5. Assisting with regulatory compliance

In the context of machine learning, the first point is crucial for getting trustworthy and relevant results. Using high-quality data for validation, cleaning, and enrichment procedures can help to maintain high data quality standards.

Access to sensitive data must be protected, and teams need to follow data protection standards such as GDPR and CCPA. Encryption, access control, and frequent system audits all help boost data security and privacy.

Another area related to governance initiatives centers on tracking the origin and transformations of data as it passes through the ML pipeline. This helps you evaluate the impact of data on model performance and ensure data pipeline traceability. Data lineage is especially significant in machine learning applications because it helps you identify data sources and data changes that contribute to model results.

Finally, there’s compliance with appropriate industry standards and norms, such as the Health Insurance Portability and Accountability Act (HIPAA), to avoid any legal or ethical difficulties associated with data use in ML applications.

Implement version control for datasets

Keeping track of several versions of data can be equally difficult. Things quickly come apart without adequate synchronization, balance, and accuracy. 

Data versioning is an approach that helps to maintain track of numerous versions of the same data without incurring massive storage costs. 

Creating machine learning models calls for more than just executing code; it also requires training data and the correct parameters. Updating models is an iterative process where you may need to keep track of all past changes. Data versioning allows you to save a snapshot of the training data and experimental outcomes to make implementation easier at each iteration.

Almost every business is subject to data protection requirements such as GDPR, which require them to maintain specific information in order to verify compliance and the history of data sources. In this case, data versioning can help with both internal and external audits.

Data version control is essential to automate data quality checks and implement and assert data governance. The ability to isolate data at ingest or after generation and expose it to users only if it has been verified is an action that requires data branching (shallow copy), data merge operations, and pre merge hooks that execute the validation tests.

Monitoring and maintenance

As with any project, constant monitoring and regular maintenance are part of the job. Regularly updating datasets to reflect changes in the real world makes sense if you want your models to deliver accurate results. 

Another aspect relates to quality. Smart teams implement tools for continuous monitoring of data quality and drift. They also put automated data validation processes in place.

What kind of tools can you use to accomplish those and other data-related tasks? Here’s a lineup of top tools used in the field today.

Tools and technologies for efficient data preparation

A data cleaning tool accelerates and streamlines the process by automating numerous operations. Trifacta Wrangler, Astera Centerprise, OpenRefine, Winpure, and TIBCO Clarity are some good examples of data cleaning tools.

You can use such a tool for:

  • Automated data profiling – It scans and profiles the complete dataset automatically to find potential data quality concerns such as missing values, duplication, inconsistencies, and formatting mistakes. 
  • Standardization – Such tools employ standardization criteria to guarantee that data follows a uniform format and check it against established rules or reference data.
  • Deduplication – Data cleansing solutions can readily discover duplicate data or entries and automatically combine or eliminate them.
  • Parsing and transformation – A tool may parse complicated data structures, such as addresses or names, and automatically change them into a uniform format.
  • Error correction – Using specified criteria, these tools can automatically repair common mistakes such as misspellings or wrong values.
  • Tools for data quality validation
  • Data version control system for quality/governance automation 

A good data cleaning and transformation tool should have capabilities for extensive data profiling and cleaning, data quality checks, and options for workflow automation to automate the entire data cleansing effort: from data profiling to transformation, validation, and loading to the desired destination.

Best Practice: Data CI/CD for pre-processing pipelines 

To ensure the quality of the data generated by pre-processing pipelines, implementing automated data quality tests is a must. To create this CI/CD process for data pipelines, one requires two sets of tools, data version control systems and data quality tools. 

Pre-processing pipelines transport processed data from data lakes to downstream users, such as business dashboards and machine learning algorithms. It’s critical to guarantee that production data complies with corporate data governance requirements. 

One best practice is building data quality gates to assure quality and reliability at each stage of the data lifecycle. We must execute Continuous Integration (CI) tests on the data, and the data can only be promoted to production for business usage if all data governance criteria are completed.

Building data quality gates in machine learning data preparation

Data Version Control systems

Data version control systems help implement CI/CD for data in the context of machine learning using various methods. For example, in the open-source tool lakeFS, you can implement CI/CD by using pre-commit and pre-merge hooks that trigger specific rules and validation checks.

Examples of data version control tools include lakeFS, GitLFS, Project Nessie, and XetHub.

Data quality tools

Data quality tools open the door to streamlining and, in many cases, automation of data management tasks that help you ensure that data is good to go for analytics, data science, and machine learning use cases. A tool like that helps teams evaluate current data pipelines, identify quality bottlenecks, and automate different corrective procedures.

Using a data quality tool boosts customer confidence in the data. They are aware that the data quality tool has removed low-quality data, leaving only high-quality data — allowing for true data-driven decision making. 

Some examples of data quality tools are Great Expectations, Monte Carlo, Deequ, Lightup, Anomalo, Acceldata, Bigeye, and Datafold.

Conclusion

Quality data is at the foundation of machine learning. Data challenges are nothing new, but as ML models become larger and data becomes more abundant, you need to find scalable methods for assembling great training data.

Luckily, data practitioners can choose from more and more tools that help them overcome the obstacles to clean, accurate, and trustworthy data. 

One of the rapidly evolving fields is data version control. Check out this article for a practical guide to how a data version tool enables reproducibility in ML projects: ML Data Version Control and Reproducibility at Scale.

lakeFS