April 9, 2022

Machine Learning How To: Reproducible and Trackable Training

Machine Learning How To: Reproducible and Trackable Training

The investigation and experimentation phase is a common building block in developing any machine learning solution. As the model developer you spend a good chunk of your time doing research on different algorithms, gathering and processing the data, hyper-parameters tuning and so on.
Through this process you may end up adding new data to your original collected dataset, and try various algorithms or tinker with the hyper-parameters of the same algorithm many times.
You may be able to keep track of the differences between a few of these experimental setups with scripts or notebooks that are named first.ipynb, second.ipynb, final.ipynb, final2.ipynb, final_final.ipynb. However, this inefficient process causes a lot of headaches for you and your team when you want to compare results. It also makes it difficult to know which model used which dataset, and decide on which model to push to production.
This blog post is part one of a series covering different aspects of the life-cycle of machine learning in production.
In this post I will go cover the topics listed below:

  • The model training process
  • How to version control data
  • How to version control models
  • How to easily compare the performance of different model configurations
  • How to write readable, reusable production-ready code for your machine learning pipelines.

For this blog post series, I will be using a small movie review dataset and fit some basic models on it. The goal of these posts is not to optimise the data or model, but to demonstrate how changes in the model development process can be easily tracked and reproduced. All the code used for this blog series can be found on my Github. This post will go through the code inside the ml directory.

Tools and services used

  • pyenv: pyenv is a virtual environment manager that allows you to easily install and manage different Python versions in different environments. You can follow the installation instructions from its Github page.
  • AWS: S3 for remote data storage
  • Git and Github: For tracking code (and later for CI/CD, to be covered in later blog posts)
  • Docker: Containerisation is a useful tool providing reproducible environments
  • Data Version Control (DVC): An open source tool for tracking data, building ML pipelines, and performance metrics. Its integration with Github makes it very easy to pull the data and run experiments on different git branches.

In the following sections, I will walk through how to build the published project from scratch.

1. Create a Virtual Environment

Create a virtual environment and activate it.

> pyenv virtualenv 3.9 venv
> pyenv activate venv

2. Define the project structure

A project needs to have a well-defined and easy to understand structure. This makes it possible for multiple people to work on a repository without stepping on each other's toes.

An example

The ml directory of the linked repository, has different directories. Each responsible for a specific component of the machine learning project.


Each directory has a __main__.py and a config.py script. The main script is the script responsible for executing a specific task. The task is described by the enclosing directory name. This separation of concerns and distinct naming makes it easy for a new developer to navigate through the code-base.
The config script is responsible for initialising any configurable variables. I include all the environment variables in a .env file. Note that, .env files can contain sensitive information, e.g. secrets, it is therefore important to not push these files to your remote repository. This can be done by including .env (and other files you do not want to be tracked) to .gitignore. However, it is useful to include a list of all environment variables needed for running the code and the values of non-sensitive information in a file that is pushed to the repository. This way anyone else cloning the project will know what to include in their .env file
I have listed all the variables in a file named .env.example. Below are the contents of this file. Notice that the AWS secret and id are listed but the values are not included:



A few notes

  • services/ and __main__.py include code for inference which I will cover in the next blog post.
  • The metrics folder includes the performance metrics on test data. More on that in step 8.
  • Details on the Dockerfile and docker-compose file can be found in step 4.
  • Details on the DVC related directory and files can be found in steps 7 & 8.
  • Details on the shell script can be found in step 8.

3. Write code

After deciding on the project's code structure, it is time to add code. The data preparation and training steps are very specific to the data and problem at hand and the code snippets below are just an example for training a model that will predict positive or negative sentiments given a piece of text.

3.1 Preprocessing

The movie review dataset has two columns: Phrase which is the review of a movie and label which is a rating of 0-4 of the movie. The code snippet below, is the content of the prepare_data/__main__.py script.


import pandas as pd

from sklearn.model_selection import train_test_split

from model_params import TrainTestSplit
from .config import Config

def split_train_test_data():
    config = Config()
    # select the reviews with the lowest and highest ratings
    data = pd.read_csv(f"{config.data_path}/{config.data_file}", sep="\t")
    data = data[(data["Sentiment"] == 0) | (data["Sentiment"] == 4)]
    data["labels"] = data.apply(lambda x: 0 if x["Sentiment"] == 0 else 1, axis=1)

    #split the dataset into training and testing data
    data_train, data_test = train_test_split(data, test_size=TrainTestSplit.TEST_SIZE)


if __name__ == "__main__":

Notice the use of Config() for configurable variables. All such variables are defined in a config.py script in the same directory. Below is the content of this file:


from os import getenv
from sys import exit

from utils.logger import logger

class Config:
    def __init__(self):
        if getenv("DATA_PATH") is None:
            self.data_path = getenv("DATA_PATH")

        if getenv("DATA_FILE") is None:
            self.data_file = getenv("DATA_FILE")

        if getenv("TRAIN_DATA_FILE") is None:
            self.train_data_file = getenv("TRAIN_DATA_FILE")

        if getenv("TEST_DATA_FILE") is None:
            self.test_data_file = getenv("TEST_DATA_FILE")

    def exit_program(self, env_var):
        error_message = (f"prepare_data: {env_var} is missing from the set environment variables.")

3.2 Training

A very similar structure applies to the training script. Below is an example of using the training dataset created by the preprocessing script and fitting a logistic regression model to it:


import pandas as pd

from joblib import dump
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

from .config import Config
from model_params import LogisticRegressionConfig

def train():
    config = Config()
    # initialise model parameters
    lr_params = {
        "n_jobs": LogisticRegressionConfig.n_jobs,
        "C": LogisticRegressionConfig.C,
        "max_iter": LogisticRegressionConfig.max_iter,
    # load the training dataset created by prepare_data/
    train_dataframe = pd.read_parquet(f"{config.data_path}/{config.train_data_file}")

    # separate the input text and labels
    X = train_dataframe["Phrase"]
    y = train_dataframe["labels"]

    # define a series of processes that transform input data X
    clf = Pipeline(
            ("vect", CountVectorizer()),
            ("tfidf", TfidfTransformer()),
            ("clf", LogisticRegression(**lr_params)),
    # fit the model and save it in the model_path
    clf.fit(X, y)
    dump(clf, f"{config.model_path}/{config.model_file}")

if __name__ == "__main__":

The train function dumps the model to a configurable path. All the configurable variables are defined in config.py in the train/ directory:


from os import getenv
from sys import exit

from utils.logger import logger

class Config:
    def __init__(self):
        if getenv("DATA_PATH") is None:
            self.data_path = getenv("DATA_PATH")

        if getenv("TRAIN_DATA_FILE") is None:
            self.train_data_file = getenv("TRAIN_DATA_FILE")

        if getenv("MODEL_PATH") is None:
            self.model_path = getenv("MODEL_PATH")

        if getenv("MODEL_FILE") is None:
            self.model_file = getenv("MODEL_FILE")

    def exit_program(self, env_var):
        error_message = (
            f"train: {env_var} is missing from the set environment variables.")

3.3 Model Parameters

It is useful to have a script that includes all the model parameters and configurations. This way any model parameter changes will happen in one place only. For the simple logistic regression model used above, I define all the parameters used in train/__main__.py in a LogisticRegressionConfig class and include the split between train and test datasets, used in prepare_data/__main__.py in a TrainTestSplit class. As I make changes to the model or add new models I can add other classes to this script. Below is an example of what is included in ml/model_params.py:


class TrainTestSplit:
    TEST_SIZE = 0.2

class LogisticRegressionConfig:
    n_jobs = 1
    C = 1e5
    max_iter = 1000

3.4 Testing

Depending on the problem, a set of performance metrics should be defined. For the sentiment analysis model, I use accuracy and confusion matrix. This is what the content of test/__main__.py looks like:


import pandas as pd

from json import dump
from joblib import load
from sklearn.metrics import confusion_matrix, accuracy_score

from .config import Config

def test_model():
    config = Config()

    # load the model from the model_path
    model = load(f"{config.model_path}/{config.model_file}")

    # load the test data created by prepare_data/__main__.py
    data_test = pd.read_parquet(f"{config.data_path}/{config.test_data_file}")
    # separate the input from the labels
    X = data_test["Phrase"]
    y = data_test["labels"]
    # make predictions on input text, X
    y_pred = model.predict(X)

    # compute metrics
    accuracy = accuracy_score(y, y_pred)
    true_negative, false_positive, false_negative, true_positive = confusion_matrix(y, y_pred, normalize="true").ravel()

    # save the results in metric_path
    with open(f"{config.metric_path}/{config.metric_file}", "w") as metrics:
                "results": {
                    "accuracy": accuracy,
                    "true_negative": true_negative,
                    "false_positive": false_positive,
                    "false_negative": false_negative,
                    "true_positive": true_positive,

if __name__ == "__main__":

Similar to the other directories, all the configurable variables are defined in test/config.py:


from os import getenv
from sys import exit

from utils.logger import logger

class Config:
    def __init__(self):
        if getenv("DATA_PATH") is None:
            self.data_path = getenv("DATA_PATH")

        if getenv("TEST_DATA_FILE") is None:
            self.test_data_file = getenv("TEST_DATA_FILE")

        if getenv("MODEL_PATH") is None:
            self.model_path = getenv("MODEL_PATH")

        if getenv("MODEL_FILE") is None:
            self.model_file = getenv("MODEL_FILE")

        if getenv("METRIC_PATH") is None:
            self.metric_path = getenv("METRIC_PATH")

        if getenv("METRIC_FILE") is None:
            self.metric_file = getenv("METRIC_FILE")

    def exit_program(self, env_var):
        error_message = (
            f"test: {env_var} is missing from the set environment variables.")

3.5 Freeze your dependencies into requirements.txt

As you write and run scripts you need to pip install different packages. In order to reproduce the results on a different machine, a file with a list of the packages and their versions is needed. This file is usually called requirements.txt. Below is the set of libraries used in the example project:



You can either add the packages one by one or create this file automatically by running pip freeze > requirements.txt in your terminal. Note that this command will include all the packages installed in your virtual environment and not just those used in your project.

4. Create a docker image

Another tool that helps with the reproducibility of a machine learning pipeline, and any pipeline for that matter, is Docker. You can install docker by following instructions from docker's website.
Below is an example of a Dockerfile you can use for training:


FROM python:3.9-slim as py39slim
RUN apt-get update -y && apt-get install -y git


# copy over the requirements.txt file and install all requirements
COPY ./ml/requirements.txt /ml
RUN pip install -r requirements.txt --no-cache-dir

# copy over the ml directory
COPY ./ml /ml

ENTRYPOINT ["/bin/bash", "-c"]

With multiple environment variables in the code, I find it easier to write a docker-compose.yml file that copies over all the variables from .env rather then including them in a docker run command. Here is an example of a docker-compose.yml file:


version: '3.8'
      context: ./..
      dockerfile: ml/Dockerfile
      - ./data:/ml/data
      - ./metrics:/ml/metrics
      - ./model:/ml/model
      - ./prepare_data:/ml/prepare_data
      - ./train:/ml/train
      - ./test:/ml/test
      - ./../shared_utils/logger:/ml/utils/logger

5. Install DVC

Follow the instructions on DVC's website on how to install. I installed it in my virtual environment using pip:

> pip insall dvc
> pip install 'dvc[s3]'  #Needed for working with boto3

6. Gather your dataset and upload it to a remote storage space

I uploaded the example dataset to an S3 bucket, named sent-analysis.

7. Setup DVC

7.1 Initialisation

If you have cloned the repository, DVC is already initialised in the ml folder; however, if you are working on a project from scratch you will need to initialise it by running dvc init. This will create a .dvc/ directory in the project.

7.2 Link to remote storage

For your data to be accessible from different environments you can save your data in a remote storage location, e.g. s3 a and link DVC to it by running:

dvc remote add s3-remote s3://<bucket_name>/

You will notice a config file in the .dvc/ directory created linking to the above remote storage. This makes it very easy to push and pull data to your s3 bucket. For example, a coworker can easily download the dataset on their local machine by running dvc pull in the cloned repository. If any changes are made to the data on your local machine, you can sync the remote data by running dvc push.

8. Create pipelines


A pipeline is a well-structured and easily maintainable workflow. Each pipeline is made from multiple steps, or stages as referred by DVC.


In a machine learning pipeline, all the artefacts need to be traceable, and any change in any stage should trigger the reproduction of all the affected stages.


You can create stages with DVC using dvc run. The name, input, and output of each stage can be defined in this command. Running a stage creates a dvc.yml file that includes all the dependencies of the corresponding stage (and the other stages that are run in the project).
Take for example the model testing step. To create a stage, all the data, scripts, outputs that should be tracked are included in the stage definition:

dvc run -n test_model \
-d test/__main__.py \
-d data/test_data.parquet \
-d model/lr.model \
-M metrics/model_performance.json \
docker-compose run sentiment.ml "python -m test ."

Note the different tags used in the snippet above, -n is the name tag assigned to this example stage. All the stage dependencies, i.e. the script that should be run to test the model (test/__main__.py), test_data.parquet, and lr.model are tagged with -d. The performance metric file is given the metrics tag, -M.
In a training pipeline, the testing stage is usually the last step after pulling the data, preprocessing it and training a model. Each step can be created with dvc run similar to the test stage above. By defining all the dependencies in each stage, any change to an artefact will propagate across all effected stages.

Define pipeline stages in a shell script

Running different dvc run commands from the terminal one by one will make it hard for reproducing and modifying the stages. I find it easier to include all stages in a shell script. That way all the stages are in one place, can be easily reproduced from another machine, modified and version controlled with Git.
The shell script training_testing_pipeline.sh in the repository includes all the stages. Each stage can be run individually by running ./training_testing_pipeline.sh <stage_name> inside the terminal from the ml directory. Alternatively, all stages can be run together by excluding the stage or by running dvc repro.
Below is a description of each stage in the order they should run:

Pulling data
if [[ "$*" =~ "pull_data_from_s3" ]]; then
   dvc run -n pull_data_from_s3 \
   -d s3://sent-analysis/reviews.tsv \
   -o ./data/reviews.tsv \
   aws s3 cp s3://sent-analysis/reviews.tsv ./data/reviews.tsv

The first stage, pull_data_from_s3 is responsible for pulling the data from the remote storage, s3, and saving it to a specified path, in this case, data/. Running this stage will create a data/ directory in your local repository and download the dataset, reviews.tsv to it (Assuming you have included this file in an s3 bucket named sent-analysis and have set it up as the remote storage in step 7.2). Note that the output, ./data/reviews.tsv is tagged with -o.
Running this stage will create/add details to dvc.yaml:

  cmd: aws s3 cp s3://sent-analysis/reviews.tsv ./data/reviews.tsv
    - s3://sent-analysis/reviews.tsv
    - ./data/reviews.tsv

Another file that gets created/modified with running a stage is, dvc.lock. This file contains the md5 hash values for each artefact which allows for changes in any part of the pipeline to be tracked, i.e. any change results in a change in the hash value which can then trigger all the dependencies to be recomputed and get new hash values. I will show this in an example in the following section. Below is the output of dvc.lock from running the pull_data_from_s3 stage:

  cmd: aws s3 cp s3://sent-analysis/reviews.tsv ./data/reviews.tsv
    - path: s3://sent-analysis/reviews.tsv
      etag: d8a6be2d1deb19f9cd76f1b69a793b5f
      size: 8481022
    - path: ./data/reviews.tsv
      md5: d8a6be2d1deb19f9cd76f1b69a793b5f
      size: 8481022
Preparing data

Now that the dataset is pulled from the remote storage, it is time for the preprocessing step.

if [[ "$*" =~ "prepare_data" ]]; then
   dvc run -n prepare_data \
   -d data/reviews.tsv \
   -d prepare_data/__main__.py \
   -o data/train_data.parquet \
   -o data/test_data.parquet \
   docker-compose run sentiment.ml "python -m prepare_data ."

A few things to note:

  1. data/reviews.tsv was an output in the previous stage but is a dependency in this stage. This follows the logical flow of data in the pipeline, i.e. the data is collected and passed down to the preprocessing step. Declaring the dependency of prepare_data on reviews.tsv means that any change to the data which is identified by the md5 hash in dvc.lock will trigger its dependent stage, prepare_data to run again.
  2. A stage can have multiple outputs, in this case, training data and testing data.
  3. The command at this stage is run in docker through docker-compose.
    Let's take a look at the dvc.lock content created after running this stage:
  cmd: docker-compose run sentiment.ml "python -m prepare_data ."
   - path: data/reviews.tsv
     md5: d8a6be2d1deb19f9cd76f1b69a793b5f
     size: 8481022
   - path: prepare_data/__main__.py
     md5: fd698d5634e84fe9ea713e10c59154ca
     size: 719
   - path: data/test_data.parquet
     md5: a9a52c60878a1cc92cef68ee8a4eaaae
     size: 206588
   - path: data/train_data.parquet
     md5: 57fe388a9b3d96ed734ddad601941a53
     size: 808111

Notice that reviews.tsv is identified with same md5 hash, d8a6be2d1deb19f9cd76f1b69a793b5f, from the previous stage. These unique hashes make it possible to pinpoint discrepancies in the machine learning pipeline in regards to different versions of artefacts being included in different stages.

Training the model

The previous stages have created the data needed to train the model. Now we can run the scripts inside the train directory and create a model.

if [[ "$*" =~ "train_model" ]]; then
   dvc run -n train_model \
   -d train/__main__.py \
   -d data/train_data.parquet \
   -o model/lr.model \
   docker-compose run sentiment.ml "python -m train ."

Note that the training script fits a logistic regression model on the data and saves the model in a model directory which is tracked via the output tag and gets a unique md5 hash attached to it in dvc.lock. Below is the output of dvc.lock from running this stage:

  cmd: docker-compose run sentiment.ml "python -m train ."
   - path: data/train_data.parquet
     md5: 57fe388a9b3d96ed734ddad601941a53
     size: 808111
   - path: train/__main__.py
     md5: 0b044356b96a8419fcd64ba7087c225d
     size: 939
   - path: model/lr.model
     md5: afb779dc7ffb7b659c1d16937c57b910
     size: 373778
Testing the model

This stage is important for both evaluating the model and also comparing this model with other candidate models.

if [[ "$*" =~ "test_model" ]]; then
   dvc run -n test_model \
   -d test/__main__.py \
   -d data/test_data.parquet \
   -d model/lr.model \
   -M metrics/model_performance.json \
   docker-compose run sentiment.ml "python -m test ."

As seen, the test_model stage is dependent on all previous stages.
Note that the output of the performance evaluation script is tagged with the metrics tag, -M.

9. Visualise the pipeline flow

Now that all training and testing stages have been defined, we can visualise the flow of data in the pipeline by running dvc dag:

The flow of data and illustration of how one stage depends on the previous stage

10. Run experiments and compare results

It is time to see the pipeline's value in action. Let's consider two scenarios:

Modifying the model's hyper-parameters

Say that you have trained the model by running the shell script, but want to make changes to its hyper-parameters. Let's see how we can easily switch between versions of the model like switching between different code versions on Git branches.
For illustration purposes, I branched off of main by creating a branch named exp and tweaked the test size in model_params.py. This should trigger recomputation of all three stages, prepare_data, train_model and test_model.
I will run the pipeline as an experiment, with this command dvc exp run -n more-test-data --no-run-cache. Note that more-test-data is a name I gave to this experiment.
As expected, this command skips the first stage of the pipeline since the dataset has not changed. Below is the output from the terminal:

Stage 'pull_data_from_s3' didn't change, skipping                                                                                                       
Running stage 'prepare_data':
> docker-compose run sentiment.ml "python -m prepare_data ."
Creating ml_sentiment.ml_run ... done
Updating lock file 'dvc.lock'                                                                                                                           

Running stage 'train_model':
> docker-compose run sentiment.ml "python -m train ."
Creating ml_sentiment.ml_run ... done
Updating lock file 'dvc.lock'                                                                                                                           
Running stage 'test_model':
> docker-compose run sentiment.ml "python -m test ."
Creating ml_sentiment.ml_run ... done
Updating lock file 'dvc.lock'     

The results of this experiment can be compared with the main branch, and other experiments (if any) by running dvc exp show. Below is a screenshot of the result:

Comparison of performance metrics of two experiments

This experiment produced three new files that are tracked by DVC, train_data.parquet, test_data.parquet and lr.model. This means that the hashes of these files have also been updated in dvc.lock. The image below shows the changes in the lock file.

The changes made in dvc.lock after modifying the test size.

I add the changed files to git and push them to the remote repository. I then run dvc push to push the three DVC tracked files to the remote storage.
This allows anyone else working on this code to access both sets of files just by switching the branches and pulling the corresponding data from DVC. In other words, when a coworker checks out the exp branch and runs dvc pull the dvc.lock file dictates which versions of the three train, test and model files should be downloaded.

Comparing different models

Now let's change the training model. For this, I branch off of the main branch, svm-exp, modify the main script inside the train folder and fit an svm model instead of the logistic regression model.
This change should only affect the last two stages as it relates to training and testing the model on the same data produced by prepare_data. Below is the output of training an svm model and running dvc exp show. As seen we can easily compare the results of the main branch with the other two experiments, one from increasing the test data and one from training and different model. It is clear from the results of the svm model that the changes made in the svm-exp experiment outperform the other two.
Now I can add the changes and push the DVC tracked files, create a pull request on Github and get the better performing model merged.

In this blog post, I described the steps to generate easy-to-read, reproducible code for training machine learning models. The next posts will build on top of this article and discuss the next steps after deciding on the model for production use.
I hope you enjoyed this post and use some of these ideas in your own projects. If you like reading about machine learning and natural language processing, follow me on twitter for more content!