Cherry-picked Content on Machine Learning

Lessons Learned from Bloomberg's GPT Model Journey

Maryam Fallah — Sun, 30 Jul 2023 12:15:38 GMT

Generative models especially GPT style models have been the front and centre of the conversation for the past few months. There are many factors that contribute to the performance and usability of these models. Perhaps the most important factor is the training data, its quality, distribution and processing pipeline. This has encouraged many institutions to invest in training their internal large language models (LLMs) from scratch or finetune existing open source ones which is still a very expensive operation.
Bloomberg is one of these institutions. The existing LLMs have been trained on massive amounts of data from the internet (and some private data in some cases like ChatGPT) which includes various topics but at the end of the day are too broad to be useful for any specialised domain, e.g. medicine or finance. In addition to the data distribution, the tokenisation step affects how the model learns to interpret the input text. General text LLMs do not use the tokenisers that works best for numbers and therefore the model does not perform well on math and finance-related topics.
These were the main reasons why Bloomberg took up the challenge to train its own GPT LLM. Training such models is no easy feat, you may encounter training instability, scaling issues, deployment and inference challenges, compute, time and budget restrictions. Even if all goes well, which it won't, there is no straight path to determining the right combination of data and parameters and model size. We usually hear about the final polished model and not so much about the many iterations and decisions made that lead to it.
Bloomberg has published an extensive paper on these very topics which I will go over some in this blog post.

Dataset

A combination of general language public data and in-house private finance data is used.

Proprietary finance data - FinPile

Finpile as Bloomberg calls it, is a set of financial documents accumulated over the years from 2007 - 2022 and makes up 51.27% of the entire training dataset or 363B tokens.

Public data

Three public datasets are included in different proportions. For example, C4 (Colossal Clean Crawled Corpus) makes up 19.48% or 138B tokens. Other public datasets used are The Pile and Wikipedia. Total number of tokens from these sources are 345B.

Tokenisation

Handling numbers is a necessity for a language model with financial use cases. It needs to have a good understanding of comparisons, basic arithmetics, distinction of prices versus dates, etc.
To make this possible, Bloomberg uses Unigram tokeniser instead of a greedy merge-based sub-word tokeniser such as Byte Pair Encoding (BPE) or Wordpiece. Such merge-based tokenisers take a bottom-up approach by repeatedly merging the most frequent sequence pairs until the predetermined vocabulary size is reached (BPE) or merging the sequence-pair that maximises the likelihood of the training data (Wordpiece). However, the Unigram tokeniser takes a top-down approach. It starts with a large vocabulary and repeatedly removes items that increase some loss the least, e.g. log-likelihood of the training data. This allows the tokeniser to split the input in several ways by saving probabilities of different splits.

The total size of the dataset is 710B tokens. Let's put this into perspective of how much data we are talking about. If you print all the pages from english Wikipedia and stack on floor to ceiling shelves, there will be 16.5 shelves. Bloomberg's data is 20x that!

Model

Finding the right size

The model architecture is based on the BLOOM model. In order to find the right model size for the amount of data, Bloomberg relied on Chinchilla's scaling laws, specifically approach 1 and approach 2.

Chichilla's scaling laws

Deep Mind's Chinchilla's paper was the first paper to systematically study the relationship between compute, dataset size, and model size. One of its main conclusions was that model size and dataset size have a linear relationship, i.e. if you have more compute and want to train a larger model, you need more data.
This paper also introduced scaling laws to compute the optimal model size for a fixed compute budget. Given Bloomberg's fixed compute budget of 1.3M
GPU hours on 40GB A100 GPUs and 710B tokens, leaving ~30% of compute for potential retries, they chose the largest model size possible, 50B.

Finding the right shape

Now how many self-attention layers does 50B translate to? What's the hidden dimension of each? Levine et. al have studied this relationship:
D = exp(5.039).exp(0.0555.L)
Where $D$ is the hidden dimension and $L$ is the number of layers. The bloomberg team did a sweep through different integer values (L, D) that gives a total of 50B parameters, (70, 7510). These numbers are not satisfactory for two reasons:

The number of dimensions is not divisible by number of layers which is traditionally the case.
The dimensions need to be multiples of 8 for higher performance in Tensor Core operations.

For these reasons they choose 40 heads, each with a dimension of 192, resulting in a total hidden dimension of D = 7680
and a total of 50.6B parameters.

Training

We use the Amazon SageMaker service provided by AWS to train and evaluate BloombergGPT. We use the latest version available at the time of training and train on a total of 64 p4d.24xlarge instances. Each p4d.24xlarge instance has 8 NVIDIA 40GB A100 GPUs with NVIDIA NVSwitch intra-node connections (600 GB/s) and NVIDIA GPUDirect using AWS Elastic Fabric Adapter (EFA) inter-node connections (400 Gb/s).
This yields a total of 512 40GB A100 GPUs. For quick data access, we use Amazon FSX for Lustre, which supports up to 1000 MB/s read and write throughput per TiB storage unit.

Large-scale optimisation

There are a number of considerations that help with increasing training speed and reducing memory footprint while training on the GPUs.

Model parallelism: Model parameters, gradients and optimiser states are sharded across GPUs using ZERO-stage 3. There are a total of 4 copies of the model.
Reducing communication overhead: Using some features introduced by MiCS from Amazon, the communication overhead and memory requirements on the cloud cluster is reduced.
Activation checkpointing: This is applied to each transformer layer and reduces memory consumption. With activation checkpointing, the layer's input and output are saved after a forward pass and all the intermediate tensors are discarded from memory. These values are recomputed during backpropagation.
Mixed precision training: Another way to reduced memory requirements. Parameters are stored and updated in full precision, FP32, but forward and backward passes are done in Brain Float 16, BF16. The Attention block's Softmax and its caculations in the loss function were computed in full precision.
Fused kernels: The combination of multiple operations into a single one results in a fused kernel. This reduces peak memory usage because it avoids storing intermediate results in the computation graph. It also helps with speed.

Iterations

v0

They start with training a first version of the model for 10 days to monitor training and fix any potential issues. In this V0 model, they notice that training loss does not reduce much further after 4 days of training (20k steps). Also, validation loss plateaus as shown in the figure below.

Figure 1: Learning curve of the first training attempt, v0. There is a large gap between training and validation and losses plateau after step 20k.

They suspected that the large gap between training and validation and the loss plateau could be due to the chronological sorting of the data. The data in the validation is from a future time period which may have a different distribution than what the model sees during training. They decided to shuffle the data.

v1.x

A series of iterations and improvements are made after v0. In order to have a better understanding of where the instability comes from, similar to Meta's OPT model, the Bloomberg team plots the gradient norms across training steps to monitor for training instability.
The figure below shows this for different iterations of the v1 model. Spikes are an indicator of training instability and poor performance.
v1.0: In blue is the model trained after shuffling the data. As seen there is a peak around step 14k.
v1.1: Again borrowing from OPT's learnings, they rewound training to before the spike happened (around 100 steps before), shuffled the next batches of data and lowered the learning rate. As seen, the orange curve, the spike just got delayed.

Figure 2: Gradient norms across training steps for v1.x iterations

A Deeper Dive into Parameter Values
The team suspected that these spikes in the gradient norm to have repercussions in the parameter values. So they plotted the L2 norms for each component, averaged by the square root of the number of elements. As seen in the figure below, most components follow an expected trend. However, the Input LayerNorm at layer 1, elbows and a makes a sudden turn upwards.

Figure 3: Rescaled norms of each component. The outlier is the Input LayerNorm at layer 1.

They took a deeper look into the values of the multiplier weights $\gamma_1^{in}$. This is the scaling factor when computing the normalisation factor in LayerNorm. All values had the same pattern as seen below:

Figure 4: Values of the multiplier weights in the InputLayer at Layer 1

They were not able to figure out what causes this behaviour and after training three more versions of v1, ended up started training from scratch with other hypterparameter changes (The changes are listed in the table below).
However, one thing they noticed from this investigation was that weight decay was beeing applied to the multiplier weights which are initialised at 1. This unnecessarily penalises these values and pulls them to zero. This was removed.

v2

To train the final iteration of the model, the team made a series of hyperparameter choices as listed below:

Use FP32 precision in LM-head
Use max learning rate of 6e-5 instead of 1e-4
Use a gradient clipping value of 0.3 instead of 1.0
Fully shuffle data
Use a different seed to ensure different initialization and data order
Reintroduce LayerNorm at embedding layer (LNem)
Use a longer learning rate warm-up period of 1800 steps
Remove incorrect use of weight decay on LayerNorm multipliers
Use Megatron initialization rescaling
Apply query key layer scaling Shoeybi et al., 2019
Apply a batch size warm-up: Use a batch size of 1024 for 7200 iterations, then increase to 2048

They trained this model and monitored norms of weights. For 48 days training went smoothly. However, the training and validation loss flattened from day 41. They made a number of changes, e.g. reducing the learning rate, reducing weight decay and increasing dropout. None of these changes helped and since they had reached the end of their training budget they decided to call it.

Evaluation

The model trained for 48 days was evaluated on a dozen general language tasks, e.g. reading comprehension, knowledge assessments, etc. and compared with LLMs of similar or larger sizes.

Among the models with tens of billions of parameters that we compare to, BloombergGPT performs the best. Furthermore, in some cases, it is competitive or exceeds the performance of much larger models (hundreds of billions of parameters).

In addition to general language tasks the model was evaluated on finance tasks. For example it was used to generate Bloomberg Query Language (BQL) which it did not have in its training data. It was able to generate queries in a few-shot setting.

Takeaways

This paper provides a good example of how to train custom LLMs on proprietary data from scratch. For me the main takeaway was starting small and investigating the issues in the initial iterations before starting the larger scale training. Many of the issues in the larger training setting can be caught in the smaller setting which is much more effective to fix.

I hope you found this article helpful. If you have any questions or feedback, would love to hear from you.

Machine Learning How To: Continuous Integration (Part II)

Maryam Fallah — Sun, 02 Apr 2023 23:40:12 GMT

In this article of the Machine Learning How To series, I will delve deeper into the topic of continuous integration. In my previous piece, we explored the process of writing effective unit and integration tests using a combination of Pytest fixtures, Mocks, and FastAPI's TestClient. The article concluded by outlining how to run these tests locally from the root directory of your repository.
In this article, I will cover the important topics of linting, running tests and linting locally using customized command-line commands, and creating workflow YAML files for the CI pipeline. By the end of this article, you will have a better understanding of these key concepts and be able to implement them effectively in your own projects. I will continue to use the code from the mlseries repo as examples.

Linting

Linting is a common practice in software development. It analyses the code's styling, logic and flags issues. Typos, incorrect number of arguments passed to a function, unused variables or imports, incorrect use of tabs, missing closed brackets are some examples of mistakes made while writing code. Linters ensure that these mistakes are caught and fixed before the code gets pushed and/or merged. Especially for projects with multiple contributors, it ensures that the code style is consistent which makes the code more approachable and readable.
There are different options for linting in Python. flake8 is a common one which uses a combination of different linters, PyFlakes, pep8 and Mccabe under the hood. This tool detects both logical and styling issues in the code.
Detecting the issues is not enough especially for styling inconsistencies. For example, it can be inefficient to go to all the files without a new line at the end of the script and add one manually. It is better to use a formatting tool instead. black and isort are two such tools that format the code and sort the imports respectively. Here are the commands I run from the root of my code base for linting:

Install flake8, black and isort in your virtual environment
From the root of the repo inside the terminal, run flake8 .
One thing that flake8 checks for is the length of the lines. It allows a maximum of 79 characters (this is the PEP8 convention), if you want to ignore this rule, you can add --ignore E501. Another useful argument is the --count argument that prints out the number of discovered issues.
To fix the issues automatically using black, run black . from the root of your code base
To keep the order of the imports consistent, run isort rc .

Custom Command Line Commands with invoke

So far, I've added tests and linting to my repo. To ensure code quality, it's important that I run these checks every time I make changes. However, this requires remembering three separate commands, each with different arguments. Is there an easier way to group these commands into a more memorable and uniform set? Invoke is a library that does this. It allows you to define "task" functions that encapsulate and execute shell commands. This way you can run commands by invoking the task functions and not worry about the actual command-line syntax.
To do this, I create a tasks.py script in the root of my repo with the content below:

# /tasks.py

from invoke import task

file_paths = "ml/ server/ deploy/ shared_utils/ tasks.py"


@task
def tests(cmd):
    cmd.run("pytest")


@task
def lint(cmd):
    cmd.run(f"flake8 --ignore E501 --count --statistics {file_paths}")


@task
def lintfix(cmd):
    cmd.run(f"black {file_paths}")
    cmd.run(f"isort -rc {file_paths}")

As seen, each function is decorated with task which expects a Context object as the first argument to the function. In the code above, I have named this argument, cmd. The context object is used to run shell commands. For example to run flake8, I pass the corresponding command to cmd.run and include any arguments, and file paths as needed. Now instead of running the flake8 command directly from the command line, I can simply run invoke lint which is equivalent to running the command passed to the context object inside the lint task function.
Similarly to run black and isort to format the code, I run invoke lintfix. The task function name can be anything, I have used names that are expressive of the task it invokes.
This is much easier to remember and is expandable to other local shell commands.

How to Create a CI pipeline

In order to make the code more robust to errors, I want linting and tests to run automatically whenever code is pushed to the remote repository. If there aren't any issues, I have the green light to merge my code with the main branch ideally after it has been reviewed by my coworkers.
This is what a CI pipeline is for. It adds an extra verification step to the process of integrating new code and deploying it to live environments. There are many CI tools available. I will be using GitHub Actions since I'm using GitHub as my remote repository. There is detailed documentation on this topic on GitHub's documenation page.

Workflows

The core component of GitHub Actions are workflows. A workflow is the set of configurable instructions in YAML that Actions uses to run one or more jobs. For example, to run tests on the code, I need a workflow that runs the job of checking out my repo, configuring AWS credentials, pulling data from dvc, installing requirements.txt and finally running the tests. Each task under a job, e.g. checking out the code, is called a step and the actual executed task at each step is known as an action.
For the workflows to run, a trigger event needs to be defined for it. An example of an event can be a pull request, or push to a certain branch.
Let's delve into the mlseries workflows. To start, I create a .github/workflows/ directory in the root of the repo. The naming has to be the same for any code base for Actions to work. There can be multiple workflows. I create two, one for linting, lint.yml and one for tests, tests.yml.

The linting workflow

Here's a breakdown of the lint workflow:

A a new file lint.yml to .github/workflows/
Give the workflow a name: name: Linting. This will appear under the Actions tab listing all the existing workflows.
Define the triggering event using the keyword on. I would like linting to run whenever there is a push to the main branch or a pull request to it is made:

on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main

Define each job:
i. Give it a name, e.g. lint. This is the name you will see under the Jobs tab after you click on the workflow.
ii. Configure the job to run on a virtual machine. GitHub supports Linux, Windows, and macOS virtual machines

jobs:
  lint:
    runs-on: ubuntu-latest

Group the steps using the steps keyword:

    steps:

Nest each action item under steps:
i. Define a checkout action that will checkout the repo's code. The keyword uses means that this step uses an action, particularly version 3 of the actions/checkout action to check the repo on the job runner.
```
- uses: actions/checkout@v3
```
ii. Define an installation action: To run flake8 using invoke, we need to install these packages. Note that it is not necessary to use invoke, however, I wanted to keep it consistent with the local setup.
```
- name: Install flake8 
  run: |
    python -m pip install --upgrade pip
    python -m pip install flake8
    python -m pip install invoke
```
As seen, I use the keyword name to give this action a name. It makes it easier to see what step the runner is at and where it fails. run: | is used to define multi-line run commands
iii. Define the lint action: Similar to running lint locally, I want the runner to run invoke lint. I pass the command to the run keyword:
```
- run: invoke lint
```

Here is the content of lint.yml from putting all the snippets above together:

# /.github/workflows/lint.yml

name: Linting

on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Install flake8 
      run: |
        python -m pip install --upgrade pip
        python -m pip install flake8
        python -m pip install invoke
    
    - run: invoke lint

The testing workflow

The tests should run everytime there has been a change to any of the scripts in the directories that contain the tests. In this example, the server/ and ml/ directories. To add a workflow for tests, follow the steps below:

Add a workflow file: .github/workflows/tests.yml
Give the workflow a name: name: Run Tests
The tests are dependent on some environment variables to run successfully, e.g. the data file path. These variables can be defined using the env keyword:

env:
    DATA_PATH: "data"
    DATA_FILE: "reviews.tsv"
    TRAIN_DATA_FILE: "train_data.parquet"
    TEST_DATA_FILE: "test_data.parquet"
    MODEL_PATH: "model"
    MODEL_FILE: "lr.model"
    AWS_SECRET_ACCESS_KEY_ID: ${{secrets.AWS_SECRET_ACCESS_KEY_ID}}
    AWS_SECRET_ACCESS_KEY: ${{secrets.AWS_SECRET_ACCESS_KEY}}
    ML_BASE_URI: "http://ml:3000"

Note that I didn't copy/paste my AWS credentials for obvious reasons. Instead I added them to my repository secrets and access them using the secrets context. Here's how you can add secrets to your reporsitory:

Go to the Settings tab in your repo
Under Security click on "Secrets and variables"
Select "New repository secret"
Create a secret Name, e.g. AWS_SECRET_ACCESS_KEY_ID
Put in the value of your secret
Add as many secrets needed

Define the trigger event. I want the tests to run on pushes to the main branch or when pull request are made to the main branch but only if scripts inside server/ and/or ml/ has been modified. Setting up and running tests can be time consuming and therefore should only be triggered when relevant code changes. I use the paths keyword to specify the directories that should trigger this workflow (assuming that it is a push to or PR to the main branch) :

on:
  push:
    branches:
      - main
    paths:
      - 'ml/**'
      - 'server/**'

  pull_request:
    branches:
      - main
    paths:
      - 'ml/**'
      - 'server/**'

Define the job and the virtual machine it should run on:

jobs:
  run-tests:
    runs-on: ubuntu-latest
    steps:

Nest each action item under steps:
i. Add a step to configure AWS using the repo's secrets. This is necessary for later steps that needs to access the repo's remote S3 to pull the data and model:

  - name: Configure AWS credentials
    uses: aws-actions/configure-aws-credentials@v1
    with:
      aws-access-key-id: ${{ secrets.AWS_SECRET_ACCESS_KEY_ID }}
      aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      aws-region: us-east-2

ii. Checkout the code: - uses: actions/checkout@v3
iii. Install the packages necessary for invoke tests to run. This includes the packages in server and ml's requirements.txt, boto3 and dvc:

  - name: Install dependencies
    run: |
      python -m pip install --upgrade pip
      if [ -f ml/requirements.txt ]; then pip install -r ml/requirements.txt; fi
      if [ -f server/requirements.txt ]; then pip install -r server/requirements.txt; fi
      pip install boto3
      pip install --upgrade dvc[s3]

iv. Create a model/ directory and pull the model from S3 using dvc:

  - name: Pull Data from Remote DVC Storage
    run: |
      cd ml
      mkdir model
      dvc pull ./model/lr.model

v. Run the tests:

  - name: Run Tests
    run: |
      invoke tests

Here's all the above snippets put together:

# /.github/workflows/tests.yml

name: Run Tests

env:
    DATA_PATH: "data"
    DATA_FILE: "reviews.tsv"
    TRAIN_DATA_FILE: "train_data.parquet"
    TEST_DATA_FILE: "test_data.parquet"
    MODEL_PATH: "model"
    MODEL_FILE: "lr.model"
    AWS_SECRET_ACCESS_KEY_ID: ${{secrets.AWS_SECRET_ACCESS_KEY_ID}}
    AWS_SECRET_ACCESS_KEY: ${{secrets.AWS_SECRET_ACCESS_KEY}}
    ML_BASE_URI: "http://ml:3000"

on:
  push:
    branches:
      - main
    paths:
      - 'ml/**'
      - 'server/**'

  pull_request:
    branches:
      - main
    paths:
      - 'ml/**'
      - 'server/**'

jobs:
  run-tests:
    runs-on: ubuntu-latest
    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_SECRET_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-2
      - uses: actions/checkout@v3
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          if [ -f ml/requirements.txt ]; then pip install -r ml/requirements.txt; fi
          if [ -f server/requirements.txt ]; then pip install -r server/requirements.txt; fi
          pip install boto3
          pip install --upgrade dvc[s3]
      - name: Pull Data from Remote DVC Storage
        run: |
          cd ml
          mkdir model
          dvc pull ./model/lr.model
      - name: Run Tests
        run: |
          invoke tests

With the defined workflows, whenever there is a trigger event, you will notice jobs running under the Actions tab on Github. Notice the names I used for the workflows are displayed on the left under All workflows. The title of the workflow runs on the right side are the commit messages.

If you click on a certain workflow, you will see all the jobs defined in that workflow on the left. Notice that run-tests was the name I gave to the job inside .github/workflows/tests.yml. I can view all the steps for each job by clicking on the job icon in the middle of the page:

As seen below, all the steps I defined in the workflow are there. You can expand any steps, e.g. Run Tests and see the results. This is useful for when the workflow fails and you need to troubleshoot.

Wrap-up

And there you have it! You now know the main components of GitHub Actions CI pipeline, and how to build one. I hope you found this article applicable to your own project. If you have any questions or feedback, would love to hear from you on Twitter.

Machine Learning How To: Continuous Integration (Part I)

Maryam Fallah — Thu, 23 Mar 2023 23:28:03 GMT

I've been writing on different stages of the machine learning (ML) life-cycle in a series of blog posts. The first article was about building traceable and reproducible pipelines. This allows for data to be version-controlled and experiments to be run in a way that makes comparisons easy and reproducible. The second article covered exposing ml inference as a service. I used Fastapi to write APIs and docker to containerise the code and gave code examples of how to run a simple application that predicts the sentiment of text on a local machine.
In this article, I will cover unit tests and integration tests. In the second part, I will go over linting and setting up a CI pipeline. These are best practices for any software project and are not unique to machine learning. Tests ensure that verifies that the code behaves as expected and that refactoring does not break the existing functionalities. Linting makes sure that your code conforms to the standard formatting rules and that there are no syntax errors or unused variables. A CI pipeline automatically runs all the existing tests whenever you push your code to a remote repository or create a pull request and check for any linting issues. In the case of errors, the build fails and you will be notified. This keeps the quality of code in check and saves you a lot of time by fixing bugs early before it affects users or other team members.
I will be using the mlseries repo for code examples and Pytest for tests.

Unit tests

The simple application in this repo has two main folders, ml/ which contains all the ML code, including building the dvc pipelines for data preprocessing, training and testing the model and the inference service.
The other folder is the server/ folder which includes the code for the backend service receiving requests from the user and sending them to ML's inference service.
Unit testing for ML systems can be different from traditional software due to the indeterministic nature of parts of the system. Here's a great article from Jeremy Jordan on testing ML systems. In this post, the focus is on the deterministic aspects of the ML code for testing. I use Pytest for testing.

Mock dependencies and isolate tests with SetUp

Below is a code snippet for preparing the data for train and testing:

# /ml/prepare_data/main.py

import pandas as pd
from model_params import TrainTestSplit
from sklearn.model_selection import train_test_split

from .config import Config


def split_train_test_data(config=Config()):
    data = pd.read_csv(f"{config.data_path}/{config.data_file}", sep="\t")
    data = data[(data["Sentiment"] == 0) | (data["Sentiment"] == 4)]
    data["labels"] = data.apply(lambda x: 0 if x["Sentiment"] == 0 else 1, axis=1)

    data_train, data_test = train_test_split(data, test_size=TrainTestSplit.TEST_SIZE)

    data_train.to_parquet(f"{config.data_path}/{config.train_data_file}")
    data_test.to_parquet(f"{config.data_path}/{config.test_data_file}")

In this simple function, the data is loaded, only rows with labels 0 and 4 are selected, and a new binary column, labels is added based on the Sentiment value modified to 0 and 1, then the selected data is split into train and test sets and finally saved as parquet files.
In general, a function may have dependencies on other functions which affect its return value. For example, the split_train_test_data function above depends on pandas reading a certain file. In addition, train_test_split from sklearn is used to split the dataframe and finally, the result is saved to disk using pandas.
When writing a unit test, the goal is to test the behaviour of the function in question and assume all its dependencies return the expected results. We, therefore, do not need to worry about reading and writing the data (which might be large and slow down tests) or whether train_test_split works or not. We can mock these dependencies and inject them into the function. Python has a built-in unittest library that can be used for this purpose.
I create a test folder inside the ml/ directory named tests/unittests/prepare_data/ and add a file test_prepare_data.py there. The folder structure is a personal preference. I find it easier to follow if the tests mimic the same structure of the code. Below is the content of this file:

# /ml/tests/unittests/prepare_data/test_prepare_data.py

from unittest import mock, TestCase
from unittest.mock import Mock, patch

import pandas as pd
from prepare_data.__main__ import split_train_test_data


class TestDataPreparation(TestCase):
    def setUp(self):
        # Create a mock self.config object and set data paths
        self.config = Mock()
        self.config.data_path = "/data"
        self.config.data_file = "data_file.csv"
        self.config.train_data_file = "train.csv"
        self.config.test_data_file = "test.csv"

    @patch("prepare_data.__main__.Config", autospec=True)
    @patch("prepare_data.__main__.pd.read_csv")
    @patch("prepare_data.__main__.train_test_split")
    @patch("prepare_data.__main__.pd.DataFrame.to_parquet")
    def test_split_train_test_data(self, mock_to_parquet, mock_train_test_split, mock_read_csv, mock_config):

        mock_config.return_value = self.config
        
        # Mock what pd.read_csv will return
        mock_read_csv.return_value = pd.DataFrame(
            {
                "Sentiment": [0, 1, 4, 2, 4, 3, 0, 4, 2, 0, 1],
                "Phrase": [
                    "foo",
                    "char",
                    "kar",
                    "bar",
                    "lar",
                    "baz",
                    "qux",
                    "quux",
                    "corge",
                    "vid",
                    "chill",
                ],
            }
        )

        # The expected intermediary dataframe before calling train_test_split()
        expected_data_before_splitting = pd.DataFrame(
            {
                "Sentiment": [0, 4, 4, 0, 4, 0],
                "Phrase": ["foo", "kar", "lar", "qux", "quux", "vid"],
                "labels": [0, 1, 1, 0, 1, 0],
            }
        )

        # Define a return value for train_test_split()
        mock_train_test_split.return_value = (
            pd.DataFrame(
                {
                    "Sentiment": [0, 0, 4, 4, 0],
                    "Phrase": ["foo", "kar", "lar", "qux", "vid"],
                    "labels": [0, 0, 1, 1, 0],
                }
            ),
            pd.DataFrame({"Sentiment": [4], "Phrase": ["quux"], "labels": [1]}),
        )

        # Call the function
        split_train_test_data()

        # Verify that pd.read_csv() was called once with the mocked path
        mock_read_csv.assert_called_once_with(
            f"{self.config.data_path}/{self.config.data_file}", sep="\t"
        )

        # Verify that train_test_split() was called with the expected df
        pd.testing.assert_frame_equal(
            mock_train_test_split.call_args[0][0].reset_index(drop=True),
            expected_data_before_splitting.reset_index(drop=True),
        )

        # Verify that to_parquet was called twice
        assert mock_to_parquet.call_count == 2
        
        # Check that the train/test data called to_parquet with the correct         paths
        mock_to_parquet.assert_any_call(
            f"{self.config.data_path}/{self.config.train_data_file}"
        )
        mock_to_parquet.assert_any_call(
            f"{self.config.data_path}/{self.config.test_data_file}"
        )

As seen, anything that is not part of the core functionality of split_train_test_data() is mocked. Here's a breakdown of how to write the above test code:

Create a test class that inherits from unittest.TestCase. TestCase has the interface for running and testing code.
Create a setUp method which is inherited from the base class and runs before any defined test. This is especially useful if you have multiple tests using the same set-up. In that case, the tearDown method should be added so that after every test any changes made to the setup are removed and the next test uses the original set-up. Since my test class only has one method, I didn't use the tearDown method.
What should be set up before testing split_train_test_data? I need to define the Config class and set its attributes since I don't want the function to actually read/write data from/to the real data_path. So, I define a mock config object using Mock() and set the attributes to dummy strings.
Define a method that will test split_train_test_data. The name of the test function should start with "test" for Pytest to find it, e.g. test_.
Decorate the test method with the mocked objects. I use unittest.mock.patch for this. There are 4 mocked objects here: the config instance, read_csv, train_test_split and to_parquet. These mocked objects are passed to the test function.
Note that for Config, the autospec is set to true. This ensures that the mocked config object has all the attributes of the Config class.
Inside the test, set the return values of the mocked objects to static predefined values. For example, create a dataframe with dummy rows and set that to be the return value of pd.read_csv. Remember that the functionality that I want to test here is after the data is loaded (or mocked) does split_train_test_data: 1) extract only the phrases with sentiment values 0 and 4 and 2) add a labels column with values 0 and 1.
Based on the defined dummy data, create the expected dataframe that split_train_test_data should pass to train_test_split. I call this expected_data_before_splitting
Call split_train_test_data.
Verify if the dependencies were called the expected number of times with the expected values. To be more specific:
1. Verify that read_csv() was called once with the mocked configured path.
2. Make sure that train_test_split was called with expected_data_before_splitting.
3. split_train_test_data makes two calls to to_parquet. I verify that this is the case in the test with assert mock_to_parquet.call_count == 2.
4. I also verify that to_parquet was called with the correct dummy data paths and file names defined in the setUp method.

Manage dependencies using pytest.fixtures

Similarly, unit tests for other parts of the code, e.g. training, and inference can be written. I will give one more example, but this time instead of using unittest.TestCase.setUp I use pytest.fixture. A fixture is a function that returns an object or value that is required by one or more tests. pytest.fixture is a decorator that allows you to define and use fixtures for your tests. The fixtures need to be passed as input to the tests that require their return values. This way, similar to unittest.TestCase.setUp, when the tests are run, the fixture functions are run first and their return value is passed to the tests that take them as input.
You can define different scopes for your fixtures, e.g. function, module, and class. The scope determines how often the fixture function should be called and when should it get destroyed. For example, in the default scope, function, the fixture functions are called and destroyed after each test function using them. Here's a great article from Xiaoxu Gao about fixtures and their scopes.
Let's see fixtures in action. Below is the code snippet for creating a Pipeline from sklearn and fitting it to the training data:

# ml/train/__main__.py

import joblib
import pandas as pd
from model_params import LogisticRegressionConfig
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

from .config import Config


def train():
    """
    Loads the training data, creates, and fits a pipeline on
    the data and saves the model
    """
    config = Config()

    lr_params = {
        "n_jobs": LogisticRegressionConfig.n_jobs,
        "C": LogisticRegressionConfig.C,
        "max_iter": LogisticRegressionConfig.max_iter,
    }
    train_dataframe = pd.read_parquet(f"{config.data_path}/{config.train_data_file}")

    X = train_dataframe["Phrase"]
    y = train_dataframe["labels"]

    clf = Pipeline(
        [
            ("vect", CountVectorizer()),
            ("tfidf", TfidfTransformer()),
            ("clf", LogisticRegression(**lr_params)),
        ]
    )
    clf.fit(X, y)
    joblib.dump(clf, f"{config.model_path}/{config.model_file}")

As seen in the code above, after the configurable variables and the training data are loaded, an sklearn.Pipeline.pipline is defined and fit to the data. Finally, the fit model is saved to the configured path. A unit test can verify whether:

The model was loaded from the expected path.
A pipeline object was fit to the data and dumped to the disk

Since I don't want to read the real training data, I will define a mocked Config class instance using fixtures. Below is the unit test code:

# ml/tests/unittests/train/

import pandas as pd
import pytest

from sklearn.pipeline import Pipeline
from unittest.mock import Mock, patch
from train.__main__ import train


@pytest.fixture
def config():
    # Create a mock config object and set data paths
    config = Mock()
    config.data_path = "/path/to/data"
    config.train_data_file = "train_data.parquet"
    config.model_path = "/path/to/model"
    config.model_file = "model.joblib"
    return config


@patch("train.__main__.Config", autospec=True)
@patch("train.__main__.pd.read_parquet")
@patch("train.__main__.joblib.dump")
def test_train(mock_dump, mock_read_parquet, mock_config, config):
    mock_config.return_value = config

    # Create mock data
    data = pd.DataFrame(
        {
            "Phrase": ["this is a test", "this is another test"],
            "labels": [0, 1],
        }
    )
    mock_read_parquet.return_value = data

    # Call the train function
    train()

    # Check that pd.read_parquet was called with the correct file path
    pd.read_parquet.assert_called_once_with(
        f"{config.data_path}/{config.train_data_file}"
    )

    # Check that joblib.dump was called with a Pipeline object
    _, args, _ = mock_dump.mock_calls[0]
    assert isinstance(args[0], Pipeline)

The main difference with test_split_train_test_data is the use of pytest.fixture. Notice that there are two arguments related to Config passed to the test function above, one is, mock_config, which replaces the call to config = Config() in ml/train/__main__.py. The other is the fixture function config which contains the mocked config object with the set dummy values. Now I need to actually set this mocked config object to the return value of the call to Config() inside ml/train/__main__.py. This is what the first line in test_train() does.
I then define a dataframe and set it to be the return value of read_parquet and verify that this dataframe was attempted to be read using the correct mocked data path.
Lastly, I check that joblib.dump saved a pipeline object.

Integration Tests

In unit tests, the goal is to verify that a small isolated piece of code works as expected. This is useful but doesn't give us the confidence that the different components of the application work together. This is the purpose of integration tests. With integration tests we don't write a test for every function involved in a certain service and rather treat the service as a black box. For example, in this codebase, I want to test the server service and make sure that making requests to the review_sentiment endpoint with different reviews results in the expected status codes and response messages. In this case, I don't need to know how the server service runs, whether it calls some other internal services or not. All I care about is, if I hit the endpoint with a valid input, do I get the expected result. Because the test treats the implementation of the service as a black box, if any changes are made to any components related to this service during refactoring, I can be confident in the changes not affecting the expected results by seeing the integration test pass.
These tests are more involved and might require some dependencies to be set up before tests, e.g. a local test database. However, the tests should be contained to the internal service you are testing. For example, if your endpoint, makes a POST request to S3, or the ML service, you should assume that those services work and return the expected results.
Let's see an example. Below is the code in server/server_main.py which I want to test.

# server/server_main.py

import re

import requests
from config import Config
from entities.review import Review
from fastapi import FastAPI, HTTPException, status

app = FastAPI()
config = Config()


@app.post("/review_sentiment/", status_code=status.HTTP_200_OK)
def add_review(review: Review):
    """
    This endpoint receives reviews, and creates an appropriate
    payload and makes a post request to ml's endpoint for a
    prediction on the input

    Args:
        review(Review): the input should conform to the Review model

    Returns:
        response(dict): response from ml's endpoint which includes
                        prediction details

    Raises:
        HTTPException: if the input is a sequence of numbers
    """
    endpoint = f"{config.ml_base_uri}/prediction_job/"
    if re.compile(r"^\-?[1-9][0-9]*$").search(review.text):
        raise HTTPException(status_code=422, detail="Review cannot be a number")

    resp = requests.post(endpoint, json=review.dict())
    return resp.json()

In this simple service, a Review object is received, if the text in the review is a number, an exception is raised with the message "Review cannot be a number". Otherwise, the review is posted to endpoint which is the ML service and its response is returned.
To test this service, I need to simulate the server app and make a POST request to it. FastAPI's TestClient makes this possible. As mentioned earlier, the server's external dependencies should be mocked. To mock the POST request to the ml service, I use requests_mock.
To test the server's review_sentiment endpoint, I create a tests/integration_tests directory and add a test_server_main.py. file with the contents below:

# server/tests/integration_tests/test_server_main.py

import re

import pytest
from config import Config
from entities.review import Review
from fastapi.testclient import TestClient
from main import add_review, app
from pydantic import ValidationError
from requests_mock import ANY


@pytest.fixture
def config():
    return Config()


@pytest.fixture(scope="module")
def module_client():
    with TestClient(app) as c:
        yield c


@pytest.fixture
def client(module_client, requests_mock):
    # ref: https://github.com/encode/starlette/issues/818
    test_app_base_url_prefix_regex = re.compile(
        rf"{re.escape(module_client.base_url)}(/.*)?"
    )
    requests_mock.register_uri(ANY, test_app_base_url_prefix_regex, real_http=True)
    return module_client


def test_add_review(config, requests_mock, client):
    review1 = "hi"
    with pytest.raises(ValidationError) as exc_info:
        add_review(Review(text=review1))
    assert exc_info.value.errors() == [
        {
            "loc": ("text",),
            "msg": "min_length:4",
            "type": "value_error.any_str.min_length",
            "ctx": {"limit_value": 4},
        }
    ]

    review2 = 123458303
    response = client.post("/review_sentiment/", json=Review(text=review2).dict())
    assert response.status_code == 422
    assert response.json()["detail"] == "Review cannot be a number"

    # Mock the request to ML, and its response
    adapter = requests_mock.post(
        f"{config.ml_base_uri}/prediction_job/",
        json={"prediction": "positive", "prediction_score": 80},
    )
    review3 = "very pleasant. highly recommend"
    client.post("/review_sentiment/", json=Review(text=review3).dict())
    
    # Verify that a call to ML was made with the correct input
    assert adapter.last_request.json() == {"text": review3}
    assert adapter.call_count == 1

Here's a break down of the code:

Create an instance of Config using pytest.fixture.
Define a TestClient instance by passing it the server app, I call this module_client. I set the scope of this fixture to module for it to be used for all tests in the script.
Since I want to only test the server endpoint and mock ML's, I need to make sure that in the test function, calls to any of the server endpoints are not mocked. For this, I need to override requests_mock in a fixture by adding the server's base URL to its registered URIs and setting real_http=True. This is what the client fixture does. To be more specific:
1. I define a regex pattern for the server's base URL which can have any set of characters after module_client.base_url/
2. Register this URL pattern with ANY http methods, to be a real http request
Define a test function, test_add_review, and pass all the fixtures to it.
Define faulty inputs and verify that the expected results are returned, this is what review1 and review2 test.
Define a valid input, review3, mock the post request to ML and set its response, e.g. {"prediction": "positive", "prediction_score": 80}.
Verify that the server test app did call ML service with review3.

Similarly, the ML service can be tested. I will not include the code snippet here since it's very similar to the server's integration tests, but you can find it in the repo.

Running the Tests

For the tests to run successfully on your local machine, you need to set up some paths and environment variables. These configurations can go in a conftest.py which is a special file used by Pytest for any plugins, fixtures, paths, etc. that need to be used for the tests to run successfully.
I create a separate conftest.py for the ml/ and server/ directories. Because I want to be able to run the code outside of docker, I need to provide Pytest with the path to shared_utils/logger which is mounted to the containers. I do this by adding the lines below to ml/conftest.py:

# ml/conftest.py

import os
import sys

"""
Loading shared scripts for unittest
This prevents throwing errors when unit test is running without logging code mounted in utils/
"""
logger_path = os.path.abspath("./shared_utils/unittest")
sys.path.insert(0, logger_path)


# Change directory to /ml because the ml code runs from /ml as root dir and this allows the calls to loading the model, etc. to be successful
os.chdir(os.getcwd() + "/ml")

My goal is to have the tests run from the root of the repo. I think it is easier this way for all tests to run from one place and avoid the need to switch between directories.
In order for the paths to the data and model to be successful for the ml scripts, I change the directory to ml after inserting the logger path.
For server/conftest.py, because server's config uses some environment variables, I use load_dotenv.

# server/conftest.py

from dotenv import load_dotenv

load_dotenv("../deploy/.env")

I can now run the tests from the terminal in the root of the repo by simply running pytest:

Running all tests from the root with pytest

That's it for part one of this blog subseries. I covered writing unit tests and integration tests, mocking dependencies, isolating using setUp and fixtures and configuring dependencies for running tests with Pytest.
In the second part, I will go over linting, how to automate tasks using invoke and finally how to set up a CI pipeline with Github Actions.

I hope you found this post useful. If you have any questions or feedback you can reach me on Twitter.

learnings from building and deploying a machine learning product from scratch

Maryam Fallah — Thu, 01 Dec 2022 11:49:00 GMT

Radiology reports can be difficult to interpret by non-medical individuals, as they are full of technical jargon. One of the unique services Ezra provides to its members is a comprehensive report written by licensed medical providers explaining all the medical findings noted by the radiologist, translated into easy-to-read, non-technical language. In addition, the medical team includes a list of actionable next steps for each notable finding in a radiology report, depending on its severity. This comprehensive Ezra Report gives members the peace of mind and care they deserve.

I joined Ezra to build a machine-learning solution that automates the process of generating the Ezra Report from radiology reports. At the time I joined, the process was manual, time-consuming, and error-prone. Fast forward to today, with a machine learning-based web app, Ezra Reporter, the medical team saves 75 minutes per report, a 600% productivity boost!

I have shared my experience with this project on Ezra's tech blog.

Machine Learning How to: ML as a Service

Maryam Fallah — Fri, 23 Sep 2022 01:32:21 GMT

In the previous article, I discussed the details of building a trackable and reproducible machine learning training pipeline. In this article, I will focus on how to use a trained model as a service in an application.

Motivation

There are more steps beyond finding the right model with the tuned hyper-parameters in a machine learning project. Once a model is trained and tested, it needs to be deployed. The model is usually part of a larger application; for example, YouTube generating suggestions for what to watch next.
The purpose of this article is to take a trained model and have it communicate with an application's backend server through an API in a local environment.
All the code snippets in this blog post can be found in the mlseries repository on my Github.
Note that the code snippets in this post are built on top of the previous article where I did a walk-through of a basic machine learning project's structure.

Below is a tree illustration of all the directories inside the repository:


├── README.md
├── deploy
│   ├── Dockerfile.ml.local
│   ├── Dockerfile.sent_ml
│   ├── Dockerfile.server
│   ├── Dockerfile.server.local
│   └── docker-compose.local.yml
├── integration_tests
├── ml
│   ├── Dockerfile
│   ├── conftest.py
│   ├── data
│   │   ├── reviews.tsv
│   │   ├── test_data.parquet
│   │   └── train_data.parquet
│   ├── docker-compose.yml
│   ├── dvc.lock
│   ├── dvc.yaml
│   ├── entities
│   │   └── review.py
│   ├── metrics
│   │   └── model_performance.json
│   ├── ml_main.py
│   ├── model
│   │   └── lr.model
│   ├── model_params.py
│   ├── prepare_data
│   │   ├── __main__.py
│   │   └── config.py
│   ├── requirements.txt
│   ├── services
│   │   └── predict.py
│   ├── test
│   │   ├── __main__.py
│   │   └── config.py
│   ├── tests
│   │   ├── integration_tests
│   │   │   └── test_ml_main.py
│   │   └── unittests
│   │       ├── prepare_data
│   │       │   └── test_prepare_data.py
│   │       ├── services
│   │       │   └── test_predict.py
│   │       └── train
│   │           └── test_train.py
│   ├── train
│   │   ├── __main__.py
│   │   └── config.py
│   ├── training_testing_pipeline.sh
│   └── utils
│       └── config.py
├── pytest.ini
├── server
│   ├── config.py
│   ├── conftest.py
│   ├── entities
│   │   └── review.py
│   ├── requirements.txt
│   ├── server_main.py
│   ├── tests
│   │   └── integration_tests
│   │       └── test_server_main.py
│   └── utils
│       └── logger
├── shared_utils
│   ├── logger
│   │   └── __init__.py
│   └── unittest
│       └── utils
│           └── logger
│               └── __init__.py
└── tasks.py

The Architecture

The figure below illustrates the architecture of a simple web application. There are three containers, the server container, the machine learning container and the front-end container. The server is the core of the application. Think of it as the brain of the app. The machine learning container exposes the model to the rest of the application. It can receive and send back information to the server. The front-end container includes any user interface-related code.
Note that I will not be covering the UI container as it does not directly interact with the ML container.
Docker compose is used to create and run all the containers together.

Figure 1. A multi-container application. Docker Compose is used to configure and run the containers.

The Design

ML Service

Let's start with the machine learning component of the app. The goal is to have the machine learning container receive requests for inference from the server and return its predictions.
I call this the prediction service. In the code repository, I add a services/ directory and create a predict.py script with the contents below:

# /ml/services/predict.py
import numpy as np

from joblib import load

from entities.review import Review
from utils.config import Config
from utils.logger import logger


class PredictService:
    """
    A class serving a trained model for prediction requests

    Methods:
        infer(review: Review) runs inference on a review and sends
                    back the sentiment and a prediciton score from 1-100
    """
    def __init__(self):
        """
        Loads the env vars and the trained model
        """
        self.config = Config()
        self.model = load(f"{self.config.model_path}/{self.config.model_file}")
    
    def predict(self, review: Review):
        """
        Runs inference on a review and sends back the sentiment and a prediciton
        score from 1-100
        
        Args:
            review(Review): a Review entity which contains text for prediction
        
        Returns:
            response(dict): a dictionary with the predicted positive or negative
                            sentiment and an associated prediciton score 
        """
        response = dict()
        X = review
        response["prediction"] = "positive" if self.model.predict(np.array([X]))[0] else "negative"
        positive_prediction_score = np.round(self.model.predict_proba(np.array([X]))[0][1] * 100, 3)
        response["prediction_score"] = positive_prediction_score if response["prediction"] == "positive" else 100 - positive_prediction_score
    
        logger.info("Completed prediction on review: %s", response)
        return response

To ensure that PredictService always receives the expected input, I define a Review model that inherits from pydantic's BaseModel. Pydantic is a useful tool for validating the input to an endpoint. For example, using the Config class, I have set the minimum acceptable length for the input. If the input is 3 characters or less, an error is raised with a message: ensure this value has at least 4 characters.

# /ml/entities/review.py

from pydantic import BaseModel

class Review(BaseModel):
    text: str

    class Config:
        """
        Controls the behaviour of the Review class
        """
        min_anystr_length = 4
        error_msg_templates = {
            'value_error.any_str.min_length': 'min_length:{limit_value}',
        }

ML API

I will be using FastAPI for writing the APIs. It is fast and easy to use.
To begin, I will add a ml_main.py script and create an instance of FastAPI named app. All the routes will be using app.

# /ml/ml_main.py

from fastapi import FastAPI, status

from services.predict import PredictService
from utils.logger import logger
from entities.review import Review

app = FastAPI()

predict_service = PredictService()


@app.post("/prediction_job/", status_code=status.HTTP_201_CREATED)
def add_prediction_job(review: Review):
    """
    This endpoint receives reviews, sends it to the 
    prediction service for prediction and returns the response
    
    Args:
        review(Review): the input should conform to the Review model
    
    Returns:
        response(dict): response from predict_service which includes
                        prediction details
    """
    response = predict_service.predict(review.text)
    return response

As shown in the above code snippet, after initializing the app, PredictService(), a POST endpoint is defined that receives a review as the input, sends it to ML's predict_service for inference and returns the prediction back to its caller.

Server API

As mentioned in the architecture section, the server is the main receiver and responder to client requests. In other words, the UI posts the user's review to the server through a POST request, which makes a POST request to the machine learning container. ML in turn makes a call to its PredictService. The series of requests are illustrated below:

Figure 2. Communication between the frontend and the backend

I create a server directory in the root of the repository and similar to the ML API, add a server_main.py script to initiate a FastAPI() instance and include the endpoints for the server container (see tree illustration at the beginning of the article for an overview of the repository's folder structure).

To accept reviews from the user and make predictions on its sentiment, I add a POST endpoint with the path review_sentiment. This endpoint expects a string as input. It creates a payload for the ML's endpoint, which must conform to the Review class, and sends it to the ML endpoint. The response from ML's endpoint is returned and any exceptions are caught.

# /server/server_main.py
import re

import requests
from config import Config
from entities.review import Review
from fastapi import FastAPI, HTTPException, status

app = FastAPI()
config = Config()


@app.post("/review_sentiment/", status_code=status.HTTP_200_OK)
def add_review(review: Review):
    """
    This endpoint receives reviews, creates an appropriate
    payload and makes a post request to ml's endpoint for a
    prediction on the input

    Args:
        review(Review): the input should conform to the Review model

    Returns:
        response(dict): response from ml's endpoint which includes
                        prediction details

    Raises:
        HTTPException: if the input is a sequence of numbers
    """
    endpoint = f"{config.ml_base_uri}/prediction_job/"
    if re.compile(r"^\-?[1-9][0-9]*$").search(review.text):
        raise HTTPException(status_code=422, detail="Review cannot be a number")

    resp = requests.post(endpoint, json=review.dict())
    return resp.json()

Containerisation & Composition

As illustrated in the design, the server and machine learning code should be isolated in separate containers. Let's create a deploy/ directory and add two Dockerfiles for the local development environment.
For the toy application covered in this article, the instructions in the Dockerfiles are as simple as installing a light Python version, copying over the code from the repository and installing the packages:

#/deploy/Dockerfile.server.local
FROM python:3.9-slim as py39slimbase

WORKDIR /server
COPY ./server/requirements.txt /server
RUN pip install -r requirements.txt

COPY ./server /server

Similarly, for the ML container:

#/deploy/Dockerfile.ml.local
FROM python:3.9-slim as py39slim

WORKDIR /ml
COPY ./ml/requirements.txt /ml
RUN pip install -r requirements.txt

COPY ./ml /ml

Docker Compose is used to have these two containers communicate with each other, and send requests back and forth as illustrated in Fig. 2.
I add a docker-compose.local.yml script in the deploy/ directory and define two services, sentiment-analysis-ml and sentiment-analysis-server.

services:
  sentiment-analysis-ml:
    container_name: ml
    build:
      context: ../
      dockerfile: deploy/Dockerfile.ml.local
    command: uvicorn ml_main:app --host 0.0.0.0 --port 3000 --reload
    environment:
      - MODEL_FILE=${MODEL_FILE}
      - MODEL_PATH=${MODEL_PATH}
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    volumes:
      - ../ml/:/ml/
      - ../shared_utils/logger:/ml/utils/logger
  
  sentiment-analysis-server:
    container_name: server
    build:
      context: ../
      dockerfile: deploy/Dockerfile.server.local
    command: uvicorn server_main:app --host 0.0.0.0 --port 8000 --reload
    environment: 
      - ML_BASE_URI=${ML_BASE_URI}
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    volumes:
      - ../server/:/server/
      - ../shared_utils/logger:/server/utils/logger
    ports:
      - 8000:8000

The command uvicorn main:app refers to the fastAPI app defined in the main.py scripts for each server and ML service. I have set ML to listen on port 3000 by adding --port 3000. This is the port that the server needs to use to send requests to ML. The combination of the container_name and --port 3000 determines the base url for the ML container, http://ml:3000. The server needs to hit http://ml:3000/prediction_job/ for getting predictions.

Running the app locally

With docker-compose, all images are built and containers are run with one command: docker-compose up. After the images are built from the Dockerfiles and the containers are running, the output below is expected in the terminal:

Figure 3. The terminal output after the containers are successfully up and running

As seen in the figure above, the server is running on http://0.0.0.0:8000.
FastAPI has interactive documentation of all the implemented APIs in the docs/ path. Head over to http://0.0.0.0:8000/docs and click on Try it Out under prediction GET request. You can type in a comment and a successful response will show the predicted sentiment and score.

Figure 4. Swagger allows interaction with implemented APIs

Wrap up

That is it for this blog post. I went over building a simple machine learning application using fastAPI and how to run it locally. There are many improvements that can be made to the simple app built in this article. For instance, implementing queues for ML requests, and authorisation for accessing APIs.

I hope you enjoyed this post and use some of these ideas in your own projects. If you like reading about machine learning and natural language processing, follow me on twitter for more content!

Machine Learning How To: Reproducible and Trackable Training

Maryam Fallah — Sat, 09 Apr 2022 00:39:24 GMT

The investigation and experimentation phase is a common building block in developing any machine learning solution. As the model developer you spend a good chunk of your time doing research on different algorithms, gathering and processing the data, hyper-parameters tuning and so on.
Through this process you may end up adding new data to your original collected dataset, and try various algorithms or tinker with the hyper-parameters of the same algorithm many times.
You may be able to keep track of the differences between a few of these experimental setups with scripts or notebooks that are named first.ipynb, second.ipynb, final.ipynb, final2.ipynb, final_final.ipynb. However, this inefficient process causes a lot of headaches for you and your team when you want to compare results. It also makes it difficult to know which model used which dataset, and decide on which model to push to production.
This blog post is part one of a series covering different aspects of the life-cycle of machine learning in production.
In this post I will go cover the topics listed below:

The model training process
How to version control data
How to version control models
How to easily compare the performance of different model configurations
How to write readable, reusable production-ready code for your machine learning pipelines.

For this blog post series, I will be using a small movie review dataset and fit some basic models on it. The goal of these posts is not to optimise the data or model, but to demonstrate how changes in the model development process can be easily tracked and reproduced. All the code used for this blog series can be found on my Github. This post will go through the code inside the ml directory.

Tools and services used

pyenv: pyenv is a virtual environment manager that allows you to easily install and manage different Python versions in different environments. You can follow the installation instructions from its Github page.
AWS: S3 for remote data storage
Git and Github: For tracking code (and later for CI/CD, to be covered in later blog posts)
Docker: Containerisation is a useful tool providing reproducible environments
Data Version Control (DVC): An open source tool for tracking data, building ML pipelines, and performance metrics. Its integration with Github makes it very easy to pull the data and run experiments on different git branches.

In the following sections, I will walk through how to build the published project from scratch.

1. Create a Virtual Environment

Create a virtual environment and activate it.

> pyenv virtualenv 3.9 venv
> pyenv activate venv

2. Define the project structure

A project needs to have a well-defined and easy to understand structure. This makes it possible for multiple people to work on a repository without stepping on each other's toes.

An example

The ml directory of the linked repository, has different directories. Each responsible for a specific component of the machine learning project.

ml/
  prepare_data/
    __main__.py
    config.py
  train/
    __main__.py
    config.py
  test/
    __main__.py
    config.py
  metrics/
    model_performance.json
  services/
     predict.py
  .env.example
  Dockerfile
  docker-compose.yml
  dvc.yml
  dvc.lock
  __main__.py
  model_params.py
  requirements.txt
  training_testing_pipeline.sh

Each directory has a __main__.py and a config.py script. The main script is the script responsible for executing a specific task. The task is described by the enclosing directory name. This separation of concerns and distinct naming makes it easy for a new developer to navigate through the code-base.
The config script is responsible for initialising any configurable variables. I include all the environment variables in a .env file. Note that, .env files can contain sensitive information, e.g. secrets, it is therefore important to not push these files to your remote repository. This can be done by including .env (and other files you do not want to be tracked) to .gitignore. However, it is useful to include a list of all environment variables needed for running the code and the values of non-sensitive information in a file that is pushed to the repository. This way anyone else cloning the project will know what to include in their .env file
I have listed all the variables in a file named .env.example. Below are the contents of this file. Notice that the AWS secret and id are listed but the values are not included:

#/ml/.env.example

DATA_PATH=data
DATA_FILE=reviews.tsv
TRAIN_DATA_FILE=train_data.parquet
TEST_DATA_FILE=test_data.parquet
MODEL_PATH=model
MODEL_FILE=lr.model
METRIC_PATH=/ml/metrics
METRIC_FILE=model_performance.json
AWS_SECRET_ACCESS_KEY=__your_aws_access_key__
AWS_SECRET_ACCESS_KEY_ID=__your_aws_secret_key__

A few notes

services/ and __main__.py include code for inference which I will cover in the next blog post.
The metrics folder includes the performance metrics on test data. More on that in step 8.
Details on the Dockerfile and docker-compose file can be found in step 4.
Details on the DVC related directory and files can be found in steps 7 & 8.
Details on the shell script can be found in step 8.

3. Write code

After deciding on the project's code structure, it is time to add code. The data preparation and training steps are very specific to the data and problem at hand and the code snippets below are just an example for training a model that will predict positive or negative sentiments given a piece of text.

3.1 Preprocessing

The movie review dataset has two columns: Phrase which is the review of a movie and label which is a rating of 0-4 of the movie. The code snippet below, is the content of the prepare_data/__main__.py script.

#/ml/prepare_data/__main__.py

import pandas as pd

from sklearn.model_selection import train_test_split

from model_params import TrainTestSplit
from .config import Config


def split_train_test_data():
    config = Config()
    
    # select the reviews with the lowest and highest ratings
    data = pd.read_csv(f"{config.data_path}/{config.data_file}", sep="\t")
    data = data[(data["Sentiment"] == 0) | (data["Sentiment"] == 4)]
    data["labels"] = data.apply(lambda x: 0 if x["Sentiment"] == 0 else 1, axis=1)

    #split the dataset into training and testing data
    data_train, data_test = train_test_split(data, test_size=TrainTestSplit.TEST_SIZE)

    data_train.to_parquet(f"{config.data_path}/{config.train_data_file}")
    data_test.to_parquet(f"{config.data_path}/{config.test_data_file}")


if __name__ == "__main__":
    split_train_test_data()

Notice the use of Config() for configurable variables. All such variables are defined in a config.py script in the same directory. Below is the content of this file:

#/ml/prepare_data/config.py

from os import getenv
from sys import exit

from utils.logger import logger


class Config:
    def __init__(self):
        if getenv("DATA_PATH") is None:
            self.exit_program("DATA_PATH")
        else:
            self.data_path = getenv("DATA_PATH")

        if getenv("DATA_FILE") is None:
            self.exit_program("DATA_FILE")
        else:
            self.data_file = getenv("DATA_FILE")

        if getenv("TRAIN_DATA_FILE") is None:
            self.exit_program("TRAIN_DATA_FILE")
        else:
            self.train_data_file = getenv("TRAIN_DATA_FILE")

        if getenv("TEST_DATA_FILE") is None:
            self.exit_program("TEST_DATA_FILE")
        else:
            self.test_data_file = getenv("TEST_DATA_FILE")

    def exit_program(self, env_var):
        error_message = (f"prepare_data: {env_var} is missing from the set environment variables.")
        logger.error(error_message)
        exit(error_message)

3.2 Training

A very similar structure applies to the training script. Below is an example of using the training dataset created by the preprocessing script and fitting a logistic regression model to it:

#/ml/train/__main__.py

import pandas as pd

from joblib import dump
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

from .config import Config
from model_params import LogisticRegressionConfig


def train():
    config = Config()
    
    # initialise model parameters
    lr_params = {
        "n_jobs": LogisticRegressionConfig.n_jobs,
        "C": LogisticRegressionConfig.C,
        "max_iter": LogisticRegressionConfig.max_iter,
    }
    
    # load the training dataset created by prepare_data/
    train_dataframe = pd.read_parquet(f"{config.data_path}/{config.train_data_file}")

    # separate the input text and labels
    X = train_dataframe["Phrase"]
    y = train_dataframe["labels"]

    # define a series of processes that transform input data X
    clf = Pipeline(
        [
            ("vect", CountVectorizer()),
            ("tfidf", TfidfTransformer()),
            ("clf", LogisticRegression(**lr_params)),
        ]
    )
    # fit the model and save it in the model_path
    clf.fit(X, y)
    dump(clf, f"{config.model_path}/{config.model_file}")


if __name__ == "__main__":
    train()

The train function dumps the model to a configurable path. All the configurable variables are defined in config.py in the train/ directory:

#/ml/train/config.py


from os import getenv
from sys import exit

from utils.logger import logger


class Config:
    def __init__(self):
        if getenv("DATA_PATH") is None:
            self.exit_program("DATA_PATH")
        else:
            self.data_path = getenv("DATA_PATH")

        if getenv("TRAIN_DATA_FILE") is None:
            self.exit_program("TRAIN_DATA_FILE")
        else:
            self.train_data_file = getenv("TRAIN_DATA_FILE")

        if getenv("MODEL_PATH") is None:
            self.exit_program("MODEL_PATH")
        else:
            self.model_path = getenv("MODEL_PATH")

        if getenv("MODEL_FILE") is None:
            self.exit_program("MODEL_FILE")
        else:
            self.model_file = getenv("MODEL_FILE")

    def exit_program(self, env_var):
        error_message = (
            f"train: {env_var} is missing from the set environment variables.")
        logger.error(error_message)
        exit(error_message)

3.3 Model Parameters

It is useful to have a script that includes all the model parameters and configurations. This way any model parameter changes will happen in one place only. For the simple logistic regression model used above, I define all the parameters used in train/__main__.py in a LogisticRegressionConfig class and include the split between train and test datasets, used in prepare_data/__main__.py in a TrainTestSplit class. As I make changes to the model or add new models I can add other classes to this script. Below is an example of what is included in ml/model_params.py:

#/ml/model_params.py

class TrainTestSplit:
    TEST_SIZE = 0.2


class LogisticRegressionConfig:
    n_jobs = 1
    C = 1e5
    max_iter = 1000

3.4 Testing

Depending on the problem, a set of performance metrics should be defined. For the sentiment analysis model, I use accuracy and confusion matrix. This is what the content of test/__main__.py looks like:

#/ml/test/__main__.py

import pandas as pd

from json import dump
from joblib import load
from sklearn.metrics import confusion_matrix, accuracy_score

from .config import Config


def test_model():
    config = Config()

    # load the model from the model_path
    model = load(f"{config.model_path}/{config.model_file}")

    # load the test data created by prepare_data/__main__.py
    data_test = pd.read_parquet(f"{config.data_path}/{config.test_data_file}")
    
    # separate the input from the labels
    X = data_test["Phrase"]
    y = data_test["labels"]
    
    # make predictions on input text, X
    y_pred = model.predict(X)

    # compute metrics
    accuracy = accuracy_score(y, y_pred)
    true_negative, false_positive, false_negative, true_positive = confusion_matrix(y, y_pred, normalize="true").ravel()

    # save the results in metric_path
    with open(f"{config.metric_path}/{config.metric_file}", "w") as metrics:
        dump(
            {
                "results": {
                    "accuracy": accuracy,
                    "true_negative": true_negative,
                    "false_positive": false_positive,
                    "false_negative": false_negative,
                    "true_positive": true_positive,
                }
            },
            metrics,
        )


if __name__ == "__main__":
    test_model()

Similar to the other directories, all the configurable variables are defined in test/config.py:

#/ml/test/config.py

from os import getenv
from sys import exit

from utils.logger import logger


class Config:
    def __init__(self):
        if getenv("DATA_PATH") is None:
            self.exit_program("DATA_PATH")
        else:
            self.data_path = getenv("DATA_PATH")

        if getenv("TEST_DATA_FILE") is None:
            self.exit_program("TEST_DATA_FILE")
        else:
            self.test_data_file = getenv("TEST_DATA_FILE")

        if getenv("MODEL_PATH") is None:
            self.exit_program("MODEL_PATH")
        else:
            self.model_path = getenv("MODEL_PATH")

        if getenv("MODEL_FILE") is None:
            self.exit_program("MODEL_FILE")
        else:
            self.model_file = getenv("MODEL_FILE")

        if getenv("METRIC_PATH") is None:
            self.exit_program("METRIC_PATH")
        else:
            self.metric_path = getenv("METRIC_PATH")

        if getenv("METRIC_FILE") is None:
            self.exit_program("METRIC_FILE")
        else:
            self.metric_file = getenv("METRIC_FILE")

    def exit_program(self, env_var):
        error_message = (
            f"test: {env_var} is missing from the set environment variables.")
        logger.error(error_message)
        exit(error_message)

3.5 Freeze your dependencies into requirements.txt

As you write and run scripts you need to pip install different packages. In order to reproduce the results on a different machine, a file with a list of the packages and their versions is needed. This file is usually called requirements.txt. Below is the set of libraries used in the example project:

#/ml/requirements.txt

uvicorn[standard]==0.13.4
fastapi==0.68.1
black==21.4b2
mypy==0.812
pandas==1.1.5
pyarrow==5.0.0
scikit-learn==0.24.2

You can either add the packages one by one or create this file automatically by running pip freeze > requirements.txt in your terminal. Note that this command will include all the packages installed in your virtual environment and not just those used in your project.

4. Create a docker image

Another tool that helps with the reproducibility of a machine learning pipeline, and any pipeline for that matter, is Docker. You can install docker by following instructions from docker's website.
Below is an example of a Dockerfile you can use for training:

#/ml/Dockerfile

FROM python:3.9-slim as py39slim
RUN apt-get update -y && apt-get install -y git

WORKDIR /ml

# copy over the requirements.txt file and install all requirements
COPY ./ml/requirements.txt /ml
RUN pip install -r requirements.txt --no-cache-dir

# copy over the ml directory
COPY ./ml /ml

ENTRYPOINT ["/bin/bash", "-c"]

With multiple environment variables in the code, I find it easier to write a docker-compose.yml file that copies over all the variables from .env rather then including them in a docker run command. Here is an example of a docker-compose.yml file:

#/ml/docker-compose.yml

version: '3.8'
services:
  sentiment.ml:
    build:
      context: ./..
      dockerfile: ml/Dockerfile
    environment:
      - DATA_PATH=${DATA_PATH}
      - DATA_FILE=${DATA_FILE}
      - TRAIN_DATA_FILE=${TRAIN_DATA_FILE}
      - TEST_DATA_FILE=${TEST_DATA_FILE}
      - MODEL_PATH=${MODEL_PATH}
      - MODEL_FILE=${MODEL_FILE}
      - METRIC_PATH=${METRIC_PATH}
      - METRIC_FILE=${METRIC_FILE}
    volumes:
      - ./data:/ml/data
      - ./metrics:/ml/metrics
      - ./model:/ml/model
      - ./prepare_data:/ml/prepare_data
      - ./train:/ml/train
      - ./test:/ml/test
      - ./../shared_utils/logger:/ml/utils/logger

5. Install DVC

Follow the instructions on DVC's website on how to install. I installed it in my virtual environment using pip:

> pip insall dvc
> pip install 'dvc[s3]'  #Needed for working with boto3

6. Gather your dataset and upload it to a remote storage space

I uploaded the example dataset to an S3 bucket, named sent-analysis.

7. Setup DVC

7.1 Initialisation

If you have cloned the repository, DVC is already initialised in the ml folder; however, if you are working on a project from scratch you will need to initialise it by running dvc init. This will create a .dvc/ directory in the project.

7.2 Link to remote storage

For your data to be accessible from different environments you can save your data in a remote storage location, e.g. s3 a and link DVC to it by running:

dvc remote add s3-remote s3:///

You will notice a config file in the .dvc/ directory created linking to the above remote storage. This makes it very easy to push and pull data to your s3 bucket. For example, a coworker can easily download the dataset on their local machine by running dvc pull in the cloned repository. If any changes are made to the data on your local machine, you can sync the remote data by running dvc push.

8. Create pipelines

What?

A pipeline is a well-structured and easily maintainable workflow. Each pipeline is made from multiple steps, or stages as referred by DVC.

Why?

In a machine learning pipeline, all the artefacts need to be traceable, and any change in any stage should trigger the reproduction of all the affected stages.

How?

You can create stages with DVC using dvc run. The name, input, and output of each stage can be defined in this command. Running a stage creates a dvc.yml file that includes all the dependencies of the corresponding stage (and the other stages that are run in the project).
Take for example the model testing step. To create a stage, all the data, scripts, outputs that should be tracked are included in the stage definition:

dvc run -n test_model \
-d test/__main__.py \
-d data/test_data.parquet \
-d model/lr.model \
-M metrics/model_performance.json \
docker-compose run sentiment.ml "python -m test ."

Note the different tags used in the snippet above, -n is the name tag assigned to this example stage. All the stage dependencies, i.e. the script that should be run to test the model (test/__main__.py), test_data.parquet, and lr.model are tagged with -d. The performance metric file is given the metrics tag, -M.
In a training pipeline, the testing stage is usually the last step after pulling the data, preprocessing it and training a model. Each step can be created with dvc run similar to the test stage above. By defining all the dependencies in each stage, any change to an artefact will propagate across all effected stages.

Define pipeline stages in a shell script

Running different dvc run commands from the terminal one by one will make it hard for reproducing and modifying the stages. I find it easier to include all stages in a shell script. That way all the stages are in one place, can be easily reproduced from another machine, modified and version controlled with Git.
The shell script training_testing_pipeline.sh in the repository includes all the stages. Each stage can be run individually by running ./training_testing_pipeline.sh inside the terminal from the ml directory. Alternatively, all stages can be run together by excluding the stage or by running dvc repro.
Below is a description of each stage in the order they should run:

Pulling data

if [[ "$*" =~ "pull_data_from_s3" ]]; then
   dvc run -n pull_data_from_s3 \
   -d s3://sent-analysis/reviews.tsv \
   -o ./data/reviews.tsv \
   aws s3 cp s3://sent-analysis/reviews.tsv ./data/reviews.tsv
fi

The first stage, pull_data_from_s3 is responsible for pulling the data from the remote storage, s3, and saving it to a specified path, in this case, data/. Running this stage will create a data/ directory in your local repository and download the dataset, reviews.tsv to it (Assuming you have included this file in an s3 bucket named sent-analysis and have set it up as the remote storage in step 7.2). Note that the output, ./data/reviews.tsv is tagged with -o.
Running this stage will create/add details to dvc.yaml:

pull_data_from_s3:
  cmd: aws s3 cp s3://sent-analysis/reviews.tsv ./data/reviews.tsv
  deps:
    - s3://sent-analysis/reviews.tsv
  outs:
    - ./data/reviews.tsv

Another file that gets created/modified with running a stage is, dvc.lock. This file contains the md5 hash values for each artefact which allows for changes in any part of the pipeline to be tracked, i.e. any change results in a change in the hash value which can then trigger all the dependencies to be recomputed and get new hash values. I will show this in an example in the following section. Below is the output of dvc.lock from running the pull_data_from_s3 stage:

pull_data_from_s3:
  cmd: aws s3 cp s3://sent-analysis/reviews.tsv ./data/reviews.tsv
  deps:
    - path: s3://sent-analysis/reviews.tsv
      etag: d8a6be2d1deb19f9cd76f1b69a793b5f
      size: 8481022
  outs:
    - path: ./data/reviews.tsv
      md5: d8a6be2d1deb19f9cd76f1b69a793b5f
      size: 8481022

Preparing data

Now that the dataset is pulled from the remote storage, it is time for the preprocessing step.

if [[ "$*" =~ "prepare_data" ]]; then
   dvc run -n prepare_data \
   -d data/reviews.tsv \
   -d prepare_data/__main__.py \
   -o data/train_data.parquet \
   -o data/test_data.parquet \
   docker-compose run sentiment.ml "python -m prepare_data ."
fi

A few things to note:

data/reviews.tsv was an output in the previous stage but is a dependency in this stage. This follows the logical flow of data in the pipeline, i.e. the data is collected and passed down to the preprocessing step. Declaring the dependency of prepare_data on reviews.tsv means that any change to the data which is identified by the md5 hash in dvc.lock will trigger its dependent stage, prepare_data to run again.
A stage can have multiple outputs, in this case, training data and testing data.
The command at this stage is run in docker through docker-compose.
Let's take a look at the dvc.lock content created after running this stage:

prepare_data:
  cmd: docker-compose run sentiment.ml "python -m prepare_data ."
  deps:
   - path: data/reviews.tsv
     md5: d8a6be2d1deb19f9cd76f1b69a793b5f
     size: 8481022
   - path: prepare_data/__main__.py
     md5: fd698d5634e84fe9ea713e10c59154ca
     size: 719
  outs:
   - path: data/test_data.parquet
     md5: a9a52c60878a1cc92cef68ee8a4eaaae
     size: 206588
   - path: data/train_data.parquet
     md5: 57fe388a9b3d96ed734ddad601941a53
     size: 808111

Notice that reviews.tsv is identified with same md5 hash, d8a6be2d1deb19f9cd76f1b69a793b5f, from the previous stage. These unique hashes make it possible to pinpoint discrepancies in the machine learning pipeline in regards to different versions of artefacts being included in different stages.

Training the model

The previous stages have created the data needed to train the model. Now we can run the scripts inside the train directory and create a model.

if [[ "$*" =~ "train_model" ]]; then
   dvc run -n train_model \
   -d train/__main__.py \
   -d data/train_data.parquet \
   -o model/lr.model \
   docker-compose run sentiment.ml "python -m train ."
fi

Note that the training script fits a logistic regression model on the data and saves the model in a model directory which is tracked via the output tag and gets a unique md5 hash attached to it in dvc.lock. Below is the output of dvc.lock from running this stage:

 train_model:
  cmd: docker-compose run sentiment.ml "python -m train ."
  deps:
   - path: data/train_data.parquet
     md5: 57fe388a9b3d96ed734ddad601941a53
     size: 808111
   - path: train/__main__.py
     md5: 0b044356b96a8419fcd64ba7087c225d
     size: 939
  outs:
   - path: model/lr.model
     md5: afb779dc7ffb7b659c1d16937c57b910
     size: 373778

Testing the model

This stage is important for both evaluating the model and also comparing this model with other candidate models.

if [[ "$*" =~ "test_model" ]]; then
   dvc run -n test_model \
   -d test/__main__.py \
   -d data/test_data.parquet \
   -d model/lr.model \
   -M metrics/model_performance.json \
   docker-compose run sentiment.ml "python -m test ."
fi

As seen, the test_model stage is dependent on all previous stages.
Note that the output of the performance evaluation script is tagged with the metrics tag, -M.

9. Visualise the pipeline flow

Now that all training and testing stages have been defined, we can visualise the flow of data in the pipeline by running dvc dag:

The flow of data and illustration of how one stage depends on the previous stage

10. Run experiments and compare results

It is time to see the pipeline's value in action. Let's consider two scenarios:

Modifying the model's hyper-parameters

Say that you have trained the model by running the shell script, but want to make changes to its hyper-parameters. Let's see how we can easily switch between versions of the model like switching between different code versions on Git branches.
For illustration purposes, I branched off of main by creating a branch named exp and tweaked the test size in model_params.py. This should trigger recomputation of all three stages, prepare_data, train_model and test_model.
I will run the pipeline as an experiment, with this command dvc exp run -n more-test-data --no-run-cache. Note that more-test-data is a name I gave to this experiment.
As expected, this command skips the first stage of the pipeline since the dataset has not changed. Below is the output from the terminal:

Stage 'pull_data_from_s3' didn't change, skipping                                                                                                       
Running stage 'prepare_data':
> docker-compose run sentiment.ml "python -m prepare_data ."
Creating ml_sentiment.ml_run ... done
Updating lock file 'dvc.lock'                                                                                                                           

Running stage 'train_model':
> docker-compose run sentiment.ml "python -m train ."
Creating ml_sentiment.ml_run ... done
Updating lock file 'dvc.lock'                                                                                                                           
Running stage 'test_model':
> docker-compose run sentiment.ml "python -m test ."
Creating ml_sentiment.ml_run ... done
Updating lock file 'dvc.lock'

The results of this experiment can be compared with the main branch, and other experiments (if any) by running dvc exp show. Below is a screenshot of the result:

Comparison of performance metrics of two experiments

This experiment produced three new files that are tracked by DVC, train_data.parquet, test_data.parquet and lr.model. This means that the hashes of these files have also been updated in dvc.lock. The image below shows the changes in the lock file.

The changes made in dvc.lock after modifying the test size.

I add the changed files to git and push them to the remote repository. I then run dvc push to push the three DVC tracked files to the remote storage.
This allows anyone else working on this code to access both sets of files just by switching the branches and pulling the corresponding data from DVC. In other words, when a coworker checks out the exp branch and runs dvc pull the dvc.lock file dictates which versions of the three train, test and model files should be downloaded.

Comparing different models

Now let's change the training model. For this, I branch off of the main branch, svm-exp, modify the main script inside the train folder and fit an svm model instead of the logistic regression model.
This change should only affect the last two stages as it relates to training and testing the model on the same data produced by prepare_data. Below is the output of training an svm model and running dvc exp show. As seen we can easily compare the results of the main branch with the other two experiments, one from increasing the test data and one from training and different model. It is clear from the results of the svm model that the changes made in the svm-exp experiment outperform the other two.
Now I can add the changes and push the DVC tracked files, create a pull request on Github and get the better performing model merged.

In this blog post, I described the steps to generate easy-to-read, reproducible code for training machine learning models. The next posts will build on top of this article and discuss the next steps after deciding on the model for production use.
I hope you enjoyed this post and use some of these ideas in your own projects. If you like reading about machine learning and natural language processing, follow me on twitter for more content!

An Introduction to Reinforcement Learning

Maryam Fallah — Mon, 21 Jun 2021 01:31:12 GMT

Emulating the natural learning process has been an inspiration for developing many machine learning algorithms. Reinforcement learning is one of them.
We have all encountered many occasions as kids where we learnt a behaviour through positive or negative reinforcement. Think of the first time you touched a thorny rose stem and figured that next time you should probably grab it more gently. This was a learning experience through negative reinforcement. When your parents praised you or got you a gift when you scored an A on that math test, that was a positive reinforcement learning instance where you learnt that studying hard and getting good grades is good and something to be repeated, or at least your parents thought that.
But these learning encounters do stop at childhood. For instance, you are more likely to remember (learn) to buckle up if the car doesn't stop beeping until you secure your seatbelt (negative reinforcement). You have likely implemented some form of reinforcement to train your dog, and the list goes on...
The common factor in all these learning examples is the presence of some reinforcer, positive or negative, that increases a desired behaviour, e.g. avoiding thorns, getting As, closing your seatbelt, etc. These reinforcers can be a natural consequence of our behaviour as is the case when we feel pain by touching sharp thorns, or they could be an external intervention as was the case when your parents got you a gift for being a hard working student.

Seeing the effectiveness of this learning paradigm, the machine learning community tried simulating it. We call this subset of machine learning reinforcement learning (RL). There are various types of RL algorithms and they are applied to different domains. The most common one is in gaming, a famous example is DeepMind's AlphaGo that beat the world Go champion. You can find a full documentary of this project here.
Other use cases of RL are in the finance sector for optimising clients' portfolios or in recommendation engines where the goal is to recommend the products that the user will most likely purchase.
But the question is how are these RL algorithms designed to replicate the natural learning process we experience in order to learn Go or find the best recommendation?
In this article, I will focus on the basic terminology and different types of RL algorithms.

Definition and terminology

I briefly touched upon what RL is in a broader scope, but what is its definition and objective in the machine learning world?
RL is a subset of machine learning which involves creating software that can learn a strategy for behaving in a desired way. In order to understand this better let's first define some terminology:

Environment (E) and Agent (A)

The agent is the entity that we want to learn desired behaviours. This could be a robot we are trying to get to learn playing football or the virtual player in a chess game. The environment is the system that the agent interacts with in its learning process. In the robot example, it's the football field, and in the chess game it's the chess board.

State (S)

The environment is made up of smaller units called states that contain information about the environment. For example, each position on the chess board is a state that has positions to its top, bottom, left, right and at a particular time may have the opponent's bishop to its right. This information helps the agent in deciding how to interact with the environment.

Action (a)

Actions are the execution of the agent's decisions. It's the behaviour it performs in interacting with the environment.

Reward (R)

Reward is the core of RL. It's the reinforcer of the desired behaviour we are hoping the agent to learn. It is a scalar that is fed back by the environment based on the action the agent takes. In other words, the reward signal is an immediate measurement of the agent's progress.

Policy ($\mathbf{\pi}$)

Policy is the strategy that the agent employs for deciding its next action.

Value function, V(s)

Value function is a function of states, $\mathbf{s}$, and is a measure of the goodness of any given state. We consider goodness to be equivalent to the expected future rewards. In other words, if being in state A opens the path to states with high rewards that the agent can take next (high expected future rewards), then state A has high value.
Formally, the value function is represented as below:

$\begin{aligned}V^\pi({s_t}) = E_{\pi}[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ... | s_t=s] = E_{\pi}[\sum_{k=0}^{\infty}\gamma^kr_{t+k} | s_t=s]\end{aligned}\tag{1}$

This equation is computing the accumulation of all the future rewards that are expected given that the agent is in state $s$ at time $t$ and takes actions under policy $\pi$. The term $\gamma$ that acts as a weight for future rewards is called the *discount factor* and usually has a positive value less than 1. This discount factor reduces the future rewards because they are delayed as opposed to the immediate reward received at time $t$.

Q function, Q(s, a)

Q function is similar to the Value function with the difference being that it measures the goodness of a given state, action pair. That is, V(s) computes the expected future reward on all possible actions, whereas Q(s,a) computes the expected future reward of each individual action. We can formulate this goodness measure as Eq.2.

$\begin{aligned}Q^\pi({s_t, a_t}) = E_{\pi}[\sum_{k=0}^{\infty}\gamma^kr_{t+k} | s_t=s, a_t=a]\end{aligned}\tag{2}$

Model

Model is a description of the environment in terms of a distribution over the rewards and states. In other words, if we know what is the probability of transitioning from state to state $s'$ and receiving reward $r$ using action $a$ from state $s$, we have a model of the environment.

The Reinforcement Learning Objective

Maximising the accumulated reward is the main objective of reinforcement learning problems. This can be done through maximising the Q function for example or it can be done through learning the policy directly. Either approach has various algorithms.

Different Types of RL algorithms

RL algorithms fit into different categories:

Model-based vs. model-free algorithms:

Algorithms that use a function of state transitions in the environment are model-based methods, whereas algorithms that do not consider any information about the environment are model-free.
Model-based approaches have the advantage of capturing all the information there is about the problem, all the dynamics of the environment, etc. You can think of it as simulating a blueprint of the environment where you can easily give it an unseen frame of a video game and ask what the next frame will be. However, this approach has some downsides. One is that it requires high computation and is likely to consider details that are irrelevant to the game itself. To clarify, consider using a model-based RL algorithm for learning how to play an Atari game. This model-based approach will first learn the essence of the game, how certain actions transition the current frame to the next frame. Assuming there is some random video in the background, there will be a lot of action-independent information in the that the algorithm considers relevant and therefore consumes a lot of compute and capacity to learn these dynamics which have no effect on the actual game.
Another disadvantage is that once the environment's dynamics are learnt, going from this model to the actual policy for determining the agent's actions is non-trivial and can also be computationally expensive.

Value-based vs. Policy-based algorithms:

In value-based methods, the model optimises the value function and implies the policy from the optimal value function, i.e. picks the action that gets to the state with the highest value. This approach is closer to the true objective of maximising the cumulative reward, but still has the issue of focusing on less important details. For example, consider a grid where a robot is in and wants to learn the optimal policy. The policy in this setting can happen to be as simple as "always go up" which is fairly easy to learn, however the value optimisation process which this simple policy will be derived from may not be as straightforward. This can be due to various observations that the robot has in the grid that may slow it down temporarily, e.g. some object pops up and gets in the way. These observations change the value function and we will need a rich value function approximator such as a deep neural network to approximate it, whereas the policy itself is completely blind to the observations and is very simple. So in this example, learning the policy directly will be easier and more efficient. This is what policy-based algorithms focus on, learning the policy explicitly through optimising a policy objective function which is the true objective of RL. Using all the data to only update the policy however may not always be beneficial as some samples contain little information about the policy and a lot about the environment, and discarding this valuable knowledge is bad.
There are approaches that combine the benefits of value-based and policy-based learning, such as actor-critic learning.

Off-policy vs. on-policy algorithms:

Some RL algorithms such as Q-learing, update the estimated return at each iteration irrespective of the policy at hand. Whereas on-policy algorithms such as SARSA, estimates the return for state-action pairs assuming the current policy continues to be followed. To clarify, in off-policy learning, the agent interacts with the environment under a policy by taking actions. These interactions are saved into a "buffer" along with all the other interactions made thus far. At each time-step the policy is updated based on samples from all the past interactions in this buffer and not solely from the most recent policy. However, during on-policy, there is no buffer and the policy gets updated based on the agent's most recent interaction with the environment.

Conclusion

In this article, I gave a brief introduction into what RL is and what are the categories that different RL algorithms fit into.
I hope you enjoyed learning about the basics of reinforcement learning. If you like reading about machine learning, natural language processing and their applications, follow me on twitter for more content!

Semantic Similarity Measurement in Clinical Text

Maryam Fallah — Tue, 30 Mar 2021 23:38:16 GMT

Transformer-based language models are powerful tools for a variety of natural language processing tasks such as text similarity, question answering, translation, etc.
What are the limitations of these models in specific domains such as biomedicine? How do various domain-specific language models compare to one another? Do models trained on biomedical data perform well on clinical data?
I have delved into these questions in a blog post I wrote on The Ezra Tech Blog.

An Overview of Different Transformer-based Language Models

Maryam Fallah — Sun, 21 Mar 2021 17:25:37 GMT

Language models are models that learn the distribution of words in a language and can be used in a variety of natural language processing tasks such as question answering, sentiment analysis, translation, etc.

Transformers are encoder-decoder structures that learn word distributions by selectively attending to certain parts of the text, a mechanism known as attention.

I have written a blog post as part of the Ezra Tech Blog on different language models that use the Transformer as their main component such as USE, GPT and BERT and how they can be used for embedding input text. Please visit for details on each method, how it compares to other models and how to use them in Python.

An Overview of Different Text Embeddings

Maryam Fallah — Sun, 21 Mar 2021 16:57:57 GMT

Text embedding is a method to capture the meaning and context of text and covert those meanings into numerical representations digestible by a computer. This is necessary for any machine learning task that takes text as its input. For example, question answering, text generation, text summarisation, etc.

I wrote a blog post as part of the Ezra Tech Blog on different embedding models namely, Word2Vec, GloVe, FastText and ELMo. Please visit for details on each method, how it compares to other models and how to use them in Python.

Extracting Training Data from Large Language Models

Maryam Fallah — Wed, 06 Jan 2021 01:52:27 GMT

Language models (LM) are machine learning models trained on vast amounts of text to learn the intricacies and relationships that exist among words and concepts; in other words, the distribution over words and sentences. Generally, it has been seen that the larger these models are the more powerful they become at handling various natural language tasks that they may not have even been trained for.
We have all heard about GPT-3 developed by OpenAI in June 2020 and have seen various demos and articles about its size and performance.
Language models like GPT-3 and its predecessor GPT-2 use publicly available data on the world wide web as training data. The issue of data privacy may not be a big concern when it is already published and readily available. However, it can be problematic when the data used for training the language model is private, contains personally identifiable information (PII) and yet the data can be retrieved using adversary attacks.
In this article, I will be summarising the methods used by Carlini et. al to extract training data from GPT-2. They show that using simple adversary attacks one can query a black box language model for memorised training data.

Studied Language Models

For this study, the authors have used different variants of GPT-2 with a main focus on the biggest version, GPT-2 XL with 1.5 billion parameters. The dataset used for training this model was scraped from the internet, cleaned of HTML, and de-duplicated. This gives a text dataset of approximately 40 GB.

Memorisation

Memorisation is a necessary element in our learning process as humans. We need to memorise some information in order to be able to generalise our learning. Neural networks are not very different in this regard. For example, a neural network model needs to memorise the pattern of postal codes in California before generating a valid postal code when given a prompt such as "My address is 1 Main Street, San Francisco CA".
Although such an abstract form of memorisation is needed, memorising exact strings of training data especially those containing PII can be problematic and ethically concerning.
It is generally believed that memorisation of training data is a result of overfitting. This study shows that this is not always the case and although LM models such as GPT-2 that was trained on a large dataset over as few as 12 epochs does not overfit, the training loss for some data samples are anomalously low. This low training loss makes the model prune to memorising those training examples.

Text Generation Methods

To query the LM for training data one has to a) generate text from the LM and b) infer whether the generated sample is a member of the training data. Language models generate text by iteratively sampling tokens, $x_{i+1}$ from some distribution conditioned on the $i$ previously selected tokens. This sampling ends when some stopping criterion is reached.
A number of different text generation methods are considered in this study which mainly differ in their sampling strategy, i.e. a greedy sampling method which chooses the tokens that have the highest likelihood compared to an exploration based sampling method that allows for a more diverse output.

Membership Inference Methods

Once text is generated from the LM, some measure is needed to infer whether each sample was in the dataset or not. This is called membership inference and there are different ways to implement it.
One way is by measuring the perplexity of a sequence. A low perplexity says that the model is not perplexed or confused by the sequence and if it predicts a high likelihood for the same sequence, it could be indicative that the sequence was memorised from the training data.
Another approach is by comparing the predicted likelihood of a sequence by the LM in question with another LM. The idea here is that if the predicted sequence is not memorised and the predicted score is high simply because the sequence of tokens makes logical sense then the other model that was trained on a different dataset should also predict a high occurrence likelihood for the same sequence. Therefore, a sequence with an unexpectedly high likelihood from GPT-2 compared to another LM can be a signal for memorisation.

Results

The authors generate 200,000 samples using different text generation strategies. Then, for each membership inference method, find the top 100 samples. This gives a total of 1800 samples of potentially memorised content which is verified manually through Google search and directly querying GPT-2 (after being granted access by OpenAI researchers).
This results in an aggregate true positive rate of 33.5%. Most memorized content is fairly canonical text from news headlines, log files, entries from forums or wikis, or religious text. However, significant information on individuals such as addresses and telephone numbers were also found. Fig.1 shows an example of such memorised data. The PII are blocked out for privacy reasons.
The study also investigated the effect of the number of repetitions of a string in the training set and the size of the model on memorisation. It is found that a minimum frequency of 33 of a certain string in a single document seems to make the model prone to memorising that string and as the model gets bigger, memorisation increases.

Figure 1. Example of memorised data easily queried from GPT-2 that contains PII

Conclusion

This work was a proof of concept that large language models although do not overfit, still memorise some unnecessary information from the training set that may contain private information. Although the dataset for GPT-2 was scraped from the web and therefore may not have been "private", it is still concerning that black box models may memorise data that can be extracted given the appropriate prompt. This is especially problematic for when the data is later removed from the web or never was released but are exposed through black box models.
Some data curation or privacy measures, e.g. differential privacy is needed to ensure that the data does not contain sensitive information but in the case of large datasets there seems to be room for improvement and innovation to ensure security across all data points.
Although this study was submitted after the release of GPT-3 (December 14, 2020), the authors have not mentioned why they chose to focus on GPT-2 instead. However, they did compare the amount of content memorised by GPT-2 XL and its medium and small version with 334 and 124 million parameters respectively. This comparison showed that the XL model memorised 18 times more information. With this, it is expected that GPT-3 wich is ~100 times larger than GPT-2, has more data memorised.
I hoped you enjoyed learning about this study. If you like to read more about machine learning, natural language processing and computer vision, follow me on twitter for more content!

This Looks Like That: Deep Learning for Interpretable Image Recognition

Maryam Fallah — Tue, 01 Dec 2020 20:24:32 GMT

Model interpretability is a topic in machine learning that has recently gained a lot of attention. Interpretability is crucial in both the development and usage of machine learning models. It helps the developers understand why particular mistakes are made and how they can be improved. It also makes it easier for the end user to see what factors lead to a final decision rather than relying on a vague final score or probability.
In this regard, many specialists recommend using models that are inherently interpretable, e.g. decision trees or Explainable Boosting Machine (EBM) as they gain comparable results relative to more complex models and are transparent as opposed to black-box models.
However, such additive models are not applicable to all problems. In the computer vision realm for example, deep neural networks are by far the most promising solution to object detection, segmentation, etc. but lack interpretability.
There has been some research for explaining the decisions of deep neural networks. Such posthoc analysis fits a model to a trained model and therefore does not capture the true decision process.
In this article, I will be summarising the work of Chen et. al where an interpretable deep neural network is introduced. Unlike other neural networks, this model has the capability to explicitly give reasoning for its classification decisions by showing what parts of the input it thinks are similar to the training images.

Motivation and Objective

How do we as humans identify that a certain image is of a penguin and not a table or a parrot? Our brains are able to scan the image and compare the penguin's head, body shape, colour, etc. with the corresponding body parts of other penguins it has seen in real life or virtually. It also makes the same comparisons with other animals and objects it has seen before and concludes that the picture is most similar to a penguin. Our brain uses these seen instances as prototypes and assigns a similarity score between the image in question and each prototype.
Can a machine learning model be trained to do the same?
Enabling the neural network to reason in the same fashion is the goal of this study. The authors introduce prototypical part network (ProtoPNet) which learns to dissect the training images into prototypical parts, e.g. the penguin head, beak are considered as identifiable prototypes for the class penguin. The model then computes a weighted summation of the similarity scores between different parts of the input image and the learnt prototypes for each class. With this part of the image looks like that part of the training image approach, the model is transparent and the interpretation reflects its actual decision process.

ProtoPNet Architecture

The proposed ProtoPNet uses a conventional convolutional neural network (CNN) such as VGG-16, VGG-19, ResNet-34, which has been pre-trained on ImageNet, followed by two additional $1 \times 1$ convolutional layers. The key element in this network is the prototype layer $g_p$ that comes after the CNN and is in charge of comparing the input image patches with some learnt prototypes. $g_p$ is then followed by fully connected layer $h$.
Fig. 1 (from paper), shows this architecture with an example. The input image of a clay colored sparrow, is split into patches which are mapped to a latent space by passing through the convolutional layers. The learnt prototypes $p_j$ are mapped to the same space and represent specific patches in the training set images. If you are familiar with NLP, this mapping is similar to embedding the input words to a vector space where words with similar meanings are closer to each other.
Then, for each input patch and each class, the euclidean distance between the latent representations of the patch and the class prototypes ($p_j$) are computed and inverted to similarity scores. The higher the score, the stronger the chance of a prototypical part being present in the input image. These similarity scores can be seen as an activation map where the areas with higher activity indicate a stronger similarity. We can upsample this activation map to the size of the input image in order to visualise where these areas are in the form of heat maps. This heat map is then reduced to a single similarity score using a global max pooling. For example, in Fig.1 the similarity between a learnt clay colored sparrow head prototype ($p_1$) and the head of the clay colored sparrow in the input is 3.954 and the similarity score between that input patch and a Brewer’s sparrow head prototype ($p_2$) is 1.447 indicating that the model finds the input image patch to be more similar to the first prototype than the second.
Finally, the similarity scores are passed to a fully connected layer to produce output logits which are normalised using a softmax function.

Figure 1. Architecture of ProtoPNet

Training ProtoPNet

For training the proposed network, the CUB-200-2011 dataset which is a dataset with 200 bird species was used. For the prototype layer, a fixed number, 10 prototypes per class, was considered for learning the most important patches in the training dataset.
In order for the model to learn which prototype patches among the training data of each class are most differentiating, a meaningful latent space needs to be learnt where the semantically similar image patches, e.g. different patches of male peacock feathers are mapped to the same area.
This can be seen as a clustering problem where the goal is to select features, i.e. patch representations in a way that patches belonging to the same class are close to one another and patches from different classes are far away and easily separable.
We can enforce such separation by penalising the prototype patches that are found to be very similar to patches of other classes. For example, a patch that includes a tree branch should be penalised if selected as there can be pictures of birds on tree branches in any of the classes.
Fig.2 shows an example of classifying a test image of a red-bellied woodpecker after training. Latent features, $f(x)$, are computed from the convolutional layers and then a similarity score between $f(x)$ and the latent features of each 10 learnt prototypes of each class is computed. In Fig.2 two of these similarity computations are shown. The left image shows the similarity scores of three prototype images of the red-bellied woodpecker class along with the upsampled heat map that shows the areas in the original image that are strongly activated by the similarity computation. These similarity scores are then multiplied with the class weights and summed to give a final score of the input image belonging to that class. The same process happens in the right image but for a red-cockaded woodpecker. Comparing the final scores of the two classes, we see that the model seems more confident that the test image belongs to the red-bellied woodpecker.

Figure 2. The transparent decision making process of ProtoPNet

Results

A number of baseline models (CNNs without the prototype layer) were trained using the same training and testing dataset (with the same augmentation) used for ProtoPNet, and as seen in Table.1, the results are very similar which verifies that interpretable models are a good alternative to black-box models.

Table 1. Accuracy comparison of ProtoPNet with baseline models

Conclusion

In this article, I explained ProtoPNet, a deep neural network that is able to reason about classifying input images in a similar fashion to humans. Such transparency is especially important in high stake applications such as medical image classification where a single score of whether there is a brain tumor in a given MRI is not sufficient.
I hope you enjoyed learning about interpretable deep learning models and try to implement them in your own projects. If you like reading about machine learning, natural language processing and computer vision, follow me on twitter for more content!

Do wide and deep networks learn the same things? Uncovering how neural network representations vary with width and depth

Maryam Fallah — Wed, 11 Nov 2020 19:37:52 GMT

When designing deep neural networks we have to decide on its depth/width. These so-called hyper-parameters need to be tuned based on the dataset, available resources, etc. Different studies have compared the performance of wider networks with deeper networks (Eldan et. al & Zagoruyko et. al) but there has been less focus on how these hyper-parameters affect the model beyond its performance. In other words, we do not know enough about how tweaking the network depth/width changes the hidden layer representations and what neural networks actually learn from additional layers/channels.
In this article I am going to summarise the findings of a paper published by Google Research in October 2020. This study investigates the impact that increasing the model width and depth has on both internal representations and outputs. Particularly, it aims at answering the following questions:

How does depth and width affect the internal representations?
Is there any relation between the hidden representations of deeper/wider models and the size of the training dataset?
What happens to the neural network representations as they propagate through the model structure?
Can wide/deep networks be pruned without affecting accuracy?
Are learned representations similar across models of different architectures and different random initialisations?
Do wide and deep networks learn different things?

Each section will answer one of these questions, in order.

Experimental Setup

As mentioned in the intro, this paper focuses on the tensors that passed through the hidden layers and how they change with additional layers, etc. In order to be able to compare the internal representations of different model architectures with one another, some similarity metric is needed. In this work, the authors use linear centered kernel alignment (CKA). This metric was first introduced by Google Brain as a means to understand the behaviour of neural networks better. Please refer to this paper for details on how CKA is computed, but generally speaking you can think of it as any other similarity metric such as cosine similarity that computes the closeness of two tensors. I will get into more detail of CKA when answering question 3.
A family of ResNet models and three commonly used image datasets, CIFAR-10, CIFAR-100 and ImageNet have been used for experiments.

Effect of Increasing Depth/Width on Internal Representations (Question 1)

In order to study the impact of width/depth on hidden layers, a group of ResNets is trained on CIFAR-10. For each network, the representation similarity between all pairs of hidden layers is computed. This pairwise similarity measure, gives us a heatmap with the x and y axis representing the layers of the model. The top row of Fig.1 shows such results from 5 ResNets with varying depth (the rightmost being the deepest) and the bottom row shows the outputs of such computations with a different set of ResNets that vary in their width (the rightmost being the widest). As we see, the heatmaps initially resemble a checkerboard structure. This is mainly because ResNets have residual blocks and therefore, representations after residual connections are similar to other post-residual representations. However, things start to get quite interesting as the model gets deeper or wider. We see that with deeper and wider networks, a block structure (yellow squares) starts to appear in the heatmaps suggesting that a considerable number of hidden layers share similarities with each other.
It's worth noting that residual connections are not related to the emergence of block structures as they also appear in networks without them. In addition, the authors compute heatmaps for a deep and a wide model with various random initialisations and see that the block structure is present for each trained model; however, its size and position varies.

Figure 1. Pairwise similarity between layers for models of varying depth and width

Relationship between the block structure and the model size (Question 2)

After discovering the emergence of block structures in deeper/wider networks, the authors want to see whether this phenomenon is related to the absolute size of the model or to its relative size to the amount of training data.
In order to answer this question, a model with a fixed architecture is trained on a dataset of varying size. After training, CKA similarity is computed between hidden layer representations as we see in Fig.2. Each column in Fig.2 is a particular ResNet architecture that has been trained on all, 1/4 and 1/16 of the dataset respectively. Each successive column is wider than the previous one. Considering each row, we see that similar to Fig.1, block structure appears as the network gets wider. However, for each model with a fixed width (the columns), block structures emerge when less training data is available.
Experiments on networks with varying depth give similar results (See paper's appendix).
This experiment shows that models that are relatively overparameterised (compared to the amount of training data) contain block structures in their internal representations.

Figure 2. Pairwise similarity between layers for models of varying width (each row) and amount of training data (each column)

The block structure and the first principal component (Question 3)

So far we have seen block structures emerge in networks as they get wider or deeper and become overparameterised relative to their training data, but we do not know what they actually mean and what information they contain?
Google Brain's paper on CKA shows that for centered matrices, the CKA can be written in terms of the normalised principal components (PC) of the inputs (See eq.1).

$\begin{aligned}\small CKA(XX^T, YY^T) = \frac{\sum_{i=1}^{p1}\sum_{j=1}^{p2}{\lambda}^i_X{\lambda}^j_Y<\mathbf {u_x}^i, \mathbf {u_y}^j>^2}{{\sqrt{\sum_{i=1}^{p1}({{\lambda}^i_X})^2}}{\sqrt{\sum_{j=1}^{p2}({{\lambda}^j_Y})^2}}}\end{aligned}\tag{1}$

Where $\mathbf {u_X}^i$ and $\mathbf {u_Y}^i$ are the $u^{ith}$ normalised principal components of $\mathbf X$ and $\mathbf Y$. ${\lambda}^i_X$ and ${\lambda}^i_Y$ are the corresponding squared singular values. You do not need to fully understand this formula, I included it here to give you a better idea as to why the authors use principal components to explore what happens inside the block structures.

The objective of principal components is to explain as much variance in the data as possible through a set of orthogonal vectors, i.e. principal components. Given eq.1, assuming that the first principal explains all the variance in the data and therefore we only consider this principal component in the summation, the numerator and denominator will cancel each other out and CKA will be equal to $<{\mathbf {u_X}^1, \mathbf {u_Y}^1>}^2$, which is the squared alignment between the first PCs of the inputs.
This led to the idea that the layers with high CKA similarity (block structures) will likely have a first PC that explains a large fraction of the variance of the hidden layer representations. In other words, the block structure reflects the behaviour of the first PC of the internal representations. To verify this, the authors compute the fraction of variance explained by the first PC of each layer in both wide and deep ResNets and see whether a block structure appeared for the same group of layers that covered most of the variance. Fig.3 shows the results. The left set of plots are from a deep network and plots on the right show outputs from a wide network. As we see in both sets, the same group of layers that make up the block structure in the top right heatmaps, cover a high fraction of variance shown in the bottom left graphs.
To further investigate the relation between block structures and first PCs, the cosine similarity between the first PC of all the layers are computed as we see in the top left plots. This shows that the block structures appear in the same size and location as in the heatmaps from CKA (top right).
In addition, authors remove the first PC and compute the CKA heatmap. As illustrated in the bottom right plots, without the first PC the block structures are also removed.
These experiments suggest that the first principal component is what is preserved and propagated throughout its constituent layers leading to block structures.

Figure 3. Relationship between the block structure of a deep ResNet (left) and a wide ResNet (right) with the first principal component of the hidden layers

Linear probes and pruning the model (Question 4)

Given that block structures preserve and propagate the first PC, you might be wondering if a) the presence of a block structure is an indication of redundancy? and if so b) whether models that contain block structures in their similarity heatmaps can be pruned?
To answer these two questions, the authors consider each layer of both a narrow and a wide network and fit a linear classifier to predict the output class. This method was introduced by Alain et al. and is called linear probing. You can think of these probes (classifiers) as thermometers that are used to measure the temperature (performance) of the model in various locations. Fig.4 shows two groups of thin and wide ResNets. The two left panes are two thin networks with different initialisations and as we see the accuracy is monotonically increasing for all layers with or without residual connections (shown in blue and orange respectively). However, for a wide network we see little increase in accuracy of layers within the block structure. Note that without residual connections (orange line) the performance of layers that make the block structure decreases. This suggests that these connections play an important role in preserving target representations in the block structure leading to increased prediction accuracy.
This suggests that there is redundancy in the block structures' constituent layers.

Figure 4. Linear probing for models with different width and initialisation seeds (top row). Accuracy of each layer before (orange) and after (blue) residual connections (bottom row)

Now let's get to pruning. Note that in ResNets, the network is split into a number of stages. The number of channels and kernel sizes are fixed at each stage and they increase from one stage to the other. The vertical green lines in the bottom row plots of Fig.5 show three stages in the analysed models. In order to investigate the effect of pruning, blocks are deleted one-by-one from the end of each stage while keeping the residual connections since they are important in preserving the important internal representations. Linear probing is performed on intact and pruned models and results show that pruning from inside the block structure (the area between the green vertical lines in the bottom row of Fig.5) has very little impact on the test accuracy. However, pruning has a negative effect on performance for models without block structure. The two left plots in Fig.5 show the comparison in performance between a thin model before and after pruning blocks. We see a noticeable drop in accuracy (blue line in bottom plots) when blocks are removed from any of the three stages.
The two right plots in Fig.5 show the comparison in performance between a wide model with a block structure before and after pruning blocks. We do not see a big difference in accuracy when pruning happens in the middle stage that contains the block structure.
The grey horizontal line shows the performance of the full model.
These results suggest that models that are wide/deep enough to contain a block structure can be compressed without much loss in performance.

Figure 5. Intact ResNet-38 (1x) and performance of its layers (first column), same model after pruning (second column). Intact wide ResNet-38 (10x) and performance of its layers (third column), same model after pruning (fourth column)

Effect of depth/width on representations across models (Question 5)

In this section we want to learn a) if models of the same design but different initialisations learn similar representations? and b) whether this changes with increasing the model capacity, i.e. making it deeper/wider? and c) is their similarity between representations of different model designs? and finally d) how is this similarity or dissimilarity affected as models get deeper/wider?
To answer part a, a small model with a fixed architecture was trained three times with three different initialisations. Then, CKA maps between layers of each pair of models was computed (see the leftmost group of plots in Fig.6). As we see in Fig.6, the model does not contain any block structure and representations across initialisations (off-diagonal plots) show the same grid-like similarity structure as within a single model.
The middle (wide model) and right (deep model) group of plots answer part b. Both models have block structures (See plots on the diagonal). Comparing the same model from different seeds (off-diagonal plots) we see that despite some similarity between representations in layers outside the block structure, there is no similarity between the block structures across seeds.

Figure 6. Comparison of the same model architecture across three random seeds. The left group is a small model, the middle is a wide model and the right group shows a deep model

To answer parts c) and d) the same experiments are run on models with different designs. Particularly for part c) two different small model architectures are trained and there similarity is computed. As we see in the left group of plots in Fig.7 for these two models, representations at the same relative depths are similar across models.
However, when comparing a wide and deep model that contain block structures, results show that the representations within the block structure are unique to the model (see group of plots on the right of Fig.7)

Figure 7. Comparison of different model architectures. The left group compares two small models and the right group compare two models with higher capacity

With this analysis, we can conclude that representations from layers outside the block structures bear some resemblance with one another independent of the model architecture being the same or not but there is little to no similarity between representations inside the block structure.

Effect of depth/width on predictions (Question 6)

Now we want to investigate the difference between the predictions of different architectures on a per-example basis.
For this purpose, the authors train a population of neural networks on CIFAR-10 and ImageNet. Fig.8a compares the prediction accuracy of individual data points for 100 ResNet-62 (1x) and ResNet-14 (2x) models. As we see the difference in predictions of these models are very different and large enough to not be by chance (Fig.8b). Subplot b suggests that similar model architectures with different initialisations make similar predictions.

Fig.8c shows the accuracy differences on ImageNet classes for ResNets with increasing width (y-axis) and increasing depth (x-axis). This analysis shows that there is a statistically significant difference in class-level error between wide and deep models. On the test sets, wide networks show better performance at identifying scenes compared to objects (74.9% $\pm$ 0.05 vs. 74.6% $\pm$ 0.06, $p = 6 \times 10^5$, Welch’s t-test).

Figure 8. a) comparison of prediction accuracy per example for 100 ResNet-62 (1x) and ResNet-14 (2x) models b) Same as a), for disjoint sets of 100 architecturally identical ResNet-62 models trained with different initialisations c) Accuracy differences on ImageNet classes for ResNets between models with increased width (y-axis) and increased depth (x-axis)

Conclusion

This article goes through the details of a study run by Nguyen et. al which investigated the effect of increasing the capacity of neural networks relative to the training size. They found that with increasing model depth/width, block structures emerge. These structures preserve and propagate the first principal component of the hidden layers. This study also discovered that block structures are unique to each model; however, other internal representations are quite similar across different models or same models with different initialisations. In addition, analysis on different image datasets indicated that wide networks learn different features from deep networks and their class error-level and per-example performance is different.
I hope you enjoyed learning about block structures, when they appear and the difference in performance between wide and deep networks. If you like reading about machine learning, natural language processing and brain-computer interface, follow me on twitter for more content!

Attention-based nested U-Net (ANU-Net)

Maryam Fallah — Mon, 02 Nov 2020 20:54:48 GMT

In this article, I will be summarising the methods and findings of a paper titled ANU-Net: Attention-based nested U-Net to exploit full resolution features for medical image segmentation, published in May 2020 by Li et. al. This paper uses various techniques to improve the accuracy of medical image segmentation and reduce the prediction time by pruning the model.
Although this paper focuses on medical image segmentation, its techniques can be applied to other applications as the problems these methods solve are general deep learning problems.
Specifically, this paper provides solutions for:

Vanishing/exploding gradients
Semantic gap between encoder and decoder feature maps
Presence of irrelevant features in the feature maps
Convergence of the model
Imbalanced training datasets which lead to a biased model
Slow prediction due to the size of the model and its parameters

This paper has used different techniques to tackle each of these problems which I will discuss in detail in this article. These methods can be used independently, so feel free to skip the parts that are not of your interest (I will refer to the problem number in the headers for easier navigation).

Medical Image Segmentation and its Usage

Imaging segmentation is a subset of computer vision. It is the process of assigning each pixel in the image to a class. A group of pixels that belong to the same class are considered to be part of the same object. The end result of this classification task is a set of contours identifying the different objects in the image.
Medical image segmentation applies this process to medical images such as MRIs and CT scans with the objective to detect abnormalities and lesions in specific parts of the body. This is especially useful in the early detection of cancer. Cancer is the second leading cause of death globally and its early detection and treatment is the main method to increase its survival rate.

ANU-Net's building blocks

Nested, Dense Skip Connections (solution for problems 1&2)

The proposed neural network in this paper, shares its backbone with some previous research. Its main component, similar to many other image segmentation models, is an encoder, decoder structure for downsampling and upsampling the image respectively. This is the U-Net part of the model which was initially proposed by Ronneberger et. al (See Fig.1 from the paper). The name of the network comes from its U shape structure.
Encoders help with detecting what the features are, whereas decoders put these detected features into perspective as to where they are in the input image.
To help with preserving the context of the input and reduce the semantic gap between the encoder and decoder pathways, U-Net concatenates the outputs of each encoder to the decoder of the same depth (see copy and crop arrows in Fig.1).

Figure 1: U-Net structure proposed by Ronneberger et. al

In order to make the encoder and decoder feature maps more semantically similar, Zhou et. al suggested using convolutional layers with dense connections instead of simply concatenating the encoders' feature maps to the decoders (Fig.2 from paper). These layers are called nested skip connections and are densely connected in the sense that the output of each convolutional layer is directly passed the to the next layer (see the dotted blue lines in Fig.2). The idea of densely connected layers was first proposed by Huang et. al and has a number of advantages. It addresses the vanishing gradient issue as the output of all layers including the initial ones get propagated to the higher layers. For the same reason, the model can become more compact. It also, adds more diversity to the feature set as it includes a range of complex features from all the layers.

Figure 2. U-Net++ structure proposed by Zhou et. al

Attention Gates (solution for problems 2&3)

In order to focus on relevant parts of the image and enhance the learning of the target area, Li et. al suggested using attention gates in the skip connections. As seen in Fig.3, the output of each convolutional layer, along with a gate signal get passed to the attention gate to suppress the irrelevant features. More details on how this gate functions is provided below.

Figure 3. ANU-Net structure proposed by Li et. al

The image below, shows how attention gates are used in this context. This process can be broken down to the following steps:

Two inputs enter the gate: The gate signal (g), e.g. X2_1 from Fig.3 which is an upsampled feature and helps with selecting more useful information from the second input to the gate: the encoded feature (f), e.g. X1_1
Convolution and batch normalisation is performed on each input
The results from step 2 are merged and passed to a relu activation function
Convolution and batch normalisation
The result is passed to a sigmoid activation function to compute the attention coefficient $\alpha$
$\alpha$ is applied to the original feature input to suppress the irrelevant information

Figure 4. Attention gate

Deep Supervision (solution for problem 4)

In order to improve the network's convergence behaviour and have distinct features be selected from all the hidden layers and not just the last layer, Lee et. al suggested Deeply Supervised Nets. The central idea behind this method is to directly supervise the hidden layers as opposed to the indirect supervision provided by backpropagation of the error signal which is computed from the last layer. To achieve this, companion objective functions are placed for each hidden layer.
You can think of this method as truncating the neural network into $m+1$ smaller networks ($m$ being the number of hidden layers) where each hidden layer acts as the final output layer. Therefore, for each hidden layer a loss is computed from the companion objective function. The goal is to minimise the entire network's output classification error while reducing the prediction error of each individual layer.
Backpropagation of error is performed as usual, with the main difference being the the error backpropagates from both the final layer and the local companion output.
In ANU-Net, the authors have used this idea by adding a $1\times1$ convolutional layer and sigmoid activation function to every output in the first layer and passing the result directly to the final loss function as seen in Fig.5.

Figure 5. Deep supervision in ANU-Net

Loss function (solution for problem 5)

Up to this point I have discussed the techniques ANU-Net uses to extract full resolution semantic information. In order to learn from this information, a hybrid loss function is used. This function combines two loss functions, Dice loss and Focal loss.
Dice coefficient which is used to compute Dice loss (Dice loss = 1-Dice coefficient) measures the overlap between two samples and is computed as follows:

$\begin{aligned}\large Dice\quad coefficient \quad (\bar Y, Y) = \frac{2 \times \bar Y. Y}{{\bar Y} ^ 2 + Y^2} \end{aligned}\tag{1}$

Where $\bar Y$ is the ground truth and $Y$ is the prediction.
Focal loss focuses on solving the imbalanced dataset problem, a common issue with medical images as with many other fields. It is formulated as below:

$\begin{aligned}\large Focal \quad loss \quad (p_t) = {-\alpha_t \times (1-p_t)^\gamma \times log(p)} \end{aligned}\tag{2}$

Where $p_t$ equals $p$ when $y=1$ and $1-p$ otherwise. This cross entropy based loss function, tackles the imbalanced data problem in two ways:

It assigns a weight, $\alpha_t$, to the loss value of the data points. This weight is inversely proportional to the size of the data in each class. So the prediction loss of a point that is from a class that makes up 80% of the entire training set will shrink in value when training because it gets multiplied by $\alpha_t = 1/0.8$. In other words, it forces the loss function to pay more attention to the value of data points from the smaller class.
It penalises samples that are easily classified ($(1-p)^\gamma$ term, $\gamma$ is set to 2). With this term, Focal loss prevents the model to learn from data points that are easily classified and focuses on hard examples. The easiness of classifying a sample is determined by its probability of belonging to any class, $p_t$. For instance, when the predicted value of a sample data point is close to 0.99 (indicating it belongs to the positive class), $(1-p_t)^\gamma$ penalises the contribution of this easy sample to the overall loss function.
Using the advantages of both Dice loss and Focal loss, ANU-Net proposes the loss function below:

$\begin{aligned}\large Loss = {\sum_{i=1}^{4}(1-[\frac{\alpha \times \bar Y \times logY_i}{|Y_i - 0.5|^\gamma} + \frac{2 \times Y_i . \bar Y + s}{Y_i ^2 + \bar Y ^2 + s}]) }\end{aligned}\tag{3}$

Where $\frac{\alpha \times \bar Y \times logY_i}{|Y_i - 0.5|^\gamma}$ is inspired by Focal loss, the term $\alpha$ has the same objective of down-weighing the dominant class and $\frac{1}{|Y_i - 0.5|^\gamma}$ is the penalising term that assigns higher values to more uncertain inputs. For example, when a data point is assigned a probability of 0.5 (which can belong to either the positive or negative class) this term will be very large forcing the loss function to learn from this hard example.
Note that $s$ is a smoothing factor in the Dice coefficient.

Model Pruning (solution for problem 6)

One huge benefit of using deep supervision is that during inference the model can be pruned. This is because at training, each hidden layer is treated as an output layer with a loss function that backpropagates the error. Since at inference time only forward propagation is done, we can prune the model making it significantly smaller and faster to compute the results. The figure below shows the ANU-Net at four levels of pruning. The grey area are the inactivated section.

Figure 6. Pruned ANU-Net

Depending on the performance on the test set, a shallow pruned model with less parameters can be used instead of the full model to increase speed.

Results

Four medical image datasets and four performance metrics namely, Dice, intersection over union (IoU), Precision and Recall were used in this research paper. The results are compared to five popular models (U-Net, R2U-Net, UNet++, Attention U-Net and Attention R2U-Net).
In all four datasets, ANU-Net outperforms the other methods. For instance, in one of the datasets that included CT images of the liver, compared to the Attention U-Net (an older model), ANU-Net's IoU ratio increased by 7.99%, the Dice coefficient increased by 3.7%, the precision increased 5% and recall rate increased 4% (see paper for detailed results).
Fig.7 shows the ground truth image for liver segmentation (in red) and compares the results of ANU-Net (in blue) with R2U-Net (in green). The arrows in the rightmost image indicates the areas that R2U-Net missed.

Figure 7. Ground truth liver segment (red), segmentation performed by ANU-Net (blue) and R2U-Net (green). The arrows indicate the missed areas

Results from different pruned networks

Note that there is a significant difference in the number of parameters of the pruned and full model. ANU-Net L1 is 98.8% smaller than ANU-Net L4 and when tested on the liver dataset, L1 was 17.13% faster on average at prediction. This speed improvement obviously comes at the cost of accuracy, with a 13.35% and 27.18% decrease in IoU and Dice coefficient respectively. Another pruned model, ANU-Net L3, showed more promising results with a 7.7% increase in speed, 75.5% reduction in parameters and only 0.62% decrease in IoU and 2.56% decrease in Dice coefficient.

Conclusion

In this article, I discussed ANU-Net and its properties. This work, combined very interesting ideas to extract full resolution semantic information with the use of densely connected skip connections, attention gates and deep supervision. It also, used a hybrid loss function that penalises different data points based on the class they belong to and whether they are easily classified or not. In addition, the authors showed the possibility of pruning the model at inference time to speed up prediction. Although this network focuses on medical image segmentation, its components can be applied to any deep learning problem.
I hope you enjoyed learning about these different concepts. If you like reading about machine learning, natural language processing and brain-computer interface, follow me on twitter for more content!

Energy-based Out-of-distribution Detection

Maryam Fallah — Thu, 22 Oct 2020 02:48:15 GMT

In this article, I'm going to summarise a paper with the above title that was published in October 2020. This paper focuses on a common problem in machine learning, overconfident classifiers. What do I mean by that?
Neural networks are a common solution to classification problems. Take the classic digits classification on MNIST handwritten images as an example. With an image dataset of digits from zero to nine, we can train a model that would detect what number is in a given image. But what if the input is a picture of a dog?
Ideally, the model should be able to detect that the features extracted from the dog image is nothing like that of the images it has seen before which were only pictures of numbers. In other words, the model should detect that the dog picture is out of distribution (OOD) and say this doesn't look like anything I have seen before, so I'm not sure if it belongs to any of the 10 classes I was trained on. But that's not what happens. The model which is most likely using a softmax score, outputs the most likely digit class the dog image belongs to, i.e. the model is overconfident.
So what's the solution? Energy scores!
I get into the details of what this score is below but essentially the idea is that softmax scores do not align with the probability density of the inputs and sometimes can produce overly high confidence scores for out-of-distribution samples (hence the model is overconfident), whereas energy scores are linearly proportional to the input distribution. Therefore, they are more reliable in detecting in- and out-of-distribution data points.
This paper shows that energy scores can easily be used at inference time on a pre-trained neural network without any need for changing the model parameters or it can be implemented as a cost function during training.

What is an energy function?

The idea of energy scores comes from thermodynamics and statistical mechanics, specifically from Boltzmann (Gibbs) distribution which measures the probability of a system being in a particular state given the state's energy level and the system's temperature:

$\begin{aligned}\large p(y|\mathbf {x}) = \large \frac{e^-\frac{E(\mathbf x, y)}{T}}{\int\limits_{y'} e^-\frac{E(\mathbf x, y')}{T}} = \large \frac{e^-\frac{E(\mathbf x, y)}{T}}{e^-\frac{E(\mathbf x)}{T}} \end{aligned}\tag{1}$

Where $T$ is the temperature and $E(x): R^D \rightarrow R$ is the energy function that maps an input of dimension $D$ to a single scalar called the energy value and based on the Helmholtz free energy definition, it can be expressed as:

$\begin{aligned}\large E(\mathbf x) = - T.log(\int\limits_{y'} e^-\frac{E(\mathbf x, y')}{T}) \end{aligned}\tag{2}$

From Softmax to Energy Function

Now let's consider the softmax function that maps logits of dimension D to K real-valued numbers representing the likelihood of the input belonging to any of the K classes:

$\begin{aligned}\large p(y|\mathbf x) = \frac{e^-\frac{f_y(\mathbf x)}{T}}{\sum_{i}^{K} e^-\frac{f_i(\mathbf x)}{T}}\end{aligned}\tag{3}$

Where $f_{y}(x)$ is the $y^{th}$ index of $f(\mathbf x)$, i.e. the logit representing the $y^{th}$ class. Compare Eq.1 with Eq.3 and we get:

$\begin{aligned}\large E(\mathbf {x}, y) = - f_y (\mathbf{x}) \end{aligned}\tag{4}$

Plug that into the definition of $E(\mathbf{x})$ in Eq.2:

$\begin{aligned}\large E(\mathbf {x}; f) = - T.log\mkern3mu\sum_{i}^{K} e^-\frac{f_i(\mathbf x)}{T} \end{aligned}\tag{5}$

Eq.5 means that without any change in the trained neural network's configuration, we can compute the energy values of the input in terms of the denominator of the softmax function. Now let's see why this helps with the original model overconfidence issue discussed in the introduction.

Using Energy scores instead of softmax scores at inference time

The goal here is to be able to detect when an input is very different from all the inputs used during training. We can look at this as a binary classification problem and use energy functions to compute the density function of our original discriminative model, e.g. handwritten digits classifier:

$\begin{aligned}\large p(\mathbf {x}) = \frac{e^-\frac{E(\mathbf x; f)}{T}}{\int\limits_{\mathbf x} e^-\frac{E(\mathbf x; f)}{T}}\end{aligned}\tag{6}$

Take the log of both sides and we get:

$\begin{aligned}\large log\mkern5mup(\mathbf {x}) = \frac{-E(\mathbf x; f)}{T} - log\mkern4mu Z\end{aligned}\tag{7}$

The second term in the above equation is simply a constant normalisation factor from the denominator of Eq.6. This proves that the negative energy score of an input $\mathbf x$ is linearly aligned with its density function. In other words, the lower the energy score of some input, the higher the likelihood of it belonging to the input distribution. Fig.1 (from the paper) shows how an energy function as computed by Eq.5 can be applied to a pre-trained model to detect out-of-distribution samples. The energy threshold $\tau$ is set to the value that correctly detects the most number of the in-distribution data (by sliding the negative energy value in the figure below and selecting the number that causes the least overlap between in- and out-of-distribution data).

Figure 1. Out-of-distribution detection using an energy function

Linear relation with the probability density is the main reason why energy scores are superior to softmax scores that are not in direct relation with the density function. In fact, the paper shows that the softmax confidence score can be written as the sum of the energy score and the maximum value of $f(\mathbf{x})$. Since $f$ tends to be maximum for seen data and $E(\mathbf {x}; f)$ as shown above is lower for such data, softmax scores are not aligned with the density function.
Fig.2 (from the paper) shows the comparison between softmax scores and energy scores for in- and out-distribution data points. The authors split a dataset to training, validation and test and trained a model using the train and validation sets. Then they used the test set as in-distribution samples and used a completely different dataset as out-distribution samples. Notice how the negative energy score for an in-distribution sample is lower and much different than the negative energy score of an out-distribution sample (7.11 vs. 11.19). Whereas, the softmax scores for the same samples are almost identical (1.0 vs 0.99).

Figure 2. Softmax versus negative energy scores for two in and out-distribution samples

Using Energy functions in training

The paper also investigates the benefits of energy-based learning since the gap between in- and out-of-distribution samples in models trained using softmax may not always be enough for accurate differentiation.
The idea here is that including the energy function in the cost function during training allows for more flexibility to shape the energy surfaces of in- and out-of-distribution data points (blue and grey areas in the right image of Fig.1) and have them far from each other. Specifically, the model is trained using this objective function:

$\begin{aligned}\large min_{\theta}\quad E_{(\mathbf x, y) \sim D^{train}_{in}}[-log F_y(\mathbf x)] + \lambda.L_{energy}\end{aligned}\tag{8}$

Where $F(\mathbf x)$ is the output of the softmax function, $D^{train}_{in}$ is the in-distribution training data and $L_{energy}$ is a regularisation term based on the energy of in- and out-of-distribution samples used in training. Essentially, this term makes sure that the energy of OOD samples are not below a margin $m_{out}$ and the energy of in-distribution points are not higher than a margin $m_{in}$. In other words, the regularisation loss term makes sure that there is a big enough gap between in- and out-of-distribution samples by penalising samples that have energy levels between $m_{in}$ and $m_{out}$.

Experiments and Results

The authors have used three image datasets, namely SVHN, CIFAR-10 and CIFAR-100 as in-distribution data and six datasets (Textures, SVHN, Places365, LSUN-Crop, LSUN-Resize, iSUN) as out-of-distribution data and measure different metrics. One metric that was considered was the false positive rate (FPR) of OOD examples when the in-distribution true positive rate (TPR) is 95%.
Using energy scores at inference on two pre-trained models, the FPR95 decreased by 18.03% and 8.96% compared to the FPR from the softmax confidence score.
Using energy functions to shape the energy surfaces during training gives less error rates compared to other methods (4.98% vs 5.32%)
The parameter $T$ showed to affect the gap between in- and out-of-distribution samples with an increase bringing the distributions closer to each other. The authors suggest using a value of 1 to make the energy score parameter-free.

Conclusion

This study has shown the shortcomings of current methods used in practice for classifying problems and suggests energy-based learning to improve the models' OOD detection by reducing the false positive rates.
I hope you enjoyed learning about energy scores and their importance. If you like reading about machine learning, natural language processing and brain-computer interface, follow me on twitter for more content!