cover

Model uncertainty through Monte Carlo dropout - PT1

This small series of blog-posts aims to explain and illustrate the Monte Carlo Dropout for evaluating the model uncertainty. The first one will investigate the model uncertainty in Deep Learning and how it can be hadled, inspecting pros and cons of different approaches. Then, the second part explains, step by step, the pipeline of a practical project (with code). I hope youā€™ll enjoy them!

Model uncertainty

Artificial intelligence algorithms typically provide predictions without taking into account their certainty or uncertainty. However, while dealing with delicate outcomes like benign or malignant tumors, it is important to provide only certain outcomes. Modern algorithms achieved great results on medical imaging applications, but again, this is not strictly correlated with a lower model uncertainty. Ideally, we aim to have an AI that achieves great performances but at the same time it should be able to seek for a human supervision whenever it is not confident enough.

Uncertainty in Deep Learning

Uncertainty in Deep Learning represents one of the major obstacles during the development. The uncertainty may arise from the observations and they may be reflected on the subsequent model predictions. Fields like biology, physics or healthcare have a very little tolerance, therefore there is a special need for dealing with uncertain predictions.

Starting from the definition, we may first distinguish two kinds of uncertainty: aleatoric and epistemic [1]. The intrinsic stochasticity of the data is referred to as the aleatoric uncertainty, and it is obvious that it cannot be minimized. On the other side, the inappropriateness of the training observations is referred to as epistemic uncertainty. Simply put, the lack of data and understanding is reflected in the epistemic uncertainty, which may be reduced by including additional training examples. A visual representation of both uncertainty measures are reported in figure below. The epistemic uncertainty also accounts for the model uncertainty, because this is a type of uncertainty that can be explained if we have enough data.

epistemic-vs-aleatoric

Overview of Bayesian Deep Learning

Well established methods for the evaluation of the model uncertainty rely on Bayesian Neural Networks, where each weight is represented by means of a distributions other than single values. Going backwards, Bayesian statisticsā€™ capacity to genuinely quantify uncertainty is a key characteristic. As a result, rather than focusing on parameter point estimates, it specifies a probability distribution across the parameters. The hypothesis on the value of each parameter is represented by this distribution, known as posterior distribution. The Bayesā€™ Theorem is used to compute it. Therefore given a weight ww and a dataset DD, this is defined as:

p(wāˆ£D)=p(Dāˆ£w)p(w)p(D)p(w|D) = \frac {p(D|w) p(w)} {p(D)}

where p(wāˆ£D)p(w|D) represents the posterior distribution, p(Dāˆ£w)p(D|w) represents the likelihood distribution (what we observed from the dataset), p(w)p(w) is the prior distribution (our prior belief) and finally p(D)p(D) is the evidence, also called marginal likelihood.

In Bayesian Networks, therefore the goal is find a predictive posterior distribution, that slightly differs from the posterior distribution. In fact, the posterior distribution can be represented as the distribution of an unknown quantity (treated as a random variable), whereas the predictive posterior distribution is the distribution for future predictive data based on the observed ones. Given a test observation xx and a candidate class yy, this is formalized as

p(yāˆ£D,x)=āˆ«p(yāˆ£w,x)p(wāˆ£D)dwp(y|D,x) = \int p(y|w,x)p(w|D) dw

where the idea is to consider every possible parameter settings (due to the integral), weighted with their probabilities. However, as you may imagine, considering all possible parameter settings in very deep networks having millions or billions of parameters may be a little bit expensive. This is why it is barely employed.

If you wanna know more about Bayesian Deep Learning, I suggest you to deep dive this wonderful article [2]. My goal here was only to underline the limits of Bayesian Deep Learning.

Monte Carlo dropout

Gal et al [3] proposed in 2015 the Monte Carlo dropout, an approximation of the Bayesian Inference. The standard dropout randomly inactivates neurons in a given layer with probability p and this is usually applied during training in order to reduce the overfitting and regularize the learning phase. The Monte Carlo dropout, on the other hand, approximates the behaviour of Bayesian inference by keeping the dropout activated also at inference time. It has been showed to be equivalent to drawing samples from a posterior distribution, allowing therefore a sort of Bayesian inference. In fact, for every dropout configuration Ī˜t\Theta_t we have a new sample from an approximate poterior distribution q(Ī˜āˆ£D)q(\Theta|D). Therefore, the model likelihood becomes:

p(yāˆ£x)ā‰ƒ1Tāˆ‘t=1Tp(yāˆ£x,Ī˜t)s.t.Ī˜tāˆ¼q(Ī˜āˆ£D)p(y|x) \simeq \frac 1 T \sum_{t=1}^T p(y|x,\Theta_t) \quad s.t. \quad \Theta_t \sim q(\Theta|D)

where the likelihood of the model can be assumed for smiplicity to follow a Gaussian distribution:

p(yāˆ£x,Ī˜)=N(f(x,Ī˜),s2(x,Ī˜))p(y|x,\Theta) = N ( f(x,\Theta),s^2(x,\Theta))

where f(x,Ī˜)f(x,\Theta) represents the mean and s2(x,Ī˜)s^2(x,\Theta) represents the variance.

pipeline

A set of NN inferences with dropout activated provides NN different model configurations and slightly different outcomes. The uncertainty will be estimated afterwards through a statistical analysis performed on the output, called Monte Carlo samples

Each dropout configuration yields a different output by randomly switching neurons off (the ones with a red cross) with each forward propagation. Multiple forward passes with different dropout configurations yield a predictive distribution over the mean p(f(x,Ī˜)p(f(x, \Theta)).

Uncertainty metrics

Unless we kept a dropout probability very low, we are likely to obtain slightly different Monte Carlo samples for every prediction. Therefore, based on the current task, we have to define some uncertainty metrics, that are able to estimate the uncertainty based on the differences between these Monte Carlo samples.

Therefore, this is strictly related to the task we are dealing with. So, are we dealing with a classification task? Regression? Or even a semantic segmentation task? They will provide different outcomes and they have to be treaten differently.

In case of a classification or regression task, it would be interesting the evaluate the variance between the monte carlo samples. On the other hand, in semantic segmentation task there may be many ways for estimating the uncertainty. For example in Bayesian QuickNat [3], the authors proposed three different metrics: the coefficient of variation in MC Volumes, the Dice agreement in MC samples and the Intersection over Unions of MC samples.

The following example is taken from my Master Thesis [5] and it represents a Monte Carlo dropout pipeline employed on a tumor classification task. The goal was to classify whether a tumor was benign or malignant. Therefore we leveraged 100100 Monte Carlo samples and we estimated the uncertainty with the variance between the malignant probabilities, proving that an higher variance was correlated with lower classification performances.

MC Classification pipeline

Conclusions

To conclude, it should be clear that Bayesian Deep Learning is prohibitive for all those modern netwrorks which have billions of parameters. Several approximation of the Bayesian inference were proposed over the years and one the most popular is the Monte Carlo Dropout, that we covered on this first article.

The second part of this small series aims to propose a practical example (with code), letā€™s have a look!


References:

  1. A Deeper Look into Aleatoric and Epistemic Uncertainty Disentanglement, Valdenegro 2022
  2. A Comprehensive Introduction to Bayesian Deep Learning
  3. Bayesian QuickNAT: Model Uncertainty in Deep Whole-Brain Segmentation for Structure-wise Quality Control
  4. Code of the project for the second part of this blogpost
  5. My MSc thesis

Images:

Related articles:

    background

    05 December 2022

    avatar

    Francesco Di Salvo

    45 min

    30 Days of Machine Learning Engineering

    30 Days of Machine Learning Engineering

    background

    16 January 2023

    avatar

    Daniele Moltisanti

    6 min

    Advanced Data Normalization Techniques for Financial Data Analysis

    In the financial industry, data normalization is an essential step in ensuring accurate and meaningful analysis of financial data.

    background

    17 January 2023

    avatar

    Francesco Di Salvo

    10 min

    AI for breast cancer diagnosis

    Analysis of AI applications for fighting breast cancer.

    background

    14 November 2022

    avatar

    Francesco Di Gangi

    5 min

    Artificial Intelligence in videogames

    Artificial Intelligence is a giant world where we can find everything. Also videogames when we don't even notice...

    background

    18 November 2022

    avatar

    Gabriele Albini

    15 min

    Assortativity Coefficients in social networks data

    Assortativity helps analysing pattern of connections in networks. Letā€™s use it to confirm if people tend to connect to similar people.

JoinUS