A Gentle Introduction to Active Learning

13 min readSep 2, 2023

A trail in Peora — Uttarakhand ( Canon EOS 760D, f/8, Exp. Time 1/250sec, ISO-200, 50mm )

I recently ran into a problem of re-training deep learning models. My specific use case had very scarce data and very difficult to gather additional labelled data. While doing some research on the problem, I came across Active Learning (AL). This article is gist of my learning notes , where I intend to keep a working prototype.

Though there’s no dearth of the theoretical material on the topic, I will attempt a very brief introduction to the topic and jump straight to a prototype.

Note : you can access the relevant python code and jupyter notebooks in the git repository here

Background

There are two broad aspects that AL addresses.

Labelled Data is Expensive: in most practical scenarios, gathering labelled data is expensive. If your model is hungry for more labelled data, it may not be feasible to just go to the customer and ask for more data.
Label the Most Meaningful Samples: In Keras for example, the usual model.fit method picks up random samples from a pool in batches in every epoch. However, there’s no point in picking up samples that do not contribute much to the training. Let’s say, we are working on a binary classification problem. What if, we could tell the algorithm to pick up most important samples that contribute most to finding the decision boundary.

You could end up having a mix and match of different scenarios like:

There may be lot of unlabeled data available, but getting it labeled is just too expensive.
There is no more data available , it’ll take considerable effort and cost to gather more data and then label it.

The entire AL concept hovers around these two points. In a nutshell, when in need of more labelled data , try to get those samples that contribute the most to training.

Which Samples to Label ?

How do we measure the importance of unlabeled samples ? Let’s consider a very simple scenario of a binary classification with two features x1 and x2. Look at the following diagram ( from MODAL Library ) which shows some labelled samples and some unlabeled. Obviously, the points that are closest to the decision boundary will have the maximum impact on the training. On the other hand, if we keep on supplying points that are far from the decision boundary , the model will find it hard to learn or never learn at all. In other words, points that are the most uncertain are key to find the boundary.

Decision Boundary ( Source MODAL Documentation )

Thus , if we have some unlabeled data , we must strive to pick the most uncertain samples and label them. The question is , how do we measure the uncertainty. Let’s take a small detour.

Measuring Uncertainty

Let us assume for a moment that, we have trained a classifier with a very limited training set that was available to us. Obviously , if this model was to predict new samples , it will not produce very confident predictions i.e it’ll be uncertain about its prediction. At this point , we introduce few metrics to measure uncertainty.

Least Confidence (LC) : Look at the prediction probabilities of two new samples. Sample-2 prediction probability is much more uncertain than Sample-1. If we were to calculate LC of both samples , then it will be higher for the second sample.

Formally, LC is given by the equation below. For these two samples , sample-2 has a higher LC and therefore has more uncertainty.

Margin of Confidence (MC) : In this case , for each sample we take the 1st two probabilities and calculate MC as follows.

Once again , we notice that sample-2 has a much lower MC score and therefore has more uncertainty.

Entropy : although entropy is a subject within itself in machine learning and information theory, but let’s not go there. For now , let’s accept that high entropy occurs when the probabilities are most like each another. For the example below we’ve taken base-10 for the logarithm.

For sample-2 , the probabilities are more like each other, hence the higher entropy.

Now we have a formal definition of measuring uncertainty and we will soon use them.

The General Approach

Coming back to our original problem, consider the following scenario. We are building a model that needs to meet certain evaluation criteria ( Recall, Accuracy, F1-Score etc ). So we have the usual hold-out test set and a training set. We build an initial model and evaluate it on the hold-out test set. Alas ! the metrics are below the evaluation criteria. We realize, that we need more data. As our luck would have it, there is abundant data but, all unlabeled !!

So we get hold of an expert , who could label the data for us. However, as labeling is expensive , we decide to choose the samples very carefully. We use one of the measures above to pick the samples with highest uncertainty and pass them to the human annotator.

Once the human annotator labels these samples, we train the model again. We keep on repeating this process, till we reach acceptable model performance.

General AL Flow ( Source Keras Documentation)

We are now ready to perform some experiments.

Our Problem Statement

We will start with a relatively simple problem. For this demonstration , we will use the cat-dog training dataset, from Kaggle. It contains 25000 images equally divided in cats and dogs. The task is to design a binary classifier based on a CNN architecture. Along our journey , we will assume that we do not have all these samples at our disposal.

The conventional approach would typically split the data in train, validation and test sets, followed by building a model. We’ll take this approach to set the benchmark.

Let’s Get Coding

Initially, we’ll use a conventional approach as follows.

Download the dogs-vs-cats.zip file from kaggle. Unzip only the train.zip into your project folder. The code unzips the files from train.zip into a dataset folder.
We leave 10% samples each for validation and test set.
Use the remaining for training.
Build a baseline model using the train and validation set for a certain number of epochs.
Evaluate the model on test set.
Measure uncertainty on the test set to understand the metrics.

This code snippet builds a 2 dimensional numpy array called dataset from the image files , with the first dimension containing the file names and the second column containing label ( 0 for cat and 1 for dog )

import tensorflow as tf
from tensorflow.image import resize
from tensorflow import keras
from tensorflow.keras.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split

from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt
import numpy as np
import os
from scipy.stats import entropy

#A custom library for helper functions
from src.helper import *
np.random.seed(0)

path=os.path.join(os.getcwd(),'dataset')
label_dict={'cat':0,'dog':1}
dataset=np.array([(os.path.join(path,i),label_dict[i.split('.')[0]]) for i in os.listdir(path)])

At this point if you print the first 3 elements of the array dataset, you should get something like this. Nothing much going on here really.

dataset[0:3]
>>[['C:\\xxxx\\dataset\\cat.0.jpg' '0']
 ['C:\\xxxx\\dataset\\cat.1.jpg' '0']
 ['C:\\xxxx\\dataset\\cat.10.jpg' '0']]

This is followed by the assigning the filenames to X and labels to y. Shuffle them a bit and then split the data into train, validation and test sets.

X,y=dataset[::,0],dataset[::,1]
y = y.astype(int)
y = to_categorical(y)

#Shuffle the dataset
p = np.random.permutation(len(X))
X,y = X[p], y[p]

#Strip off 10% samples for hold out test set
test_idxs = np.random.choice(len(X), size=int(0.1*len(X)), replace=False, p=None)
x_test, y_test = X[test_idxs],y[test_idxs]

#Delete the test set samples from X,y 
X = np.delete(X, test_idxs)
y = np.delete(y, test_idxs, axis = 0)

#usual train-val split
#the test_size = 0.11 is just a small maneuver to have
#almost same number val samples as test 
x_train, x_val, y_train, y_val = train_test_split(X, y, test_size=0.11, random_state=42)

Once again, nothing fancy , just three separate arrays to hold the X and y vectors. At this point, if we were to check the number of samples in the datasets , this is what we would get. With 25000 samples, we have around 10% samples each in training and validation sets.

print(f"Samples in Training set: {x_train.shape[0]}")
print(f"Samples in Training set: {x_val.shape[0]}")
print(f"Samples in Training set: {x_test.shape[0]}")

>>Samples in Training set: 20025
Samples in Validation set: 2475
Samples in Test set: 2500

We have a custom function build_dataset, that returns batches of tensors that can be fed to the training process.

#The buid_dataset is a custom function that returns tensor batches

val_dataset=build_dataset(x_val,y_val,repeat=False,batch=256)
test_dataset=build_dataset(x_test,y_test,repeat=False,batch=256)

BATCH_SIZE=16
STEPS_PER_EPOCH=len(x_train)/BATCH_SIZE

train_dataset=build_dataset(x_train,y_train,batch=BATCH_SIZE)
input_shape=train_dataset.element_spec[0].shape[1:]

The function simple_model, from helper library builds a CNN architecture. Its not too big, just 684 thousand parameters.

model=simple_model(input_shape)
model.compile(
        loss="binary_crossentropy",
        optimizer=Adam(),
        metrics=[keras.metrics.Recall(),
                 keras.metrics.CategoricalAccuracy()]
    )
model.summary()

Some basic callbacks , and we fit the model. Although , the epochs is set to 200 however, when I ran the model it converges by the 34th epoch.

checkpoint=ModelCheckpoint(filepath='model/model_full.h5',
                           monitor='val_loss',save_best_only=True,verbose=1)

csv_logger=keras.callbacks.CSVLogger('logger/trainlog_full.csv',
                                     separator=',',append=False)

early_stopper=keras.callbacks.EarlyStopping(monitor='val_loss',
                                            min_delta=0.001,
                                            restore_best_weights=True,
                                            patience=10)

callbacks_list=[checkpoint,early_stopper,csv_logger]

model.fit(train_dataset,steps_per_epoch=STEPS_PER_EPOCH,epochs=200,
          validation_data=val_dataset,validation_steps=None,
          callbacks=callbacks_list)

What’s important, is to evaluate the baseline model performance on the test dataset. It has reached 88% accuracy after consuming the entire training dataset with 20000 samples.

acc_test, acc_loss ,_= model.evaluate(test_dataset, verbose=0)

print("-" * 100)
print(model.evaluate(test_dataset, verbose=0,return_dict=True))

----------------------------------------------------------------------------------------------------
{'loss': 0.2882618308067322, 'categorical_accuracy': 0.8831999897956848}

The learning curves present no surprises. As a matter of fact , from 23rd epoch, the validation accuracy actually starts degrading.

Before we move on to the next section , let us try to measure the various uncertainty metrics on the test set.

Test Set Uncertainty : based on the uncertainty measures we learned above, let us try them on the test set. We should be able to make out the most uncertain predictions in the test set. Let us start with top 10 LC scores.

y_test_proba = model.predict(test_dataset)
#Calculate Least Confidence
y_test_uncert = 1 - y_test_proba.max(axis=1)
#Indices of the top 10 Least Confidence
y_test_top_lc = np.argsort(y_test_uncert)[-10:]
#Print the predictions for the top 10 least confidence
print(y_test_proba[y_test_top_lc])

>>[[0.49423876 0.50576127]
 [0.49471268 0.5052873 ]
 [0.50421554 0.4957844 ]
 [0.50365025 0.49634972]
 [0.503      0.49699995]
 [0.497582   0.50241804]
 [0.49760765 0.50239235]
 [0.49801224 0.50198776]
 [0.50127554 0.49872446]
 [0.5001126  0.49988738]]

Clearly , these predictions aren’t that confident. How about the MC scores for the test set ? Once again, we notice that these prediction probabilities are almost identical for both the classes. Which means , the model isn’t very confident about these samples.

part = np.partition(-y_test_proba, 1, axis=1)
# margin calculation
margin = - part[:, 0] + part[:, 1]
# indices of the lowest margin scores
y_test_least_mc = np.argsort(margin)[:10]
#Print the predictions for the 10 least margins
print(y_test_proba[y_test_least_mc])

[[0.5001126  0.49988738]
 [0.50127554 0.49872446]
 [0.49801224 0.50198776]
 [0.49760765 0.50239235]
 [0.497582   0.50241804]
 [0.503      0.49699995]
 [0.50365025 0.49634972]
 [0.50421554 0.4957844 ]
 [0.49471268 0.5052873 ]
 [0.49423876 0.50576127]]

Finally, let us look at the predictions with the top ten entropies.

#indices of the predictions with 10 largest entropies
y_test_max_ents = np.argsort(entropy(y_test_proba.T))[-10:]
#Print the 10 predictions with largest entropies
print(y_test_proba[y_test_max_ents])

>>[[0.49423876 0.50576127]
 [0.49471268 0.5052873 ]
 [0.50421554 0.4957844 ]
 [0.50365025 0.49634972]
 [0.503      0.49699995]
 [0.497582   0.50241804]
 [0.49760765 0.50239235]
 [0.49801224 0.50198776]
 [0.50127554 0.49872446]
 [0.5001126  0.49988738]]

We have successfully established, that we can use the metrics LC, MC and Entropy to figure out the most uncertain predictions. So how do we use these concepts in active learning.

The Active Learning (AL) Loop

Using the entire train-set we were able to reach 88% of accuracy on the test dataset. We argue that , we do not have a large training set with 20000 samples like above. Let us start with a much smaller training set to build an initial model. Say, we have a seed dataset with only 5000 labelled samples and we start training with whatever we have. The reamiaining 15000 samples are kept in a pool.

The code for the AL demo can be found in 02_AL_Training.ipynb. You will notice that most of the initial part is identical to the previous one, so I am not repeating it.

initial_seed = 5000
x_seed , x_pool = x_train[0:initial_seed], x_train[initial_seed:]
y_seed , y_pool = y_train[0:initial_seed], y_train[initial_seed:]

print(f"Samples in Seed set: {x_seed.shape[0]}")
print(f"Samples in Pool: {x_pool.shape[0]}")
print(f"Samples in Validation set: {x_val.shape[0]}")
print(f"Samples in Test set: {x_test.shape[0]}")

>>Samples in Seed set: 5000
Samples in Pool: 15025
Samples in Validation set: 2475
Samples in Test set: 2500

After training the model on these initial samples , our performance on the test dataset is as follows.

print("-" * 100)
print(model.evaluate(test_dataset, verbose=0,return_dict=True))

>>{'loss': 0.47423428297042847, 'categorical_accuracy': 0.7764000296592712}

Obviously , this is nowhere near the 88% accuracy we achieved while using the full training set.

We will now build the AL logic according this flowchart.

Step_1: test the current model on the test set. If it exceeds or equals the baseline accuracy we got using the full training set, then we exit the AL loop. Otherwise we proceed.

Step_2: measure the uncertainties in the pool dataset. In other words we query the pool dataset. For this experiment, we will use the entropy measure. Pick the top 200 samples with maximum entropy, append them to the seed dataset and delete them from the pool.

In a real world scenario , we will send these samples to a human annotator or use some other labelling technique. Then they are appended to the initial dataset.

Step_3: re-compile the model to reset the optimizer states and fit again. Save model if there is an improvement in loss. Go back to Step_1

The code might appear a bit intimidating , but its quite simple , I promise !!

for iteration in range(num_iterations):
    
    #Step_1
    loss, acc = model.evaluate(test_dataset, verbose=0)
    print(f"Test Set Accuracy after {iteration} iteration {acc}")
    al_history.append([loss, acc, x_seed.shape[0], x_pool.shape[0]])
    
    if acc >= acc_baseline:
        
        break
    
    #Step_2
    #Use the current model to predict the pool dataset
    y_pool_proba = model.predict(pool_dataset)
    
    #Pick the index of the top entropy samples in pool
    pool_max_ents = np.argsort(entropy(y_pool_proba.T))[-sampling_size:]
    
    #Acquire those samples from pool
    x_sample = x_pool[pool_max_ents]
    y_sample = y_pool[pool_max_ents]
    
    #Add these samples to the seed dataset
    y_seed = concat((y_seed,y_sample),axis=0)
    x_seed = concat((x_seed,x_sample),axis=0)
     
    #Delete the acquired samples from pool
    x_pool = np.delete(x_pool, pool_max_ents, 0 )
    y_pool = np.delete(y_pool, pool_max_ents, 0 )

    #Build the tensorflow dataset object for this iteration
    pool_dataset = build_dataset(x_pool,y_pool,repeat=False,batch=256,
                                 shuffle = False)
    train_dataset = build_dataset(x_seed,y_seed,batch=BATCH_SIZE) 

    print(f"Samples in seed dataset {x_seed.shape[0]} , in pool dataset {x_pool.shape[0]}")
    print("-" * 100)

    #Step_3
    model.compile(
        loss = "binary_crossentropy",
        optimizer = Adam(),
        metrics = CategoricalAccuracy()
    )
    
    history = model.fit(train_dataset,steps_per_epoch=STEPS_PER_EPOCH,epochs=100,
          validation_data=val_dataset,validation_steps=None,
          callbacks=callbacks_list)
    
    #If the fit method generated a new best model , load it for
    #the next iteration
    model = keras.models.load_model("model/model_al.h5")
    clear_output()
    clear_session()

On executing this , you should see something like this after every iteration. For example, at the beginning of 5th iteration, this is what happens

Calculate the test accuracy with last saved model
Predict the samples in pool , so that we can calculate the entropies of each sample currently in pool
Samples currently in the seed and pool-set

Test Set Accuracy after 5 iteration 0.828000009059906
55/55 [==============================] - 4s 71ms/step
Samples in seed dataset 5800 , in pool dataset 14225
----------------------------------------------------------------------------------------------------
Epoch 1/100
1250/1251 [============================>.] - ETA: 0s - loss: 0.1595 - categorical_accuracy: 0.9388
Epoch 1: val_loss did not improve from 0.42407
1251/1251 [==============================] - 18s 13ms/step - loss: 0.1596 - categorical_accuracy: 0.9387 - val_loss: 0.5336 - val_categorical_accuracy: 0.8214
Epoch 2/100
1251/1251 [============================>.] - ETA: 0s - loss: 0.1512 - categorical_accuracy: 0.9407
Epoch 2: val_loss improved from 0.42407 to 0.40954, saving model to model\model_al.h5
1251/1251 [==============================] - 16s 13ms/step - loss: 0.1512 - categorical_accuracy: 0.9407 - val_loss: 0.4095 - val_categorical_accuracy: 0.8473

The logic then moves 200 samples from pool to seed and start another training. If there’s any improvement in loss , then the model is saved.

Top : Baseline model training using the full training set. Bottom : model trained using active learning

After 29 iterations , we have reached almost the same level of accuracy as the baseline model built on the full training set. What’s more important , we only used about 5800 additional samples and did not use ~9200 samples from the pool at all !!

The following chart shows the size of the seed and pool datasets, followed by the model performance on test set after every re-training. It took 29 re-trainings , meaning we queried the pool set 29 times. Each time , we selected 200 points with highest entropy and added them to the seed set.

With an initial labelled set of 5000 samples , we would have sent only around 5800 of them to a team of human annotators. That’s a huge saving !! Note that , when you run the experiment , you may end up with slightly different numbers. Nevertheless , you will not need to exhaust the entire pool.

Before we conclude , let’s visualize some of the results we got in the active learning phase. What happened in each of the re-training, can be seen in the chart below. The green markers are for iteration start i.e. where a new re-training started. Those 29 re-training attempts amount to 433 epochs.

Note the initial variation in accuracy. This is because , in initially when we were adding highly uncertain samples from pool to seed, the model had a lot to learn. As we keep on retraining, the model learns from uncertainty in new samples. This “surprise-element” disappears , as the training progresses.

Conclusion

We were able to see that , when labelled data is scarce , it might be suitable to gather a pool of unlabeled data and label them iteratively in batches. We “query” the most uncertain samples from the pool and have them labelled. This way , we do not spend a lot of time and money on labelling every unlabeled sample.

For the sake of experimentation , the querying from pool based on uncertainty was done manually. In a more practical scenario , you can use libraries like modAL , that does all the querying from the pool. You can set the preferred uncertainty measure and the package does the rest. Have a look at the documentation from the reference section. It has many features that you can explore for your projects.

You will find in the associated literature , that there are many other paradigms of active-learning and this was only the surface.

That’s all folks !! Hope you enjoyed the article.

References:

Active Learning : Burr Settles

Human in the Loop Machine Learning : Robert (Munro) Monarch

modAL in a nutshell : A python Active Learning Library developed by Tivadar Danka