Region Proposal Network(RPN) (in Faster RCNN) from scratch in Keras

Akash Kewar
12 min readMay 17, 2021

Before you start, I highly recommend you to open This google colab notebook in parallel to understand concepts better.
Also, My implementation is heavily based on the Guide to build Faster RCNN in PyTorch article.

Region proposal network that powers Faster RCNN object detection algorithm

In this article, I will strictly discuss the implementation of stage one of two-stage object detectors which is the region proposal network (in Faster RCNN).

Two-stage detectors consist of two stages (duh), First stage (network) is used to suggest the region of interest (region of the image where the object might be) and these proposals are then sent to another network (stage two) for the actual object classification of proposals and offset regression (More on this later). On the other hand, one-stage detectors are really fast (in the terms of prediction and processing) but they usually have less accuracy than two-stage detectors (there is always a trade-off between speed and correctness).

(a) one-stage detector, (b) two-stage detector (Image source)

The Basic Architecture

The RPN in Faster RCNN consist of 3x3 convolution on the feature map given by the backbone network (VGG16 in our case, 50x50x512). We output 512 the channel which is then fed into 1x1 sibling convolution layer each for objectiveness score (classification) and offset regression.

Input Image And Reshaping

Input image (resized to 800x 800) with ground truth boxes

​The input image contains 2 object that needs to be detected. We first resize the image to 800 x 800 size. To resize the ground truth box we multiply the % of compression/ expansion to the coordinate of the ground truth box.

*Note: Resizing step is not necessary, but for simplicity will resize it. We may feed arbitrary image size which requires few changes in the code.

Reshape image and annotations​

Before proceeding we should understand and familiarize ourselves with few keywords.

Backbone Network

The backbone network is a network (architecture) which are pre-trained for classification tasks (image classification) like VGG16, ResNet18, GoogLeNet or Inception v1 and so on. We use these network as fully convolution network by removing the “top” (fully connected layers) of the network, By doing so, the first cell in the feature map will contain the information about the group of pixels in the original image. These network act as feature extractor which is then fed into the RPN network together with the anchor box offsets for each cell in the feature map. For instance, we have 800 x800x3 image, after feeding it into the backbone(VGG16 in our case) we get 50x50x512 feature map size, we put n' anchor boxes at each location in the feature map or the original input with the stride given by input_size_x/feature_map_size_x (800/50 = 16) , total anchor boxes will be n' * 50 * 50. We compute the offset of each anchor box to the nearest ground truth box (based on the IOU, which tells how far we should move the corresponding anchor box to match the ground truth box), (more on this later).

Anchor Boxes

Anchor boxes also called priors are the predefined boxes having fixed size/s. We put different sizes/ scale /area and ratio of these boxes to capture objects of various shapes and sizes. For instance, standing humans have the shape of a vertical rectangle and a ship or boat have a shape of a horizontal rectangle.

We put these anchor boxes on the input image pixel location given by the provided stride based on the backbone network (we do this because we need anchor set for each pixel in feature map) and pre-process them which becomes labels to RPN model (more on this under next heading).

Anchor boxes having the same area but different aspect ratio to capture different object shape and size.

We generate anchor centres which will be used to put 9 anchor boxes for each anchor centre.

Generating anchor centres
Anchor centres, this is where we will put all the anchor boxes (9 boxes per anchor centre)

​Assuming the backbone network is VGG 16 and there are 9 anchor boxes for each anchor location, we will get 50X50X512 tensor. The total anchor location will be 50X50 = 2500. For each anchor centre, we will put 9 anchor boxes of different scale and ratio, there will be a total of 2500*9 = 22500 anchor boxes. We will apply to pre-process to these anchor boxes like discarding (discard all the anchor boxes which are outside the image), assign a label (assign 1 to the ancho box which contains object 0 if it contains background else -1 which we will ignore during training), normalize each of the anchor boxes based on some criteria to make it a proper label for RPN network(More on this under next heading).

Generating Anchor Box Coordinate Using Aspect Ratio And Scale

After getting anchor centres we generate anchor boxes based on the anchor centres, ratios and scales. Here we will use the following expression to compute the height and width of the anchor box given ratio and scales.

Computing height is given scale (area) and aspect ratio
Computing width is given scale (area) and aspect ratio

*NOTE: width should be 45.24 and not​ 49.24.

# Anchor box coordinates
[[ -37.254834 , -82.50966799, 53.254834 , 98.50966799],
[ -82.50966799, -173.01933598, 98.50966799, 189.01933598],
[-173.01933598, -354.03867197, 189.01933598, 370.03867197],
...,
[ 701.49033201, 746.745166 , 882.50966799, 837.254834 ],
[ 610.98066402, 701.49033201, 973.01933598, 882.50966799],
[ 429.96132803, 610.98066402, 1154.03867197, 973.01933598]]

​Let's visualize 9 anchors boxes whose anchor centre is in the centre of the image.

9 anchor boxes whose anchor centre is located at the centre of the image

For more details look at the “Generating anchor boxes for each anchor location​​” section in the notebook.

IOU (Intersection Over Union)

IOU is a way to measure overlap between two rectangular boxes. It is a popular evaluation metric in object detection, which measures the overlap between the ground truth box and the predicted bounding box.

We will use IOU to assign labels to each of the anchor boxes. Firstly:

  1. We will select all the boxes which fall inside the image.
  2. We will label the anchor box as “1” (contains object) or positive if it has IOU/ overlap greater than equal to 0.7.
  3. We will label the anchor box as “0” or negative if it has IOU/ overlap smaller than equal to 0.3.
  4. We will label the anchor box as “-1” otherwise which will be discarded and won’t be used as a label while training.
  5. We will also discard all the boxes which are going beyond the image size (outside the image).

Secondly:

  1. We will assign each anchor box to the ground truth box, which means we will assign what “ground truth” label does anchor box with maximum overlap. For instance, there might be multiple objects in an image, so we assign a ground truth label to the anchor box denoting what ground truth box current anchor box belongs to.

Thirdly:

  1. We will assign each ground truth box to an anchor box that has maximum IOU.

*NOTE: valid anchor box is the anchor box that falls inside the image.

Assigning Each Valid Anchor Box To Ground Truth Box Based On The IOU

After getting all the anchor box coordinates, we compute IOU for each anchor box to all the ground truth box, So we will have a matrix of size, the first row denoting anchor boxes and the column shows maximum IOU among all the ground truth boxes. We assign each anchor box to the ground truth box which has the maximum IOU score (what is the best ground truth box which has the most overlap with the current anchor box).

IOU and best ground truth box assigned to each object/anchor box pair

*NOTE: we have two objects in our image which needs to be detected.

Anchor boxes with highest IOU with ground truth boxes

​Anchor boxes are already capturing the object with only simple filtering and IOU metric.

Assigning Labels To Each Anchor Box

Label preparation is a bit tricky in the context of RPN, because of the RPN outputs anchor offset (and corresponding objectiveness score), After generating anchor we need to assign each anchor a label denoting if anchor contains an object or background (because RPN has 2 outputs one is the objectiveness score and other is the anchor offset), there are multiple cases to consider here. Remember, multiple anchors could be overlapping with the ground truth box and those overlapping anchors are still valid anchors (containing object).

label mapping = {object:1, background:0, ignore: -1}

Case 1: assign label "1" to anchors that have the highest overlap with the ground truth box (here we do not put the threshold condition (as we have done in below point), we need to have at least one anchor box for each object).

Case 2: assign a label "1" to anchors whose overlap with the ground truth box is greater than or equal to 0.7.

Case 3: assign a label "0" to anchors whose overlap with the ground truth box is smaller than 0.3.

Case 4: rest is assigned as "-1" which will be discarded (including all the anchor boxes which are outside the image).

Labelling each valid anchor boxes based on top anchors and IOU threshold

Balancing Anchor Labels And Creating Mini Batch Of 256 Anchors

As we would have many background class instances (0 labels), our model might get biased towards background class. To mitigate this, we will use 256 instances overall with a ratio of 1:1 for positive and negative class (128 instances of each class). If we don’t have enough foreground instances (128) we would pad the batch by sampling background examples (for instances, if we have 50 examples of positive class we lack 88(128–50) instances, we will sample 88 more background samples to make 256 mini-batch).

Computing Anchor Offset For Each Anchor Box To Assigned Ground Truth Box

As we have assigned the ground truth box to each anchor box based on the maximum IOU score, we want the model to adjust the current anchor box to match the ground truth box as close as possible. As our anchor boxes are formatted as x_min, x_max, y_min, y_max we need to convert it into the format of height, width, center x, center y to calculate.

Coordinate conversion

We compute anchor offsets to the assigned ground truth box using the following equation:

We compute horizontal (delta x) and vertical (delta y) difference from the ground truth centre point to the assigned anchor box centre point scaled by the width/ height of the anchor. In this way, our model learns to adjust the anchor box by predicting how “off” anchor centre is from the ground truth box.

After adjusting centre coordinates we also need to adjust the width and height of the anchor box, we do this by computing the log of the ratio between width/ height of the ground truth box to width/ heightof the anchor box.

These will our target values (regression target) which we will use to train our model.

Our model will predict these offsets and these expression will be reversed to get the coordinates of actual region proposals (anchor box adjusted).

Computing parameterized offset (will be our regression target)

Have a look at this StackOverflow answer to understand why we parameterize coordinate “Coordinate prediction parameterization in object detection networks

And this is how our data frame for a valid anchor would look like.

Final data frame for training RPN network

*Note: we will be feeding all 22500 anchors to the model, but will ignore all the "-1" label while computing classification loss (i.e. binary cross-entropy) and will only consider foreground or label 1 for regression (we won't regress background class).

Here is how custom log loss would look like:

RPN classification head, Note how we have ignored class label “-1”​

And here is how l1 regression loss would look like:

RPN delta regression head, Note how we have only considered class label “1”​

*NOTE: After the above step, We will feed features maps output by the backbone network with labels and offsets (deltas) to the RPN network. For more details look at the “Custom loss function” section in the notebook.

Adjusting Anchor Offsets (Predicted Deltas) Predicted By Our Region Proposal Network

As we have fed our model offsets, we will be getting predicted offsets by the model. These offsets will be used to adjust anchor boxes which will be our final region of interest. To do this we again need to perform coordinate format conversion (because adjustment needs height, width, center_x, center_y format ).

Getting anchor offsets(deltas) and objectiveness score from the RPN Keras model​
Coordinate format conversion and adjusting deltas (getting region of interest)​

* NOTE: After getting ROI(adjusted anchor boxes) we again need to convert the centre coordinate format to x_min, y_min, x_max, y_max (pascal VOC format).

Clipping, Filtering And Non-max Suppression(NMS)

Non-max Suppression

NMS is an algorithm that suppresses close by boxes based on the IOU threshold having a non-maximum probability/ objectiveness score. In other words, we sort ROIs(adjusted anchor boxes) based on the objectiveness score output by the RPN model, Compute IOU between the first ROI and reset the ROIs (All other ROIs will be having a score less than the first ROI, and hence the suppressing boxes will be non-max), remove all the ROIs whose IOU score is less than the given threshold (0.7 usually).

NMS: before and after​

Image Source

After adjusting all the anchor boxes with the offsets predicted by the model, we further filter the boxes in the following ways:

  1. Clip the boxes to the max of 800 and min of 0.
  2. Remove all the boxes whose area/ scale if less than 256 or whose either size is less than 16.
  3. Sort the ROIs (adjusted anchor boxes) on the objectiveness score predicted by our RPN model.
  4. Select top 12000 ROIs. (this is just another way of filtering ROIs)
  5. Apply NMS to eliminate overlapping boxes.
  6. After NMS we select the top 2000 ROIs (further decreasing the number of region proposals)

Have a look at Clipping, Filtering And Non-max Suppression(NMS) section in the notebook for the implementation details.

Final Post Processing (Before Feeding It Into ROI Pooling)

Just like in the section Assigning Labels To Each Anchor Box, we repeat this procedure for our 2000 region proposals. We also repeat Balancing Anchor Labels And Creating Mini Batch of 256 Anchors section as well, but this time we will select 128 samples rather than 256, and the ratio of positive to negative will be 25%:75% (earlier it was 1:1). Also, we will assign actual class labels (for instance, 1 for the dog, 2 for cat and so on) to the ROI rather than assigning 0 and 1 , which was in the earlier case because this will be fed into Faster RCNN classier head to classify which object.

We randomly sample 32 positive assigned ROIs and sample rest from the negative 0/ background as background samples. If any one of the class doesn't have enough samples, we will sample the remaining samples from the majority class (which would be background class (class 0)).

Please check the Final Post Processing section in the notebook for full code.

Visualization of positive region proposals

Final foreground region proposals

​All the proposed regions contain the object (or some part of them).

Visualization of negative (background, 0 class) region proposals

Final background region proposals

That is it! ( *For now)

​In the next part we will continue our journey via Faster RCNN network (stage two of two-stage object detector) by feeding these selected proposals to the ROI pooling so to standardize the shape of proposals so that it could be fed into Faster RCNN head which is another network having two sibling convolution layer (just like RPN) one for offset regression (on the proposed ROI) and another for object classification (*in RPN we had objectiveness score (binary classification), here we classify actual objects(number of classes in the dataset)).

Again, Here is a link to the google colab notebook regionProposalNetworkInKeras

Thank you for reading this article, Hope you found it useful.

References, Sources and Citations

--

--