mAP (mean average precision) for Recommender systems and Object detection algorithms

10 min readNov 28, 2021

Bison covered in snow (credit Tom Murphy)

Before we start, I would like to point out this blog is my understanding of the other blogs and source code that I have mentioned under the “Sources, Citations and References” section. I have used images and paragraphs from those sources. Please take a moment to appreciate them.

The mAP is a metric used to measure the performance of information retrieval systems(recommendation systems) and object detection algorithms. We could argue that object detection is also an information retrieval system because we sort the bounding boxes by their confidence score over all the images and mark them as TP(true positive) or FP(false positive) based on the IOU threshold (ie, 50% overlap), this let us penalize the score if we have high confidence on FP detections (consider it as the irrelevant recommendation at early in recommendation list) in this way we are imposing the ranked ordering of items/ detections constraint. This will make more sense as you read through this blog.

What is precision?

Precision is the proportion/ percentage of correct positive predictions out of all the predicted positive predictions.

Why only precision in mAP?

Well, there are multiple formulations of mAP from using only precision to compute mAP to using a precision-recall curve which yields roughly the same results. So mAP is not always about “precision”, sometimes we can include recall (by using an alternative formulation of mAP) as well.

How is it helpful in evaluating recommendation systems?

Vanilla precision alone doesn’t help in evaluating the quality of recommendations. Here is the issue with using precision.

Let’s say I am asked to recommend N = 5 products (this means I predict “positive” for five products), from all the possible products there are only m=3 that are actually relevant to the user, and my successes and failures in my ranked list are [0,1,1,0,0]. Then:
# of items we recommended = 5
# of our recommendations that are relevant = 2
# of all the possible relevant items = 3
precision = 2/5
So the precision is 2/5 when we recommend 5 products out of which only two relevant items to the user got included in the recommendation set at the 2nd position and the 3rd position. The issue here is, no matter what the recommendation arrangement looks like ([0, 0, 0, 1, 1]) the precision would always remain the same.

But why is this an issue? Admit it, humans are lazy and they don’t like to interact with computers when it is not necessary(unnecessary interactions) also the user doesn’t have all the time in the world to scroll through all the recommendations, and hence first constrain is the recommendation set should be as brief as possible. The second constraint is the position of relevant items, Due to the first constrain of laziness and all the time in the world we want to show the items (relevant items) which a user is most likely to buy at the top of the recommendation set, So the ideal recommendation set should contain all the relevant items and they should be ordered by the buying likelihood (an item which user is most likely to buy, 2nd most likely item should be at second position and so on).

Precision@K/ Precision at K

To enforce the position and relevance of items constrain we use Precision@K.

Precision at K is the proportion of recommended items in the top-k set that are relevant.
Suppose that my precision at 10 in a top-10 recommendation problem is 80%. This means that 80% of the recommendation I make are relevant to the user

So Precision@K is all about focusing on the rank ordering of relevant items.

How do I compute Precision@K?

Precision@K is a computer by computing precision at every index up to K. So we will have K precision in the end (and this is the reason why we take the average of precisions, to get a single value, BAM!).

Precision at index K is computed by considering all the previous set of values including value at the index K.

Imagine K = 3 then Precision@3 = [Precision up to 1, Precision up to 2, Precision up to 3]

Let’s imagine recommending N=3 products (AP@3) to a user who actually added a total of m=3 products. Here are some examples of outcomes for our algorithm:

If you notice, we would have high precision at the index K if all (or most) of the previous recommendations are relevant. Also, notice if we frontload the relevant documents precision is high at that index.

KEY TAKEAWAYS

Precision is computed globally (considering the whole set) and hence it is a single value, Precision@K is computed at each index (locally) up to K and hence we would have K precisions.
Precision@K would be different for a different arrangement of recommendations (order of the recommendations matter).
Precision@K the value would be more when relevant items are front-loaded (front loading the recommendations matter).
Precision@K for any index would increase if we have relevant items in the chain (continuous manner) ([0, 0, 1, 1, 1], three consecutive relevant items)

Average Precision at N(AP@N)

As Precision@K returns a list of K elements it would be better if we could boil down the list of values into a single value to make interpretation and comparison easy. What would be the better way to achieve this than taking a simple average? That is exactly what Average precision is, It is the average of all the Precisions at all the indexes up to K. Considering the below table:

Again, AP will reward you for the relevant recommendations and for front-loading your recommendations (Because AP depends on precision@K, all that is true forPrecision@K is also true for AP).

“If we are asked to recommend N items, the number of relevant items in the full space of items is m, then:"

“where rel(k) is just an indicator that says whether that kth item was relevant (rel(k)=1) or not (rel(k)=0). I’d like to point out that instead of recommending NN items would have recommended, say, 2N, but the AP@N metric says we only care about the average precision up to the Nth item.”

What is the mean of average precisions (mAP)?

Whatever we have computed is just for one object/ user, where else to get the overall performance of the system/ algorithm we need to compute the AP for each object/ user. Not again, we have |U| no of APs, to get a single measure we again take the average.

Average precision should be called Average Precision@K because we are taking the average of precisions computed up to K and we are computing the mean of Average Precision@K for each object/ user to get the overall performance of the system/ algorithm.

Here |U| is the no of users and we are summing up AP@N for each user normalized by the number of users (mean).

Variations of AP@N formula

AP when you have more possible relevant items than the number of recommendations you have to provide. In such cases, to control the denominator term from being unfairly suppressed we take the minimum of m (total number of relevant items) and N(number of recommendations to provide).

Another formulation of AP you could encounter is without the rel(k) function in the equation but the conditional statement is explicitly mentioned so P(K) won’t contribute to the sum of the item K is irrelevant.

When there are no relevant items to recommend to the user, m would be 0 and so should be the AP. To keep the math in sync, we just include a conditional statement to set AP to 0 when m is 0.

What is a recall?

A recall is the proportion/ percentage of recommended relevant items to all the possible relevant items.

AP@N formulation using Precision and Recall

There are other formulations of average precision@N which uses "change in recall" metric and other one used precision X recall curve.

AP@N using change in recall

The idea here is to compute the recall at every step(item) up to K(just like precision@K) by considering the first recommendation to get 1st position recall, the first two recommendations to get 2nd position recall, first three recommendations to get 3rd position recall and so on. To get a change in the recall at the Kth location we just subtract K -1st recall from Kth recall.

Here delta r(k) is the change in the recall at the Kth location.

Notice how this formulation eliminates the normalization term 1/m which other formulation uses. It is because the change in recall would be exactly 1/m if the recommendation is relevant 0 otherwise. It also eliminates conditional statements (precision is 0 if the item is incorrect 1 otherwise condition in other formulations) because again, change in recall would be 0 for the incorrect recommendation which would make precision at Kth location 0.

AP@N using precision X recall curve

This technique uses the estimated area under the precision X recall curve to compute AP. It is commonly used to measure (and compare) the performance of object detection algorithms where N is the number of ground truth boxes for a class (ie bird).

AP is the precision averaged across all recall values between 0 and 1.

if at the next i we got a correct recommendation then both precision and recall should increase, whereas if we got that recommendation wrong then precision will decrease while recall will be unchanged.

Before we jump into the precision X recall curve, let's look at how to compute precision and recall for object detection algorithm.

An illustrated example

There are 7 images with 15 ground-truth objects represented by the green bounding boxes and 24 detected objects represented by the red bounding boxes. Each detected object has a confidence level and is identified by a letter (A, B,…, Y).

The following table shows the bounding boxes with their corresponding confidences. The last column identifies the detections as TP or FP. In this example, a TP is considered if IOU ≤ 30%, otherwise it is an FP. By looking at the images above we can roughly tell if the detections are TP or FP.

In some images, there is more than one detection overlapping a ground truth (Images 2, 3, 4, 5, 6, and 7). For those cases, the first detection is considered TP while the others are FP. This rule is applied by the PASCAL VOC 2012 metric: “e.g. 5 detections (TP) of a single object is counted as 1 correct detection and 4 false detections”.

The Precision x Recall curve is plotted by calculating the precision and recall values of the accumulated TP or FP detections. For this, first, we need to order the detections by their confidences, then we calculate the precision and recall for each accumulated detection as shown in the table below (Note that for recall computation, the denominator term (“Acc TP + Acc FN” or “All ground truths”) is constant at 15 since GT boxes are constant irrespective of detections).:

Example computation for the 2nd row (Image 7): Precision = TP/(TP+FP) = 1/2 = 0.5 and Recall = TP/(TP+FN) = 1/15 = 0.066

Plotting the precision and recall values we have the following Precision x Recall curve:

As you can see the plot is jagged so it is hard to compare the two algorithms. We would like to make this function monotonic which would smooth the plot to estimate the area under the curve easily. There are mainly two techniques that are used to smooth the plot to make the algorithm’s performance comparable, one being 11 point interpolation and the other being N-point interpolation or every point interpolation.

11 point interpolated average precision

11 point interpolation average precision divides the recall into 11 equal parts and interpolated precision values are obtained by taking the maximum precision whose recall value is greater than its current recall value:

11 point interpolation average precision in python:

N point interpolated average precision

In this technique, we use all the unique recall given in the set rather than using only 11 evenly spaced recalls. The interpolated precision at index i is computed by taking the maximum precision at index i and index i -1. The final AP for an object is just the mean of the area of the rectangle at each unique recall.

N point interpolation average precision in python:

And again, to the single mAP globally we just take mean of AP across all the objects.

Hopefully, now you should have a better understanding of mean Average Precision and how it can be used in the context of recommender systems and object detectors.