Object Detection

Class Imbalance Problem

total # positive classes <<< total # negative classes

example: identifying fradulent claims

There may not be many fradulent claims, so the classifier will tend to classify fraudulent claims as genuine.

Model 1: classified 7/10 fraudulent transactions as genuine. 10/10,000 genuine transactions as fraudulent = 17 "mistakes"
Model 2: classified 2/10 fraudulent transactions as genuine. 100/10,000 genuine transactions as fraudulent = 102 "mistakes"

Since we want to minimize fraudulent transactions as genuine, model 2 actually performs better even though it made more "mistakes". Therefore, it is good to not base performance on mistakes, but on true positive (TP) rate, true negative (TN) rate, FP rate, FN rate.

Formula	Performance
TP Rate = TP / (TP + FP)	Close to 1 = good
TN Rate = TN / (TN + FN)	Close to 1 = good
FP Rate = FP / (FP + TN)	Close to 0 = good
FN Rate = FN / (FN + TP)	Close to 0 = good

How to Mitigate Class Imbalance Problem

Cost Function Based Approach - think one false negative as worse than one false positive. (weigh false negatives more)
- i.e. thinking a claim was genuine but it was actually a fraud would be weighted with a larger cost than one that thought a claim was a fraud but it was actually geunine is less bad and therefore has lower cost
Sampling Based Approach
- oversampling: adding more of the minority class - might have to deal with overfitting of minority class
- undersampling: removing more of the majority class - may risk moving more representative instances of majority class

Sampling

Downsampling

Reduces # of pixels in the image, i.e. shrinking the image. Then, when you want to make the image the same size as it was previously, you will need to increase the pixel size
Example: reduce a 512x512 image to 256x256 = factor of 2 downsampling in horizontal and vertical directions

Upsampling

Increases the # of pixels in the image, i.e. enlarging the image. The added pixels are estimated from surrounding samples.

Feature Pyramid

Used for recognizing objects at vastly different scales
Scale-Invariant because the object's scale change is offset by shifting its level in the pyramid
Feature maps close to the image layer are composed of low-level structures not effective for accurate object detection

Feature Pyramid Network (FPN) is composed of a bottom-up and top-down pathway
bottom-up is useful for feature extraction (spatial resolution decreases as you go up to the top layers of the pyramid and view a smaller version of the object, i.e. the semantic value increases)

FPN uses a top-down pathway to construct higher resolution layers from a semantic rich layer
The bottom-up pathway uses ResNet

Anchor Boxes

Because a CNN has shared weights, it is not able to estimate the absolute position in an image, anchor boxes make it possible so the CNN only needs to predict the relative transformation for each anchor box (anchor box is the bounding box)

RetinaNet

RetinaNet can match the speed of one-stage detectors and surpass the accuracy of the two-stage detectors.
one-stage detectors have typically had worse accuracy than two-stage detectors - why? -> class imbalance problem
RetinaNet addresses problem that one-stage detectors have with class imbalance between foreground and background of the image during training of dense detectors - how? -> reshaping the standard cross entropy loss, i.e. it down-weights the loss assigned to well-classified examples. (want to minimize loss, now well-classified examples don't help as much for the loss)
The loss will focus training on a sparse set of hard examples and prevent the large number of easy negatives from overwhelming the detector. This loss is called Focal Loss.
Uses a dense sampling of object locations in an input image and an in-network feature pyramid and anchor boxes

C_i is just a type of convolution, for example, conv5 = 256 3x3 filters at stride 1, pad 1
In the top-down pathway, apply a 1x1 convolution filter

Focal Loss

well-classified examples: p_t > 0.5
Scaling factor decays to 0 as confidence in the correct class increases (loss low at well-classified examples)

Suppose

gamma = 5, p_t = 0.1 bad classified, then -(1-0.1)^5 * log(0.1) = 1.36 loss
gamma = 5, p_t = 0.9 well classified, then -(1-0.9)^5 * log(0.9) = 1.05E-6 loss ~ 0 loss

RetinaNet Performance Against other Detectors

RetinaNet outperforms Faster R-CNN, a two-stage detector

SSD

SSD does not select bottom layers of the pyramid for object detection, since the semantic value is not high enough to justify its use as it significantly reduces speed (SSD uses upper layers for detection - performs worse on small objects)

One Stage Detectors

Must process a much larger set of candidate object locations regularly sampled across an image (background part of image still dominates even if using a sampling heuristic)
RetinaNet
YOLO
SSD

Two Stage Detectors

Stage 1: Class imbalance is addressed through the proposal stage (Selective Search, Edge Boxes, DeepMask, RPN) to narrow down # of candidate object locations, filtering most background samples
Stage 2: sampling heuristics like a fixed foreground-to-background ratio are performed to maintain a balance between foreground and background
Faster R-CNN
Mask R-CNN

Smarker/object-detection.md

Select an option

No results found