Training heuristics have been shown to significantly improve the accuracy of image classification models. However, object detection models present a more complex challenge due to their intricate neural network structures and diverse optimization targets. Training strategies and pipelines can vary dramatically between different object detection models. This work explores a collection of training tweaks, or "freebies," that can be applied to various models, including Faster R-CNN and YOLOv3, without altering the model architectures themselves. Consequently, these techniques do not affect inference costs. Empirical results indicate that these freebies can improve absolute precision by up to 5% compared to state-of-the-art baselines.
The field of object detection has seen numerous recent advancements. A simple but comprehensive survey of recent improvements in object detection in the era of deep learning includes progress such as real-time object detectors and works that borrow ideas from recurrent neural networks (RNNs) and generative adversarial networks (GANs). Furthermore, new datasets are being introduced to complement existing ones like COCO, which are used to test the generalization ability of object detectors and evaluate the sources of errors. For instance, two complementary datasets to COCO have been introduced for object detection. In a different approach, a unified neural network has been proposed for object detection, multiple object tracking, and vehicle re-identification. This work unifies the detector and re-identification model into an end-to-end network by adding an additional track branch for tracking in a Faster R-CNN architecture, utilizing the Region of Interest (RoI) feature vector from the Faster R-CNN baseline, which reduces the amount of calculation.
One of the key training freebies explored is data augmentation. It is noted that data augmentation is especially important in the context of the Single Shot MultiBox Detector (SSD) to enable the detection of objects at different scales. However, data augmentation has a minimal effect on multi-stage pipelines, such as Faster R-CNN. The authors argue that the RoI pooling operations on feature maps substitute the need for extensive data augmentation like random cropping, which is a core component of the single-step framework. Because the single-step framework lacks spatial variables, spatial data expansion is very important to the appearance of the algorithm, a fact confirmed by the SSD algorithm.
Mixup is a prominent training technique examined. It is a data augmentation method that generates new training data by combining pairs of existing examples and their labels. While initially successful in image classification, its application in object detection is less straightforward. The work explores a sample mixup method for visual consistency in the field of target detection. Literature on mixup proposed that the distribution of the fusion rate is obtained by the beta distribution with parameters a=0.2 and b=0.2, resulting in mixes that are mostly noise. Inspired by heuristic experiments, more attention is paid to important object representation methods that co-occur naturally in the field of object detection. A semi-adversarial object patch transplantation method is also mentioned, which is not a traditional attack.
The effectiveness of mixup in object detection is a key finding. It is surprising that the mixup technique is useful in the object detection setting, and it requires more investigation. Empirical results show that a half-half mixup ratio (0.5:0.5) gives the largest performance boost. Random sampling from a beta distribution is slightly better than a fixed 0.5:0.5 even mixup. Furthermore, an object detector trained with mixup is more robust against "alien objects," as demonstrated in an "elephant in the room" test. This enhanced robustness to image alterations and an ability to decontextualize detections results in improved generalization power. Some research proposes leveraging the inherent region mapping structure of anchors to introduce a mixup-driven training regularization for region proposal-based object detectors.
Beyond mixup, other training tweaks are discussed. Cosine learning rate and class label smoothing are also very useful for boosting performance. A novel conceptual framework called Supervision Interpolation is introduced, offering a fresh perspective on interpolation-based augmentations by relaxing and generalizing Mixup. This framework leads to a simple yet versatile and effective regularization that enhances the performance and robustness of object detectors. For example, LossMix is proposed as a novel data augmentation approach based on class activation maps for long-tail recognition, which simplifies and generalizes Mixup for object detection and beyond, setting a new state of the art for cross-domain object detection.
In the context of long-tailed visual recognition, a comprehensive collection of existing tricks has been gathered, and extensive systematic experiments have been performed to provide a detailed experimental guideline and obtain an effective combination of these tricks. This is particularly relevant for object detection, where class imbalance is a common challenge.
For semi-supervised learning in object detection, a simple yet effective framework called STAC has been proposed. STAC deploys highly confident pseudo labels of localized objects from an unlabeled image and updates the model by enforcing consistency via strong augmentations. This approach demonstrates the importance of leveraging unlabeled data to improve model performance.
In summary, a bag of freebies for training object detection neural networks includes a variety of techniques that can be applied without changing the model architecture. These include specific data augmentation strategies tailored for different model types (e.g., extensive augmentation for single-stage detectors like SSD), mixup with a focus on half-half ratios and random beta sampling for improved robustness and generalization, cosine learning rate schedules, class label smoothing, and supervision interpolation frameworks. These tweaks collectively contribute to significant improvements in model precision, with reported gains of up to 5% absolute precision over state-of-the-art baselines. The field continues to evolve with new methods like LossMix and STAC, further enhancing the capabilities of object detection models.
