ARTICLE

Object detection, in general, identifies objects axis-aligned boxes in an image. The most successful object detectors will enumerate an exasperating list of potential objects / object locations in order to classify each object. As successful as they are though this is wasteful, not efficient, and requires additional post-processing.

In the article "Objects as points", written by Xingyi Zhou, Dequan Wang, Phillipp Krähenbül, a different approach of modeling an object as a single point (the center point of its bounding box) is taken. The detector they employ uses key-point estimation in order to find the center points. Through this it regresses to all other object properties (i.e - size, 3D location, orientation, pose, etc.). Their center point based approach, entitled CenterNet is end to end differentiable. CenterNet is also simpler, faster, more efficient, and more accurate that corresponding bounding box based detectors.

CenterNet achieves the best speed-accuracy trade-off on the MS COCO dataset, with 28.1% AP at 142 FPS, 37.4% AP at 52 FPS, and 45.1% AP with multi-scale testing at 1.4 FPS, according to the authors. They use the same approach to estimate 3D bounding box in the KITTI benchmark and human pose on the COCO key-point dataset. Not only does their method perform competitively with sophisticated multi-stage methods but it also runs in real time.

In summary, Zhou, Wang, and Krähenbül provide a simpler alternative to object detection tasks that is also more efficient. Two stage-detectors recompute image features for each potential box and then recompute and classify those features. Post-processing, specifically non-maxima suppression, removes the duplicate detections for the same instance by computing a bounding box IoU-- this post-processing is extremely hard to differentiate and train but once done enables a higher efficiency and correctness rate.

Other papers that are relevant include “Object Detection with Deep Learning: A Review” written by Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, and Xindong Wu, “3D Bounding Box Estimation Using Deep Learning and Geometry” written by Arsalan Mousavian, Dragomir Anguelov, John Flynn, Jana Kosecka, and “Weighted Unsupervised Learning for 3D Object Detection” written by Kamran Kowsari and Manal H. Alassaf.

All three of these papers discuss different levels of object detection as well as different approaches, techniques, cameras, and images. The papers talk of K-weights, clusters, neural networks, and so on, and so forth. All of these papers collectively provide a deeper understanding to the main paper we have chosen to undertake.

We evaluated and reproduced Zhou, Wang, and Krähenbül's experiment by testing their models in our system and using CenterNet to identify objects as 3D images. We worked in a native Linux environment as opposed to our initial preliminary plan of using PyCharm CE or IDLE with Windows, and we used our own images as test data after emulating the same results the authors reached through using the datasets that they have used in their examples.