Previous Iterations:
Wav2Vec2[1] (+ Language Modeling)
Hubert[2] (+ Language Modeling)
Limitations:
Low accuracy (WER).
Sensitive to noise.
Accents.
Current Iteration:
Whisper[3]
Limitations:
Sentence-level transcription.
Registration failure (rare).
Model - YOLOv7[4]
Roboflow - Data annotation and Data augmentation[5]
Accuracy: 96%
Object detection from top view
Small object detection
Fast and accurate
We observed the user’s interaction in the following situations:
When objects are in close proximity
When objects are moving fast
When one object is on the top of another one
When objects in the hand
When objects are in the color sensitive
Collected data during user’s interaction
Applied data augmentation policies
Geometric transformation
Color transformation - not to bias the model with colors
Mosaic - to handle small object detection problem
Cutout - to handle occlusion problem
[1] Baevski, Alexei, et al. "wav2vec 2.0: A framework for self-supervised learning of speech representations." Advances in Neural Information Processing Systems 33 (2020): 12449-12460.
[2] Hsu, Wei-Ning, et al. "Hubert: Self-supervised speech representation learning by masked prediction of hidden units." IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021): 3451-3460.
[3] Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision." OpenAI Blog (2022).
[4] Wang, Chien-Yao, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors." arXiv preprint arXiv:2207.02696 (2022).
[5] Cubuk, Ekin D., et al. "Autoaugment: Learning augmentation strategies from data." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.