Using pre-annotated and open-source datasets can be very unreliable and might not fit your specific needs. With HITL, you can specify what properties you are looking for in a dataset, and the people labeling the datasets can look out for those keywords for you. Introducing a human brain into the process can greatly improve the accuracy and quality of the datasets you create as well. The human brain can catch anomalies and other strange and weird things efficiently. At Sama, a company that offers HITL, they found that the accuracy of labels from open-source and re-annotated datasets came out to be about 50-75%. On the other hand, dataset that used HITL had accuracies of 95% and over.
There are various reasons as to why crowdsourcing wouldn't be the best option for your company. Sama has given 3 reasons:
The skill level of the agents working on annotations isn't guaranteed to be of a sufficient quality.
The devices they work on could be vulnerable and unsecure.
They might leave the session early and not continue. This causes the dataset to lower in quality. People who stay can learn how to properly label as well as learn from mistakes they have made and improve.
While AI can process data faster than humans can, there are many things that they can't understand. This is where humans can assist. Humans have an understanding of ethics and what is considered ethical to an individual. Humans are able to look through the data and "clean up" things that an AI would have otherwise ignored.
If this is necessary, what standards should be put in place?