One of the challenges holding machines back from approaching human-level speech recognition like Whisper has has been acquiring sufficiently large amounts of high-quality, labeled training data. “Labeled” in this case means audio of speech that has a corresponding text associated with it. With enough of these labeled data, a machine learning model can learn to take in speech audio as an input and then output the correct corresponding text.
Read More