ffmpeg is a package that can handle the conversion of audio files from one to another format. The details of the package is at: https://www.ffmpeg.org/ffmpeg.html#Synopsis
To covert all files in a folder, the following command can be used (linux/Mac)
for i in *.opus; do ffmpeg -i "$i" "${i%.*}.mp3"; done
Audacity is an open source tool for editing audios: https://www.audacityteam.org/
librosa, PyAudio
Voice activity detection (VAD): detect when the voice/speech happens. For tools, silero: https://thegradient.pub/one-voice-detector-to-rule-them-all/ and webrtcvad: https://github.com/wiseman/py-webrtcvad/
MFCC:
Whisper can be dowloaded from here: https://github.com/openai/whisper
You can install it via pip. Once installed,
import whisper, librosa
whisper_model=whisper.load_model('small.en') # base or small.en works pretty good.
method 1: working with mp3/wav files
result = whisper_model.transcribe('my_audio.wav',word_timestamps=True)
pd.DataFrame(result['segments'][0].get('words')) # display the timestamp of each words
result['text'] # display the transcribed text
Method 2: Directly convert to numpy array from st.audio_input
audio_data = st.audio_input()
if audio_data:
audio_arry,_ = librosa.load(audio_data,sr=16000)
result = whisper_model.transcribe(audio_array, word_timestamps=True)