How to use Voice Recognition

CasualTranscriber 2.7 has an experimental feature that utilizes the macOS's built-in speech recognition feature for automatic transcription.

Currently, I'm still testing the functionality, but here are the known limitations and information about the feature:

The accuracy of speech recognition depends on the macOS's functionality. In the case of clean English monologues, it recognizes text with reasonable accuracy, but when recognizing Japanese, the accuracy is not as high, especially this function has some issues with Kanji recognition (obviously). In conversations involving multiple participants or with loud background noises such as in TV dramas, the recognition accuracy significantly decreases.
Currently, the recognition is set to offline mode (not sending voice data to Apple's servers). Processing text on servers would offer better recognition accuracy, but it has the limitation of processing up to one minute at a time, and considering the handling of data that cannot be sent outside, it is set as offline-only. In the future, I might add an option to support online recognition, but in that case, the file needs to be preprocessed (with 1 min. segments)
The required processing time for recognition is slightly shorter than real-time on Big Sur and approximately half to one-third of the actual time on Ventura. Additionally, processing on M2 is slightly faster than M1. Recognition also works on Intel Mac (at least on Ventura).
The error message "Siri and Dictation are disabled" appears when the speech recognition feature is turned off. Please enable the speech recognition feature in system settings as indicated below.
The error message "Failed to access assets" indicates that the language data for the desired recognition is not available. Please download the voice data for the desired language as indicated below.
If there is music inserted in the middle, such as a song, the recognition process seems to finish at that point. Even if the message indicates successful completion, the subsequent parts will not be recognized. Please edit or cut out inserted songs in advance and process only the speech parts.
If the error "kAFAssistantErrorDomain error - 203" occurs, it seems to indicate that there was no speech to be recognized after that point. If there is still audio present beyond that point, please split the file and process it separately. In previous trials on Big Sur, although the voice portion was recognized entirely, there were cases where this error occurred at the end. However, on Ventura, it completed without an error. So, if the processing is completed as expected, just ignore this message.

This new version is compatible with macOS 11.3 Big Sur or later. However, on Big Sur and Monterey (or later), there seems to be changes in how the speech recognition feature is handled. On Monterey and later, before using the feature, you need to enable the Dictation functionality and download the language data for the desired recognition. On macOS versions earlier than Big Sur, this feature is not available, so this functionality will not be available on CasualTranscriber.

On macOS Monterey or later, please enable the Dictation in the Keyboard section of System Settings. Then, under the Language settings, choose "Customize" and download the language data for the language you want to use for recognition. The language of your locale is available by default. In the example below, English (United States) is available. The added languages will be displayed in the list by their names.

Once you have configured the Dictation settings, launch CasualTranscriber and select the Voice Recognition Language in the ADV1 section of the Preferences.

The list displays all the languages that can be recognized, but the languages that will be recognized are only the ones which you have downloaded the language data.

Now you can recognize the audio/video file on the window. Go to Menu -> Misc -> Recognize Sound in File.

The segmentation of the text corresponds to the sections recognized by macOS's speech recognition feature. Time stamps are inserted before and after each segment. However, please note that this feature is independent of the control of the audio and video files, so there may be slight discrepancies (likely in milliseconds) between the time stamps and the actual content.

With version 2.7.1 (20230617) or later, a batch processing function is available. This function allows processing multiple audio/video files and save the results as RTF file with CasualTranscriber format.

To use this feature, select "Batch Transcriber" from Main Menu -> Window. Once the window is open, drag and drop the files you want to process onto the table, select the recognition language, and click "Process." You will be prompted to choose a folder to save the files, so please select a folder accordingly.

If an error occurs, information about the error will be recorded on the file, and the processing for that file will be terminated. As a known issue, the same as the individual file dictation, on Big Sur, even if the dictation recognizes all the audio in the file, an error may occur and the process will end if there is a portion without human voices at the end or even with some human voice. So, if find this error, please verify if the last part is actually recognized.

Page updated

Google Sites

Report abuse