by sil-ai
Open source · 249 downloads · 0 likes
This model is a fine-tuned version of SpeechT5, specialized in automatic speech recognition (ASR) with forced alignment and inference. It converts audio recordings into accurately transcribed text by leveraging SpeechT5’s capabilities while optimizing its performance for transcription tasks. Its primary use cases include transcribing speeches, generating automatic subtitles, or analyzing audio content for professional or consumer applications. What sets it apart is its hybrid approach combining forced alignment and inference, which enhances synchronization between the audio and the generated text. It stands as a robust solution for transcription needs requiring both speed and reliability.
This model is a fine-tuned version of microsoft/speecht5_tts on the None dataset. It achieves the following results on the evaluation set:
More information needed
More information needed
More information needed
The following hyperparameters were used during training:
| Training Loss | Epoch | Step | Validation Loss |
|---|---|---|---|
| 0.1869 | 30.3030 | 1000 | 0.1695 |
| 0.1612 | 60.6061 | 2000 | 0.1583 |
| 0.1399 | 90.9091 | 3000 | 0.1664 |
| 0.1301 | 121.2121 | 4000 | 0.1640 |
| 0.1208 | 151.5152 | 5000 | 0.1699 |
| 0.1161 | 181.8182 | 6000 | 0.1746 |
| 0.108 | 212.1212 | 7000 | 0.1673 |
| 0.0945 | 242.4242 | 8000 | 0.1804 |
| 0.1044 | 272.7273 | 9000 | 0.1787 |
| 0.0929 | 303.0303 | 10000 | 0.1756 |
| 0.0845 | 333.3333 | 11000 | 0.1701 |
| 0.0894 | 363.6364 | 12000 | 0.1739 |
| 0.0813 | 393.9394 | 13000 | 0.1667 |
| 0.0818 | 424.2424 | 14000 | 0.1740 |
| 0.0769 | 454.5455 | 15000 | 0.1719 |
| 0.0788 | 484.8485 | 16000 | 0.1780 |
| 0.0759 | 515.1515 | 17000 | 0.1745 |
| 0.0933 | 545.4545 | 18000 | 0.1754 |
| 0.0764 | 575.7576 | 19000 | 0.1760 |