Sox can now Babble and Coqui STT and TTS on one Device

Sox has finally had some work done, and is now capable of babbling!

For the last month I have been working on Sox on and off as I get frustrated with library collisions within python. Due to the libraries being made at slightly different times, they need a very narrow range of library versions to work. This, of course is not published. After much guess and check the following work with the current versions of Coqui. I WILL NOT BE UPDATING THIS PAGE AS TIME GOES ON! PLEASE LOOK ELSEWHERE FOR MORE UP TO DATE COMPATIBILITY.

  • torch==1.12.0
  • torchaudio==0.12.0
  • numpy==1.22.4
  • numba==0.56.4

These might not be the most correct, but they do work with Coqui-TTS, Coqui-STT and Silero-VAD

The way Sox currently works is that it constantly listens for voice activity using Silero-VAD, when it finds some, it saves the previous second, and any more voice that comes in, along with a second of not voice activity afterwards. If this is less than 3 seconds, it sends it to Coqui-STT. If the response contains "Hey Sox", then it goes into active mode, where the code is run again, but this time does not doe the 3 second check. This is then decoded by Coqui-STT again, and compared for any specific phrases. I do not currently have any specific phrases included as I just got this working.

As for TTS, once the libraries worked, passing the string to the model works. I tried training my own voice model based off of Sox in the movie, but I ran out of time on Google Collab, so it only got 1/1000th the training it required. From this, it sounds like baby babbles, but the tone of the voice is already nearly there. I will need to find a way to have a linux copy take over my main computer to run the real training to get the voice for Sox. 

Next steps are to include specific action phrases as well as  combine the TTS and STT components into one. 


You need to log in to post comments