Coran and his Olkari Cubes in Voltron, one of my inspirations
This idea started after I saw a video on YouTube about someone making a sound sculpture with a microcontroller that would generate an endless stream of music and sounds. I love the idea of having a set of glowing cubes that do various strange and specific things (like the Olkari Cubes in Voltron). Aside from generating music, one of the oddly specific tasks I wanted my cube to execute was taking a random article header from Wikipedia and speaking it out loud. This led me on a wild ride of experimenting with the DAC (digital to analog converter) capabilities of the ESP32, Wikipedia's very neat API, and PCB (printed circuit board) design.
I started out looking at different ways to output and modify sound on an ESP32. The DAC on the ESP32 is quite capable, as demonstrated by my Analog Screenshare project, and I came across a very capable library called Arduino Audio Tools. This allows for very powerful manipulation of different types of sound data and generation. I started out looking at this library for music creation purposes, but I quickly realized the other capabilities, such as text to speech. To the right is my test platform: an ESP32-S2 mini, a small speaker, and an analog amplifier I stole from a CRT project (trying to make a vectorscope) which I had taken from some very cheap PC speakers.
After talking with the developer of the Audio Tools library, I was able to use an experimental version of his code to allow this version of the ESP32 to output analog audio using a new protocol. There are still some bugs, but it was very rewarding to figure out and learn how this library works. The first thing I did was test the classic "Hello world!" with a few different types of text to speech. First I tested a port of Flite. This worked well, but because it uses 2MB of system storage (the ESP32 has 4MB total) I wouldn't have been able to download and parse the Wikipedia data. The next thing I tried was sending a request to Google Translate TTS (Text to Speech) and receiving an mp3 file of the text sent. This worked wonderfully and sounded very clear for my initial testing.
This TTS engine is very compact compared because it can fit all of the sound data on just about 3MB of flash storage on the ESP32. Because of this compactness, it's not very clear or expressive. (My cheap and probably cooked amp is not helping)
This TTS script visits a URL that sends text to a Google Translate voice server that returns an MP3 file of the text being spoken. This method is very clear and expressive, but it takes time and computer resources to send and receive the request, and has a character limit of 300 characters. This is barely enough to speak 2 sentences usually. For some reason, the MP3 is received at a different bitrate and sounds high pitch.
(video is from before I began using an I2S amp)
This is the final text to speech engine I tested, and it shows lots of promise. It was ported to Arduino by Phil Schatzmann. I am able to change the voice to many different variations and also it allows for speed, pitch volume and other parameters to be changed. It requires a lot of space on the microcontroller, but this isn't an issue because the flash storage on the ESP32-S2 is large enough to handle it. There are some skipping issues that I must resolve before it is clear enough to use.
One of the most interesting parts of the project was getting the random article data. The ESP32 has functions built in for getting data from websites, but you need to plug in a URL. How do I get a random URL each time? My first thought was to use this link: https://en.wikipedia.org/wiki/Special:Random. This works well, and it brings users to a random page like I wanted. Unfortunately, I need only text with no images or other types of data. Luckily, the API Wikipedia is built on, MediaWiki, has a special function for extractions of specific data types, as well as a randomizing function. After some experimenting, I came up with this URL:
This URL returns a random Wikipedia page's intro in plain text formatted as a .JSON document.
Using a very well-designed library called ArduinoJSON, I can take this returned .JSON document and "deserialize" it. This process simply splits the .JSON into each of its parts. Once split, I can use a function to search for the text I wish to use. The best part about this library is that it automatically converts the text into standard format so I can send it into the text-to-speech with no strange issues or symbols!
After getting the initial code working, I shifted my focus to designing a PCB for this project (I got bored of programming so I decided to teach myself PCB design). I started with a basic schematic put together from atomic14's mini esp32-s3 and documentation from Expressif. This PCB includes a 3.7V battery charger with a power switcher so it can switch seamlessly from USB power to battery. It also includes a MAX98357 I2S amplifier for integrated mono audio output. I've also included 4 SK6812 RGBW LEDs for cool glowing effects inside my cube platform. These LEDs have a logic level converter to convert 3.3V to 5V for the most stable performance. The headers at the bottom include all of the GPIO pins I could fit, as well as the data out from the last LED if I would like to extend the LEDs with the 5V logic.
After designing the PCB, I began designing its enclosure to include all the things needed for function: a battery, a button, a speaker, slots for the PCB, and a hole for the USB-C port. I considered many different designs using different parts I have on hand as well as parts I could easily source. I heavily considered putting the PCB right in the middle of the cube, but this proved complicated for speaker placement as well as needing lights on the bottom of the PCB, making the construction more expensive and complicated.
Press Space on the 3D model to see inside!
This project was started November 2023