A Brief History of Vocal Synthesis

by Caleb Skelly and Quinne Houck
December 10th, 2025

Voice-OverMusicSound Design

From Hatsune Miku to Microsoft Sam to the IBM 7094, computers have been speaking since 1961. Learn about the different forms of vocal synthesis, how they are integrated, and how they are created!

In November, Wavyrn had the opportunity to present at an annual game development event where Caleb Skelly, one of our VO producers and sound designers, presented on the history of vocal synthesis.

Caleb is a VO producer, sound designer, and occasional biomedical researcher. Besides working with insects and crocheting in his spare time, Caleb is also passionate about vocal synthesis, his favorite aspect of audio design.

Below is an abridged summary of the presentation, highlighting the important history, questions and answers delivered at the panel.

————————————————

In this, we're gonna go into a brief history of: what vocal synthesis is, how it's done, how it's implemented; all that fun stuff. I would like to emphasize that this is a brief synopsis—I could go on for days and days on all this, so you’re welcome to do your own deeper research!

What is a voice? It seems self-explanatory, but the answer is a little more complicated than people might believe. A human voice is a product of the human vocal tract. If you don't have a vocal tract, you don't have a voice, and the way the vocal tract operates is what gives you your voice. Voices are, for the most part, unique to each person—people can sound very similar, but even twins will have differences that change how a voice sounds.

What actually makes up a voice, though, is broken down into a lot of different parts, which we can then use to build up our own unique voice—which we’ll get into later. First though, the primary way the vocal tract produces voice is by vibration of the vocal cords. The folds and the larynx and “voice box” all then affect an individual’s pitch range, timbre, and so on. The rate of vibration determines pitch, but then there are special unpitched types like screaming and vocal fry, which we won’t get into today.

At its base, the structure and length of the vocal cords are mostly what determine pitch. If you’re familiar with instruments, you’ll see that usually the larger or longer something is, the lower its pitch can go, and it’s exactly the same with humans. People with larger vocal tracts and cords have deeper voices, which is why people with testosterone tend to have lower voices— testosterone lengthens your vocal cords as part of hormonal development.

Articulation and prosody are how we control air coming out—not just from the mouth, but also the nose— to create language. This is the component that is super important with synthesis, because imitating pitch is easy, but to synthesize a voice you have to imitate language.

Articulation is created by each part of your face: your teeth, your tongue, your throat, your nasal passage. These all touch or affect parts of your vocal tract to cut off air, and with all these moving parts, there’s a lot of articulations that can be made, most of which are imitated in synthesis. Prosody is about rhythm and inflection, and is often correlated to accents.

So what is Vocal Synthesis? We just want to copy a human voice, but it’s actually incredibly difficult. We have to use a lot of methods and technologies. Some methods just create an exact model of the human vocal tract: there’s a free version called Pink Trombone that you can look up online and play around with.

No matter what, all synthesis methods are fundamentally tied to the vocal tract, but they aren’t always digital. The earliest known scientific example was from 1780 by physics professor Christian Gottlieb Krasenstein of Copenhagen University. He created these shaped reeds in tubes that, when he blew into them, approximated the sounds of the five pure English vowels: A, E, I, O, and U. Even before that, since ancient times, we’ve been fascinated with the idea of copying other sounds.

“Modern” vocal synthesis was started by the IBM 704 computer, which used a vocoder to sing Daisy Bell in 1961. It was the first time we could input text into a system and have it output as a discernible voice.

There are many different types of vocal synthesis, but we’re going to focus on three big ones that are relevant to game developers: formant, concatenative, and artificial intelligence.

Formant is the most ‘retro’ sounding— your Microsoft Sams, speak and spells, default computer voice. This method doesn’t use human voice samples, just approximates the ‘formant,’ the sound, of a human voice. It then uses oscillators (digital or analog) to copy those waveforms to reproduce the specific voice it was fed. Those ‘formats’ are sonic representations of a very small and predictable unit of human voice.

We have a lot of ways to extrapolate these, but generally we follow a chart that matches phonetic sounds to different parts of a graph comparing two frequencies that oscillate. For example, if you create an oscillation that is at 2300 at the F2 frequency, and 250 at the F1 frequency, it will sound like the human E. This method is old and not very common anymore, but they’re often kept as ‘legacy’ options or for niche or retro systems.

Concatenation was, up until recently, the primary way voices were synthesized. It works by selecting units of sound from a recorded library, then pitching and crossfading them together. This means creating them takes a long time, since they require a full phonetic dictionary and connections between every consonant and vowel in a language, there’s a lot of quality assurance and testing, and they can be very buggy and unreliable. Even so, they were the standard for almost 15 years, so they’re still very relevant.

The way concatenation transforms the human voice is through a method called Fast Fourier Transform (FFT); other systems use their own algorithms, like Hidden Markov Method (HMM). FFT is a formula which turns the human voice into discrete chunks called bins.

The most famous example of concatenation is Hatsune Miku, on the Vocaloid 2 engine. She was the first real synth that took off and kind of paved the way for vocal synthesis. Alexa and Siri also used to be concatenation, but they’re now AI, which we’ll talk about in a bit.

Vocaloid was developed in Barcelona at Pompeu Fabra University. Vocaloid 1 was formant, 2 through 5 were concatenation, and the current one, Vocaloid 6, is AI. Vocaloid 1 was actually a sort of hybrid formant-concatenation, where it was based on formants from a singer but generated via concatenation methods. Once we got to Vocaloid 2, Hatsune Miku was introduced.

The way she works is kind of the basis for how most all other concatenation synths work, at their core. First you have to pick a language: Japanese is very easy to synthesize, and English is one of the most difficult. You record your library, then it’s analyzed and compiled into a database of specific pieces of sound corresponding to language. This makes it so that phonetic symbol input can correctly output the right pieces of sound. XSampa is a table of simplified ways to write out any language phonetics, a form of IPA, or international phonetic alphabet. When the VA records this, they have to intonate the recording list of phonyms, including in multiple even pitches.

Next you have to set parameters. Overlap dictates how much a vowel and preceding syllable overlap; pre-utterance is where the next consonant and vowel cluster enter. This helps avoid stretching consonants, only holding on vowels. There’s also offset, consonant, and cutoff.

Concatenation libraries sound unnatural outside their recorded range, so to extend that range you need to provide multiple pitches. F0, the fundamental frequency of the voice, is estimated and logged, which is what is being analyzed and reproduced during synthesis.

Concatenation also has a lot of storage issues— I have one library that’s 16 GB. Formant synths are tiny, like 60 kB, and even AI models are around 300 MB. Concatenation is also locked to a single language, and its pitch range is quite limited, so you have to record multiple pitches in a library. Samples are then processed in realtime, which is resource heavy, and they can also sound really unnatural.

Artificial Intelligence has rendered concatenation mostly obsolete, since it opens potentials that were impossible before. Kasane Teto was a synth character originally based in Utau, a sort of free alternative to Vocaloid. There were a lot of attempts to port her to other engines, but nothing worked quite right, until about 3 years ago when she was released on an AI system, and now she’s the most popular voice on that platform by far.

When I use “AI” in this context it’s not what most people are calling AI— Generative AI, Large Language Models, and so on. Vocal synthesis “AI” is Analytical AI, which is trained on a narrow data set and taught to break down a human voice sample to then recreate the sounds. It’s more focused, it’s more reliable, and it’s very self-contained; it doesn’t require massive data centers that eat up electricity and water.

There are a ton of forms, but the most common is MEL Spectrogram, where the units of voice are converted to visual format, learned, and reproduced. True AI synthesis systems need a set of samples labeled with phonetic and pitch information, just like in concatenation, but the needed quantity of info is much less. The engine then learns this over and over until it’s able to replicate it. False AI systems estimate the voice without labeling, such as in voice cloning systems where it just analyzes a given sample and then takes your input and alters it to sound like the sample, rather than stitching sounds together.

Because of how simple and effective it is, AI synthesis has boomed. The technology is evolving so fast laws can’t keep up, so it’s fraught with ethical and legal issues. It can clone a voice and be nearly indistinguishable, and you don’t need consent to be able to do it. There are some new sorts of audio ‘fingerprinting’ technologies to be able to track this sort of thing, but they’re not entirely effective.

This violation of consent is even an issue with previously sampled actors: UNI was a Vocaloid 4 that came out several years ago, but recently they updated her voice to an AI version on Synthesizer V, which was never discussed or agreed upon. The voice actress for UNI is suing the company, since she never gave permission for that and it sounds so indistinguishable with the new technology.

So, even though AI creates a lot of different possibilities, it also creates a lot of ethical and legal potholes. Even this company, who had worked with this VA before, did not do their due diligence in clearing the re-use of her samples and making sure the VA knew exactly what she was getting into.

©️ 2025 Wavyrn • All Rights Reserved