Text To Speech: 5 tips for video narration

May 5, 2021

Recording voiceover for videos requires a lot of time, especially for people who are not voice-over professionals. With natural reader text to speech systems, you can save a lot of time by getting a text to speech translator to turn your video script into a nice voice-over, in a fraction of time.

In this article, we’ll explain how to use natural-sounding text to speech voices, and provide some key tips on getting the most out of modern text to voice generators.

What is text to speech?

As the name suggests, text to speech readers turn a textual script (for example a Word document or presenter notes in a Powerpoint document) into voice. From robot voice generators in early nineties, the technology of neural networks and machine learning systems has advanced so much that, for many purposes, computer-generated voices sound almost indistinguishable from humans. Machines are not there yet fully, for example it’s difficult to convey emotion through text-to-speech, but for informational videos the benefits are amazing. This is especially true for languages where the author is not a native speaker, or has a strong accent. Text-to-speech voices bridge that gap easily. You just need to know how to write the script in a target language, or get a translator to do it for you, and the text-to-speech generator will do the rest.

Hiring a professional voice-over artist can cost hundreds of dollars for just a few minutes of good narration, and you will need to pay that again if you want to change a tiny fraction of the video in the future (even with the same actor and same equipment, two recordings done at different times might sound inconsistent). With text to speech engines, you get consistency even for later changes.

How to use text to speech?

With Narakeet, you can use natural text to speech, in 100 languages, with 800 voice generators (see Available voices for a quick text to speech demo). We use the best text to speech engines online, such as Google Text To Speech, IBM Watson Text to Speech, Amazon Polly and Yandex Text to Speech. Such text to speech software usually requires technical knowledge and programming skills, but Narakeet makes it easy to turn text to speech online - you just need to supply the script.

Here are some important things to consider when writing text-to-speech scripts:

Use full sentences

Deep Neural Network (DNN) speech translators can understand the context around the words, and provide natural-sounding informational content. They work best with full sentences, rather than fragments. This allows the generator to insert the correct pauses between words, and deal with ambiguous pronunciation correctly.

Especially when narrating screencasts, authors often explain individual steps with just a few words, and take longer breaks between the words. For machine-generated voice, it is better to have a full sentence explaining each step.

This is also true for emphasis, and changing the voice speed. Instead of making a single word stand out in a sentence, create two short sentences and emphasise one of them.

You can provide a moderate emphasis by surrounding a sentence with two underscores (_).

For strong emphasis, surround a sentence with two asterisks (**).

For the opposite, reduced emphasis, surround a sentence in two tildas (~~).

This is my usual voice.
_This is more important._
**This is really important.**
~~This is not so much.~~

Use short, meaningful paragraphs

To provide as much context as possible, Narakeet will generate voice-over for entire paragraphs when possible. This provides the best results in most situations, allowing text-to-speech translators to have the context of multiple related sentences when reading your script. To benefit from this approach, create relatively short paragraphs with a few sentences that are closely related. Split unrelated text into separate paragraphs.

This also helps create a slightly longer breaks between paragraphs, so your audience can digest information.

Insert longer pauses where needed

Sometimes, it’s useful to take a slightly longer break between sentences or paragraphs. When you record the narration yourself, and have the context of the video you are doing a voice-over on, it’s easy to take pauses intuitively. But the computer generated voices don’t know what will be played as they speak.

You can add short pauses by using a (pause) stage direction. Make sure to add it in a separate paragraph. For example, the following script adds a 2 second pause between the two sentences.

Think about that for a moment.

(pause: 2)

The correct answer is 5.

Avoid jokes

You can instruct text-to-speech voice generators to read certain parts faster or slower, or emphasise certain words, but the big limit is that they cannot easily convey emotion. Jokes require precise timing, and audible jokes often rely on word play, so it’s best to avoid them at the moment. If you want to include a joke, perhaps turn it into a dialogue between a computer-generated voice, and a short recording of your own voice (or a professional artist).

With Narakeet scripts, you can mix text-to-speech and pre-recorded audio by including an (audio) stage direction. For example, the following snippet will use a computer generated voice for the question, but play a pre-recorded answer.

Why did the chicken cross the road?

(audio: chicken-answer.mp3)

Vary the voice speed and volume

Computer-generated voices are very consistent, which is amazing when experimenting and replacing parts later, but it can sound monotone for longer text. Humans adjust the speed and volume of their voice naturally, and you can get the text-to-speech readers to do the same using and voice-speed and voice-volume stage directions. Here are a few examples:

This script will slow down the reader for the first sentence, and go back to normal speed for the second:

(voice-speed: slow)

I'm usually slow in the mornings.

(voice-speed: normal)

After a coffee I speak normally.

This script will raise the voice for the first sentence, and pronounce the second in normal volume

(voice-volume: loud)

I sometimes get angry!

(voice-volume: normal)

Then I calm down.

For some more advanced tips on how to control narration, see the format reference.

Cover photo by Kelly Sikkema on Unsplash

Narakeet helps you create text to speech voiceovers, turn Powerpoint presentations and Markdown scripts into engaging videos. It is under active development, so things change frequently. Keep up to date: RSS, Slack, Twitter, YouTube, Facebook, Instagram, TikTok