Archive for the ‘Audio’ Category

Teaching Your Application to Really Talk

Friday, March 26th, 2010

Speech Synthesis, otherwise known as Text to Speech (TTS), is a technology that quickly synthesizes a human voice using text as input. Speech synthesis  is the default behavior for voice calls on the Tropo platform. The Tropo ‘say‘ verb is the one that provides the TTS capability, by taking a string of text and speaking it back. It is of course possible for this verb to take a URL to a ‘wav’ or ‘mp3′ file for pre-recorded audio to be played as well.

When it comes to teaching your application to speak we follow the Perl ethos of making “the simple things easy and difficult things possible”. So your application may speak very well with the simplicity of our APIs, or it may be as sophisticated and emotional as you like through Tropo exposing powerful capabilities for giving your voices character.

For our first example we will simply say:

say 'I like squirrels!'

Which then renders this audio.

Next, we may choose from a voice that speaks any number of languages supported by Tropo (US/UK English, Castilian/Mexican Spanish, French, German, Italian & Dutch). Lets give French a try for our next example:

say "J'aime les écureuils!", :voice => 'florence'

Which then renders this audio.

Now, those were the simple examples that anyone may use to add a little speech to their applications. But, remember, we also make the difficult possible for those who want to really make their characters speak. As sometimes simply customizing the voice is not enough. There are cases when you’d also like control over pitch, volume and intonation. Tropo natively supports a standard called the Synthesized Speech Markup Language (SSML).

The Speech Synthesis Markup Language (SSML) is a W3C standard for controlling the pace, tone, pitch and all around sound of computer generated voices. Here’s a Ruby script that repeats the same sentence four times; each at a gradually lower speed:

answer 
say "<speak> I like squirrels!. 
I <prosody rate='-10%'>like squirrels!</prosody> 
I <prosody rate='-30%'>like squirrels!</prosody>  
I <prosody rate='-50%'>like squirrels!</prosody> 
</speak>"  
hangup

Which renders this audio. The previous example made use of the rate property of the SSML prosody element to control the playback speed. There are many other elements and attributes you may use, including: emphasis, phoneme, etc. To learn more about SSML and related technologies check out the W3C site at http://www.w3.org/TR/speech-synthesis/.

If you would like to call in and listen to these examples live, you may do so by dialing +990009369991429940 on Skype (free) or calling +1.408.940.5920 from any phone. What are you waiting for? Get started by signing up for an always free developer account @ Tropo.com.

Best Practices in Audio Files for Tropo

Friday, April 24th, 2009

You’ve got this great MP3 of your young child doing all the names for your brand-spankin’ new auto-attendant that you’ve created using Tropo, and when you go to call the application, it now sounds like some space alien has taken over your program! Dude, what happened?!?

Well, as it turns out, while cellphones and MP3 players can play audio files like stereo MP3s, they must be “local” on your device. See, telephony standards have just not “kept up with the times.” In Telephony the standard is 8bit, 8Khz u-law formatted WAV, and that standard, is generations removed from that nice 44KHz, 32-bit, stereo MP3 you just recorded. So, in order to play your MP3, it must be converted, on-the-fly, to work on the proper telephony standard — and unfortunately, that usually leads to less than stellar results.

The Tropo platform supports a number of different audio formats. When converting your sound files for optimum performance in your application, it is always best to have your files in 8bit, 8Khz u-law format from the start.

The supported sound formats (and their proper file extensions) for the Tropo platform are as follows:

  • 8kHz, 8bit u-law (wav or raw) (*.wav or *.ulaw)
  • 8kHz, 8bit a-law (wav or raw) (*.wav or *.alaw)
  • 8kHz, 8bit pcm (wav) (*.wav)
  • 8khz, 16bit pcm (wav or raw) (*.wav or *.pcm)
  • MS-GSM (wav) (*.wav)
  • GSM 6.10 (raw) (*.gsm)

Recording your own prompts

  • If you don’t plan on mixing and editing the prompts they record, it’s best to just record at 8khz/16bit and then save or convert to 8khz/8bit-ulaw.
  • If you plan on mixing and editing prompts you have recorded, it’s best to record at 48Khz/16bit, and not 44Khz, (the down-sample to 8Khz works much better when you are dealing with something that is a multiple of 8k), and then save or convert to 8khz/8bit-ulaw when you are done mixing and editing their prompts.
  • Most definitely use the gain control tools in your audio software to adjust the RMS amplitude and peak amplitude of your audio recordings so that audio is loud enough to be heard by customers; but not so loud that prompt echoes will be heard by the recognition engine; and not so loud that pops and clicks are heard as a result of clipping.

Tools

A great, free, cross-platform tool for working with audio files is Audacity. Their wiki and Forums offer outstanding support for folks using their application for the first time.

Another cross-platform, open-source audio editor application is SoX. SoX is a command-line utility, but has some outstanding capabilities for processing those “hard to fix” audio files.

For those PC-only folks out there, Goldwave is another freeware audio editor that you may want to consider, if Audacity or SoX does not fit the bill for you.

We hope that these hints help you in preparing your audio files for successful use on Tropo.