Page tree

Overview

SmartBody is responsible for acquiring the audio that is associated with a character utterance. This audio is then preferably played back by the game engine, but it can also be played back by SmartBody itself. To acquire the audio, SmartBody can either:

  • read in an existing prerecorded speech audio file
  • send a request to a text-to-speech engine

This document focuses on the latter. 

SmartBody will send a RemoteSpeechCmd message to the TtsRelay module, requesting for a line of text to be converted into audio. The message contains what voice to use and where to put the generated file. TtsRelay will send back a RemoteSpeechCmd message, containing the exact file location, a viseme schedule with detailed timing information for lip-synching and word boundary timing information for synchronization of nonverbal behavior as specified through BML. 

TTS Engines

Rhetorical (RVoiceRelay)

Voice Codes:
set character doctor voice remote M021 <- Saso Doctor's voice
set character elder voice remote M009 <- Saso Elder's voice

Cerevoice (CerevoiceRelay)

Voice Codes:
set character doctor voice remote star
set character doctor voice remote katherine
set character doctor voice remote starconv

Cepstral (CepstralRelay)

MSSpeech (MSSpeechRelay)

Voice Codes:
set character doctor voice remote BradVoice

Festival (FestivalRelay)

Voice Codes:
set character doctor voice remote BradVoice

RemoteSpeech Interface

To trigger a TTS call:

sbm bml char doctor speech "Hello world.  Testing Text to Speech"

Sent by Smartbody to TTS Engine:

RemoteSpeechCmd speak doctor 1 M021 ../../data/cache/audio/utt_20110528_175743_doctor_1.aiff
<?xml version="1.0" encoding="UTF-8"?>
<speech type="text/plain">
   Hello world.  Testing Text to Speech
</speech>

RVoiceRelay Example:

Actual message sent to Rhetorical:

<?xml version="1.0" encoding="UTF-8"?>
<speech type="text/plain">Hello world.  Testing Text to Speech</speech>

Sent by TTS Engine:

RemoteSpeechReply doctor 2 OK:
<?xml version="1.0" encoding="UTF-8"?>
<speak>
   <soundFile name="d:\edwork\saso\core\beavin\..\..\data\cache\audio\utt_20110528_180148_doctor_2.aiff"/>
   <viseme start="0.0" type="_"/>
   <word end="0.4049886621315193" start="0.049977324263038546">
      <viseme start="0.049977324263038546" type="Ih"/>
      <viseme start="0.14498866213151929" type="Ih"/>
      <viseme start="0.2" type="D"/>
      <viseme start="0.2549659863945578" type="OW"/>
   </word>
   <word end="0.8099773242630386" start="0.4049886621315193">
      <viseme start="0.4049886621315193" type="OO"/>
      <viseme start="0.5199546485260771" type="Er"/>
      <viseme start="0.5849886621315192" type="R"/>
      <viseme start="0.6649886621315193" type="D"/>
      <viseme start="0.7699773242630386" type="D"/>
   </word>
   <viseme start="0.8099773242630386" type="_"/>
   <viseme start="0.860498866213152" type="_"/>
   <viseme start="1.060498866213152" type="_"/>
   <word end="1.5854875283446712" start="1.1104761904761904">
      <viseme start="1.1104761904761904" type="D"/>
      <viseme start="1.1574603174603175" type="Ih"/>
      <viseme start="1.2354648526077097" type="Z"/>
      <viseme start="1.3304761904761904" type="D"/>
      <viseme start="1.3824943310657596" type="Ih"/>
      <viseme start="1.4374603174603175" type="NG"/>
   </word>
   <word end="1.8724716553287981" start="1.5854875283446712">
      <viseme start="1.5854875283446712" type="D"/>
      <viseme start="1.6424943310657596" type="Ih"/>
      <viseme start="1.7174603174603174" type="KG"/>
      <viseme start="1.7674829931972789" type="Z"/>
      <viseme start="1.8374603174603175" type="D"/>
   </word>
   <word end="1.927482993197279" start="1.8724716553287981">
      <viseme start="1.8724716553287981" type="D"/>
      <viseme start="1.9024943310657596" type="Ih"/>
   </word>
   <word end="2.408480725623583" start="1.927482993197279">
      <viseme start="1.927482993197279" type="Z"/>
      <viseme start="2.0224943310657597" type="BMP"/>
      <viseme start="2.1174603174603175" type="EE"/>
      <viseme start="2.207482993197279" type="j"/>
   </word>
   <viseme start="2.408480725623583" type="_"/>
   <viseme start="2.4584580498866213" type="_"/>
</speak>

MSSpeechRelay Example:

Actual message sent to MSSpeech:

<speak version="1.0" xml:lang="en-US">Hello world.  Testing Text to Speech .</speak>

(note the added period at the end)

Sent by TTS Engine:

RemoteSpeechReply doctor 1 OK:
<?xml version="1.0" encoding="UTF-8"?>
<speak>
   <soundFile name="d:\edwork\vhtoolkit\data\cache\audio\utt_20110528_180527_doctor_1.wav"/>
   <viseme start="0" type="_"/>
   <viseme start="0.003" type="Oh"/>
   <viseme start="0.047" type="Ih"/>
   <viseme start="0.098" type="D"/>
   <viseme start="0.258" type="Oh"/>
   <viseme start="0.418" type="Oh"/>
   <viseme start="0.479" type="Er"/>
   <viseme start="0.54" type="R"/>
   <viseme start="0.601" type="D"/>
   <viseme start="0.695" type="D"/>
   <viseme start="0.745" type="_"/>
   <viseme start="1.367" type="_"/>
   <viseme start="1.37" type="D"/>
   <viseme start="1.461" type="Ih"/>
   <viseme start="1.546" type="Z"/>
   <viseme start="1.6" type="D"/>
   <viseme start="1.654" type="Ih"/>
   <viseme start="1.729" type="KG"/>
   <viseme start="1.804" type="D"/>
   <viseme start="1.9" type="Ih"/>
   <viseme start="2.022" type="KG"/>
   <viseme start="2.087" type="Z"/>
   <viseme start="2.16" type="D"/>
   <viseme start="2.233" type="D"/>
   <viseme start="2.297" type="Oh"/>
   <viseme start="2.341" type="Z"/>
   <viseme start="2.425" type="BMP"/>
   <viseme start="2.509" type="Ih"/>
   <viseme start="2.606" type="j"/>
   <viseme start="2.73" type="_"/>
</speak>

CerevoiceRelay Example:

Actual text sent to cerevoice engine:

<?xml version="1.0" encoding="UTF-8"?>
<speech type="text/plain">Hello world.  Testing Text to Speech </speech>

(note the space, also note that cerevoicerelay removes punctuation because of an apparent bug in cerevoice)

Sent by TTS Engine (CerevoiceRelay Example) (hand-formatted):

RemoteSpeechReply doctor 1 OK:
<?xml version="1.0" encoding="UTF-8"?>
<speak>
   <soundFile name="d:\edwork\saso\data\cache\audio\utt_20110621_192933_doctor_1.wav"/>
   <viseme start="0.000000" type="_"/>
   <mark name="sp1:T0" time="0.010975"/>
   <mark name="sp1:T1" time="0.010975"/>
   <word end="2.468209" start="0.010975">
      <viseme start="0.010975" type="Ih"/>
      <viseme start="0.090975" type="Ih"/>
      <viseme start="0.120952" type="D"/>
      <viseme start="0.231157" type="Oh"/>
      <viseme start="0.430088" type="OO"/>
      <viseme start="0.527008" type="Er"/>
      <viseme start="0.663673" type="D"/>
      <viseme start="0.723719" type="D"/>
      <viseme start="0.768662" type="D"/>
      <viseme start="0.848662" type="Ih"/>
      <viseme start="0.948662" type="Z"/>
      <viseme start="1.113696" type="D"/>
      <viseme start="1.173651" type="Ih"/>
      <viseme start="1.223510" type="NG"/>
      <viseme start="1.357624" type="D"/>
      <viseme start="1.431655" type="Ih"/>
      <viseme start="1.511610" type="KG"/>
      <viseme start="1.566621" type="Z"/>
      <viseme start="1.636644" type="D"/>
      <viseme start="1.696644" type="Oh"/>
      <viseme start="1.833379" type="Z"/>
      <viseme start="1.958231" type="BMP"/>
      <viseme start="2.028209" type="EE"/>
      <viseme start="2.188209" type="j"/>
   </word>
   <mark name="sp1:T2" time="2.468209"/>
   <mark name="sp1:T3" time="2.468209"/>
   <viseme start="2.468209" type="_"/>
</speak>

new output 11/7/11:

RemoteSpeechReply doctor 1 OK: 
<?xml version="1.0" encoding="UTF-8"?>
<speak>
<soundFile name="d:\edwork\saso\core\TtsSpeechRelay\bin\data\cache\audio\utt_20110528_175743_doctor_1.wav.wav"/>
   <viseme start="0.000000" type="_"/>
   <mark name="sp1:T0" time="0.010975"/>
   <mark name="sp1:T1" time="0.010975"/>
   <word end="0.353100" start="0.010975">
      <viseme start="0.010975" type="Ih"/>
      <viseme start="0.099709" type="Ih"/>
      <viseme start="0.126943" type="D"/>
      <viseme start="0.252789" type="Oh"/>
   </word>
   <mark name="sp1:T2" time="0.353100"/>
   <mark name="sp1:T3" time="0.353100"/>
   <word end="0.762222" start="0.353100">
      <viseme start="0.353100" type="OO"/>
      <viseme start="0.446472" type="Er"/>
      <viseme start="0.532245" type="D"/>
      <viseme start="0.602222" type="D"/>
   </word>
   <mark name="sp1:T4" time="0.762222"/>
   <mark name="sp1:T5" time="0.762222"/>
   <viseme start="0.762222" type="_"/>
   <mark name="sp1:T6" time="0.000000"/>
   <mark name="sp1:T7" time="0.962222"/>
   <viseme start="0.000000" type="_"/>
   <mark name="sp1:T8" time="1.162222"/>
   <mark name="sp1:T9" time="1.162222"/>
   <word end="1.617595" start="1.162222">
      <viseme start="1.162222" type="D"/>
      <viseme start="1.254784" type="Ih"/>
      <viseme start="1.340280" type="Z"/>
      <viseme start="1.419229" type="D"/>
      <viseme start="1.479229" type="Ih"/>
      <viseme start="1.509215" type="NG"/>
   </word>
   <mark name="sp1:T10" time="1.617595"/>
   <mark name="sp1:T11" time="1.617595"/>
   <word end="2.077460" start="1.617595">
      <viseme start="1.617595" type="D"/>
      <viseme start="1.747483" type="Ih"/>
      <viseme start="1.827483" type="KG"/>
      <viseme start="1.927438" type="Z"/>
      <viseme start="2.037460" type="D"/>
   </word>
   <mark name="sp1:T12" time="2.077460"/>
   <mark name="sp1:T13" time="2.077460"/>
   <word end="2.227483" start="2.077460">
      <viseme start="2.077460" type="D"/>
      <viseme start="2.197460" type="Ih"/>
   </word>
   <mark name="sp1:T14" time="2.227483"/>
   <mark name="sp1:T15" time="2.227483"/>
   <word end="2.847438" start="2.227483">
      <viseme start="2.227483" type="Z"/>
      <viseme start="2.347483" type="BMP"/>
      <viseme start="2.427438" type="EE"/>
      <viseme start="2.587438" type="j"/>
   </word>
   <mark name="sp1:T16" time="2.847438"/>
   <mark name="sp1:T17" time="2.847438"/>
   <viseme start="2.847438" type="_"/>
</speak>

FestivalRelay example:

Actual text sent to Festival:

<?xml version="1.0" encoding="UTF-8"?>
<speech type="text/plain">Hello world.  Testing Text to Speech </speech>

(note that this gets edited by FestivalRelay and eventually gets sent out as 'Helloworld.TestingTexttoSpeech'

Sent by TTS Engine (FestivalRelay Example) (hand-formatted):

RemoteSpeechReply doctor 7 OK:
<?xml version="1.0" encoding="UTF-8"?>
<speak>
   <soundFile name="d:\edwork\vhtoolkit\bin\FestivalRelay\data\cache\festival\utt_20110722_185051_doctor_7.wav"/>
   <viseme start="0.000000" type="_" />
   <mark name="T0" time="0.080000"/>
   <word end="0.640000" start="0.080000" >
      <viseme start="0.080000" type="Ih" />
      <viseme start="0.160000" type="Ih" />
      <viseme start="0.240000" type="D" />
      <viseme start="0.320000" type="Oh" />
      <viseme start="0.400000" type="Er" />
      <viseme start="0.440000" type="R" />
      <mark name="T1" time="0.480000"/>
   </word>
      <mark name="T2" time="0.080000"/>
      <word end="0.640000" start="0.080000" >
      <viseme start="0.480000" type="D" />
      <viseme start="0.560000" type="D" />
      <mark name="T3" time="0.640000"/>
   </word>
      <mark name="T4" time="0.640000"/>
      <word end="0.880000" start="0.640000" >
      <viseme start="0.640000" type="D" />
      <viseme start="0.720000" type="Ao" />
      <viseme start="0.800000" type="D" />
      <mark name="T5" time="0.880000"/>
   </word>
      <mark name="T6" time="0.880000"/>
      <word end="2.160000" start="0.880000" >
      <viseme start="0.880000" type="D" />
      <viseme start="0.960000" type="Ih" />
      <viseme start="1.040000" type="Z" />
      <viseme start="1.120000" type="D" />
      <viseme start="1.200000" type="Ih" />
      <viseme start="1.280000" type="NG" />
      <viseme start="1.360000" type="D" />
      <viseme start="1.440000" type="Ih" />
      <viseme start="1.520000" type="KG" />
      <viseme start="1.600000" type="Z" />
      <viseme start="1.680000" type="D" />
      <viseme start="1.760000" type="Ao" />
      <viseme start="1.840000" type="Z" />
      <viseme start="1.920000" type="BMP" />
      <viseme start="2.000000" type="EE" />
      <viseme start="2.080000" type="j" />
      <mark name="T7" time="2.160000"/>
   </word>
   <viseme start="2.160000" type="_" />
</speak>

new output 11/7/11:

RemoteSpeechReply doctor 1 OK: <?xml version="1.0" encoding="UTF-8"?>
<speak>
 <soundFile name="d:\edwork\saso\core\TtsSpeechRelay\bin\data\cache\audio\utt_20110528_175743_doctor_1.wav"/>
<mark name="T0" time="0.210000"/>
 <word end="0.795159" start="0.210000" >
 <viseme start="0.367043" type="D" />
 <viseme start="0.704177" type="D" />
 <viseme start="0.756153" type="D" />
 <mark name="T1" time="0.795159"/>
 </word>
 <mark name="T2" time="0.795159"/>
 <word end="1.013328" start="0.795159" >
 <viseme start="0.795159" type="D" />
 <viseme start="0.953081" type="D" />
 <mark name="T3" time="1.013328"/>
 </word>
 <mark name="T4" time="1.013328"/>
 <word end="2.455301" start="1.013328" >
 <viseme start="1.013328" type="D" />
 <viseme start="1.210314" type="Z" />
 <viseme start="1.282180" type="D" />
 <viseme start="1.358164" type="Ih" />
 <viseme start="1.394886" type="NG" />
 <viseme start="1.452691" type="D" />
 <viseme start="1.608044" type="KG" />
 <viseme start="1.690684" type="Z" />
 <viseme start="1.788436" type="D" />
 <viseme start="1.962315" type="Z" />
 <viseme start="2.065681" type="BMP" />
 <viseme start="2.312202" type="j" />
 <mark name="T5" time="2.455301"/>
 </word>
 </speak>

NPCEditor/NVBG Example

Utterance #20 in Toolkit

RemoteSpeechCmd sent by SBM

RemoteSpeechCmd speak brad 1 BradVoiceFestival ../../data/cache/audio/utt_20110809_151922_brad_1.aiff
<?xml version="1.0" encoding="utf-16"?>
<speech id="sp1" ref="tech_sapiTTS" type="application/ssml+xml">
         <mark name="T0" />SAPI
	<mark name="T1" /><mark name="T2" />is
	<mark name="T3" /><mark name="T4" />a
	<mark name="T5" /><mark name="T6" />speech
	<mark name="T7" /><mark name="T8" />and
	<mark name="T9" /><mark name="T10" />text
	<mark name="T11" /><mark name="T12" />to
	<mark name="T13" /><mark name="T14" />speech
	<mark name="T15" /><mark name="T16" />interface
	<mark name="T17" /><mark name="T18" />by
	<mark name="T19" /><mark name="T20" />Microsoft.
	<mark name="T21" /><mark name="T22" />I
	<mark name="T23" /><mark name="T24" />use
	<mark name="T25" /><mark name="T26" />it
	<mark name="T27" /><mark name="T28" />to
	<mark name="T29" /><mark name="T30" />be
	<mark name="T31" /><mark name="T32" />able
	<mark name="T33" /><mark name="T34" />to
	<mark name="T35" /><mark name="T36" />talk
	<mark name="T37" /><mark name="T38" />to
	<mark name="T39" /><mark name="T40" />you.
	<mark name="T41" />
</speech>

Festival example

RemoteSpeechReply brad 2 OK: <?xml version="1.0" encoding="UTF-8"?>
<speak>
   <soundFile name="d:\edwork\vhtoolkit\bin\FestivalRelay\data\cache\festival\utt_20110809_152521_brad_2.wav"/>
   <viseme start="0.000000" type="_" />
   <mark name="T0" time="0.080000"/>
   <word end="0.400000" start="0.080000" >
      <viseme start="0.080000" type="Z" />
      <viseme start="0.160000" type="Ao" />
      <viseme start="0.240000" type="BMP" />
      <viseme start="0.320000" type="EE" />
      <mark name="T1" time="0.400000"/>
   </word>
   <mark name="T2" time="0.400000"/>
   <word end="0.560000" start="0.400000" >
      <viseme start="0.400000" type="Ih" />
      <viseme start="0.480000" type="Z" />
      <mark name="T3" time="0.560000"/>
   </word>
   <mark name="T4" time="0.560000"/>
   <word end="0.640000" start="0.560000" >
      <viseme start="0.560000" type="Ih" />
      <mark name="T5" time="0.640000"/>
   </word>
   <mark name="T6" time="0.640000"/>
   <word end="0.960000" start="0.640000" >
      <viseme start="0.640000" type="Z" />
      <viseme start="0.720000" type="BMP" />
      <viseme start="0.800000" type="EE" />
      <viseme start="0.880000" type="j" />
      <viseme start="0.960000" type="_" />
      <mark name="T7" time="1.040000"/>
   </word>
   <mark name="T8" time="1.040000"/>
   <word end="1.280000" start="1.040000" >
      <viseme start="1.040000" type="Ih" />
      <viseme start="1.120000" type="NG" />
      <viseme start="1.200000" type="D" />
      <mark name="T9" time="1.280000"/>
   </word>
   <mark name="T10" time="1.280000"/>
   <word end="1.680000" start="1.280000" >
      <viseme start="1.280000" type="D" />
      <viseme start="1.360000" type="Ih" />
      <viseme start="1.440000" type="KG" />
      <viseme start="1.520000" type="Z" />
      <viseme start="1.600000" type="D" />
      <mark name="T11" time="1.680000"/>
   </word>
   <mark name="T12" time="1.680000"/>
   <word end="1.840000" start="1.680000" >
      <viseme start="1.680000" type="D" />
      <viseme start="1.760000" type="Ih" />
      <mark name="T13" time="1.840000"/>
   </word>
   <mark name="T14" time="1.840000"/>
   <word end="2.160000" start="1.840000" >
      <viseme start="1.840000" type="Z" />
      <viseme start="1.920000" type="BMP" />
      <viseme start="2.000000" type="EE" />
      <viseme start="2.080000" type="j" />
      <mark name="T15" time="2.160000"/>
   </word>
   <mark name="T16" time="2.160000"/>
   <word end="2.720000" start="2.160000" >
      <viseme start="2.160000" type="Ih" />
      <viseme start="2.240000" type="NG" />
      <viseme start="2.320000" type="D" />
      <viseme start="2.400000" type="Er" />
      <viseme start="2.440000" type="R" />
      <mark name="T17" time="2.480000"/>
   </word>
   <mark name="T18" time="2.160000"/>
   <word end="2.720000" start="2.160000" >
      <viseme start="2.480000" type="F" />
      <viseme start="2.560000" type="Ih" />
      <viseme start="2.640000" type="Z" />
      <mark name="T19" time="2.720000"/>
   </word>
   <mark name="T20" time="2.720000"/>
   <word end="2.880000" start="2.720000" >
      <viseme start="2.720000" type="BMP" />
      <viseme start="2.800000" type="Ih" />
      <mark name="T21" time="2.880000"/>
   </word>
   <mark name="T22" time="2.880000"/>
   <word end="3.599999" start="2.880000" >
      <viseme start="2.880000" type="BMP" />
      <viseme start="2.960000" type="Ih" />
      <viseme start="3.039999" type="KG" />
      <viseme start="3.119999" type="R" />
      <viseme start="3.199999" type="Oh" />
      <viseme start="3.279999" type="Z" />
      <viseme start="3.359999" type="Ao" />
      <viseme start="3.439999" type="F" />
      <viseme start="3.519999" type="D" />
      <viseme start="3.599999" type="_" />
      <mark name="T23" time="3.679999"/>
   </word>
   <mark name="T24" time="3.679999"/>
   <word end="3.759999" start="3.679999" >
      <viseme start="3.679999" type="Ih" />
      <mark name="T25" time="3.759999"/>
   </word>
   <mark name="T26" time="3.759999"/>
   <word end="3.999999" start="3.759999" >
      <viseme start="3.759999" type="OO" />
      <viseme start="3.839999" type="Oh" />
      <viseme start="3.919999" type="Z" />
      <mark name="T27" time="3.999999"/>
   </word>
   <mark name="T28" time="3.999999"/>
   <word end="4.159998" start="3.999999" >
      <viseme start="3.999999" type="Ih" />
      <viseme start="4.079998" type="D" />
      <mark name="T29" time="4.159998"/>
   </word>
   <mark name="T30" time="4.159998"/>
   <word end="4.319998" start="4.159998" >
      <viseme start="4.159998" type="D" />
      <viseme start="4.239998" type="Ih" />
      <mark name="T31" time="4.319998"/>
   </word>
   <mark name="T32" time="4.319998"/>
   <word end="4.479998" start="4.319998" >
      <viseme start="4.319998" type="BMP" />
      <viseme start="4.399998" type="EE" />
      <mark name="T33" time="4.479998"/>
   </word>
   <mark name="T34" time="4.479998"/>
   <word end="4.799998" start="4.479998" >
      <viseme start="4.479998" type="Ih" />
      <viseme start="4.559998" type="BMP" />
      <viseme start="4.639998" type="Ih" />
      <viseme start="4.719998" type="D" />
      <viseme start="4.799998" type="_" />
      <mark name="T35" time="4.879998"/>
   </word>
   <mark name="T36" time="4.879998"/>
   <word end="5.039998" start="4.879998" >
      <viseme start="4.879998" type="D" />
      <viseme start="4.959998" type="Ih" />
      <mark name="T37" time="5.039998"/>
   </word>
   <mark name="T38" time="5.039998"/>
   <word end="5.279997" start="5.039998" >
      <viseme start="5.039998" type="D" />
      <viseme start="5.119998" type="Ao" />
      <viseme start="5.199997" type="KG" />
      <mark name="T39" time="5.279997"/>
   </word>
   <mark name="T40" time="5.279997"/>
   <word end="5.439997" start="5.279997" >
      <viseme start="5.279997" type="D" />
      <viseme start="5.359997" type="Ih" />
      <mark name="T41" time="5.439997"/>
   </word>
   <mark name="T42" time="5.439997"/>
   <word end="5.599997" start="5.439997" >
      <viseme start="5.439997" type="OO" />
      <viseme start="5.519997" type="Oh" />
      <mark name="T43" time="5.599997"/>
   </word>
   <viseme start="5.599997" type="_" />
</speak>

MSSpeechRelay Example

Text sent to MSSpeech:

<speak version="1.0" xml:lang="en-US">
<mark name="sp1:T0" />SAPI
<mark name="sp1:T1" />
<mark name="sp1:T2" />is
<mark name="sp1:T3" />
<mark name="sp1:T4" />a
<mark name="sp1:T5" />
<mark name="sp1:T6" />speak
<mark name="sp1:T7" />
<mark name="sp1:T8" />and
<mark name="sp1:T9" />
<mark name="sp1:T10" />text
<mark name="sp1:T11" />
<mark name="sp1:T12" />to
<mark name="sp1:T13" />
<mark name="sp1:T14" />speak
<mark name="sp1:T15" />
<mark name="sp1:T16" />interface
<mark name="sp1:T17" />
<mark name="sp1:T18" />by
<mark name="sp1:T19" />
<mark name="sp1:T20" />Microsoft.
<mark name="sp1:T21" />
<mark name="sp1:T22" />I
<mark name="sp1:T23" />
<mark name="sp1:T24" />use
<mark name="sp1:T25" />
<mark name="sp1:T26" />it
<mark name="sp1:T27" />
<mark name="sp1:T28" />to
<mark name="sp1:T29" />
<mark name="sp1:T30" />be
<mark name="sp1:T31" />
<mark name="sp1:T32" />able
<mark name="sp1:T33" />
<mark name="sp1:T34" />to
<mark name="sp1:T35" />
<mark name="sp1:T36" />talk
<mark name="sp1:T37" />
<mark name="sp1:T38" />to
<mark name="sp1:T39" />
<mark name="sp1:T40" />you.
<mark name="sp1:T41" />.
</speak>

Reply:

RemoteSpeechReply brad 4 OK:
<?xml version="1.0" encoding="UTF-8"?>
<speak>
   <soundFile name="d:\edwork\vhtoolkit\data\cache\audio\utt_20110809_154741_brad_4.wav"/>
   <viseme start="0" type="_"/>
   <mark name="T0" time="0.003"/>
   <word end="0.347" start="0.003">
      <viseme start="0.003" type="Z"/>
      <viseme start="0.099" type="Ih"/>
      <viseme start="0.196" type="BMP"/>
      <viseme start="0.259" type="Ih"/>
      <mark name="T1" time="0.347"/>
   </word>
   <mark name="T2" time="0.347"/>
   <word end="0.465" start="0.347">
      <viseme start="0.347" type="Ih"/>
      <viseme start="0.416" type="Z"/>
      <mark name="T3" time="0.465"/>
   </word>
   <mark name="T4" time="0.465"/>
   <word end="0.527" start="0.465">
      <viseme start="0.465" type="Ih"/>
      <mark name="T5" time="0.527"/>
   </word>
   <mark name="T6" time="0.527"/>
   <word end="0.874" start="0.527">
      <viseme start="0.527" type="Z"/>
      <viseme start="0.605" type="BMP"/>
      <viseme start="0.683" type="Ih"/>
      <viseme start="0.795" type="KG"/>
      <mark name="T7" time="0.874"/>
   </word>
   <mark name="T8" time="0.874"/>
   <word end="1.053" start="0.874">
      <viseme start="0.874" type="Ih"/>
      <viseme start="0.957" type="D"/>
      <viseme start="1.04" type="D"/>
      <mark name="T9" time="1.053"/>
   </word>
   <mark name="T10" time="1.053"/>
   <word end="1.401" start="1.053">
      <viseme start="1.053" type="D"/>
      <viseme start="1.119" type="Ih"/>
      <viseme start="1.238" type="KG"/>
      <viseme start="1.295" type="Z"/>
      <viseme start="1.348" type="D"/>
      <mark name="T11" time="1.401"/>
   </word>
   <mark name="T12" time="1.401"/>
   <word end="1.47" start="1.401">
      <viseme start="1.401" type="D"/>
      <viseme start="1.442" type="Oh"/>
      <mark name="T13" time="1.47"/>
   </word>
   <mark name="T14" time="1.47"/>
   <word end="1.878" start="1.47">
      <viseme start="1.47" type="Z"/>
      <viseme start="1.547" type="BMP"/>
      <viseme start="1.624" type="Ih"/>
      <viseme start="1.736" type="KG"/>
      <mark name="T15" time="1.878"/>
   </word>
   <mark name="T16" time="1.878"/>
   <word end="2.523" start="1.878">
      <viseme start="1.878" type="Ih"/>
      <viseme start="1.955" type="D"/>
      <viseme start="2.032" type="D"/>
      <viseme start="2.075" type="Ih"/>
      <viseme start="2.11" type="R"/>
      <viseme start="2.145" type="F"/>
      <viseme start="2.257" type="Ih"/>
      <viseme start="2.399" type="Z"/>
      <mark name="T17" time="2.523"/>
   </word>
   <mark name="T18" time="2.523"/>
   <word end="2.665" start="2.523">
      <viseme start="2.523" type="D"/>
      <viseme start="2.554" type="Ih"/>
      <mark name="T19" time="2.665"/>
   </word>
   <mark name="T20" time="2.665"/>
   <word end="3.931" start="2.665">
      <viseme start="2.665" type="BMP"/>
      <viseme start="2.753" type="Ih"/>
      <viseme start="2.841" type="KG"/>
      <viseme start="2.913" type="R"/>
      <viseme start="2.943" type="Ih"/>
      <viseme start="2.973" type="Z"/>
      <viseme start="3.067" type="Ao"/>
      <viseme start="3.202" type="F"/>
      <viseme start="3.255" type="D"/>
      <viseme start="3.308" type="_"/>
      <viseme start="3.928" type="_"/>
      <mark name="T21" time="3.931"/>
   </word>
   <mark name="T22" time="3.931"/>
   <word end="4.067" start="3.931">
      <viseme start="3.931" type="Ih"/>
      <mark name="T23" time="4.067"/>
   </word>
   <mark name="T24" time="4.067"/>
   <word end="4.336" start="4.067">
      <viseme start="4.067" type="Ih"/>
      <viseme start="4.17" type="Oh"/>
      <viseme start="4.273" type="Z"/>
      <mark name="T25" time="4.336"/>
   </word>
   <mark name="T26" time="4.336"/>
   <word end="4.474" start="4.336">
      <viseme start="4.336" type="Ih"/>
      <viseme start="4.403" type="D"/>
      <mark name="T27" time="4.474"/>
   </word>
   <mark name="T28" time="4.474"/>
   <word end="4.54" start="4.474">
      <viseme start="4.474" type="D"/>
      <viseme start="4.515" type="Oh"/>
      <mark name="T29" time="4.54"/>
   </word>
   <mark name="T30" time="4.54"/>
   <word end="4.691" start="4.54">
      <viseme start="4.54" type="D"/>
      <viseme start="4.588" type="Ih"/>
      <mark name="T31" time="4.691"/>
   </word>
   <mark name="T32" time="4.691"/>
   <word end="5.051" start="4.691">
      <viseme start="4.691" type="Ih"/>
      <viseme start="4.847" type="D"/>
      <viseme start="4.901" type="Ih"/>
      <viseme start="4.976" type="D"/>
      <mark name="T33" time="5.051"/>
   </word>
   <mark name="T34" time="5.051"/>
   <word end="5.15" start="5.051">
      <viseme start="5.051" type="D"/>
      <viseme start="5.095" type="Oh"/>
      <mark name="T35" time="5.15"/>
   </word>
   <mark name="T36" time="5.15"/>
   <word end="5.469" start="5.15">
      <viseme start="5.15" type="D"/>
      <viseme start="5.244" type="Ao"/>
      <viseme start="5.404" type="KG"/>
      <mark name="T37" time="5.469"/>
   </word>
   <mark name="T38" time="5.469"/>
   <word end="5.64" start="5.469">
      <viseme start="5.469" type="D"/>
      <viseme start="5.584" type="Oh"/>
      <mark name="T39" time="5.64"/>
   </word>
   <mark name="T40" time="5.64"/>
   <word end="6.558" start="5.64">
      <viseme start="5.64" type="Ih"/>
      <viseme start="5.784" type="Oh"/>
      <viseme start="5.928" type="_"/>
      <viseme start="6.553" type="_"/>
      <mark name="T41" time="6.558"/>
   </word>
   <viseme start="6.558" type="_"/>
</speak>

Saso Agent Example

Start Saso - sbm, nvbg, nlu, Fake Recognizer, Agent 1. Click "hello gentlemen".

RemoteSpeechCmd speak doctor-perez 1 M021 ../../data/cache/audio/utt_20110809_193606_doctor-perez_1.aiff
<?xml version="1.0" encoding="UTF-8"?>
<speech id="sp1" ref="" type="application/ssml+xml">
   <mark name="T0" />hello
   <mark name="T1" />
   <mark name="T2" />captain
   <mark name="T3" />
</speech>

RvoiceRelay Example

Text sent to Rvoice:

<?xml version="1.0" encoding="UTF-8"?>
<speech id="sp1" ref="" type="application/ssml+xml">
   <mark name="T0" />hello
   <mark name="T1" />
   <mark name="T2" />captain
   <mark name="T3" />
</speech>

Reply:

RemoteSpeechReply doctor-perez 1 OK:
<?xml version="1.0" encoding="UTF-8"?>
<speak>
   <soundFile name="d:\edwork\saso\core\beavin\..\..\data\cache\audio\utt_20110809_193606_doctor-perez_1.aiff"/>
   <viseme start="0.0" type="_"/>
   <viseme start="0.0" type="_"/>
   <mark name="T0" time="0.049977324263038546"/>
   <word end="0.33696145124716553" start="0.049977324263038546">
      <viseme start="0.049977324263038546" type="Ih"/>
      <viseme start="0.14498866213151929" type="Ih"/>
      <viseme start="0.2" type="D"/>
      <viseme start="0.24997732426303854" type="OW"/>
   </word>
   <mark name="T2" time="0.33696145124716553"/>
   <mark name="T1" time="0.33696145124716553"/>
   <word end="0.8029931972789116" start="0.33696145124716553">
      <viseme start="0.33696145124716553" type="KG"/>
      <viseme start="0.39696145124716553" type="Ih"/>
      <viseme start="0.4819954648526077" type="BMP"/>
      <viseme start="0.5419954648526077" type="D"/>
      <viseme start="0.6399546485260771" type="Ih"/>
      <viseme start="0.7029931972789115" type="NG"/>
   </word>
   <mark name="T3" time="0.8029931972789116"/>
   <viseme start="0.8029931972789116" type="_"/>
   <viseme start="0.8529705215419501" type="_"/>
</speak>
  • No labels