Aldebaran documentation What's new in NAOqi 2.4.3?


NAOqi Audio - Overview | API

juju Pepper only

What it does

ALVoiceEmotionAnalysis identifies the emotion expressed by the speaker’s voice, independently of what is being said.

How it works

Underlying technology

ALVoiceEmotionAnalysis relies on the ST Emotion voice analysis technology by AGI.

The module also depends on ALSpeechRecognition for speech detection, so ALSpeechRecognition (or ALDialog) needs to be started correctly for it to work.

EmotionRecognized event

At the end of the utterance, the engine processes the emotion levels and raises the “ALVoiceEmotionAnalysis/EmotionRecognized” event, which is structured as follows:

  [ matched emotion index, matched emotion level ],
  emotion levels: [ calm, anger, joy, sorrow, laughter ],
  excitement level


  • “Matched emotion index” is the index of the dominant emotion selected by the engine in the emotion levels vector

    0 = unknown
    1 = calm
    2 = anger
    3 = joy
    4 = sorrow

    Laughter is not really considered an emotion, so it will never be the dominant one.

  • “Matched emotion level” is the level of the dominant emotion

  • “Emotion levels” is the vector of the emotions’ scores. Scores are integers between 0 and 5.

  • “Excitement level” is a separate value that measures the amount of excitement in the voice. High excitement is often linked with joy or anger.

Example: [ [4,3], [0,2,1,3,0], 1 ] Here, the matched emotion is sorrow with a level of 3. We can find this value in the emotion levels vector as well. Excitement is quite low (1), meaning the voice was quite monotone.

Getting started

Use the following Choregraphe behavior: VoiceEmotionAnalysis.crg

The minimum utterance duration can be set in the box parameters (see “MinSignalLength” parameter in ALVoiceEmotionAnalysisProxy::setParameter).

Note that there is a Speech Recognition box. It’s purpose is to start ALSpeechRecognition with dummy vocabulary so that the speech detection works (speech detection will not work if you merely subscribe to ALSpeechRecognition without having added vocabulary).