Reading non-ASCII text

<< return to examples index

Suppose you have your robot configure to speak in French, and you want it to say some sentences from a data file.

Doing so is a bit trickier than it sounds, because you have to take care of the encoding.

Example

First, download the following files and put them on the robot, in the same directory:

The coffee_en.txt file contains the string “I like coffee”, the files coffee_fr_utf-8.txt and coffee_fr_latin9.txt hold its French translation: J’aime le café, so it’s best if you robot can speak French in addition to English :)

Let’s have a closer look on the file

# -*- encoding: UTF-8 -*-
from naoqi import ALProxy
import codecs


def say_from_file(tts, filename, encoding):
    with codecs.open(filename, encoding=encoding) as fp:
        contents = fp.read()
        # warning: print contents won't work
        to_say = contents.encode("utf-8")
    tts.say(to_say)

def main():
    tts = ALProxy("ALTextToSpeech", "127.0.0.1", 9559)

    tts.setLanguage('French')
    say_from_file(tts, 'coffee_fr_utf-8.txt', 'utf-8')
    say_from_file(tts, 'coffee_fr_latin9.txt', 'latin9')

    tts.setLanguage('English')
    # the string "I like coffee" is encoded the exact same way in these three
    # encodings
    say_from_file(tts, 'coffee_en.txt', 'ascii')
    say_from_file(tts, 'coffee_en.txt', 'utf-8')
    say_from_file(tts, 'coffee_en.txt', 'latin9')

if __name__ == "__main__":
    main()

First, notice how we do not use open but codecs.open, specifying the encoding.

Also notice how we decode the result of the read from the file. The object returned by fp.read is a unicode object, and we need to encode it back to get a str object encoded in i’UTF-8’, usable the TTS proxy.

Trying to run print contents won’t work because Python will try to decode the string using the current locale of the robot, which is ‘ASCII’, leading to this error:

Traceback (most recent call last):
  File "non_ascii.py", line 22, in <module>
    main()
  File "non_ascii.py", line 18, in main
    say_from_file(filename)
  File "non_ascii", line 10, in say_from_file
    print contents
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position
13: ordinal not in range(128)

Notice at last that regardless of the file encoding, everything gets encoded to ‘UTF-8’ before being sent to the text-to-speech proxy.

Running the example

Open a SSH connection on the robot, and type

$ python non_ascii.py

Going further

If you are not sure whereas your file is UTF-8 encoded, you can use something like:

with codecs.open(filename, encoding="utf-8") as fp:
    try:
        contents = fp.read()
    except UnicodeDecodeError:
        print filename, "is not UTF-8 encoded"
        return