Search

Letter Sound Examples

Overview

This page covers how to create targets when using the <sound-out> markup

Individual letter sounds
Multiple letter sounds
All the letter sounds in a word
Letter sounds that require 2 phonemes

Individual letter sounds

A common use case of the SoapBox voice engine is when a student repeats, reads, or calls out individual letter sounds (i.e., phonemes).

The <sound-out> markup is used to create these targets.

The pronunciation attribute is always required when using isolated phonemes as targets: e.g., <sound-out pronunciation=”sh”>sh</sound-out>.

See Working With Short Sounds for more details about the pronunciation tag.

Example: “Tell me the SOUND of the letter G”

Input target

<sound-out pronunciation="g">g</sound-out>

JSON output

one quality_score value for letter “g”,
token_type : “sound-out”
Example [audio was /g/]:

"results": [{
    "hypothesis_score": 98.0,
    "duration": 4.2,
    "hypothesis_duration": 0.18,
    "category": "g",
    "end": 2.7,
    "start": 2.52,
    "word_breakdown": [{
      "duration": 0.18,
      "quality_score": 98.0,
      "token_type": "sound-out",
      "end": 2.7,
      "start": 2.52,
      "phone_breakdown": [{
        "duration": 0.18,
        "quality_score": 96.0,
        "end": 2.7,
        "start": 2.52,
        "phone": "g"
      }],
      "word": "g",
      "target_transcription": "g"
    }]
  }]

Working with multiple letters sounds as targets

There are two ways to input multiple letter sounds as targets:

When the order of the response does NOT matter
When the order of the response DOES matter

When the order of the response does NOT matter

This is when the order of the response from the student is not important and saying the sounds in any order is a valid response.

Example: “Tell me the sound of each of the letters: P S I L M”

Input target: multiple targets

<sound-out pronunciation="p">p</sound-out> 
<sound-out pronunciation="s">s</sound-out> 
<sound-out pronunciation="ih">i</sound-out> 
<sound-out pronunciation="l">l</sound-out> 
<sound-out pronunciation="m">m</sound-out>

JSON output:

Several quality_score values, one per letter.
token_type : “sound-out”
Example [audio was /p/ /ih/ /s/ /l/ /m/]:
Note: the student said /ih/ and /s/ in a different order to the prompt

"results": [
    {
      "hypothesis_score": 89,
      "duration": 8.61,
      "hypothesis_duration": 0.45,
      "category": "p",
      "end": 0.99,
      "start": 0.54,
      "word_breakdown": [
        {
          "duration": 0.45,
          "quality_score": 89,
          "token_type": "sound-out",
          "end": 0.99,
          "start": 0.54,
          "phone_breakdown": [
            {
              "duration": 0.45,
              "quality_score": 85,
              "end": 0.99,
              "start": 0.54,
              "phone": "p"
            }
          ],
          "word": "p",
          "target_transcription": "p"
        }
      ]
    },
    {
      "hypothesis_score": 87,
      "duration": 8.61,
      "hypothesis_duration": 1.2,
      "category": "s",
      "end": 3.36,
      "start": 2.16,
      "word_breakdown": [
        {
          "duration": 1.2,
          "quality_score": 87,
          "token_type": "sound-out",
          "end": 3.36,
          "start": 2.16,
          "phone_breakdown": [
            {
              "duration": 1.2,
              "quality_score": 81,
              "end": 3.36,
              "start": 2.16,
              "phone": "s"
            }
          ],
          "word": "s",
          "target_transcription": "s"
        }
      ]
    },
    {
      "hypothesis_score": 78,
      "duration": 8.61,
      "hypothesis_duration": 0.33,
      "category": "i",
      "end": 4.71,
      "start": 4.38,
      "word_breakdown": [
        {
          "duration": 0.33,
          "quality_score": 78,
          "token_type": "sound-out",
          "end": 4.71,
          "start": 4.38,
          "phone_breakdown": [
            {
              "duration": 0.33,
              "quality_score": 70,
              "end": 4.71,
              "start": 4.38,
              "phone": "ih"
            }
          ],
          "word": "i",
          "target_transcription": "ih"
        }
      ]
    },
    {
      "hypothesis_score": 73,
      "duration": 8.61,
      "hypothesis_duration": 0.66,
      "category": "l",
      "end": 6.18,
      "start": 5.52,
      "word_breakdown": [
        {
          "duration": 0.66,
          "quality_score": 73,
          "token_type": "sound-out",
          "end": 6.18,
          "start": 5.52,
          "phone_breakdown": [
            {
              "duration": 0.66,
              "quality_score": 64,
              "end": 6.18,
              "start": 5.52,
              "phone": "l"
            }
          ],
          "word": "l",
          "target_transcription": "l"
        }
      ]
    },
    {
      "hypothesis_score": 77,
      "duration": 8.61,
      "hypothesis_duration": 1.32,
      "category": "m",
      "end": 8.34,
      "start": 7.02,
      "word_breakdown": [
        {
          "duration": 1.32,
          "quality_score": 77,
          "token_type": "sound-out",
          "end": 8.34,
          "start": 7.02,
          "phone_breakdown": [
            {
              "duration": 1.32,
              "quality_score": 69,
              "end": 8.34,
              "start": 7.02,
              "phone": "m"
            }
          ],
          "word": "m",
          "target_transcription": "m"
        }]
    }]

When the order of the response DOES matters

This is when the order of the response from the student is important.

The sounds are expected to be produced in the order given in the target. If student says them in a different order, misplaced sounds are marked as deleted.

Example: “Tell me the sounds of the letters P S I L M”

Input target: Use a single target:

<sound-out pronunciation="p">p</sound-out> <sound-out pronunciation="s">s</sound-out> <sound-out pronunciation="ih">i</sound-out> <sound-out pronunciation="l">l</sound-out> <sound-out pronunciation="m">m</sound-out>

JSON output:

Several quality_score values, one per letter.
token_type : “sound-out”
Example [audio was /p/ /s/ /ih/ /l/ /m/]:

"results": [{
    "hypothesis_score": 79.0,
    "duration": 8.61,
    "hypothesis_duration": 7.8,
    "category": "p s i l m",
    "end": 8.34,
    "start": 0.54,
    "word_breakdown": [{
      "duration": 7.8,
      "quality_score": 87.0,
      "token_type": "sound-out",
      "end": 8.34,
      "start": 0.54,
      "phone_breakdown": [{
        "duration": 0.57,
        "quality_score": 82.0,
        "end": 1.11,
        "start": 0.54,
        "phone": "p"
      }],
      "word": "p",
      "target_transcription": "p"
    }, {
      "duration": 7.8,
      "quality_score": 86.0,
      "token_type": "sound-out",
      "end": 8.34,
      "start": 0.54,
      "phone_breakdown": [{
        "duration": 1.2,
        "quality_score": 81.0,
        "end": 3.36,
        "start": 2.16,
        "phone": "s"
      }],
      "word": "s",
      "target_transcription": "s"
    }, {
      "duration": 7.8,
      "quality_score": 77.0,
      "token_type": "sound-out",
      "end": 8.34,
      "start": 0.54,
      "phone_breakdown": [{
        "duration": 0.33,
        "quality_score": 68.0,
        "end": 4.71,
        "start": 4.38,
        "phone": "ih"
      }],
      "word": "i",
      "target_transcription": "ih"
    }, {
      "duration": 7.8,
      "quality_score": 73.0,
      "token_type": "sound-out",
      "end": 8.34,
      "start": 0.54,
      "phone_breakdown": [{
        "duration": 0.66,
        "quality_score": 64.0,
        "end": 6.18,
        "start": 5.52,
        "phone": "l"
      }],
      "word": "l",
      "target_transcription": "l"
    }, {
      "duration": 7.8,
      "quality_score": 73.0,
      "token_type": "sound-out",
      "end": 8.34,
      "start": 0.54,
      "phone_breakdown": [{
        "duration": 1.32,
        "quality_score": 64.0,
        "end": 8.34,
        "start": 7.02,
        "phone": "m"
      }],
      "word": "m",
      "target_transcription": "m"
    }]
  }]

All the letter sounds in a word

In a use case where a student is required to read or call out all the letter sounds (i.e., phonemes) in a word, the word can be used as the target.

The <sound-out> markup is used to create these targets: <sound-out>dog</sound-out>.

Note: The pronunciation attribute is required only when using a target that is not in the dictionary.

Example: “Tell me the ALL SOUNDS in the word dog”

Input target:

<sound-out>dog</sound-out>

JSON output:

Phonemes are scored, as they were produced singularly (not as part of a word).
One quality_score value for the word “dog” that is the combination of the isolated phoneme quality_score values.
token_type : “sound-out”
Example [audio was /d/ /ao/ /g/]:

"results": [
    {
      "hypothesis_score": 71,
      "duration": 3.48,
      "hypothesis_duration": 1.86,
      "category": "dog",
      "end": 2.55,
      "start": 0.69,
      "word_breakdown": [
        {
          "duration": 1.86,
          "quality_score": 71,
          "token_type": "sound-out",
          "end": 2.55,
          "start": 0.69,
          "phone_breakdown": [
            {
              "duration": 0.24,
              "quality_score": 64,
              "end": 0.93,
              "start": 0.69,
              "phone": "d"
            },
            {
              "duration": 0.3,
              "quality_score": 56,
              "end": 1.89,
              "start": 1.59,
              "phone": "ao"
            },
            {
              "duration": 0.15,
              "quality_score": 61,
              "end": 2.55,
              "start": 2.4,
              "phone": "g"
            }
          ],
          "word": "dog",
          "target_transcription": "d ao g"
        }
      ]
    }
  ]

Working with letter sounds that require 2 phonemes

Some letter sounds like those for “x” and “q” require 2 or more phonemes to be pronounced. The <custom-word> markup must be used in this use case as no pause is expected between the phonemes.

Example: “Tell me the sound of the letter X”

Input target:

<custom-word pronunciation="k s"> x </custom-word>

JSON output:

one quality_score value for “x”,
token_type : "custom-word"
Example [audio was /k s/]:

"results": [{
    "category": "x",
    "hypothesis_score": 93,
    "word_breakdown": [{
        "quality_score": 93,
        "target_transcription": "k s",
        "word": "x",
        "token_type": "custom-word",
        "phone_breakdown": [{
            "phone": "k",
            "quality_score": 91,
            "end": 0.51,
            "start": 0.39,
            "duration": 0.12
          },{
            "phone": "s",
            "quality_score": 88,
            "end": 1.14,
            "start": 0.51,
            "duration": 0.63
          }],
        "end": 1.14,
        "start": 0.39,
        "pitch": {
          "values": []
        },
        "duration": 0.75
      }],
    "end": 1.14,
    "start": 0.39,
    "duration": 1.53,
    "hypothesis_duration": 0.75
}]

Short sounds guide and best practices

Additional details for working with Short Sounds can be found here: Working With Short Sounds .