Working With Short Sounds

Overview

This page contains best practices for creating input targets for short sounds, covering some common and less common use cases.

Pronunciation attribute

  • The pronunciation that the SoapBox Labs' voice engine verifies is controlled by the markup tag pronunciation attribute.

  • If multiple phonemes are inserted, they must be separated by a single space character.

  • If the pronunciation of a single letter sound needs to be scored (i.e., /w/), the pronunciation attribute must be provided (e.g., <sound-out pronunciation="w">w</sound-out>).

  • If a custom sound-out pronunciation is required for the word, the pronunciation attribute is added to override the standard dictionary pronunciation.

Using the pronunciation attribute with letter sounds

If the <sound-out> tag is used without the pronunciation attribute, the SoapBox system will score against the letter name and not the letter sound because the dictionary pronunciation is used.

Example: “Tell me the sound of the letter G.” 

Input target

<sound-out>g</sound-out>

Note: Without the pronunciation attribute it will default to the letter name.

JSON output

  • One quality_score value for “g”,

  • Note: the letter name pronunciation is returned. See “jh iy” in phone_breakdown

  • token_type : “sound-out”

  • Example [audio was /g/]:

"results": [{
    "hypothesis_score": 16.0,
    "duration": 3.36,
    "hypothesis_duration": 0.12,
    "category": "g",
    "end": 0.15,
    "start": 0.03,
    "word_breakdown": [{
      "duration": 0.12,
      "quality_score": 16.0,
      "token_type": "sound-out",
      "end": 0.15,
      "start": 0.03,
      "phone_breakdown": [{
        "duration": 0.09,
        "quality_score": 1.0,
        "end": 0.12,
        "start": 0.03,
        "phone": "jh"
      }, {
        "duration": 0.03,
        "quality_score": 19.0,
        "end": 0.15,
        "start": 0.12,
        "phone": "iy"
      }],
      "word": "g",
      "target_transcription": "jh iy"
    }]
  }]

Differentiating between letter names and letter sounds

A common scenario may be for an educator to ensure a student is saying a letter sound rather than a letter name. In cases where the phonemes of the letter name and the phonemes of the letter sound do not overlap, this is straightforward. Take the letter “c”, for example. The phonemes that make up the letter name “c” are s iy. The phoneme that makes up the letter sound “c” is usually k (unless asking for the “c” pronunciation in a word like “cease”). Therefore, if you’re verifying the child said the letter sound “c” and not the letter name, using a single target for the letter sound “c” (i.e., <sound-out pronunciation="k">c</sound-out>) will verify the user said the letter sound and not the letter name.

In less straightforward cases, the letter sound is contained within the letter name and more than one target should be used. For example, the letter “b”. The phonemes that make up this letter name are b iy.
The phoneme that makes up this letter sound is b.

As shown, the letter name and the letter sound both contain the b phoneme. As such, if the educator uses <sound-out pronunciation="b">b</sound-out> as a target and the child says the letter name, b iy, the target phoneme b will return a high score, making it seem that the child said the letter sound and not the letter name, as shown in img. 1.

img.1 child saying the letter name b

In these circumstances, it is best practice to use both the letter name and the letter sound as targets. Therefore, if the child says the letter name, both the letter name and the letter sound targets will return high scores. This allows the educator to surmise that the child said a letter name, not the letter sound. Similarly, if the letter sound returns a high score and the letter name returns a mediocre or low score, the educator can surmise that the child said only b, the letter sound, and not b plus iy, the letter name. This is shown in img. 2 and the accompanying JSON snippet below:

img. 2 child saying the letter sound b

"results": [
    {
      "hypothesis_score": 94,
      "duration": 2.55,
      "hypothesis_duration": 0.15,
      "category": "b",
      "end": 1.44,
      "start": 1.29,
      "word_breakdown": [
        {
          "duration": 0.15,
          "quality_score": 94,
          "token_type": "sound-out",
          "end": 1.44,
          "start": 1.29,
          "phone_breakdown": [
            {
              "duration": 0.15,
              "quality_score": 92,
              "end": 1.44,
              "start": 1.29,
              "phone": "b"
            }
          ],
          "word": "b",
          "target_transcription": "b"
        }
      ]
    },
    {
      "hypothesis_score": 11,
      "duration": 2.55,
      "hypothesis_duration": 0.18,
      "category": "b",
      "end": 1.47,
      "start": 1.29,
      "word_breakdown": [
        {
          "duration": 0.18,
          "quality_score": 11,
          "token_type": "letter",
          "end": 1.47,
          "start": 1.29,
          "phone_breakdown": [
            {
              "duration": 0.15,
              "quality_score": 92,
              "end": 1.44,
              "start": 1.29,
              "phone": "b"
            },
            {
              "duration": 0.03,
              "quality_score": 1,
              "end": 1.47,
              "start": 1.44,
              "phone": "iy"
            }
          ],
          "word": "b",
          "target_transcription": "b iy"
        }
      ]
    }
  ]
}

Using multiple targets can also be used to determine the inverse — that a letter name was said rather than a letter sound — but is usually unnecessary. In that scenario, the majority of the needed information will be available via phoneme breakdown of just the letter name target. If the only target provided is the phonemes b iy for the letter name “b”, and the child says the letter sound b instead, then only b in b iy will return a high score, indicating that a letter sound was said and not a letter name.

Differentiating between letter sounds and words

There is a need to cater for the various ways in which a child may respond to an assessment item.

Example: “Tell me the starting sound in the word Dog”

A child may say the word rather than the sound, or possibly say both the word and the sound, whilst working out their response.

In a phoneme isolation activity, for example, we want to hear the response “/d/” only. If a child says the word “dog” and does not say the isolated sound “/d/”, the system will still hear the individual sound “/d/” as it exists in the word and score that sound highly. So there is the possibility of the student scoring high for an incorrect response.

Therefore, to identify the correct response, we need to ask the system to listen for both “dog” and “/d/”.

The acceptance criteria then becomes:

a) If both targets scored greater than the threshold, this means the full word was said and the answer should be considered incorrect because the audio contains the word.

b) If only the isolated sound score is above the threshold and the word score is low, this means the isolated sound only was said and the answer should be considered correct.

Use cases and examples

Phoneme isolation

Here’s how to create the targets for a phoneme isolation use case (i.e., when a student is asked to isolate a single sound from within a word).

Example: “Tell me the starting sound in the word DOG.”

Input targets: 

  • <sound-out pronunciation="d">d</sound-out>

  • dog

JSON outputs: 

  • one quality_score value for “d”, 

  • token_type : “sound-out”

  • one quality_score value for “dog”, 

  • token_type : “word”

  • Example [audio was /d/]:

"results": [{
    "hypothesis_score": 61.0,
    "duration": 1.14,
    "hypothesis_duration": 0.36,
    "category": "d",
    "end": 0.36,
    "start": 0.0,
    "word_breakdown": [{
      "duration": 0.36,
      "quality_score": 61.0,
      "token_type": "sound-out",
      "end": 0.36,
      "start": 0.0,
      "phone_breakdown": [{
        "duration": 0.36,
        "quality_score": 49.0,
        "end": 0.36,
        "start": 0.0,
        "phone": "d"
      }],
      "word": "d",
      "target_transcription": "d"
    }]
  }, {
    "hypothesis_score": 10.0,
    "duration": 1.14,
    "hypothesis_duration": 0.75,
    "category": "dog",
    "end": 0.75,
    "start": 0.0,
    "word_breakdown": [{
      "duration": 0.75,
      "quality_score": 10.0,
      "token_type": "word",
      "end": 0.75,
      "start": 0.0,
      "phone_breakdown": [{
        "duration": 0.51,
        "quality_score": 60.0,
        "end": 0.51,
        "start": 0.0,
        "phone": "d"
      }, {
        "duration": 0.18,
        "quality_score": 12.0,
        "end": 0.69,
        "start": 0.51,
        "phone": "ao"
      }, {
        "duration": 0.06,
        "quality_score": 1.0,
        "end": 0.75,
        "start": 0.69,
        "phone": "g"
      }],
      "word": "dog",
      "target_transcription": "d ao g"
    }]
  }]

Phoneme blending

In this use case, a student repeats or reads blending of two or more phonic/phonemic sounds into one word (either real or nonsense words).

  • If the target word is in the dictionary, no markup tag is required to create the targets.

  • If the target word is not in the dictionary, the <custom-word> markup tag is required to create the targets.

Learn more about Custom Words and Pronunciations here: Custom Words and Pronunciations Examples

Example: “What word can we make when we put these sounds together?”

When a student is given a sequence of sounds (/b/ /aa/ /r/ /k/) to blend into a word (“bark“), typically, there are three possibilities:

  1. The student says the correct word bark ;

  2. The student says the sequence of sounds (/b/ /aa/ /r/ /k/) first, and then the target word bark;

  3. The student says the sequence of sounds (/b/ /aa/ /r/ /k/) only, instead of the target word.

A multi-target approach can be used to discriminate such similar targets. The target with the highest score is likely to be the one produced in the audio.

Input targets:

  • bark

  • <sound-out>bark</sound-out>

  • <sound-out>bark</sound-out> bark

Note: Any target word that is not in the dictionary should be entered with the <custom-word> tag.

JSON output: 

  • One quality_score value per target word.

  • token_type:  “word” or “sound-out“ (depending on the target)

  • Example [audio was “bark“]:

"results": [
    {
      "hypothesis_score": 93,
      "duration": 1.86,
      "hypothesis_duration": 0.81,
      "category": "bark",
      "end": 1.26,
      "start": 0.45,
      "word_breakdown": [
        {
          "duration": 0.81,
          "quality_score": 93,
          "token_type": "word",
          "end": 1.26,
          "start": 0.45,
          "phone_breakdown": [
            {
              "duration": 0.27,
              "quality_score": 96,
              "end": 0.72,
              "start": 0.45,
              "phone": "b"
            },
            {
              "duration": 0.09,
              "quality_score": 95,
              "end": 0.81,
              "start": 0.72,
              "phone": "aa"
            },
            {
              "duration": 0.27,
              "quality_score": 82,
              "end": 1.08,
              "start": 0.81,
              "phone": "r"
            },
            {
              "duration": 0.18,
              "quality_score": 86,
              "end": 1.26,
              "start": 1.08,
              "phone": "k"
            }
          ],
          "word": "bark",
          "target_transcription": "b aa r k"
        }
      ]
    },
    {
      "hypothesis_score": 11,
      "duration": 1.86,
      "hypothesis_duration": 0.81,
      "category": "bark",
      "end": 1.26,
      "start": 0.45,
      "word_breakdown": [
        {
          "duration": 0.81,
          "quality_score": 11,
          "token_type": "sound-out",
          "end": 1.26,
          "start": 0.45,
          "phone_breakdown": [
            {
              "duration": 0.27,
              "quality_score": 96,
              "end": 0.72,
              "start": 0.45,
              "phone": "b"
            },
            {
              "duration": 0.03,
              "quality_score": 4,
              "end": 1.08,
              "start": 1.05,
              "phone": "aa"
            },
            {
              "duration": 0.03,
              "quality_score": 1,
              "end": 1.14,
              "start": 1.11,
              "phone": "r"
            },
            {
              "duration": 0.09,
              "quality_score": 90,
              "end": 1.26,
              "start": 1.17,
              "phone": "k"
            }
          ],
          "word": "bark",
          "target_transcription": "b aa r k"
        }
      ]
    },
    {
      "hypothesis_score": 47,
      "duration": 1.86,
      "hypothesis_duration": 0.81,
      "category": "bark bark",
      "end": 1.26,
      "start": 0.45,
      "word_breakdown": [
        {
          "duration": 0,
          "quality_score": 0,
          "token_type": "sound-out",
          "end": -1,
          "start": -1,
          "phone_breakdown": [
            {
              "duration": 0,
              "quality_score": 0,
              "end": 0,
              "start": 0,
              "phone": "b"
            },
            {
              "duration": 0,
              "quality_score": 0,
              "end": 0,
              "start": 0,
              "phone": "aa"
            },
            {
              "duration": 0,
              "quality_score": 0,
              "end": 0,
              "start": 0,
              "phone": "r"
            },
            {
              "duration": 0,
              "quality_score": 0,
              "end": 0,
              "start": 0,
              "phone": "k"
            }
          ],
          "word": "bark",
          "target_transcription": "b aa r k"
        },
        {
          "duration": 0.81,
          "quality_score": 93,
          "token_type": "word",
          "end": 1.26,
          "start": 0.45,
          "phone_breakdown": [
            {
              "duration": 0.27,
              "quality_score": 96,
              "end": 0.72,
              "start": 0.45,
              "phone": "b"
            },
            {
              "duration": 0.09,
              "quality_score": 95,
              "end": 0.81,
              "start": 0.72,
              "phone": "aa"
            },
            {
              "duration": 0.27,
              "quality_score": 82,
              "end": 1.08,
              "start": 0.81,
              "phone": "r"
            },
            {
              "duration": 0.18,
              "quality_score": 86,
              "end": 1.26,
              "start": 1.08,
              "phone": "k"
            }
          ],
          "word": "bark",
          "target_transcription": "b aa r k"
        }
      ]
    }
  ]

Phoneme manipulation

Phoneme manipulation is when a student is asked to modify, change, or move the individual sounds in a word, often to create a new word.

Example: “What word can be formed by removing the first sound in HEAR?”

When a student is asked to remove the starting sound in a word (e.g., “hear”), typically, there are three potential responses:

  1. The student says the correct word “ear”;

  2. The student says the given word “hear”;

  3. The student says some phonemes of the given word and optionally the target word.

A multi-target approach can be used to discriminate such similar targets. The target with the highest score is likely to be the one produced in the audio. If both have high scores, it’s likely that the longest word was said in audio (the shorter word is always part of the longer one).

Input targets:

  • ear

  • hear ear

  • <sound-out>hear</sound-out> ear

Note: Any word in the targets above that is not in the dictionary should be entered with the <custom-word> tag.

JSON output: 

  • One quality_score value per target word.

  • token_type:  “word” or “sound-out“ (depending on the target)

  • Example [audio was “ear“]:

"results": [
    {
      "hypothesis_score": 87,
      "duration": 1.62,
      "hypothesis_duration": 0.48,
      "category": "ear",
      "end": 1.08,
      "start": 0.6,
      "word_breakdown": [
        {
          "duration": 0.48,
          "quality_score": 87,
          "token_type": "word",
          "end": 1.08,
          "start": 0.6,
          "phone_breakdown": [
            {
              "duration": 0.24,
              "quality_score": 74,
              "end": 0.84,
              "start": 0.6,
              "phone": "iy"
            },
            {
              "duration": 0.24,
              "quality_score": 91,
              "end": 1.08,
              "start": 0.84,
              "phone": "r"
            }
          ],
          "word": "ear",
          "target_transcription": "iy r"
        }
      ]
    },
    {
      "hypothesis_score": 28,
      "duration": 1.62,
      "hypothesis_duration": 0.45,
      "category": "hear",
      "end": 1.08,
      "start": 0.63,
      "word_breakdown": [
        {
          "duration": 0.45,
          "quality_score": 28,
          "token_type": "word",
          "end": 1.08,
          "start": 0.63,
          "phone_breakdown": [
            {
              "duration": 0.06,
              "quality_score": 2,
              "end": 0.69,
              "start": 0.63,
              "phone": "hh"
            },
            {
              "duration": 0.12,
              "quality_score": 94,
              "end": 0.81,
              "start": 0.69,
              "phone": "iy"
            },
            {
              "duration": 0.27,
              "quality_score": 95,
              "end": 1.08,
              "start": 0.81,
              "phone": "r"
            }
          ],
          "word": "hear",
          "target_transcription": "hh iy r"
        }
      ]
    },
    {
      "hypothesis_score": 44,
      "duration": 1.62,
      "hypothesis_duration": 0.48,
      "category": "hear ear",
      "end": 1.08,
      "start": 0.6,
      "word_breakdown": [
        {
          "duration": 0,
          "quality_score": 0,
          "token_type": "sound-out",
          "end": -1,
          "start": -1,
          "phone_breakdown": [
            {
              "duration": 0,
              "quality_score": 0,
              "end": 0,
              "start": 0,
              "phone": "hh"
            },
            {
              "duration": 0,
              "quality_score": 0,
              "end": 0,
              "start": 0,
              "phone": "iy"
            },
            {
              "duration": 0,
              "quality_score": 0,
              "end": 0,
              "start": 0,
              "phone": "r"
            }
          ],
          "word": "hear",
          "target_transcription": "hh iy r"
        },
        {
          "duration": 0.48,
          "quality_score": 87,
          "token_type": "word",
          "end": 1.08,
          "start": 0.6,
          "phone_breakdown": [
            {
              "duration": 0.24,
              "quality_score": 74,
              "end": 0.84,
              "start": 0.6,
              "phone": "iy"
            },
            {
              "duration": 0.24,
              "quality_score": 91,
              "end": 1.08,
              "start": 0.84,
              "phone": "r"
            }
          ],
          "word": "ear",
          "target_transcription": "iy r"
        }
      ]
    }
  ]

Phoneme segmentation

In a use case where a student is required to read or call out all the letter sounds (i.e., phonemes) in a word, the word can be used as the target.

The <sound-out> markup is used to create these targets: <sound-out>dog</sound-out>.

Note: The pronunciation attribute is required only when using a target that is not in the dictionary.

Example: “Tell me ALL SOUNDS in the word dog”

Input target:

<sound-out>dog</sound-out> 

JSON output:

  • Phonemes are scored, as they were produced singularly (not as part of a word)

  • One quality_score value for the word “dog” that is the combination of the isolated phoneme quality_score values.

  • token_type : “sound-out”

  • Example [audio was /d ao g/]:

"results": [
    {
      "hypothesis_score": 71,
      "duration": 3.48,
      "hypothesis_duration": 1.86,
      "category": "dog",
      "end": 2.55,
      "start": 0.69,
      "word_breakdown": [
        {
          "duration": 1.86,
          "quality_score": 71,
          "token_type": "sound-out",
          "end": 2.55,
          "start": 0.69,
          "phone_breakdown": [
            {
              "duration": 0.24,
              "quality_score": 64,
              "end": 0.93,
              "start": 0.69,
              "phone": "d"
            },
            {
              "duration": 0.3,
              "quality_score": 56,
              "end": 1.89,
              "start": 1.59,
              "phone": "ao"
            },
            {
              "duration": 0.15,
              "quality_score": 61,
              "end": 2.55,
              "start": 2.4,
              "phone": "g"
            }
          ],
          "word": "dog",
          "target_transcription": "d ao g"
        }
      ]
    }
  ]

Phonemes and letter names that (usually) overlap

The table below lists common letter sounds that may return a high score when the letter name is said instead, as the letter sound is part of the letter name.

Letter

Name in Arpabet Phonemes

Sound in Arpabet Phonemes

1

B

b iy

b

2

D

d iy

d

3

F

eh f

f

4

J

jh ey

jh

5

K

k ey

k

6

L

eh l

l

7

M

eh m

m

8

N

eh n

n

9

P

p iy

p

10

R

aa r

r

11

S

eh s

s

12

T

t iy

t

13

V

v iy

v

14

X

eh k s

k s

15

Z

z iy (US) / z eh d (IE/UK)

z

Targeting specific vowel sounds

In early literacy education, a common task is to ensure the student knows the various ways the English vowel letters can be pronounced. This is sometimes referred to as “long” and “short” sounds in English phonics.

Because English vowel letters represent a number of different sounds, translating the English orthography to SoapBox-Engine-compatible Arpabet symbols can be challenging. The best way around this is to think about the sounds rather than the orthography and use the Arpabet to IPA chart to determine what sound is intended.

There is great vowel variety between and within English dialects. Therefore, the best way to determine what Arpabet vowel to use, is via the IPA to Arpabet chart here: https://docs.soapboxlabs.com/resources/linguistics/arpabet-to-ipa/

If a more direct correlation between standard “long” and “short” sounds and Arpabet is needed, a general guide is provided below.

There are a number of phonics systems that have different methods of teaching the English consonants and vowels. The guide below may or may not adhere to the phonics system utilized by a given user. This is meant as a general guidance tool only. The https://docs.soapboxlabs.com/resources/linguistics/arpabet-to-ipa/ resource should be used for a more fine-grained approach.

Long Vowels

Word Examples

Arpabet Approximation

Short Vowels

Word Examples

Arpabet Approximation

A

late; aid

ey

A

ant; hat

ae

E

meet; eagle

iy

E

ebb; hem

eh

I

ride; kite

ay

I

igloo; hit

ih

O

go; know

ow

O

oxen; hot

aa

U

universe; use

y uw

U

up; hut

ah

A common vowel confusion is with the letter A, as it relates to up to 9 different sounds in English, depending on the specific dialect. Below is a table of different sounds that A can make, which can be used to assist in choosing the most suitable phoneme targets.

“A” Word(s)

Arpabet Approximation

IPA

ant; apple; fan

ae

æ

angel; apron; place

ey

asleep; zebra, umbrella

ah

ə

many; any; said

eh

e

village; pillage; cabbage

ih (sometimes reduced to ah)

ɪ (sometimes reduced to ə)

area; parent; fairy

eh

ɛə

father; palm

aa

ɑ

wash; swan

aa

ɒ (primarily in UK English)

ball; water

ao (some dialects use aa)

ɔ (some dialects use ɑ)