Fluency - Prosody Features

What is prosody?

Prosody is often described as the “music” of a language, or the way that a language “flows”. Linguistically, it’s a combination of many aspects of language, including pitch, intonation, and stress. In edtech, pauses and timing (sometimes referred to as “phrasing” or “chunking”) also fall under the category of prosody.

Why is prosody important for reading fluency?

Reading fluently is not just reading quickly. As a reader develops their fluency, how expressively they read is as important as how quickly they read. Fluency is measured broadly through

accuracy
rate
expression

Expression is measured through prosody rubrics.

Prosody is linked to comprehension, in that, when a student understands what they’re reading or saying, their emphasis and expression will reflect this and add a layer of meaning to their words.

Prosody is a key data point in measuring fluency. Since it doesn’t develop at the same pace as other fluency skills, it requires independent progress reporting.

Why is prosody important for second language learning?

Prosody plays an important role in speech comprehension and production. It adds richness and meaning and differs from language to language. It’s, therefore, key to sounding fluent and being understood.

Here’s how one leading English language learning provider describes the importance of prosody for language fluency: Being understood in a language goes beyond vocabulary and pronunciation. A fluent speaker produces stretches of language with a fairly even tempo and some minor hesitations.

What is pitch?

Pitch is one important aspect of prosody. In language, pitch is the “mental sensation, or perceptual correlate, of fundamental frequency” (Jurafsky & Martin, 2009, p. 234). This means pitch is the way we perceive a speaker’s fundamental frequency, which is also referred to as “F0”. F0 is understood as “[t]he frequency of the vocal fold vibration, or the frequency of the complex wave” (Jurafsky & Martin, 2009, p. 232).

Perceptually, this can be perceived as the “tone” or the “note” at which someone speaks at a particular moment in time.

How is pitch measured?

Pitch (or fundamental frequency, F0) is typically measured in Hertz (Hz). Adult male voices fall between 80-175 Hz on average, while adult female voices fall between 160-270 Hz on average. Children’s voices are between 215-400 Hz on average (although, sources give varying numbers for the average children’s range).

How is pitch visualized?

In linguistic/speech research, pitch is often visualized by displaying its contour in relation with the spoken words. A pitch contour, also called a pitch track, is a plot of F0 over time, where the y-axis is F0 in Hz and the x-axis is time. However, pitch can be visualized in many ways, depending on the use case. In particular, the pitch visualization needs for edtech companies and teachers may be different than those for speech researchers in academia.

Why is pitch important for prosody?

When pitch is extracted over an entire audio file, we can see the ups and downs of a speaker’s tone throughout that utterance. This allows, among other things, for us to see if a speaker raised their pitch at the end of a question, or dropped their pitch at the end of a statement. It also allows us to estimate the speaker’s “expressiveness” of their speech. Did their voice undulate as they spoke/read, which might indicate a certain level of interest, understanding, and fluency? Or did the speaker’s intonation stay relatively neutral, which might indicate less interest, understanding, and fluency?

To see how pitch correlates more broadly with phrasing, see here and here.

How is pitch represented in the JSON response?

Pitch is represented as a series of pitch values in Hertz at word-level for each word in the audio file. Pitch values are returned at fixed intervals. The interval duration can be derived by dividing the duration of the word by the number of elements in the pitch array. For example, if 75 pitch values are returned for a word and its start and end times are 2.04 s and 2.79 s respectively, the word duration is 0.75 s, each pitch value is separated by 0.75 s / 75 = 0.01 s = 10 ms. The default interval duration is 10 ms.

Pitch values range from 50 to 600 in order to capture the pitch of typical adults and children, as well as values that are outside the expected range for this type of demographic . See the JSON response below for an illustration of pitch in the JSON:

{
	"confidence": 99.5146484375,
	"duration": 0.75,
	"end": 2.79,
	"phone_breakdown": [{
		"duration": 0.21,
		"end": 2.25,
		"phone": "l",
		"score": 99.620590209961,
		"start": 2.04,
		"top10_phones": []
	}, {
		"duration": 0.21,
		"end": 2.46,
		"phone": "ay",
		"score": 98.380950927734,
		"start": 2.25,
		"top10_phones": []
	}, {
		"duration": 0.33,
		"end": 2.79,
		"phone": "k",
		"score": 99.893486022949,
		"start": 2.46,
		"top10_phones": []
	}],
	"phonetic_transcription": "l ay k",
	"pitch": { # Pitch values are displayed here
		"values": [289, 299, 306, 306, 299, 288, 274, 266, 259, 253, 243, 244, 259, 
		253, 246, 242, 237, 232, 230, 230, 228, 226, 225, 223, 221, 221, 221, 
		222, 222, 222, 222, 220, 217, 214, 213, 213, 213, 205, 190, 190, 188, 
		196, 204, 211, 215, 219, 221, 222, 223, 223, 223, 223, 223, 224, 225, 
		226, 228, 229, 230, 231, 232, 233, 234, 236, 237, 238, 239, 240, 242, 
		243, 244, 245, 246, 248, 249]
	},
	"start": 2.04,
	"time_since_previous": 0.24,
	"word": "like"
}

How do I match pitch values to specific phonemes in a word?

Pitch values are returned at word-level, and there may be instances where it’s necessary to map pitch values to specific phonemes in a given word. The word and phoneme time stamps together with the pitch interval duration can help to link the pitch values to the text.

Here is some useful information regarding the pitch-value-to-phoneme mapping:

                                    1000 ms = 1 s
                  1 pitch value per fixed intervals (default 10 ms, or 0.01 s)

Using the JSON above as an example, if we want to determine which pitch values correlate to which phonemes in the word like, we can split the word-level pitch values proportionally to the phoneme durations. For example, we can determine the corresponding indexes in the pitch values array for the first phonemes by using its start and end keys with the following formula:

        start_pitch_index = ((start_of_phoneme / pitch_interval) - (start_of_word / pitch_interval)) - 1
          end_pitch_index = ((end_of_phoneme / pitch_interval) - (start_of_word / pitch_interval)) - 1

With this, we can get the start and end index of the phoneme ay:

start_pitch_index = ((2.25 / 0.01) - (2.04 / 0.01)) - 1
start_pitch_index = (225 - 204) - 1
start_pitch_index = 20

end_pitch_index = (2.46 / 0.01) - (2.04 / 0.01) - 1
end_pitch_index = (246 - 204) - 1
end_pitch_index = 41

With the start and end indexes found, we can can get a sub array between pitch[20] and pitch[41], which are the pitch values for ay.

Note: No pitch values are returned when there is silence and/or when the speaker is not speaking.

How do I account for pitch when there is silence?

Particularly when concatenating pitch values for several words or a whole utterance, customers may wish to have a pitch value for every moment of time throughout the sentence, utterance, or chunk of words. Having this allows for easier graphing and more fine-grained analysis. However, pitch values only exist when there is human speech and, therefore, are not returned when there are pauses in between words, as pause pitch values should not be included in the pitch statistics computation.

To account for this, zeroes can be inserted into the array in the place where the silence is in the audio. Silence is typically represented by zeroes.

The length of the pause(s) should be determined using the time_since_previous variable. Note this variable checks the time in seconds where there is no speech between the current word and the previous word. The exact time in the audio where the pause occurred can be determined by using the end variable of the previous word and the start variable of the current word.

For example, in the JSON sample below, there is a 2.07 second pause in between Word 1 (annoying) and Word 2 (they), which starts at 2.13 seconds and ends at 4.2 seconds:

{
	"normalised_word": "very",
	"reference_index": 0,
	"reference_word": "Very",
	"sub_types": {},
	"transcription_details": {
		"confidence": 99.458404541016,
		"end": 1.56,
		"phone_breakdown": [{
				"duration": 0.09,
				"end": 1.26,
				"phone": "v",
				"score": 0.09,
				"start": 1.17
			},
			{
				"duration": 0.12,
				"end": 1.38,
				"phone": "eh",
				"score": 0.12,
				"start": 1.26
			},
			{
				"duration": 0.09,
				"end": 1.47,
				"phone": "r",
				"score": 0.09,
				"start": 1.38
			},
			{
				"duration": 0.09,
				"end": 1.56,
				"phone": "iy",
				"score": 0.09,
				"start": 1.47
			}
		],
		"pitch": {
			"values": [263, 262, 260, 259, 258, 258, 259, 260, 262, 264, 
			267, 268, 270, 271, 271, 270, 266, 263, 260, 258, 256, 253, 
			248, 244, 240, 238, 236, 233, 231, 229, 226, 225, 224, 223, 
			222, 222, 221, 220, 220]
		},
		"start": 1.17,
		"time_since_previous": 0
	},
	"transcription_index": 0,
	"transcription_word": "very",
	"type": "CORRECT"
}, {
	"normalised_word": "annoying",
	"reference_index": 1,
	"reference_word": "annoying.",
	"sub_types": {},
	"transcription_details": {
		"confidence": 93.530090332031,
		"end": 2.13,
		# the pause begins at 2.13 seconds "phone_breakdown": [{
				"duration": 0.06,
				"end": 1.62,
				"phone": "ah",
				"score": 0.06,
				"start": 1.56
			},
			{
				"duration": 0.06,
				"end": 1.68,
				"phone": "n",
				"score": 0.06,
				"start": 1.62
			},
			{
				"duration": 0.24,
				"end": 1.92,
				"phone": "oy",
				"score": 0.24,
				"start": 1.68
			},
			{
				"duration": 0.06,
				"end": 1.98,
				"phone": "ih",
				"score": 0.06,
				"start": 1.92
			},
			{
				"duration": 0.15,
				"end": 2.13,
				"phone": "ng",
				"score": 0.15,
				"start": 1.98
			}
		],
		"pitch": {
			"values": [220, 220, 220, 221, 223, 225, 226, 226, 228, 229, 
			231, 233, 233, 231, 230, 232, 233, 233, 233, 232, 230, 229, 
			228, 225, 223, 220, 214, 207, 200, 193, 187, 183, 183, 182, 
			180, 177, 175, 174, 174, 175, 176, 178, 180, 182, 183, 183, 
			184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194]
		},
		"start": 1.56,
		"time_since_previous": 0
	},
	"transcription_index": 1,
	"transcription_word": "annoying",
	"type": "CORRECT"
}, {
	"normalised_word": "they",
	"reference_index": 2,
	"reference_word": "They",
	"sub_types": {},
	"transcription_details": {
		"confidence": 99.990058898926,
		"end": 4.38,
		"phone_breakdown": [{
				"duration": 0.09,
				"end": 4.29,
				"phone": "dh",
				"score": 0.09,
				"start": 4.2
			},
			{
				"duration": 0.09,
				"end": 4.38,
				"phone": "ey",
				"score": 0.09,
				"start": 4.29
			}
		],
		"pitch": {
			"values": [278, 276, 275, 274, 276, 275, 271, 271, 271, 271, 270, 
			268, 267, 267, 270, 274, 275, 274]
		},
		"start": 4.2,
		# the pause ends at 4.2 seconds check for duration of pause here
		"time_since_previous": 2.07 
	},
	"transcription_index": 2,
	"transcription_word": "they",
	"type": "CORRECT"
}

In order to assign pitch values of zero to the pause, we first need to determine how many pitch values equal 2.07 seconds when the pitch interval is the default 0.01 s. We can use the equation described here to determine how many pitch values we need:

2.07 s / 0.01 s = 270 pitch values

Therefore, we know that we want to insert 270 zeroes between the pitch values for the word “annoying” and the word “they” in order to create a timeline of pitch that covers the whole sentence, including pauses. Doing so creates the following pitch value array that can be used for further analysis/graphing:

sentence_values = [
#first word
263,262,260,259,258,258,259,260,262,264,267,268,270,271,271,270,266,263,260,258,256,253,248,244,240,238,236,233,231,229,226,225,224,223,222,222,221,220,220,
#second word
220,220,220,221,223,225,226,226,228,229,231,233,233,231,230,232,233,233,233,232,230,229,228,225,223,220,214,207,200,193,187,183,183,182,180,177,175,174,174,175,176,178,180,182,183,183,184,185,186,187,188,189,190,191,192,193,194,
# silence between second and third word
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
# third word
278,276,275,274,276,275,271,271,271,271,270,268,267,267,270,274,275,274
]

How do I get the pitch values for a sentence?

Getting the pitch values for a sentence is a straight-forward process of sequentially concatenating the pitch values of all the words in the sentence. But, before this can be done, the sentence start and end must be determined. Typically, this is based on punctuation (see here for recommendations on how to do this). Once the sentence start and end boundaries are defined, iterate through every word within those defined limits and concatenate all relevant pitch: {values[]} objects in sequential order.

There may be pauses in between words. This should be checked by using the time_since_previous variable of each word, and accounted for as desired. See here for suggestions.

How do I get the pitch values for an entire audio file?

Getting the pitch values for an entire audio file is a simple matter of iterating through the text_score object and concatenating the pitch: {values[]} object in sequential order for each word. See here for what to do when there are pauses/silences throughout the utterance and you want to account for these in the concatenated pitch values array.

There may be pauses in between words. This should be checked by using the time_since_previous variable of each word, and accounted for as desired. See here for suggestions.

Advanced use case: How do I remove values for unvoiced phonemes?

It should be noted that in the JSON response, we return pitch values for both voiced and unvoiced consonants.

A voiced consonant is defined as a consonant where there is a vibration in the vocal folds.
An unvoiced consonant is defined as a consonant where there is no vibration in the vocal folds.
Vowels are always voiced.

As noted here, pitch is the perception of F0, which is the frequency at which the vocal folds vibrate. Therefore, since unvoiced consonants have no vibration, they have no F0 (i.e., pitch) values. The pitch values we return for unvoiced consonants are an estimate based on the surrounding voiced phonemes. We do this for several reasons:

Including estimates for unvoiced consonants allows for a smoother modelling of the pitch over time.
Including estimates for unvoiced consonants simplifies important calculations (described throughout this document), without making the calculations less accurate.
Most customers’ needs will be best served by not having zeroes in the pitch values object.
It provides flexibility in determining phoneme boundaries, allowing co-articulation and approximation effects from voiced to unvoiced phonemes to be accounted for, if desired.

However, there may be instances where a more fine-grained, linguistic analysis of the data is needed/desired, and a customer may wish to remove the pitch values for unvoiced consonants.

In order to do this, first identify which phonemes in the target word(s)'s phone_breakdown object are unvoiced. Below is a list of all unvoiced phonemes in the SoapBox phoneme set:

Unvoiced Phonemes
ch	f	hh
k	p	s
sh	t	th

Next, align word-level pitch values to individual phonemes, as described here. For every pitch value associated with a phoneme in the unvoiced list above, change the pitch values to 0.

In this way, the pitch array of the word “her” from the earlier JSON example would change from:

# original pitch array
348, 336, 324, 313, 302, 294, 282, 267, 253, 242, 239, 240, 244, 248, 253, 258, 263, 267

to

# modified pitch array
 0, 0, 0, 0, 0, 0, 0, 0, 0, 242, 239, 240, 244, 248, 253, 258, 263, 267
|                          |
9 values have changed to zero 
where phoneme is unvoiced
                            |                                          |
                            the other values have remained unchanged 
                            where phoneme is voiced

What is phrasing?

“Phrasing” or “chunking” is a term used in literacy education that refers to the appropriate pauses and grouping of words within a sentence as one reads. Phrasing is the ability for a reader to to understand which words go together as short phrases and which do not, such as "in the ocean," "the baby seal," "he runs," and "along came a spider." Additionally, it involves recognizing when to pause after a period, comma, semicolon, etc., and adjusting intonation accordingly, such as raising intonation for questions and lowering it for periods.

How is phrasing measured?

Phrasing is often measured intuitively by a teacher, typically using punctuation to determine where phrase breaks, pauses, and changes in intonation should be.

Phrasing information can be derived from the output time alignment by checking the position and duration of the pauses detected in the audio.

Why is phrasing important for prosody?

Good phrasing makes the meaning of a phrase clear by chunking words in a meaningful way. The correct use of pauses aligned to punctuation is a sign of text comprehension. Reading sounds like normal speech with appropriate pauses between phrases, clauses, and sentences.

For second language learners, phrasing means the language flows naturally and the meaning is clear to fluent speakers.

Use cases in prosody

SoapBox’s prosody feature offers a number of aspects that can be combined to create a comprehensive analysis of a read passage or used individually depending on a client’s specific needs. The most common use cases for prosody are described in this section.

The methods that are described in the following sections are meant to be examples of how SoapBox’s pitch and phrasing datapoints can be interpreted and used to derive useful insights on audio prosody quality. However, these methods are not intended as the best or only possible approaches. Numerous other analysis and data aggregation techniques can be used to report on prosody quality differently according to your needs.

Use case 1: Punctuation-based timing

One important aspect of prosody is ensuring that students pause at appropriate points in the reading passage. These appropriate points are typically demarcated by punctuation. Periods, commas, question marks, and exclamation marks are the most common demarcations.

In order to check if a learner paused appropriately at particular punctuation marks, these punctuation mark boundaries must first be identified. This can be done by checking the value of the reference_word variable in the JSON response.

For example, if a period is considered an appropriate end-of-sentence marker where a pause should follow, iterate through each reference_word in text_score in the JSON response and check if a . is contained in the reference_word. It is important to check the reference_word and not the normalised_word or the transcription_word, as the latter two fields have all punctuation stripped. Only reference_word retains the original punctuation from the text.

This method is reliant on the reference text containing desired punctuation.

Pseudocode:

for word_object in results["text_score"]:
  if  "." in word_object["reference_word"]:
    check prosody datapoints

After identifying a word with the target punctuation, whether or not the learner paused afterwards needs to be established. For this, the time_since_previous attribute of the word after the word with punctuation can be used. This establishes how long, in seconds, passed between the word with the punctuation mark and the next word.

A threshold relevant to the task/situation can then be set to determine if the learner paused long enough, paused too long, or didn’t pause at all.

See the following JSON response for the variables that should be checked in the example “Very annoying. They”:

{
	"normalised_word": "very",
	"reference_index": 0,
	"reference_word": "Very",
	#No punctuation in reference_word 
	"sub_types": {},
	"transcription_details": {
		"confidence": 99.458404541016,
		"end": 1.56,
		"phone_breakdown": [{
				"duration": 0.09,
				"end": 1.26,
				"phone": "v",
				"score": 0.09,
				"start": 1.17
			},
			{
				"duration": 0.12,
				"end": 1.38,
				"phone": "eh",
				"score": 0.12,
				"start": 1.26
			},
			{
				"duration": 0.09,
				"end": 1.47,
				"phone": "r",
				"score": 0.09,
				"start": 1.38
			},
			{
				"duration": 0.09,
				"end": 1.56,
				"phone": "iy",
				"score": 0.09,
				"start": 1.47
			}
		],
		"pitch": {
			"values": [263, 262, 260, 259, 258, 258, 259, 260, 262, 264, 267, 268, 
			270, 271, 271, 270, 266, 263, 260, 258, 256, 253, 248, 244, 240, 238, 
			236, 233, 231, 229, 226, 225, 224, 223, 222, 222, 221, 220, 220]
		},
		"start": 1.17,
		"time_since_previous": 0
	},
	"transcription_index": 0,
	"transcription_word": "very",
	"type": "CORRECT"
}, {
	"normalised_word": "annoying",
	"reference_index": 1,
	"reference_word": "annoying.",
	#Punctuation in reference_word 
	"sub_types": {},
	"transcription_details": {
		"confidence": 93.530090332031,
		"end": 2.13,
		"phone_breakdown": [{
				"duration": 0.06,
				"end": 1.62,
				"phone": "ah",
				"score": 0.06,
				"start": 1.56
			},
			{
				"duration": 0.06,
				"end": 1.68,
				"phone": "n",
				"score": 0.06,
				"start": 1.62
			},
			{
				"duration": 0.24,
				"end": 1.92,
				"phone": "oy",
				"score": 0.24,
				"start": 1.68
			},
			{
				"duration": 0.06,
				"end": 1.98,
				"phone": "ih",
				"score": 0.06,
				"start": 1.92
			},
			{
				"duration": 0.15,
				"end": 2.13,
				"phone": "ng",
				"score": 0.15,
				"start": 1.98
			}
		],
		"pitch": {
			"values": [220, 220, 220, 221, 223, 225, 226, 226, 228, 229, 231, 
			233, 233, 231, 230, 232, 233, 233, 233, 232, 230, 229, 228, 225, 
			223, 220, 214, 207, 200, 193, 187, 183, 183, 182, 180, 177, 175, 
			174, 174, 175, 176, 178, 180, 182, 183, 183, 184, 185, 186, 187, 
			188, 189, 190, 191, 192, 193, 194]
		},
		"start": 1.56,
		"time_since_previous": 0
	},
	"transcription_index": 1,
	"transcription_word": "annoying",
	"type": "CORRECT"
}, {
	"normalised_word": "they",
	"reference_index": 2,
	"reference_word": "They",
	"sub_types": {},
	"transcription_details": {
		"confidence": 99.990058898926,
		"end": 4.38,
		"phone_breakdown": [{
				"duration": 0.09,
				"end": 4.29,
				"phone": "dh",
				"score": 0.09,
				"start": 4.2
			},
			{
				"duration": 0.09,
				"end": 4.38,
				"phone": "ey",
				"score": 0.09,
				"start": 4.29
			}
		],
		"pitch": {
			"values": [278, 276, 275, 274, 276, 275, 271, 271, 271, 271, 270, 
			268, 267, 267, 270, 274, 275, 274]
		},
		"start": 4.2,
		# check the time_since_previous variable of the word after the word 
		# with puncutation
		"time_since_previous": 2.07 
	},
	"transcription_index": 2,
	"transcription_word": "they",
	"type": "CORRECT"
}

See below for example pseudocode:

Pseudocode

for word_object in results["text_score"]:
  if  "." in word_object["reference_word"]:
    if word_object[curent_word_object_idx + 1]["time_since_previous"] >= minimum_pause_threshold and word_object[curent_word_object_idx + 1]["time_since_previous"] <= maximum_pause_threshold:
      Learner Paused
    else:
      Learner Did Not Pause (or Paused Too Long)

The pause thresholds can be set to whatever is appropriate for a given task. For demonstration purposes, SoapBox has used the following thresholds:

Punctuation	Minimum Threshold	Maximum Threshold
Comma, semicolon	0.2 seconds	1 second
Period, question mark, exclamation mark, colon	0.2 seconds	2 seconds

Should customers choose to use these thresholds in their own implementations of phrasing, we highly recommend testing and validating the thresholds on relevant evaluation data to ensure desired performance.

Use case 2: Punctuation-based expression

Another important aspect of Prosody is ensuring that students vary their intonation at appropriate points in the reading passage. These appropriate points are typically demarcated by punctuation, in particular question marks, where the intonation is expected to rise, and periods, where the intonation is expected to fall.

In order to check if a learner intoned appropriately at question marks and periods, these punctuation mark boundaries must first be identified. See Use Case 1 for how to do this.

Once a word with a question mark or period is identified, the pitch values can be used to check if the learner raised or lowered their intonation as expected. See the JSON response below for an example of relevant fields to check/use:

{
	"normalised_word": "is",
	"reference_index": 64,
	"reference_word": "\"Is",			# no question mark or period 
	"sub_types": {},
	"transcription_details": {
		"confidence": 99.950164794922,
		"end": 34.32,
		"phone_breakdown": [{
				"duration": 0.18,
				"end": 34.23,
				"phone": "ih",
				"score": 0.18,
				"start": 34.05
			},
			{
				"duration": 0.09,
				"end": 34.32,
				"phone": "z",
				"score": 0.09,
				"start": 34.23
			}
		],
		"pitch": {
			"values": [221, 222, 223, 224, 225, 226, 228, 229, 231, 233, 232, 232, 232, 232, 233, 232, 230, 223, 215, 211, 210, 211, 214, 217, 220, 217, 210]
		},
		"start": 34.05,
		"time_since_previous": 0.35
	},
	"transcription_index": 70,
	"transcription_word": "is",
	"type": "CORRECT"
}, {
	"normalised_word": "it",
	"reference_index": 65,
	"reference_word": "it",				# no question mark or period
	"sub_types": {},
	"transcription_details": {
		"confidence": 99.66739654541,
		"end": 34.41,
		"phone_breakdown": [{
				"duration": 0.06,
				"end": 34.38,
				"phone": "ih",
				"score": 0.06,
				"start": 34.32
			},
			{
				"duration": 0.03,
				"end": 34.41,
				"phone": "t",
				"score": 0.03,
				"start": 34.38
			}
		],
		"pitch": {
			"values": [198, 194, 183, 168, 160, 149, 143, 143, 151]
		},
		"start": 34.32,
		"time_since_previous": 0
	},
	"transcription_index": 71,
	"transcription_word": "it",
	"type": "CORRECT"
}, {
	"normalised_word": "mice",
	"reference_index": 66,
	"reference_word": "mice?\"",		# question mark in reference_word 
	"sub_types": {},
	"transcription_details": {
		"confidence": 99.92845916748,
		"end": 35.13,
		"phone_breakdown": [{
				"duration": 0.09,
				"end": 34.5,
				"phone": "m",
				"score": 0.09,
				"start": 34.41
			},
			{
				"duration": 0.27,
				"end": 34.77,
				"phone": "ay",
				"score": 0.27,
				"start": 34.5
			},
			{
				"duration": 0.36,
				"end": 35.13,
				"phone": "s",
				"score": 0.36,
				"start": 34.77
			}
		],
		"pitch": {
			# Use pitch values to see if there was a rise in intonation 
			"values": [166, 178, 184, 189, 193, 195, 196, 195, 194, 194, 192, 
			191, 190, 189, 188, 187, 188, 189, 190, 192, 194, 197, 201, 
			206, 211, 219, 228, 243, 260, 283, 296, 336, 369, 384, 390, 
			392, 390, 388, 386, 386, 386, 384, 383, 381, 379, 377, 375, 
			373, 373, 373, 371, 369, 368, 366, 364, 362, 360, 360, 360, 
			360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360]
		},
		"start": 34.41,
		"time_since_previous": 0
	},
	"transcription_index": 72,
	"transcription_word": "mice",
	"type": "CORRECT"
}

One way to see if the pitch values for the word in question went appropriately up or down is to calculate the slope of the target word’s pitch values. The slope is a good reflection of how steep the incline or decline of the pitch was. In most coding languages, this is one or two lines of code.

Using Python and the numpy package, the slope of a pitch value array can be derived with the np.polyfit function:

Python code

import numpy as np
# the 'values' variable contains the JSON response for the target word
x_values = np.linspace(timeStart, timeEnd, len(values)) 
slope = np.polyfit(x_values, values, 1)[0]

Once the slope is calculated, a threshold can be used to determine if the incline or decline was sharp enough to qualify as a rise or fall in intonation:

Pseudo code

if slope >= rising_intonation_threshold:
  Learner Raised Intonation
elif slope <= falling_intonation_threshold:
  Learner Lowered Intonation
else:
  Intonation Did Not Rise or Fall

The rising/falling thresholds can be set to whatever is appropriate for a given task. For demonstration purposes, SoapBox has used the following thresholds:

Punctuation	Desired Result	Threshold
Question mark	Inclining slope	130
Period	Declining slope	-90

Should customers choose to use these thresholds in their own implementations of phrasing, we highly recommend testing and validating the thresholds on relevant evaluation data to ensure desired performance.

Use case 3: Estimating expressiveness

Pitch is often thought of as a measure of “expressiveness” in reading/speaking. Therefore, some customers may wish to use pitch to create an “expressiveness measure” for a whole audio file and/or individual sentences.

Passage-level expressiveness

To determine the level of expressiveness of a whole audio file, sequentially concatenate all pitch values in the JSON response.

When determining expressiveness using intonation, silence pitch values must be excluded. Ensure that zeroes haven’t been added into the pitch array, such as described here and here.

Once all values are concatenated into a single array, calculate the standard deviation. In most coding languages, this is one line of code.

Using Python and the numpy package, standard deviation is computed as:

Python code

import numpy as np

std_deviation = np.std(concatenated_pitch_array)

After the standard deviation is calculated, a threshold can be used to determine the level of expressiveness. In general, if the standard deviation is a large number, then the learner varied their pitch substantially and was expressive. If there is low deviation, the learner did not vary their pitch much and was not very expressive.

Pseudo code

if std_dev > expressiveness_threshold:
  Learner Had High Expressiveness
elif std_dev <= expressiveness_threshold:
  Learner Had Low Expressiveness

The expressiveness thresholds can be set to whatever is appropriate for a given task. For demonstration purposes, SoapBox has used the following thresholds:

Expressiveness	Threshold
Low	<= 52
High	> 52

Should customers choose to use these thresholds in their own implementations of Phrasing, we highly recommend testing and validating the thresholds on relevant evaluation data to ensure desired performance.

Sentence-level expressiveness

The methods for determining utterance-level expressiveness can also be extended to the sentence level, if desired. Instead of using the pitch values of the entire JSON response, the pitch values for a specific part of the JSON response will be used.

Sentence-level expressiveness is generally more subjective than file-level expressiveness. Therefore, SoapBox recommends using a file-level expressiveness metric rather than a sentence-level one to have a better correlation to human assessment.

In order to do identify sentence-level expressiveness, sentence boundaries must first be determined. Typically, these are periods, questions marks, and exclamation marks (see here for more on how to use punctuation as sentence boundaries).

We recommend only including sentences that are three or more words long. Anything shorter provides too little information to get an accurate expressiveness measure. If it is necessary for a sentence less than three words to be checked for expressiveness, we recommend including the previous or following sentence values in the computations for the short sentence.

Once sentences are identified, the process for getting the standard deviation can be used on each sentence individually. However, the utterance-level thresholds may not be suitable for sentence-level evaluation.

The expressiveness thresholds can be set to whatever is appropriate for a given task. For demonstration purposes, SoapBox has used the following thresholds:

Expressiveness	Threshold
Low	<= 26
High	> 26

Should customers choose to use this threshold in their own implementations of phrasing, we highly recommend testing and validating the threshold on relevant evaluation data to ensure desired performance.

Use case 4: Minimum and maximum pitch values

One area of interest may be how high and/or low a learner’s pitch went either over the course of a whole utterance or a single sentence.

Determining the maximum and minimum pitch values involves concatenating all pitch values of interest (either all pitch values in a sentence) and then checking for the smallest number and the greatest number in the sequence. (See here for recommendations on determining sentence boundaries, or for the whole utterance.)

When getting the min and max, ensure zeroes haven’t been added into the pitch array, such as described here and here. Including zeroes will give a false minimum value.

These values can then be reported as the minimum value and the maximum value of pitch for the utterance/sentence.

Example of mapping prosody data to a rubric

Teachers want the most immediate and measurable data on student performance and progress. This includes ‘at a glance’ visualizations in dashboards and levelled rubrics.

In order to support customers integrating Prosody to their learning and assessment we have provided an example of data mapping to a rubric.

Fluency - Prosody Features

What is prosody?

Why is prosody important for reading fluency?

Why is prosody important for second language learning?

What is pitch?

How is pitch measured?

How is pitch visualized?

Why is pitch important for prosody?

How is pitch represented in the JSON response?

How do I match pitch values to specific phonemes in a word?

How do I account for pitch when there is silence?

How do I get the pitch values for a sentence?

How do I get the pitch values for an entire audio file?

Advanced use case: How do I remove values for unvoiced phonemes?

What is phrasing?

How is phrasing measured?

Why is phrasing important for prosody?

Use cases in prosody

Use case 1: Punctuation-based timing

Use case 2: Punctuation-based expression

Use case 3: Estimating expressiveness

Passage-level expressiveness

Sentence-level expressiveness

Use case 4: Minimum and maximum pitch values

Example of mapping prosody data to a rubric

Search