Example of mapping prosody data to a rubric

Overview

Prosody rubrics provide quick, clear, actionable information for educators. Trusted rubrics, such as the NAEP ORF score, are widely used and provide a standardized benchmark.

When building our prosody feature, we referenced a number of recognized rubrics to inform our use cases and create reference models of how our prosody JSON data can be used to generate student prosody scores. These reference models include guideline computations, which enable you to combine several data points to produce both individual prosody skill scores and an aggregate prosody score for a student. Used within a model, these scores can provide educators with levelled measures for their classes and students.

Sample rubric definition

Below is a sample rubric model that can be used to aggregate the various data points our prosody feature returns. This model shows how prosody reference points such as pitch, timing, and punctuation can be leveraged to generate a score mapped to a prosody rubric.

1

<25% of the words have high expressiveness.
Speech has mostly individual words.
The reading has low expressiveness in >75% of the passage.

2

25% to 50% of the words have high expressiveness.
Speech has some local word groupings.
The reading style is mostly arrhythmic or read with low expressiveness. 

3

>50% of the words have high expressiveness.
Speaking style follows the structure or meaning of words up to sentence level.
Intonation may reinforce grouping or be of low expressiveness.

4

>75% of the words have high expressiveness.
Speaking style expresses sentence structure and meaning.
Intonation is occasionally inconsistent, but not read with low expressiveness.

5

At least 90% of the words have high expressiveness.
Speaking style consistently expresses the structure and meaning of sentences, paragraphs, and passage.
May include some reading errors, but it is expressive throughout.

Definitions for this model:

High expressiveness: Indicates that pitch has sufficiently changed across a sentence or group of words.*

Low expressiveness: Indicates that pitch has not sufficiently changed across a sentence or group of words.*

Phrasing: Evaluates the number and the positions of pauses, hesitations, and punctuation.

Intonation: Evaluates the pitch variation across a passage.

*Sufficient changes: Variation above or below a pre-defined threshold as outlined in example models provided later in this document.

The engine output

As outlined in the Prosody section, the SoapBox engine can return several prosody-related data points, as demonstrated in this example JSON:

…	
	"confidence": 99.51,
	"duration": 0.75,
	"end": 2.79,
	"phone_breakdown": [{...}],
	"phonetic_transcription": "l ay k",
	"pitch": {
           "values": [289, 299, 306, 306, 299, 288, 274, 266, 259, 253, 243, 
           244, 259, 253, 246, 242, 237, 232, 230, 230, 228, 226, 225, 223, 
           221, 221, 221, 222, 222, 222, 222, 220, 217, 214, 213, 213, 213, 
           205, 190, 190, 188, 196, 204, 211, 215, 219, 221, 222, 223, 223, 
           223, 223, 223, 224, 225, 226, 228, 229, 230, 231, 232, 233, 234, 
           236, 237, 238, 239, 240, 242, 243, 244, 245, 246, 248, 249]
	},
	"start": 2.04,
	"time_since_previous": 0.24,
	"word": "like"
}

  • pitch: Each word object will contain a pitch object. In the pitch object is an array, which contains all the fundamental excitation frequency (pitch) values for that word.

  • time_since_previous: Shows the number of seconds since the last spoken word. This returns information to be used to measure phrasing.

  • punctuation: If detected in the reference text, it may imply that the current word token is at the end of a sentence or phrase, and it requires specific prosody rendering.

Sentence & passage-level expressiveness

  • Expressiveness can be measured at passage and sentence level using the F0 scores returned.

  • This is done by concatenating all relevant pitch: {values[]} objects in sequential order.

  • The expressiveness thresholds can be set to whatever is appropriate for a given task.

Phrasing & emphasis

  • Timestamps and the time_since_previous attribute will enable determination of whether appropriate pauses have occurred (e.g., punctuation marks) and whether there are incorrect pauses (e.g., between words).

  • Pitch values pitch: {values[]} can be used to check if the student raised or lowered their intonation as expected, and the slope of a pitch value array can be derived (e.g., expressive punctuation).

  • Once the slope is calculated, a threshold can be used to determine if the incline or decline was sharp enough to qualify as a rise or fall in intonation.

Example of aggregate score calculation

By leveraging the JSON outputs as illustrated in Prosody Use Cases, computations can be used to generate rubric levels.

In the diagram below, we show a hierarchy of prosody scoring.

The Rubric Score refers to the overall single score for a student against the prosody model. This is generated from the two individual prosody skills of:

  1. Word & Passage Level Expressiveness

  2. Phrasing & Emphasis

These two prosody skills are scored using a combination of five measures generated by data returned for pitch variation, pauses, and punctuation when reading or speaking.

Diagram of aggregate score calculation

These dimensions are not prescriptive, and they can be modified and adapted to each customer’s needs and specific use cases.

Rubric score

All individual dimension scores range between 1 and 5. The final Rubric Score is the average of the different dimension scores.

Rubric Score = average (Word & Passage Level Expressiveness, Phrasing & Emphasis)

These dimensions can also be grouped according to the prosody measures above.

Word & Passage Level Expressiveness = 
               average (Word Level Expressiveness, Passage Level Expressiveness)

and

Phrasing & Emphasis = average (Correct Pauses, Incorrect Pauses, Phrasing Intonation)

The computation of the individual score for each dimension is reported in the following sections.

Prosody criteria

Word level expressiveness

This score is purely derived by the word-level pitch values. It consists of the proportion of words with high expressiveness out of the total number of words.

As reported in the related use case 3 example, high expressiveness is when the standard deviation of the pitch vector is higher than a threshold (e.g., 26 Hz).

# of expressive words = # of words with std_dev > threshold

percentage = # of expressive words / # of words

score = 1 if 0 < percentage <= 0.25
      = 2 if 0.25 < percentage <= 0.5
      = 3 if 0.5 < percentage <= 0.75
      = 4 if 0.75 < percentage <= 0.9
      = 5 if 0.9 < percentage <= 1

Passage level expressiveness

This score is purely derived by the word-level pitch values.

It measures the variation of the pitch values on the entire passage. As outlined in the related use case 3 example, the passage expressiveness is measured by extracting the standard deviation over all pitch values for the entire passage (overall_std_dev). Since this value has no fixed upper boundary, the chosen method to convert it into a percentage is by using a sigmoid function.

percentage = 1/(1+EXP(-14 * ( overall_std_dev /100-0.5))) * 100

score = 1 if 0 < percentage <= 0.4
      = 2 if 0.4 < percentage <= 0.5
      = 3 if 0.5 < percentage <= 0.6
      = 4 if 0.6 < percentage <= 0.7
      = 5 if 0.7 < percentage <= 1

Phrasing and emphasis

Correct pauses

A correct pause is when a student pauses in the right place (i.e., at a punctuation sign).

This score is derived by the information in the time_since_previous and the punctuation symbols in the reference text.

As described in the use case 1 example, we can check whether the duration of the following pause falls within an appropriate range for each punctuation symbol.

# of accepted pauses = punctuation pauses that are within the accepted range

accepted_percentage = # of accepted pauses / # of punctuations

score = 1 if 0 < accepted_percentage <= 0.25
      = 2 if 0.25 < accepted_percentage <= 0.5
      = 3 if 0.5 < accepted_percentage <= 0.75
      = 4 if 0.75 < accepted_percentage <= 0.9
      = 5 if 0.9 < accepted_percentage <= 1 

Incorrect pauses

An incorrect pause is an instance of a pause where there shouldn’t be one.

This score is derived from the information in the time_since_previous attribute.

It measures the ratio between the number of unusual pauses and the total number of words.

The assumption is that a non-proficient speaker would decode rather than speak fluently. Therefore, very often, there will be a pause after each word.

Incorrect pauses can be detected by checking if time_since_previous is longer than a threshold (e.g., 200 ms).

# of incorrect pauses = # of pauses longer than a threshold

incorrect_percentage = 1 - # of incorrect pauses / # of words

score = 1 if 0 < incorrect_percentage <= 0.5
      = 2 if 0.5 < incorrect_percentage <= 0.7
      = 3 if 0.7 < incorrect_percentage <= 0.8
      = 4 if 0.8 < incorrect_percentage <= 0.95
      = 5 if 0.95 < incorrect_percentage <= 1

Phrasal intonation

This score is derived by the word-level pitch values and the punctuation label.

It measures the ratio between the number of points in the passage where the student’s intonation has appropriately changed (e.g., at the end of a sentence) and the total number of punctuation symbols that need to be assessed (e.g., . or ?).

As reported in the use case 2 example, the accepted intonation is when the trend of the pitch of the word before the punctuation symbol has an acceptable slope (e.g., >130 or <-90 for ? and <-90 for .).

# of accepted intonation = trend of the pitch array on the word 
                           before the assessed punctuation is above 
                           or below a threshold 
                           
percentage = # of accepted intonation points / # of assessed punctuations

score = 1 if 0 < percentage <= 0.25
      = 2 if 0.25 < percentage <= 0.5
      = 3 if 0.5 < percentage <= 0.75
      = 4 if 0.75 < percentage <= 0.9
      = 5 if 0.9 < percentage <= 1 

Examples

Good expressivity

Sample reading: Listen to audio file here

Taking the above audio sample where the child read well, the rubric score would be be calculated like this:

Diagram of aggregate score calculation for good expressivity

A further breakdown would look like this:

JSON Data

Details

Percentage

Score

Word level expressiveness

75 of 76 words had high expressiveness

5

Passage level expressiveness

The variation of the pitch values on the entire passage

88%

5

Correct pauses

10 of a potential 13 pauses were observed correctly

3

Incorrect pauses

3 instances where the child paused where they shouldn’t have (out of a potential 76)

3

Phrasal intonation

7 out of a potential 11 times where the intonation was appropriately changed

2

Rubric score

Rubric score is the average of the different dimension scores

4

Poor expressivity

Sample reading: Listen to audio file here

If we take the above audio sample where the child read with poor expressivity, the rubric score would be be calculated like this:

Diagram of aggregate score calculation for poor expressivity

JSON Data

Details

Percentage

Score

Word level expressiveness

42 of 60 words had high expressiveness

3

Passage level expressiveness

The variation of the pitch values on the entire passage

38%

1

Correct pauses

14 of a potential 26 pauses were observed correctly

3

Incorrect pauses

14 instances where the child paused where they shouldn’t have (out of a potential 60)

3

Phrasal intonation

2 out of a potential 8 times where the intonation was appropriately changed

2

Rubric score

Rubric score is the average of the different dimension scores

2

Resources

Audio files mentioned above

File for Good Expressivity example:

blueProsodyClip.wav

File for Poor Expressivity sample:

zamProsodyClip.wav