Fluency (Online)

Overview

The Fluency Web Service is built on top of our Speech to Text (STT) service. It provides an extra layer of analysis for a submitted audio file and accompanying reference text.

SoapBox Fluency is a speech solution designed specifically for children’s speech and assesses children's reading skills in both formal and informal settings.

Given an audio file and a text passage that the child has been asked to read; the Fluency API will transcribe the speech to text and return an analysis of the audio file compared to the text passage.

The Cloud Fluency v1 is officially released and is generally available from the https://api.soapboxlabs.com/v1/async/request/speech/fluency endpoint.

Getting Started


Best Practices

Make sure to review the documentation around Audio Encoding and Best Practices.

Data Privacy

Review our Data Privacy resources

RESTful Web Services

Review the SoapBox Cloud Web Service Interaction Methods documentation to familiarize yourself with the concepts involved.

Fluency Concepts

Familiarize yourself with the concepts of Fluency.

Custom Language Model (CLM)

You will require a CLM to use our Fluency service. More information is available in our documentation Understanding CLMs.

Interaction Method

Choose an interaction method here. Asynchronous requests are priced differently to synchronous requests. Please contact us for more information.

Authentication

Review the Web Service Authentication documentation.

Sending a Cloud Fluency Request


Once authenticated, test Fluency requests should be sent via HTTPS to:

  • The following CURL example interacts with the Fluency API in an asynchronous manner.

Requests should be sent to this endpoint via an HTTPS POST request. The following parameters should be specified for a successful response:

Name

Type

Description

x-app-key

header

For test purposes. The app key to use to authenticate with the Web Service

file

field

The audio file to be analyzed

reference_text

field

The reference text against which the speech contained in the audio file should be analyzed. This should be a plain text file using UTF8 encoding. Please refer to further details on UTF8 encoding here.

General best practice information is here

user_token

field

A unique ID that represents the speaker in the audio file. This should be a non-human readable alphanumeric identifier (such as a UUID) that has meaning to you but not SoapBox Labs. This token can be used to request deletion of a specific users data in line with our data privacy commitments.

model_id

field

The alphanumeric ID of the CLM that represents your custom domain. Please refer to the section on CLM’s for further detail.

last_word_type

field

[Optional] See Last Word Type.

Example CURL Request

Using the app_key and model_id above, you can send requests to the Fluency Web Service. There are various tools and technologies available to accomplish this. One such tool is “CURL”, which is widely available on Mac and Linux and is also available on Windows. CURL allows you to send HTTP requests and display the response. 

The following are sample CURL commands that can be used to submit an audio file to the Fluency endpoint asynchronously. It contains app_key, model_id, file, and reference_text parameters and should be run from the same directory as the audio file.

Step 1 - Submit Request (POST)

curl -X POST \
    -H "x-app-key:$APP_KEY" \
    -F "file=@$AUDIO_FILE" \
    -F "user_token=$USER_TOKEN" \
    -F "reference_text=@$REFERENCE_TEXT" \
    -F "model_id=$MODEL_ID" \
    -F "last_word_type=WHOLE" \
    https://api.soapboxlabs.com/v1/async/request/speech/fluency

If successful, this will return a JSON response similar to the following:

{
  "id":"abc123",
  "status_uri":"https:/api.soapboxlabs.comt/v1/async/results/abc123"
}

Step 2 - Retrieve Result (GET)

Extract the value of the status_uri field and use this to retrieve the result. The processing time depends on the length of the file, its complexity (e.g., audio quality) and connection speed. If the result is not yet available, you will receive a HTTP 404 status code. If you encounter a HTTP 404 you should wait a period of time before retrying.

curl https:/api.soapboxlabs.comt/v1/async/results/abc123

Fluency API Response Structure

If the request is successful, the Fluency Web Service will return a JSON response containing the Fluency analysis. At the root of the results object are the following fields:

user_id

The user_id specified in the request

language_code

The language contained in the audio file being analyzed

result_id

A unique identifier for the request

time

The UTC time the request was processed at

results

The Fluency results object returned (see results below for further details)

audio_duration

The duration of the audio file in seconds

In addition, a breakdown is also provided for each word/token identified in the transcription and reference text, along with an indication of its type (i.e., insertion, deletion, etc.). This allows the source text to then be marked up for review by a teacher for example.

API Response Example

The following is an example of the JSON structure you can expect from Fluency. In the example, the reference text is “i love stripes too” while the child says “i like stripes” in the audio file.

Reference Text

i love stripes too

Child says

“i like stripes”

{
	"audio_duration": 5.1363749504089355,
	"user_id": "abc123",
	"results": {
		"num_differences": 2,
		"substitution_count": 1,
		"insertion_count": 0,
		"correct_count": 2,
		"transcription_confidence": 87.155700683594,
		"deletion_count": 1,
		"reference_text": "i love stripes too",
		"last_word": {
			"end_timestamp": 5.1363749504089355,
			"text_score_index": 3,
			"type": "WHOLE"
		},
		"text_score": [{
			"transcription_index": 0,
			"reference_index": 0,
			"alignment_type": "CORRECT",
			"transcription_details": {
				"confidence": 86.57522583007812,
				"end": 1.8,
				"start": 1.17,
				"phone_breakdown": [{
					"duration": 0.63,
					"quality_score": 82.8630599975586,
					"end": 1.8,
					"start": 1.17,
					"phone": "ay"
				}]
			},
			"reference_word": "i",
			"normalised_word": "i",
			"sub_types": {

			},
			"transcription_word": "i"
		}, {
			"transcription_index": 1,
			"reference_index": 1,
			"alignment_type": "SUBSTITUTION",
			"transcription_details": {
				"confidence": 82.5213851928711,
				"end": 2.79,
				"start": 2.04,
				"phone_breakdown": [{
					"duration": 0.21,
					"quality_score": 91.41993713378906,
					"end": 2.25,
					"start": 2.04,
					"phone": "l"
				}, {
					"duration": 0.21,
					"quality_score": 69.66895294189453,
					"end": 2.46,
					"start": 2.25,
					"phone": "ay"
				}, {
					"duration": 0.33,
					"quality_score": 74.0502700805664,
					"end": 2.79,
					"start": 2.46,
					"phone": "k"
				}]
			},
			"reference_word": "love",
			"normalised_word": "love",
			"sub_types": {

			},
			"transcription_word": "like"
		}, {
			"transcription_index": 2,
			"reference_index": 2,
			"alignment_type": "CORRECT",
			"transcription_details": {
				"confidence": 92.3704833984375,
				"end": 4.62,
				"start": 3.0,
				"phone_breakdown": [{
					"duration": 0.21,
					"quality_score": 91.65348815917969,
					"end": 3.21,
					"start": 3.0,
					"phone": "s"
				}, {
					"duration": 0.09,
					"quality_score": 95.76628875732422,
					"end": 3.3,
					"start": 3.21,
					"phone": "t"
				}, {
					"duration": 0.09,
					"quality_score": 97.88199615478516,
					"end": 3.39,
					"start": 3.3,
					"phone": "r"
				}, {
					"duration": 0.12,
					"quality_score": 95.32052612304688,
					"end": 3.51,
					"start": 3.39,
					"phone": "ay"
				}, {
					"duration": 0.21,
					"quality_score": 98.26608276367188,
					"end": 3.72,
					"start": 3.51,
					"phone": "p"
				}, {
					"duration": 0.9,
					"quality_score": 66.78457641601562,
					"end": 4.62,
					"start": 3.72,
					"phone": "s"
				}]
			},
			"reference_word": "stripes",
			"normalised_word": "stripes",
			"sub_types": {

			},
			"transcription_word": "stripes"
		}, {
			"transcription_index": 3,
			"reference_index": 3,
			"alignment_type": "DELETION",
			"transcription_details": {

			},
			"reference_word": "too",
			"normalised_word": "too",
			"sub_types": {

			},
			"transcription_word": ""
		}],
		"transcription": "i like stripes",
		"word_count": 4,
		"repetition_count": 0
	},
	"language_code": "en-GB",
	"result_id": "abc123-282_1638878664258",
	"time": "2021-12-07T12:04:24.996Z"
}

JSON Breakdown

The following are snippets from the full JSON response above with some additional information for each key.

Results object

Within the JSON response, the results node/object contains the analysis of the audio file including data such as the number of insertions, deletions, and substitutions found and a text_score node/object that contains a breakdown of each token analyzed in the request.

"results": {
	"num_differences": 2,
	"substitution_count": 1,
	"insertion_count": 0,
	"correct_count": 2,
	"transcription_confidence": 87.155700683594,
	"deletion_count": 1,
	"reference_text": "i love stripes too",
	"last_word": {
		"end_timestamp": 5.1363749504089355,
		"text_score_index": 3,
		"type": "WHOLE"
	},
	"text_score": []
	"transcription": "i like stripes",
	"word_count": 4,
	"repetition_count": 0
}

num_differences

The number of differences between the reference and transcription text (total number of substitutions + deletions + insertions).

substitution_count

The number of times a word has been substituted for another (i.e., the number of times the child said a different word than what was in the reference text).

insertion_count

The number of times a word has been inserted (i.e., the number of times the child said a word that was not present in the reference text).

correct_count

The number of times the child correctly said a word from the reference text.

transcription_confidence

The confidence of the transcription text transcribed from the audio file.

deletion_count

The number of times a word from the reference text was not said (i.e., the number of times the child omitted a word that's present in the reference text).

Depending on the last_word_type selected, the deletion_count will vary. See Last Word Feature for further details.

reference_text

The text that is expected to be read.

text_score

A breakdown of the individual words. See text_score for more information.

transcription

The computed transcription from the audio file.

word_count

The number of words in the reference_text.

repetition_count

The number of times the child repeated a word. (For example, if the child correctly said “stripes” but immediately afterwards said “stripes” again, the second stripes would be flagged as an insertion and repetition.)

last_word

This object stores information relating to Last Word Type.

text_score

"text_score": [{
	"transcription_index": 0,
	"reference_index": 0,
	"alignment_type": "CORRECT",
	"transcription_details": {},
	"reference_word": "i",
	"normalised_word": "i",
	"sub_types": {},
	"transcription_word": "i"
}, {
	"transcription_index": 1,
	"reference_index": 1,
	"alignment_type": "SUBSTITUTION",
	"transcription_details": {},
	"reference_word": "love",
	"normalised_word": "love",
	"sub_types": {},
	"transcription_word": "like"
}, {
	"transcription_index": 2,
	"reference_index": 2,
	"alignment_type": "CORRECT",
	"transcription_details": {},
	"reference_word": "stripes",
	"normalised_word": "stripes",
	"sub_types": {},
	"transcription_word": "stripes"
}, {
	"transcription_index": 3,
	"reference_index": 3,
	"alignment_type": "DELETION",
	"transcription_details": {},
	"reference_word": "too",
	"normalised_word": "too",
	"sub_types": {},
	"transcription_word": ""
}]

Word Breakdown

The word breakdowns contain indexes. These are based on splitting the text by white space. e.g. “i like stripes” is split by the spaces of its 3 words i , like and stripes

transcription_index

The index the current word is pointing to in the transcription text.

reference_index

The index the current word is pointing to in the reference text.

alignment_type

CORRECT, INSERTION, DELETION, or SUBSTITUTION

transcription_details

An object containing more detailed results from the transcription. See transcription_details below for details.

reference_word

The original reference word from the reference text that the child is prompted to say.

Please note that when alignment_type is INSERTION, this field is empty.

normalized_word

The reference_word normalized so it can be matched with the transcription text.

Please note that when alignment_type is INSERTION, this field is empty.

sub_types

Contains information on if the word is also a repetition or if the word was flagged as a self-correction.

transcription_word

The transcription word that is being matched with the word spoken.

Please note that when alignment_type is  DELETION, this field is empty.

alignment_type

The alignment type specifies what we have detected the words in the audio file to be in relation to the reference text. There are four types: CORRECT, INSERTION, DELETION, and SUBSTITUTION. Each are described below.

Key

Description

correct

"alignment_type": "CORRECT"

The number of correct words the child said. This is when the child speaks words correctly in the reference text.

Example:

Reference: i like stripes

Transcription: i like stripes

insertion
"alignment_type": "INSERTION"

The number of insertions the child made. This is when the audio file contains extra words not present in the reference text.

Example:

Reference: i like stripes

Transcription: i like blue stripes

deletions (omissions)
"alignment_type": "DELETION"

The number of deletions / omissions the child made. This is when the audio file is missing words present in the reference text.

Example:

Reference: i like stripes

Transcription: i like

substitutions

"alignment_type": "SUBSTITUTION"

The number of times the child said a different word than expected in the reference text

Example:

Reference: i like stripes

Transcription: i love stripes

transcription_details

This object contains the transcription details from the audio file. If empty, it means the word was not detected in the audio file.

"transcription_details": {
    "time_since_previous": 0.4
	"confidence": 82.5213851928711,
	"end": 2.79,
	"start": 2.04,
	"phone_breakdown": []
}

confidence

Expressed as a percentage, a confidence score is how confident our voice engine is that a word was said.

start / end

The start/end times the word was detected in the audio file, measured in seconds.

time_since_previous

The time between the current word and the previous word, measured in seconds. Can be used for verifying self-corrections.

phone_breakdown

A phonemic breakdown of the word. Each phoneme in the transcription is also given a confidence score.

phone_breakdown

"phone_breakdown": [{
	"quality_score": 91.41993713378906,
	"end": 2.25,
	"start": 2.04,
	"phone": "l"
}, {
	"quality_score": 69.66895294189453,
	"end": 2.46,
	"start": 2.25,
	"phone": "ay"
}, {
	"quality_score": 74.0502700805664,
	"end": 2.79,
	"start": 2.46,
	"phone": "k"
}]

phone

The constituent phone of the word.

quality_score

A measure of how similar to our pronunciation models the phoneme sound is.

start / end

The start/end times the phone was detected in


sub_types

sub_types contains additional information about the word. These additional types are repetitions and self-corrections.

Key

Description

is_repetition

A true of false flag stating if the word is a repetition or not. If false, the key will not be present in the JSON.

repetition_of

If the word is a repetition, this index points to the corresponding text_score item that it repeated.

self_correction

If this word was flagged as a potential self-correction, this object will be present. See Fluency - Self Corrections Feature

Repetitions

The number of repetitions in the transcription. This is a sub type of insertions. This will not be present in the JSON unless the word was marked as a repetition.

Reference: i like stripes

Transcription: i like like stripes

In the example above, the second like has an alignment type of INSERTION, and a sub_type of repetition. This would be displayed as:

"sub_types": {
	"is_repetition": true,
	"repetition_of": 1
}

is_repetition

If the word is a repetition of another word.

repetition_of

This index points to the text_score_index of the word that is being repeated.

Self-corrections

A self-correction (aka “reparandum”) is an immediate correction of an incorrectly read word. We provide information to enable the customers to see if a potential self-correction was made, enabling the customer to ignore the self-corrected INSERTION and not count it as a mistake, which will effect the customer’s accuracy score.

"sub_types": {
	"self_correction": {
		"reparandums": [1]
	}
}

More information on self-corrections can be found here: Fluency - Self Corrections Feature .

Details on how self-corrections can be used to calculate metrics such as accuracy and words correct per minute (WCPM) can be found here: Fluency - Example Calculations .


Features - Further Reading

Self-corrections

A self-correction (aka “reparandum”) is an immediate correction of an incorrectly read word. We provide information to enable the customers to see if a self-correction was made, enabling the customer to ignore the self-corrected INSERTION and not count it as a mistake, which will effect the customer’s accuracy score.

More information on self-corrections can be found here: Fluency - Self Corrections Feature .

Details on how self-corrections can be used to calculate metrics such as accuracy and words correct per minute (WCPM) can be found here: Fluency - Example Calculations .

Last Word Type

This feature is designed to enable customers to define the point at which Fluency data is calculated in an audio file. Choosing the point to stop calculating has an impact on accuracy ratings for the reader.

The user can choose between the different types by adding the form-data last_word_type and setting that to either whole (which is the default if a user does not specify a type), read, or heard.

More detailed information can be found here: Fluency - Last Word Feature .

Details on how last_word_type can be used in calculations can be found here: Fluency - Example Calculations .

Custom Markup

Most special cases arise from normalizing cases. For example, if we normalize hElLo! we get hello, all lower cased with special characters removed. The STT engine transcribes audio files into words, and since these words from the STT engine are normalized versions, that means we have to sanitize the reference_text so it is compatible.

Generally, this is straight forward and a direct replacement. More details on special cases can be found here: Customizing Text Using Markup

Making calculations

Since accuracy is something that is defined differently by clients across different products and use cases, the Fluency endpoint does not does not prescribe an ‘accuracy score’ as part of the response, but data returned enables the customer to generate this client side.

Similarly, the WCPM score is something that is dependent of your definition of a “reading error,” and this can be further augmented depending on the last_word_type chosen.

The Fluency endpoint provides the data to for clients to make these calculations on a case by case basis.

Key metrics like accuracy, read percentage, and WCPM can all be calculated from the JSON output.

For more details on these calculations, see the Fluency - Example Calculations page.