Fluency (Online)
Overview
The Fluency Web Service is built on top of our Speech to Text (STT) service. It provides an extra layer of analysis for a submitted audio file and accompanying reference text.
SoapBox Fluency is a speech solution designed specifically for children’s speech and assesses children's reading skills in both formal and informal settings.
Given an audio file and a text passage that the child has been asked to read; the Fluency API will transcribe the speech to text and return an analysis of the audio file compared to the text passage.
The Cloud Fluency v1 is officially released and is generally available from the https://api.soapboxlabs.com/v1/async/request/speech/fluency
endpoint.
Getting Started
Best Practices
Make sure to review the documentation around Audio Encoding and Best Practices.
Data Privacy
Review our Data Privacy resources
RESTful Web Services
Review the SoapBox Cloud Web Service Interaction Methods documentation to familiarize yourself with the concepts involved.
Fluency Concepts
Familiarize yourself with the concepts of Fluency.
Custom Language Model (CLM)
You will require a CLM to use our Fluency service. More information is available in our documentation Understanding CLMs.
Interaction Method
Choose an interaction method here. Asynchronous requests are priced differently to synchronous requests. Please contact us for more information.
Authentication
Review the Web Service Authentication documentation.
Sending a Cloud Fluency Request
Once authenticated, test Fluency requests should be sent via HTTPS to:
The following CURL example interacts with the Fluency API in an asynchronous manner.
Requests should be sent to this endpoint via an HTTPS POST request. The following parameters should be specified for a successful response:
Name | Type | Description |
---|---|---|
x-app-key | header | For test purposes. The app key to use to authenticate with the Web Service |
file | field | The audio file to be analyzed |
reference_text | field | The reference text against which the speech contained in the audio file should be analyzed. This should be a plain text file using UTF8 encoding. Please refer to further details on UTF8 encoding here. General best practice information is here |
user_token | field | A unique ID that represents the speaker in the audio file. This should be a non-human readable alphanumeric identifier (such as a UUID) that has meaning to you but not SoapBox Labs. This token can be used to request deletion of a specific users data in line with our data privacy commitments. |
model_id | field | The alphanumeric ID of the CLM that represents your custom domain. Please refer to the section on CLM’s for further detail. |
last_word_type | field | [Optional] See Last Word Type. |
Example CURL Request
Using the app_key and model_id above, you can send requests to the Fluency Web Service. There are various tools and technologies available to accomplish this. One such tool is “CURL”, which is widely available on Mac and Linux and is also available on Windows. CURL allows you to send HTTP requests and display the response.
The following are sample CURL commands that can be used to submit an audio file to the Fluency endpoint asynchronously. It contains app_key, model_id, file, and reference_text parameters and should be run from the same directory as the audio file.
Step 1 - Submit Request (POST)
curl -X POST \ -H "x-app-key:$APP_KEY" \ -F "file=@$AUDIO_FILE" \ -F "user_token=$USER_TOKEN" \ -F "reference_text=@$REFERENCE_TEXT" \ -F "model_id=$MODEL_ID" \ -F "last_word_type=WHOLE" \ https://api.soapboxlabs.com/v1/async/request/speech/fluency
If successful, this will return a JSON response similar to the following:
{ "id":"abc123", "status_uri":"https:/api.soapboxlabs.comt/v1/async/results/abc123" }
Step 2 - Retrieve Result (GET)
Extract the value of the status_uri field and use this to retrieve the result. The processing time depends on the length of the file, its complexity (e.g., audio quality) and connection speed. If the result is not yet available, you will receive a HTTP 404 status code. If you encounter a HTTP 404 you should wait a period of time before retrying.
curl https:/api.soapboxlabs.comt/v1/async/results/abc123
Fluency API Response Structure
If the request is successful, the Fluency Web Service will return a JSON response containing the Fluency analysis. At the root of the results object are the following fields:
user_id | The user_id specified in the request |
---|---|
language_code | The language contained in the audio file being analyzed |
result_id | A unique identifier for the request |
time | The UTC time the request was processed at |
results | The Fluency results object returned (see results below for further details) |
audio_duration | The duration of the audio file in seconds |
In addition, a breakdown is also provided for each word/token identified in the transcription and reference text, along with an indication of its type (i.e., insertion, deletion, etc.). This allows the source text to then be marked up for review by a teacher for example.
API Response Example
The following is an example of the JSON structure you can expect from Fluency. In the example, the reference text is “i love stripes too” while the child says “i like stripes” in the audio file.
Reference Text | i love stripes too |
Child says | “i like stripes” |
{ "audio_duration": 5.1363749504089355, "user_id": "abc123", "results": { "num_differences": 2, "substitution_count": 1, "insertion_count": 0, "correct_count": 2, "transcription_confidence": 87.155700683594, "deletion_count": 1, "reference_text": "i love stripes too", "last_word": { "end_timestamp": 5.1363749504089355, "text_score_index": 3, "type": "WHOLE" }, "text_score": [{ "transcription_index": 0, "reference_index": 0, "alignment_type": "CORRECT", "transcription_details": { "confidence": 86.57522583007812, "end": 1.8, "start": 1.17, "phone_breakdown": [{ "duration": 0.63, "quality_score": 82.8630599975586, "end": 1.8, "start": 1.17, "phone": "ay" }] }, "reference_word": "i", "normalised_word": "i", "sub_types": { }, "transcription_word": "i" }, { "transcription_index": 1, "reference_index": 1, "alignment_type": "SUBSTITUTION", "transcription_details": { "confidence": 82.5213851928711, "end": 2.79, "start": 2.04, "phone_breakdown": [{ "duration": 0.21, "quality_score": 91.41993713378906, "end": 2.25, "start": 2.04, "phone": "l" }, { "duration": 0.21, "quality_score": 69.66895294189453, "end": 2.46, "start": 2.25, "phone": "ay" }, { "duration": 0.33, "quality_score": 74.0502700805664, "end": 2.79, "start": 2.46, "phone": "k" }] }, "reference_word": "love", "normalised_word": "love", "sub_types": { }, "transcription_word": "like" }, { "transcription_index": 2, "reference_index": 2, "alignment_type": "CORRECT", "transcription_details": { "confidence": 92.3704833984375, "end": 4.62, "start": 3.0, "phone_breakdown": [{ "duration": 0.21, "quality_score": 91.65348815917969, "end": 3.21, "start": 3.0, "phone": "s" }, { "duration": 0.09, "quality_score": 95.76628875732422, "end": 3.3, "start": 3.21, "phone": "t" }, { "duration": 0.09, "quality_score": 97.88199615478516, "end": 3.39, "start": 3.3, "phone": "r" }, { "duration": 0.12, "quality_score": 95.32052612304688, "end": 3.51, "start": 3.39, "phone": "ay" }, { "duration": 0.21, "quality_score": 98.26608276367188, "end": 3.72, "start": 3.51, "phone": "p" }, { "duration": 0.9, "quality_score": 66.78457641601562, "end": 4.62, "start": 3.72, "phone": "s" }] }, "reference_word": "stripes", "normalised_word": "stripes", "sub_types": { }, "transcription_word": "stripes" }, { "transcription_index": 3, "reference_index": 3, "alignment_type": "DELETION", "transcription_details": { }, "reference_word": "too", "normalised_word": "too", "sub_types": { }, "transcription_word": "" }], "transcription": "i like stripes", "word_count": 4, "repetition_count": 0 }, "language_code": "en-GB", "result_id": "abc123-282_1638878664258", "time": "2021-12-07T12:04:24.996Z" }
JSON Breakdown
The following are snippets from the full JSON response above with some additional information for each key.
Results object
Within the JSON response, the results node/object contains the analysis of the audio file including data such as the number of insertions, deletions, and substitutions found and a text_score node/object that contains a breakdown of each token analyzed in the request.
"results": { "num_differences": 2, "substitution_count": 1, "insertion_count": 0, "correct_count": 2, "transcription_confidence": 87.155700683594, "deletion_count": 1, "reference_text": "i love stripes too", "last_word": { "end_timestamp": 5.1363749504089355, "text_score_index": 3, "type": "WHOLE" }, "text_score": [] "transcription": "i like stripes", "word_count": 4, "repetition_count": 0 } |
num_differences | The number of differences between the reference and transcription text (total number of substitutions + deletions + insertions). |
---|---|
substitution_count | The number of times a word has been substituted for another (i.e., the number of times the child said a different word than what was in the reference text). |
insertion_count | The number of times a word has been inserted (i.e., the number of times the child said a word that was not present in the reference text). |
correct_count | The number of times the child correctly said a word from the reference text. |
transcription_confidence | The confidence of the transcription text transcribed from the audio file. |
deletion_count | The number of times a word from the reference text was not said (i.e., the number of times the child omitted a word that's present in the reference text). Depending on the last_word_type selected, the deletion_count will vary. See Last Word Feature for further details. |
reference_text | The text that is expected to be read. |
text_score | A breakdown of the individual words. See text_score for more information. |
transcription | The computed transcription from the audio file. |
word_count | The number of words in the reference_text. |
repetition_count | The number of times the child repeated a word. (For example, if the child correctly said “stripes” but immediately afterwards said “stripes” again, the second stripes would be flagged as an insertion and repetition.) |
last_word | This object stores information relating to Last Word Type. |
text_score
"text_score": [{ "transcription_index": 0, "reference_index": 0, "alignment_type": "CORRECT", "transcription_details": {}, "reference_word": "i", "normalised_word": "i", "sub_types": {}, "transcription_word": "i" }, { "transcription_index": 1, "reference_index": 1, "alignment_type": "SUBSTITUTION", "transcription_details": {}, "reference_word": "love", "normalised_word": "love", "sub_types": {}, "transcription_word": "like" }, { "transcription_index": 2, "reference_index": 2, "alignment_type": "CORRECT", "transcription_details": {}, "reference_word": "stripes", "normalised_word": "stripes", "sub_types": {}, "transcription_word": "stripes" }, { "transcription_index": 3, "reference_index": 3, "alignment_type": "DELETION", "transcription_details": {}, "reference_word": "too", "normalised_word": "too", "sub_types": {}, "transcription_word": "" }] |
Word Breakdown
The word breakdowns contain indexes. These are based on splitting the text by white space. e.g. “i like stripes” is split by the spaces of its 3 words i
, like
and stripes
transcription_index | The index the current word is pointing to in the transcription text. |
---|---|
reference_index | The index the current word is pointing to in the reference text. |
alignment_type | CORRECT, INSERTION, DELETION, or SUBSTITUTION |
transcription_details | An object containing more detailed results from the transcription. See transcription_details below for details. |
reference_word | The original reference word from the reference text that the child is prompted to say. |
normalized_word | The reference_word normalized so it can be matched with the transcription text. |
sub_types | Contains information on if the word is also a repetition or if the word was flagged as a self-correction. |
transcription_word | The transcription word that is being matched with the word spoken. |
alignment_type
The alignment type specifies what we have detected the words in the audio file to be in relation to the reference text. There are four types: CORRECT, INSERTION, DELETION, and SUBSTITUTION. Each are described below.
Key | Description |
---|---|
correct
| The number of correct words the child said. This is when the child speaks words correctly in the reference text. Example: Reference: Transcription: |
insertion | The number of insertions the child made. This is when the audio file contains extra words not present in the reference text. Example: Reference: Transcription: |
deletions (omissions) | The number of deletions / omissions the child made. This is when the audio file is missing words present in the reference text. Example: Reference: Transcription: |
substitutions
| The number of times the child said a different word than expected in the reference text Example: Reference: Transcription: |
transcription_details
This object contains the transcription details from the audio file. If empty, it means the word was not detected in the audio file.
"transcription_details": { "time_since_previous": 0.4 "confidence": 82.5213851928711, "end": 2.79, "start": 2.04, "phone_breakdown": [] }
confidence | Expressed as a percentage, a confidence score is how confident our voice engine is that a word was said. |
---|---|
start / end | The start/end times the word was detected in the audio file, measured in seconds. |
time_since_previous | The time between the current word and the previous word, measured in seconds. Can be used for verifying self-corrections. |
phone_breakdown | A phonemic breakdown of the word. Each phoneme in the transcription is also given a confidence score. |
phone_breakdown
"phone_breakdown": [{ "quality_score": 91.41993713378906, "end": 2.25, "start": 2.04, "phone": "l" }, { "quality_score": 69.66895294189453, "end": 2.46, "start": 2.25, "phone": "ay" }, { "quality_score": 74.0502700805664, "end": 2.79, "start": 2.46, "phone": "k" }] |
phone | The constituent phone of the word. |
---|---|
quality_score | A measure of how similar to our pronunciation models the phoneme sound is. |
start / end | The start/end times the phone was detected in |
sub_types
sub_types contains additional information about the word. These additional types are repetitions and self-corrections.
Key | Description |
---|---|
is_repetition | A true of false flag stating if the word is a repetition or not. If false, the key will not be present in the JSON. |
repetition_of | If the word is a repetition, this index points to the corresponding |
self_correction | If this word was flagged as a potential self-correction, this object will be present. See Fluency - Self Corrections Feature |
Repetitions
The number of repetitions in the transcription. This is a sub type of insertions. This will not be present in the JSON unless the word was marked as a repetition.
Reference: Transcription: |
In the example above, the second like
has an alignment type of INSERTION, and a sub_type of repetition. This would be displayed as:
"sub_types": { "is_repetition": true, "repetition_of": 1 }
is_repetition | If the word is a repetition of another word. |
---|---|
repetition_of | This index points to the text_score_index of the word that is being repeated. |
Self-corrections
A self-correction (aka “reparandum”) is an immediate correction of an incorrectly read word. We provide information to enable the customers to see if a potential self-correction was made, enabling the customer to ignore the self-corrected INSERTION and not count it as a mistake, which will effect the customer’s accuracy score.
"sub_types": { "self_correction": { "reparandums": [1] } }
More information on self-corrections can be found here: Fluency - Self Corrections Feature .
Details on how self-corrections can be used to calculate metrics such as accuracy and words correct per minute (WCPM) can be found here: Fluency - Example Calculations .
Features - Further Reading
Self-corrections
A self-correction (aka “reparandum”) is an immediate correction of an incorrectly read word. We provide information to enable the customers to see if a self-correction was made, enabling the customer to ignore the self-corrected INSERTION and not count it as a mistake, which will effect the customer’s accuracy score.
More information on self-corrections can be found here: Fluency - Self Corrections Feature .
Details on how self-corrections can be used to calculate metrics such as accuracy and words correct per minute (WCPM) can be found here: Fluency - Example Calculations .
Last Word Type
This feature is designed to enable customers to define the point at which Fluency data is calculated in an audio file. Choosing the point to stop calculating has an impact on accuracy ratings for the reader.
The user can choose between the different types by adding the form-data last_word_type
and setting that to either whole
(which is the default if a user does not specify a type), read
, or heard
.
More detailed information can be found here: Fluency - Last Word Feature .
Details on how last_word_type
can be used in calculations can be found here: Fluency - Example Calculations .
Custom Markup
Most special cases arise from normalizing cases. For example, if we normalize hElLo! we get hello, all lower cased with special characters removed. The STT engine transcribes audio files into words, and since these words from the STT engine are normalized versions, that means we have to sanitize the reference_text so it is compatible.
Generally, this is straight forward and a direct replacement. More details on special cases can be found here: Customizing Text Using Markup
Making calculations
Since accuracy is something that is defined differently by clients across different products and use cases, the Fluency endpoint does not does not prescribe an ‘accuracy score’ as part of the response, but data returned enables the customer to generate this client side.
Similarly, the WCPM score is something that is dependent of your definition of a “reading error,” and this can be further augmented depending on the last_word_type
chosen.
The Fluency endpoint provides the data to for clients to make these calculations on a case by case basis.
Key metrics like accuracy, read percentage, and WCPM can all be calculated from the JSON output.
For more details on these calculations, see the Fluency - Example Calculations page.