Speech-To-Text (Online)
This page demonstrates how to obtain a transcription of speech contained in an audio file using our Speech-To-Text (STT) Web Service.
The Cloud Recognition v1 is officially released and is generally available from the https://api.soapboxlabs.com/v1/speech/recognition
endpoint.
Getting Started
Best Practices
Make sure to review the documentation around Audio Encoding and Best Practices.
Data Privacy
Review our Data Privacy resources
RESTful Web Services
Review the SoapBox Cloud Web Service documentation to familiarize yourself with the concepts involved.
Speech-To-Text Concepts
Familiarize yourself with the concepts of Speech-To-Text.
Custom Language Model (CLM)
You will need a CLM to use our Fluency service. More information is available in Understanding CLMs.
Interaction Method
Choose an interaction method here. Asynchronous requests are priced differently to synchronous requests. Please contact us for more information.
Authentication
Review the Web Service Authentication documentation.
Sending a Cloud STT Request
Once authenticated, STT requests should be sent via HTTPS to:
The following CURL example interacts with the Recognition API in an asynchronous manner.
Requests should be sent to this endpoint via an HTTPS POST request. The following parameters should be specified for a successful response:
Name | Type | Description |
---|---|---|
x-app-key | header | For test purposes. The App Key to use to authenticate with the Web Service |
file | field | The audio file to be analyzed |
user_token | field | A unique id that represents the speaker in the audio file. This should be a non-human readable alphanumeric identifier (such as a UUID) that has meaning to you but not SoapBox Labs. This token can be used to request deletion of a specific user's data in line with our Data Privacy commitments. |
model_id | field | The alphanumeric ID of the CLM that represents your custom domain. Please refer to the section on CLM for further detail. |
Example CURL Request
Using the app_key and model_id above, you can send requests to the STT Web Service. There are various tools and technologies available to accomplish this. One such tool is “CURL”, which is widely available on Mac and Linux and is also available on Windows. CURL allows you to send HTTP requests and display the response.
The following are sample CURL commands that can be used to submit an audio file to the STT endpoint asynchronously. It contains app_key, model_id and file parameters and should be run from the same directory as the audio file.
Step 1 - Submit Request (POST)
curl -X POST \ -H "x-app-key:$APP_KEY" \ -F "file=@$AUDIO_FILE" \ -F "user_token=$USER_TOKEN" \ -F "model_id=$MODEL_ID" \ https://api.soapboxlabs.com/v1/async/request/speech/recognition
If successful this will return a JSON response similar to the following…
{ "id":"abc123", "status_uri":"https://api.northeurope.prod.nostromo.soapboxlabs.com/v1/async/results/abc123" }
The status_uri url will differ based on your region
Step 2 - Retrieve Result (GET)
Extract the value of the “status_uri” field and use this to retrieve the result. You should wait a sufficient amount of time before retrieving the result. If the result is not yet available you will receive a HTTP 404 status code. If you encounter a HTTP 404 you should wait a period of time before retrying. The result retrieval URL will change depending on where the request is sent from.
curl https://api.northeurope.prod.nostromo.soapboxlabs.com/v1/async/results/abc123
STT API Response Structure
If the request was successful, the STT Web Service will return a JSON response containing the STT analysis. At the root of the results object are the following fields:
user_id | The user_id specified in the request. |
---|---|
language_code | The language contained in the audio file being analysed. |
result_id | A unique identifier for the request |
time | The UTC time the request was processed at. |
results | The nbest list of results from the speech recognizer. See results for further details. |
audio_duration | The duration of the audio file in seconds. |
num_alternatives | The nbest list of likely alternatives from the speech recogniser. (currently set to 1) |
The following is an example of the JSON structure you can expect from STT.
{ "audio_duration": 5.1363749504089355, "user_id": "abc123", "results": [{ "confidence": 87.196960449219, "transcription": "i like stripes", "words": [{ "confidence": 86.65843963623047, "end": 1.8, "start": 1.17, "phone_breakdown": [{ "quality_score": 82.96692657470703, "end": 1.8, "start": 1.17, "phone": "ay" }], "word": "i" }, { "confidence": 82.57804870605469, "end": 2.79, "start": 2.04, "phone_breakdown": [{ "quality_score": 91.5013198852539, "end": 2.25, "start": 2.04, "phone": "l" }, { "quality_score": 69.76779174804688, "end": 2.46, "start": 2.25, "phone": "ay" }, { "quality_score": 74.07825469970703, "end": 2.79, "start": 2.46, "phone": "k" }], "word": "like" }, { "confidence": 92.35440063476562, "end": 4.62, "start": 3.0, "phone_breakdown": [{ "quality_score": 91.634033203125, "end": 3.21, "start": 3.0, "phone": "s" }, { "quality_score": 95.79405975341797, "end": 3.3, "start": 3.21, "phone": "t" }, { "quality_score": 97.87104034423828, "end": 3.39, "start": 3.3, "phone": "r" }, { "quality_score": 95.29679870605469, "end": 3.51, "start": 3.39, "phone": "ay" }, { "quality_score": 98.2538070678711, "end": 3.72, "start": 3.51, "phone": "p" }, { "quality_score": 66.72088623046875, "end": 4.62, "start": 3.72, "phone": "s" }], "word": "stripes" }] }], "num_alternatives": 1, "language_code": "en-GB", "result_id": "abc123-282_1638878309646", "time": "2021-12-07T11:58:30.257Z" }
JSON Breakdown
The following are snippets from the full JSON response above with some additional information for each key.
results
Within the response JSON the “results” node contains a detailed analysis of the audio file including data such as the transcription, the confidence/score of that transcription and a breakdown of each word in the “words” node.
"results": [{ "confidence": 87.196960449219, "transcription": "i like stripes", "words": [] }] |
transcription | The computed transcription from the audio file. |
---|---|
confidence | The overall confidence score of the transcription. |
words | A further breakdown of each of the words contained in the transcription. |
results \ words
"words": [{ "confidence": 86.65843963623047, "end": 1.8, "start": 1.17, "phone_breakdown": [], "word": "i" }, { "confidence": 82.57804870605469, "end": 2.79, "start": 2.04, "phone_breakdown": [], "word": "like" }, { "confidence": 92.35440063476562, "end": 4.62, "start": 3.0, "phone_breakdown": [], "word": "stripes" }] |
word | The word detected |
---|---|
confidence | The confidence of the detected word. |
start / end | The start/end times the word was detected in the audio file. |
phone_breakdown | A further phonetic breakdown of the word. Each phone in the transcription is also scored. |
results \ words \ phonetic_breakdown
"phone_breakdown": [{ "quality_score": 91.634033203125, "end": 3.21, "start": 3.0, "phone": "s" }, { "quality_score": 95.79405975341797, "end": 3.3, "start": 3.21, "phone": "t" }, { "quality_score": 97.87104034423828, "end": 3.39, "start": 3.3, "phone": "r" }, { "quality_score": 95.29679870605469, "end": 3.51, "start": 3.39, "phone": "ay" }, { "quality_score": 98.2538070678711, "end": 3.72, "start": 3.51, "phone": "p" }, { "quality_score": 66.72088623046875, "end": 4.62, "start": 3.72, "phone": "s" }] |
phone | The constituent phone of the word. |
---|---|
quality_score | How well the phone of the phone was pronounced. |
start / end | The start/end times the phone was detected in |