Speech-To-Text (Online)

This page demonstrates how to obtain a transcription of speech contained in an audio file using our Speech-To-Text (STT) Web Service.

The Cloud Recognition v1 is officially released and is generally available from the https://api.soapboxlabs.com/v1/speech/recognition endpoint.

Getting Started

Best Practices

Make sure to review the documentation around Audio Encoding and Best Practices.

Data Privacy

Review our Data Privacy resources

RESTful Web Services

Review the SoapBox Cloud Web Service documentation to familiarize yourself with the concepts involved.

Speech-To-Text Concepts

Familiarize yourself with the concepts of Speech-To-Text.

Custom Language Model (CLM)

You will need a CLM to use our Fluency service. More information is available in Understanding CLMs.

Interaction Method

Choose an interaction method here. Asynchronous requests are priced differently to synchronous requests. Please contact us for more information.

Authentication

Review the Web Service Authentication documentation.

Sending a Cloud STT Request

Once authenticated, STT requests should be sent via HTTPS to:

https://api.soapboxlabs.com/v1/async/request/speech/recognition

The following CURL example interacts with the Recognition API in an asynchronous manner.

Requests should be sent to this endpoint via an HTTPS POST request. The following parameters should be specified for a successful response:

Name	Type	Description
x-app-key	header	For test purposes. The App Key to use to authenticate with the Web Service
file	field	The audio file to be analyzed
user_token	field	A unique id that represents the speaker in the audio file. This should be a non-human readable alphanumeric identifier (such as a UUID) that has meaning to you but not SoapBox Labs. This token can be used to request deletion of a specific user's data in line with our Data Privacy commitments.
model_id	field	The alphanumeric ID of the CLM that represents your custom domain. Please refer to the section on CLM for further detail.

Example CURL Request

Using the app_key and model_id above, you can send requests to the STT Web Service. There are various tools and technologies available to accomplish this. One such tool is “CURL”, which is widely available on Mac and Linux and is also available on Windows. CURL allows you to send HTTP requests and display the response.

The following are sample CURL commands that can be used to submit an audio file to the STT endpoint asynchronously. It contains app_key, model_id and file parameters and should be run from the same directory as the audio file.

Step 1 - Submit Request (POST)

curl -X POST \
    -H "x-app-key:$APP_KEY" \
    -F "file=@$AUDIO_FILE" \
    -F "user_token=$USER_TOKEN" \
    -F "model_id=$MODEL_ID" \
    https://api.soapboxlabs.com/v1/async/request/speech/recognition

If successful this will return a JSON response similar to the following…

{
  "id":"abc123",
  "status_uri":"https://api.northeurope.prod.nostromo.soapboxlabs.com/v1/async/results/abc123"
}

The status_uri url will differ based on your region

Step 2 - Retrieve Result (GET)

Extract the value of the “status_uri” field and use this to retrieve the result. You should wait a sufficient amount of time before retrieving the result. If the result is not yet available you will receive a HTTP 404 status code. If you encounter a HTTP 404 you should wait a period of time before retrying. The result retrieval URL will change depending on where the request is sent from.

curl https://api.northeurope.prod.nostromo.soapboxlabs.com/v1/async/results/abc123

STT API Response Structure

If the request was successful, the STT Web Service will return a JSON response containing the STT analysis. At the root of the results object are the following fields:

user_id	The user_id specified in the request.
language_code	The language contained in the audio file being analysed.
result_id	A unique identifier for the request
time	The UTC time the request was processed at.
results	The nbest list of results from the speech recognizer. See results for further details.
audio_duration	The duration of the audio file in seconds.
num_alternatives	The nbest list of likely alternatives from the speech recogniser. (currently set to 1)

The following is an example of the JSON structure you can expect from STT.

{
  "audio_duration": 5.1363749504089355,
  "user_id": "abc123",
  "results": [{
    "confidence": 87.196960449219,
    "transcription": "i like stripes",
    "words": [{
      "confidence": 86.65843963623047,
      "end": 1.8,
      "start": 1.17,
      "phone_breakdown": [{
        "quality_score": 82.96692657470703,
        "end": 1.8,
        "start": 1.17,
        "phone": "ay"
      }],
      "word": "i"
    }, {
      "confidence": 82.57804870605469,
      "end": 2.79,
      "start": 2.04,
      "phone_breakdown": [{
        "quality_score": 91.5013198852539,
        "end": 2.25,
        "start": 2.04,
        "phone": "l"
      }, {
        "quality_score": 69.76779174804688,
        "end": 2.46,
        "start": 2.25,
        "phone": "ay"
      }, {
        "quality_score": 74.07825469970703,
        "end": 2.79,
        "start": 2.46,
        "phone": "k"
      }],
      "word": "like"
    }, {
      "confidence": 92.35440063476562,
      "end": 4.62,
      "start": 3.0,
      "phone_breakdown": [{
        "quality_score": 91.634033203125,
        "end": 3.21,
        "start": 3.0,
        "phone": "s"
      }, {
        "quality_score": 95.79405975341797,
        "end": 3.3,
        "start": 3.21,
        "phone": "t"
      }, {
        "quality_score": 97.87104034423828,
        "end": 3.39,
        "start": 3.3,
        "phone": "r"
      }, {
        "quality_score": 95.29679870605469,
        "end": 3.51,
        "start": 3.39,
        "phone": "ay"
      }, {
        "quality_score": 98.2538070678711,
        "end": 3.72,
        "start": 3.51,
        "phone": "p"
      }, {
        "quality_score": 66.72088623046875,
        "end": 4.62,
        "start": 3.72,
        "phone": "s"
      }],
      "word": "stripes"
    }]
  }],
  "num_alternatives": 1,
  "language_code": "en-GB",
  "result_id": "abc123-282_1638878309646",
  "time": "2021-12-07T11:58:30.257Z"
}

JSON Breakdown

The following are snippets from the full JSON response above with some additional information for each key.

results

Within the response JSON the “results” node contains a detailed analysis of the audio file including data such as the transcription, the confidence/score of that transcription and a breakdown of each word in the “words” node.

"results": [{
    "confidence": 87.196960449219,
    "transcription": "i like stripes",
    "words": []
}]

transcription	The computed transcription from the audio file.
confidence	The overall confidence score of the transcription.
words	A further breakdown of each of the words contained in the transcription.

results \ words

"words": [{
	"confidence": 86.65843963623047,
	"end": 1.8,
	"start": 1.17,
	"phone_breakdown": [],
	"word": "i"
}, {
	"confidence": 82.57804870605469,
	"end": 2.79,
	"start": 2.04,
	"phone_breakdown": [],
	"word": "like"
}, {
	"confidence": 92.35440063476562,
	"end": 4.62,
	"start": 3.0,
	"phone_breakdown": [],
	"word": "stripes"
}]

word	The word detected
confidence	The confidence of the detected word.
start / end	The start/end times the word was detected in the audio file.
phone_breakdown	A further phonetic breakdown of the word. Each phone in the transcription is also scored.

results \ words \ phonetic_breakdown

"phone_breakdown": [{
	"quality_score": 91.634033203125,
	"end": 3.21,
	"start": 3.0,
	"phone": "s"
}, {
	"quality_score": 95.79405975341797,
	"end": 3.3,
	"start": 3.21,
	"phone": "t"
}, {
	"quality_score": 97.87104034423828,
	"end": 3.39,
	"start": 3.3,
	"phone": "r"
}, {
	"quality_score": 95.29679870605469,
	"end": 3.51,
	"start": 3.39,
	"phone": "ay"
}, {
	"quality_score": 98.2538070678711,
	"end": 3.72,
	"start": 3.51,
	"phone": "p"
}, {
	"quality_score": 66.72088623046875,
	"end": 4.62,
	"start": 3.72,
	"phone": "s"
}]

phone	The constituent phone of the word.
quality_score	How well the phone of the phone was pronounced.
start / end	The start/end times the phone was detected in