Speech-To-Text (Online)

This page demonstrates how to obtain a transcription of speech contained in an audio file using our Speech-To-Text (STT) Web Service.

The Cloud Recognition v1 is officially released and is generally available from the https://api.soapboxlabs.com/v1/speech/recognition endpoint.

Getting Started


Best Practices

Make sure to review the documentation around Audio Encoding and Best Practices.

Data Privacy

Review our Data Privacy resources

RESTful Web Services

Review the SoapBox Cloud Web Service documentation to familiarize yourself with the concepts involved.

Speech-To-Text Concepts

Familiarize yourself with the concepts of Speech-To-Text.

Custom Language Model (CLM)

You will need a CLM to use our Fluency service. More information is available in Understanding CLMs.

Interaction Method

Choose an interaction method here. Asynchronous requests are priced differently to synchronous requests. Please contact us for more information.

Authentication

Review the Web Service Authentication documentation.

Sending a Cloud STT Request


Once authenticated, STT requests should be sent via HTTPS to:

  • The following CURL example interacts with the Recognition API in an asynchronous manner.

Requests should be sent to this endpoint via an HTTPS POST request. The following parameters should be specified for a successful response:

Name

Type

Description

x-app-key

header

For test purposes. The App Key to use to authenticate with the Web Service

file

field

The audio file to be analyzed

user_token

field

A unique id that represents the speaker in the audio file. This should be a non-human readable alphanumeric identifier (such as a UUID) that has meaning to you but not SoapBox Labs. This token can be used to request deletion of a specific user's data in line with our Data Privacy commitments.

model_id

field

The alphanumeric ID of the CLM that represents your custom domain. Please refer to the section on CLM for further detail.

Example CURL Request

Using the app_key and model_id above, you can send requests to the STT Web Service. There are various tools and technologies available to accomplish this. One such tool is “CURL”, which is widely available on Mac and Linux and is also available on Windows. CURL allows you to send HTTP requests and display the response. 

The following are sample CURL commands that can be used to submit an audio file to the STT endpoint asynchronously. It contains app_key, model_id and file parameters and should be run from the same directory as the audio file.

Step 1 - Submit Request (POST)

curl -X POST \
    -H "x-app-key:$APP_KEY" \
    -F "file=@$AUDIO_FILE" \
    -F "user_token=$USER_TOKEN" \
    -F "model_id=$MODEL_ID" \
    https://api.soapboxlabs.com/v1/async/request/speech/recognition

If successful this will return a JSON response similar to the following…

{
  "id":"abc123",
  "status_uri":"https://api.northeurope.prod.nostromo.soapboxlabs.com/v1/async/results/abc123"
}

The status_uri url will differ based on your region


Step 2 - Retrieve Result (GET)

Extract the value of the “status_uri” field and use this to retrieve the result. You should wait a sufficient amount of time before retrieving the result. If the result is not yet available you will receive a HTTP 404 status code. If you encounter a HTTP 404 you should wait a period of time before retrying. The result retrieval URL will change depending on where the request is sent from.

curl https://api.northeurope.prod.nostromo.soapboxlabs.com/v1/async/results/abc123

STT API Response Structure


If the request was successful, the STT Web Service will return a JSON response containing the STT analysis. At the root of the results object are the following fields:

user_id

The user_id specified in the request.

language_code

The language contained in the audio file being analysed.

result_id

A unique identifier for the request

time

The UTC time the request was processed at.

results

The nbest list of results from the speech recognizer. See results for further details.

audio_duration

The duration of the audio file in seconds.

num_alternatives

The nbest list of likely alternatives from the speech recogniser. (currently set to 1)

The following is an example of the JSON structure you can expect from STT.

{
  "audio_duration": 5.1363749504089355,
  "user_id": "abc123",
  "results": [{
    "confidence": 87.196960449219,
    "transcription": "i like stripes",
    "words": [{
      "confidence": 86.65843963623047,
      "end": 1.8,
      "start": 1.17,
      "phone_breakdown": [{
        "quality_score": 82.96692657470703,
        "end": 1.8,
        "start": 1.17,
        "phone": "ay"
      }],
      "word": "i"
    }, {
      "confidence": 82.57804870605469,
      "end": 2.79,
      "start": 2.04,
      "phone_breakdown": [{
        "quality_score": 91.5013198852539,
        "end": 2.25,
        "start": 2.04,
        "phone": "l"
      }, {
        "quality_score": 69.76779174804688,
        "end": 2.46,
        "start": 2.25,
        "phone": "ay"
      }, {
        "quality_score": 74.07825469970703,
        "end": 2.79,
        "start": 2.46,
        "phone": "k"
      }],
      "word": "like"
    }, {
      "confidence": 92.35440063476562,
      "end": 4.62,
      "start": 3.0,
      "phone_breakdown": [{
        "quality_score": 91.634033203125,
        "end": 3.21,
        "start": 3.0,
        "phone": "s"
      }, {
        "quality_score": 95.79405975341797,
        "end": 3.3,
        "start": 3.21,
        "phone": "t"
      }, {
        "quality_score": 97.87104034423828,
        "end": 3.39,
        "start": 3.3,
        "phone": "r"
      }, {
        "quality_score": 95.29679870605469,
        "end": 3.51,
        "start": 3.39,
        "phone": "ay"
      }, {
        "quality_score": 98.2538070678711,
        "end": 3.72,
        "start": 3.51,
        "phone": "p"
      }, {
        "quality_score": 66.72088623046875,
        "end": 4.62,
        "start": 3.72,
        "phone": "s"
      }],
      "word": "stripes"
    }]
  }],
  "num_alternatives": 1,
  "language_code": "en-GB",
  "result_id": "abc123-282_1638878309646",
  "time": "2021-12-07T11:58:30.257Z"
}

JSON Breakdown

The following are snippets from the full JSON response above with some additional information for each key.


results

Within the response JSON the “results” node contains a detailed analysis of the audio file including data such as the transcription, the confidence/score of that transcription and a breakdown of each word in the “words” node.

"results": [{
    "confidence": 87.196960449219,
    "transcription": "i like stripes",
    "words": []
}]

transcription

The computed transcription from the audio file.

confidence

The overall confidence score of the transcription.

words

A further breakdown of each of the words contained in the transcription.

results \ words

"words": [{
	"confidence": 86.65843963623047,
	"end": 1.8,
	"start": 1.17,
	"phone_breakdown": [],
	"word": "i"
}, {
	"confidence": 82.57804870605469,
	"end": 2.79,
	"start": 2.04,
	"phone_breakdown": [],
	"word": "like"
}, {
	"confidence": 92.35440063476562,
	"end": 4.62,
	"start": 3.0,
	"phone_breakdown": [],
	"word": "stripes"
}]

word

The word detected

confidence

The confidence of the detected word.

start / end

The start/end times the word was detected in the audio file.

phone_breakdown

A further phonetic breakdown of the word. Each phone in the transcription is also scored.

results \ words \ phonetic_breakdown

"phone_breakdown": [{
	"quality_score": 91.634033203125,
	"end": 3.21,
	"start": 3.0,
	"phone": "s"
}, {
	"quality_score": 95.79405975341797,
	"end": 3.3,
	"start": 3.21,
	"phone": "t"
}, {
	"quality_score": 97.87104034423828,
	"end": 3.39,
	"start": 3.3,
	"phone": "r"
}, {
	"quality_score": 95.29679870605469,
	"end": 3.51,
	"start": 3.39,
	"phone": "ay"
}, {
	"quality_score": 98.2538070678711,
	"end": 3.72,
	"start": 3.51,
	"phone": "p"
}, {
	"quality_score": 66.72088623046875,
	"end": 4.62,
	"start": 3.72,
	"phone": "s"
}]

phone

The constituent phone of the word.

quality_score

How well the phone of the phone was pronounced.

start / end

The start/end times the phone was detected in