Understanding Custom Language Models

Overview:

In order to achieve the highest levels of accuracy with our Speech to Text (STT) and Fluency products, the SoapBox team provides a level of customization for our customers.

Customers can request a Custom Language Model (CLM) which will enable the SoapBox team to generate a language model that is tailored to their domain / corpus. By doing so we can ensure that all corpus words, proper nouns and bespoke words are covered in our dictionaries and models.

To find out more about creating a CLM for your use case, please contact your Client Success Manager or support via our Service Desk or at api@soapboxlabs.com.

The Process

The process of generating a CLM is carried out by our team of experts and is outlined below.

  1. Step 1 - Licencing

  2. Step 2 - Domain Text

  3. Step 3 - Domain Audio

  4. Step 4 - CLM Generated

  5. Step 5 - CLM Evaluated

Step 1 - Licencing

Customer chooses which solution and licence model suits their needs and signs a licencing agreement in conjunction with their Business Development Representative.

Step 2 - Domain Text

Customer identifies the exact text to be used from their Domain or Corpus to generate their CLM.

In order to create and train a language model the customer must ensure that the text is of the best quality possible. Clean text will ensure that the process is smooth and the outcome is successful.

For example, for a corpus of levelled readers to be used to generate a fluency LM, the customer should observe the following guidelines.


  • Identify the exact text to be spoken or read by the learner

  • Conduct a quality control pass on the text

  • Remove any extraneous text such as teacher instructions, graphic captions, page numbers, footnotes etc.

  • Ensure that all formatting is correct for example

    • Carriage returns in the correct place

    • Extra spaces between words removed

    • All words have one space between them

Step 3 - Domain Audio

Where the customer has access to audio samples of their domain, these are supplied to the SoapBox team to support LM lab testing and tuning.


Audio must meet technical requirements such as

  • audio files recorded in wav (not converted from another compressed format)

  • no lower than 16kHz

  • not longer than 1min

  • matched to source text supplied

  • with annotations / transcriptions where they exist

Where audio samples do not exist SoapBox team will generate a sample set for lab testing and tuning.

Step 4 & 5 - CLM Generated & Evaluated

The SoapBox team will analyse the corpus for words, nouns etc. not in common usage or specific to the customer domain.

SoapBox team will generate a Custom Language Model based on the text corpus provided. The CLM will be lab tested and tuned.

Audio samples may be used to evaluate the CLM performance to ensure that it meets accuracy requirements.


Step 6 - Customer Integration

Customer integrates API with product in order to run an evaluation or release in product. See https://docs.soapboxlabs.com/technical-docs/online-technical-documentation/

Appendix I

Reference Text Guidelines

Reference text that is sent to the Fluency product is normalized to better enable comparison of reference text to that transcribed by our ASR engine. To ensure the best normalization occurs, all reference text should be sent in accordance with the following:


File Format

Plain text, e.g. txt

Character Encoding

UTF-8

End-of-Line (CR, LF, EOL)

Carriage returns, Line Feeds, and End-Of-Line Characters are expected to be found …

Default Numbers

Numbers are converted to text. At present, only counting numbers are supported by default.

E.g. 1 becomes one, 94 becomes ninety four, 1445 becomes one thousand four hundred forty five

Years & Decades

Years and decades are supported if wrapped in the years markup (described here). Otherwise, they will be normalised as default numbers.

Ordinals

Ordinals are supported if the ordinal indicator is attached to the number.

E.g:

✔ May 15th → may fifteenth

❌ May 15 → may fifteen

Other Numbers

Numbers used in cases like money, decimals, time, etc are not currently supported.

In order to get the correct output, please transform numbers and currency symbols, percentage signs, etc to words prior to submitting your text to SBL.

Letters

Upper case letters are converted to lower case letters.

E.g.

She went to Macy’s → she went to macy’s

White-space

All white-space will be stripped and word boundaries identified by such.


Note: To ensure the correct identification of word boundaries please take care to ensure that white-space follows …

Punctuation

All non-supported punctuation will be removed from the reference text.


For a list of support and non-supported punctuation, see

UTF-8 Character Normalization below Punctuation is expected in …

Introduction to Text Normalization


When using speech technology, a process called “text normalization” must be used on all input words, sentences, and paragraphs. When text is provided to SBL for the building of a CLM, or when a CURL request is sent to the Fluency Web Service, text normalization happens under the hood to ensure the text is compatible with SBL systems. Some portions of the text are pre-normalized by the customer (as per CLM guidelines), and other portions of the text are normalized automatically by SBL.

Typical examples of text normalisation include:

  • stripping punctuation

  • lowercasing all letters

  • changing digits to words

Example:

“I have 2 you can borrow,” she saidi have two you can borrow she said

Text normalization helps to ensure the best possible performance of our speech technology.

UTF-8 Character Normalization

Reference text that is sent to the Fluency product is normalized to better enable comparison of reference text to that transcribed by our ASR engine. The following describes what normalization will take place

Punctuation Name(s)

Punctuation

Unicode Char (UTF- 8)

Action

Notes

Curly Single Quote, Apostrophe

'


Curly single quotes are changed to apostrophes.

Apostrophes are left as-is.

Double Quote

Removed

Comma, Full-Stop, Question Mark, Exclamation Mark

,

.

?

!


Removed

Hyphen, En Dash, Em Dash

-

Replaced with a single space

Dollar Sign, Percentage, Ampersand, Hash Sign, At Sign

$

%

&

#

@

Removed

Parentheses, Curly Brackets, Straight Brackets

()

{}

[]

Removed

Brackets, Angled

<>

Utilised for Markup

The angled brackets indicate markup to SBL's system, and should only be in the text for this purpose

Semi-Colon, Colon

;

:

Removed

Front Slash, Back Slash, Pipe

/

\

|

Removed

Plus, Equals, Asterisk

+

=

*

Removed

Caret, Tilde, Underscore

^

~

_

Removed

Whitespace

Removed

All words are differentiated by the use of white-space i.e. white-space splitting is employed.

Splitting is conducted before any punctuation normalization occurs

Default vs Custom Normalization

Standard normalization such as lowercasing and stripping punctuation are straight-forward processes. However, some aspects of text have numerous ways they can be normalized. A key example is digits in text.

If we use “1991” as an example, we see that this can be said in the following ways:

  • nineteen ninety one

  • one thousand nine hundred ninety one

  • one thousand nine hundred and ninety one

  • one nine nine one

Because of this ambiguity, we give customers the option to customize the way certain opaque texts are normalized. However, we maintain a default setting for all text in case this customization is not utilized or needed.

Using Markup for Custom Normalization


We have defined some markup tags in our engine which can be useful for specifying how some content should be normalized. This can be used both in the CLM text sent to SBL and the CURL request sent to the Fluency Web Service.

Here is a quick example which shows the difference between default and custom normalization:

In <year>1776</year>, the declaration of independence was signed.

If this markup is used, the sentence will be normalized to:

in seventeen seventy six the declaration of independence was signed

If this mark is not used, the sentence will be normalized to:

in one thousand seven hundred seventy six the declaration of independence was signed

Use Cases & Examples


We currently support mark-up for the following use cases:

  • Years and Decades

  • Dates

  • Email Addresses

  • Letters

  • Sounding Out

  • Custom Word

Years and Decades

Years and decades can both be normalized using the year tag. It will ensure that years and decades in the text are not interpreted as cardinal numbers. The content must consist of digits only, but decades can have a trailing 's' at the end. There are no available attributes for this tag.

Correct Usage:

  • In <year>1913</year>, I sold 1913 apples.

    • in nineteen thirteen i sold one thousand nine hundred thirteen apples

  • In <year>1776</year>, the declaration of independence was signed.

    • in seventeen seventy six the declaration of independence was signed

  • Grandma’s birthday is on August 2nd, <year>1945</year>

    • grandmas birthday is on august second nineteen forty five

  • The Second World War was from <year>1939</year> to <year>1945</year>

    • the second world war was from nineteen thirty nine to nineteen forty five

  • 100 dollars in the <year>1800s</year> is equivalent to 1,904 dollars today.

    • one hundred dollars in the eighteen hundreds is equivalent to one thousand nine hundred four dollars today

  • The women’s suffrage movement in Canada began in the <year>1870s</year>

    • the womens suffrage movement in canada began in the eighteen seventies

  • If I could go to any decade, I’d go to the <year>80s</year>, or maybe the <year>50s</year>

    • if i could go to any decade id go to the eighties or maybe the fifties

Incorrect Usage:

  • <year>19 13</year>

    • Cannot have spaces in the middle of year content

  • <year>nineteen thirteen</year>

    • Content must consist of digits with an optional 's' at the end

  • <year>1913<year>

    • No / symbol in closing tag!

Dates

Dates are normalized using the <date> tag. Currently we only support dates in US format (month/day/year), where each component is a numerical value. For instance, January can be written as 1, or 01 etc. The year value is optional. There are no available attributes for this tag.

Correct Usage:

  • It was <date>10/25/2021</date>

    • it was october twenty fifth twenty twenty one

  • John bought a new car on <date>1/9/1999</date>

    • john bought a new car on january ninth nineteen ninety nine

  • St. Patricks Day is always on <date>03/17</date>

    • st patricks day is always on march seventeenth

Incorrect Usage:

  • <date>13/04/21</date>

    • Month values must be between 1 and 12

  • <date>january/10</date>

    • Components must be numerical values

  • <date>01-01-2022</date>

    • Components must be separated by a “/” symbol

Email Addresses

Email addresses are normalized using the <email> tag. The content can consist of letters, numbers, and most special characters. To keep things simple most symbols are ignored and numbers will be normalized as singular digits. There are no available attributes for this tag.

Correct Usage:

  • My email address is <email>john_doe17@hotmail.com</email>.

    • my email address is john doe one seven at hotmail dot com

  • Please contact me at <email>jane42@test-mail.co.uk</email>

    • please contact me at jane four two at test mail dot co dot uk

Incorrect Usage:

  • <email>jane doe@hotmail.com</email>

    • Spaces are not allowed in email addresses!

  • <email>john.doe.hotmail.com</email>

    • No @ symbol detected

  • <email>john@doe@hotmail.com</email>

    • More than one @ symbol detected

Letters

Letters are normalized using the <letter> tag. They have an optional pronunciation attribute which allows the user to specify the desired phoneme breakdown. Phonemes must be separated by a space. Only a single letter is allowed in the content.

Correct Usage:

  • <letter>X</letter>

    • x

  • <letter pronunciation = "ey">a</letter>

    • a

  • <letter pronunciation = "hh ey ch">H</letter>

    • h

Incorrect Usage:

  • <letter>ay</letter>

    • Content must consist of a single letter only

  • <letter pronunciation = "eych">h</letter>

    • Each phoneme must be 1 or 2 characters long

  • <letter pron = "k s">x</letter>

    • Invalid attribute name

Sounding Out

Sounding out a word can be acheived using the sound-out tag. Only singular valid words are supported and must not contain spaces or hyphens. There is an optional pronunciation attribute for specifying the phoneme breakdown. The phonemes must be separated by a space.

Correct Usage:

  • <sound-out>Tree</sound-out>

    • tree

  • <sound-out pronunciation = "b ao r k">bark</sound-out>

    • bark

  • <sound-out pronunciation = "k ae t">dog</sound-out>

Incorrect Usage:

  • <sound-out>hello world</sound-out>

    • Content contains more than one word

  • <sound-out>round-trip</sound-out>

    • Content contains a hyphenated word

Custom Words

The custom-word tag is used to add custom or made up words. Note that the pronunciation attribute is compulsory! This is to allow the engine know what the phoneme breakdown is, even if it's a valid word that exists in the language.

Correct Usage:

  • <custom-word pronunciation = "v uw s">voos</custom-word>

    • voos

  • <custom-word pronunciation = "m ae s">mas</custom-word>

    • mas

Incorrect Usage:

  • <custom-word>voos</custom-word>

    • Requires pronunciation attribute

  • <custom-word pronunciation = "m ae s k aa t">mas cot</custom-word>

    • Content contains more than one word