Understanding Custom Language Models
Overview:
In order to achieve the highest levels of accuracy with our Speech to Text (STT) and Fluency products, the SoapBox team provides a level of customization for our customers.
Customers can request a Custom Language Model (CLM) which will enable the SoapBox team to generate a language model that is tailored to their domain / corpus. By doing so we can ensure that all corpus words, proper nouns and bespoke words are covered in our dictionaries and models.
To find out more about creating a CLM for your use case, please contact your Client Success Manager or support via our Service Desk or at api@soapboxlabs.com.
The Process
The process of generating a CLM is carried out by our team of experts and is outlined below.
Step 1 - Licencing
Step 2 - Domain Text
Step 3 - Domain Audio
Step 4 - CLM Generated
Step 5 - CLM Evaluated
Step 1 - Licencing
Customer chooses which solution and licence model suits their needs and signs a licencing agreement in conjunction with their Business Development Representative.
Step 2 - Domain Text
Customer identifies the exact text to be used from their Domain or Corpus to generate their CLM.
In order to create and train a language model the customer must ensure that the text is of the best quality possible. Clean text will ensure that the process is smooth and the outcome is successful.
For example, for a corpus of levelled readers to be used to generate a fluency LM, the customer should observe the following guidelines.
Identify the exact text to be spoken or read by the learner
Conduct a quality control pass on the text
Remove any extraneous text such as teacher instructions, graphic captions, page numbers, footnotes etc.
Ensure that all formatting is correct for example
Carriage returns in the correct place
Extra spaces between words removed
All words have one space between them
Step 3 - Domain Audio
Where the customer has access to audio samples of their domain, these are supplied to the SoapBox team to support LM lab testing and tuning.
Audio must meet technical requirements such as
audio files recorded in wav (not converted from another compressed format)
no lower than 16kHz
not longer than 1min
matched to source text supplied
with annotations / transcriptions where they exist
Where audio samples do not exist SoapBox team will generate a sample set for lab testing and tuning.
Step 4 & 5 - CLM Generated & Evaluated
The SoapBox team will analyse the corpus for words, nouns etc. not in common usage or specific to the customer domain.
SoapBox team will generate a Custom Language Model based on the text corpus provided. The CLM will be lab tested and tuned.
Audio samples may be used to evaluate the CLM performance to ensure that it meets accuracy requirements.
Step 6 - Customer Integration
Customer integrates API with product in order to run an evaluation or release in product. See https://docs.soapboxlabs.com/technical-docs/online-technical-documentation/
Appendix I
Reference Text Guidelines
Reference text that is sent to the Fluency product is normalized to better enable comparison of reference text to that transcribed by our ASR engine. To ensure the best normalization occurs, all reference text should be sent in accordance with the following:
File Format | Plain text, e.g. txt |
Character Encoding | UTF-8 |
End-of-Line (CR, LF, EOL) | Carriage returns, Line Feeds, and End-Of-Line Characters are expected to be found … |
Default Numbers | Numbers are converted to text. At present, only counting numbers are supported by default. E.g. 1 becomes one, 94 becomes ninety four, 1445 becomes one thousand four hundred forty five |
Years & Decades | Years and decades are supported if wrapped in the years markup (described here). Otherwise, they will be normalised as default numbers. |
Ordinals | Ordinals are supported if the ordinal indicator is attached to the number. E.g: ✔ May 15th → may fifteenth ❌ May 15 → may fifteen |
Other Numbers | Numbers used in cases like money, decimals, time, etc are not currently supported. In order to get the correct output, please transform numbers and currency symbols, percentage signs, etc to words prior to submitting your text to SBL. |
Letters | Upper case letters are converted to lower case letters. E.g. She went to Macy’s → she went to macy’s |
White-space | All white-space will be stripped and word boundaries identified by such. Note: To ensure the correct identification of word boundaries please take care to ensure that white-space follows … |
Punctuation | All non-supported punctuation will be removed from the reference text. For a list of support and non-supported punctuation, see UTF-8 Character Normalization below Punctuation is expected in … |
Introduction to Text Normalization
When using speech technology, a process called “text normalization” must be used on all input words, sentences, and paragraphs. When text is provided to SBL for the building of a CLM, or when a CURL request is sent to the Fluency Web Service, text normalization happens under the hood to ensure the text is compatible with SBL systems. Some portions of the text are pre-normalized by the customer (as per CLM guidelines), and other portions of the text are normalized automatically by SBL.
Typical examples of text normalisation include:
stripping punctuation
lowercasing all letters
changing digits to words
Example:
“I have 2 you can borrow,” she said → i have two you can borrow she said
Text normalization helps to ensure the best possible performance of our speech technology.
UTF-8 Character Normalization
Reference text that is sent to the Fluency product is normalized to better enable comparison of reference text to that transcribed by our ASR engine. The following describes what normalization will take place
Punctuation Name(s) | Punctuation | Unicode Char (UTF- 8) | Action | Notes |
---|---|---|---|---|
Curly Single Quote, Apostrophe | ‘ ' | Curly single quotes are changed to apostrophes. Apostrophes are left as-is. | ||
Double Quote | “ | Removed | ||
Comma, Full-Stop, Question Mark, Exclamation Mark | , . ? ! | Removed | ||
Hyphen, En Dash, Em Dash | - – — | Replaced with a single space | ||
Dollar Sign, Percentage, Ampersand, Hash Sign, At Sign | $ % & # @ | Removed | ||
Parentheses, Curly Brackets, Straight Brackets | () {} [] | Removed | ||
Brackets, Angled | <> | Utilised for Markup | The angled brackets indicate markup to SBL's system, and should only be in the text for this purpose | |
Semi-Colon, Colon | ; : | Removed | ||
Front Slash, Back Slash, Pipe | / \ | | Removed | ||
Plus, Equals, Asterisk | + = * | Removed | ||
Caret, Tilde, Underscore | ^ ~ _ | Removed | ||
Whitespace | Removed | All words are differentiated by the use of white-space i.e. white-space splitting is employed. Splitting is conducted before any punctuation normalization occurs |
Default vs Custom Normalization
Standard normalization such as lowercasing and stripping punctuation are straight-forward processes. However, some aspects of text have numerous ways they can be normalized. A key example is digits in text.
If we use “1991” as an example, we see that this can be said in the following ways:
nineteen ninety one
one thousand nine hundred ninety one
one thousand nine hundred and ninety one
one nine nine one
Because of this ambiguity, we give customers the option to customize the way certain opaque texts are normalized. However, we maintain a default setting for all text in case this customization is not utilized or needed.
Using Markup for Custom Normalization
We have defined some markup tags in our engine which can be useful for specifying how some content should be normalized. This can be used both in the CLM text sent to SBL and the CURL request sent to the Fluency Web Service.
Here is a quick example which shows the difference between default and custom normalization:
In <year>
1776</year>
, the declaration of independence was signed.
If this markup is used, the sentence will be normalized to:
in seventeen seventy six the declaration of independence was signed
If this mark is not used, the sentence will be normalized to:
in one thousand seven hundred seventy six the declaration of independence was signed
Use Cases & Examples
We currently support mark-up for the following use cases:
Years and Decades
Dates
Email Addresses
Letters
Sounding Out
Custom Word
Years and Decades
Years and decades can both be normalized using the year
tag. It will ensure that years and decades in the text are not interpreted as cardinal numbers. The content must consist of digits only, but decades can have a trailing 's' at the end. There are no available attributes for this tag.
Correct Usage:
In
<year>
1913</year>
, I sold 1913 apples.in nineteen thirteen i sold one thousand nine hundred thirteen apples
In
<year>
1776</year>
, the declaration of independence was signed.in seventeen seventy six the declaration of independence was signed
Grandma’s birthday is on August 2nd,
<year>
1945</year>
grandmas birthday is on august second nineteen forty five
The Second World War was from
<year>
1939</year>
to<year>
1945</year>
the second world war was from nineteen thirty nine to nineteen forty five
100 dollars in the
<year>
1800s</year>
is equivalent to 1,904 dollars today.one hundred dollars in the eighteen hundreds is equivalent to one thousand nine hundred four dollars today
The women’s suffrage movement in Canada began in the
<year>
1870s</year>
the womens suffrage movement in canada began in the eighteen seventies
If I could go to any decade, I’d go to the
<year>
80s</year>
, or maybe the<year>
50s</year>
if i could go to any decade id go to the eighties or maybe the fifties
Incorrect Usage:
<year>
19 13</year>
Cannot have spaces in the middle of year content
<year>
nineteen thirteen</year>
Content must consist of digits with an optional 's' at the end
<year>
1913<year>
No / symbol in closing tag!
Dates
Dates are normalized using the <date>
tag. Currently we only support dates in US format (month/day/year), where each component is a numerical value. For instance, January can be written as 1, or 01 etc. The year value is optional. There are no available attributes for this tag.
Correct Usage:
It was
<date>
10/25/2021</date>
it was october twenty fifth twenty twenty one
John bought a new car on
<date>
1/9/1999</date>
john bought a new car on january ninth nineteen ninety nine
St. Patricks Day is always on
<date>
03/17</date>
st patricks day is always on march seventeenth
Incorrect Usage:
<date>
13/04/21</date>
Month values must be between 1 and 12
<date>
january/10</date>
Components must be numerical values
<date>
01-01-2022</date>
Components must be separated by a “/” symbol
Email Addresses
Email addresses are normalized using the <email>
tag. The content can consist of letters, numbers, and most special characters. To keep things simple most symbols are ignored and numbers will be normalized as singular digits. There are no available attributes for this tag.
Correct Usage:
My email address is
<email>
john_doe17@hotmail.com</email>
.my email address is john doe one seven at hotmail dot com
Please contact me at
<email>
jane42@test-mail.co.uk</email>
please contact me at jane four two at test mail dot co dot uk
Incorrect Usage:
<email>
jane doe@hotmail.com</email>
Spaces are not allowed in email addresses!
<email>
john.doe.hotmail.com</email>
No @ symbol detected
<email>
john@doe@hotmail.com</email>
More than one @ symbol detected
Letters
Letters are normalized using the <letter>
tag. They have an optional pronunciation
attribute which allows the user to specify the desired phoneme breakdown. Phonemes must be separated by a space. Only a single letter is allowed in the content.
Correct Usage:
<letter>
X</letter>
x
<letter pronunciation = "ey">
a</letter>
a
<letter pronunciation = "hh ey ch">
H</letter>
h
Incorrect Usage:
<letter>
ay</letter>
Content must consist of a single letter only
<letter pronunciation = "eych">
h</letter>
Each phoneme must be 1 or 2 characters long
<letter pron = "k s">
x</letter>
Invalid attribute name
Sounding Out
Sounding out a word can be acheived using the sound-out
tag. Only singular valid words are supported and must not contain spaces or hyphens. There is an optional pronunciation
attribute for specifying the phoneme breakdown. The phonemes must be separated by a space.
Correct Usage:
<sound-out>
Tree</sound-out>
tree
<sound-out pronunciation = "b ao r k">
bark</sound-out>
bark
<sound-out pronunciation = "k ae t">
dog</sound-out>
Incorrect Usage:
<sound-out>
hello world</sound-out>
Content contains more than one word
<sound-out>
round-trip</sound-out>
Content contains a hyphenated word
Custom Words
The custom-word
tag is used to add custom or made up words. Note that the pronunciation
attribute is compulsory! This is to allow the engine know what the phoneme breakdown is, even if it's a valid word that exists in the language.
Correct Usage:
<custom-word pronunciation = "v uw s">
voos</custom-word>
voos
<custom-word pronunciation = "m ae s">
mas</custom-word>
mas
Incorrect Usage:
<custom-word>
voos</custom-word>
Requires pronunciation attribute
<custom-word pronunciation = "m ae s k aa t">
mas cot</custom-word>
Content contains more than one word