Customizing Text Using Markup

What is Markup?

Markup is simply a way to annotate text in a document. It is used to add instructions for specific content, such as a letter, word, paragraph, or even the entire page. An advantage of markup languages is that the original text and the markup instructions are intermixed in the same document. This can prevent errors and makes the experience easier for the user.

Markup is used widely across the IT industry and each form follows a similar structure where target content is placed between a pair of corresponding tags. Optional information can be added using specific attributes, but these are not always required.

Markup Element / Instruction

A markup instruction is also known as an element. Here is a diagram which shows the basic components of a markup element.

Diagram showing an annotated example of an element structure
example element structure

What is a Tag?

A markup tag is a keyword that instructs the engine to treat some content (letter, word, number etc) in a special way. Tags are enclosed inside angled-brackets, like <this>. They are pre-defined and have a specific meaning to the engine that parses them. Each markup element consists of an opening tag and a corresponding closing tag with some content in between. This tells the engine where the markup element begins and ends.

Every opening tag must have a corresponding closing tag!

For example, suppose we have an engine that parses a document and then displays the text on a web page. If we want to tell the engine that a certain word should be displayed in bold then our original text might look like the following:

I will follow up <bold>asap</bold>

This would then be displayed as:

I will follow up asap

In the example above we told the engine to display the content “asap” in bold by placing it between the relevant opening and closing tags. Note that we included a forward-slash, “/”, symbol at the start of the closing tag. This tells the engine that it is the end of our instruction. If we left out this symbol the engine would instead detect two opening tags and no closing tag. This would cause an error!

Summary:

  • A tag is a keyword that instructs the engine to treat content in a special way

  • Tags are enclosed in <…> brackets

  • Every opening tag must have a corresponding closing tag

  • Closing tags are detected when a forward-slash symbol is placed before the tag name, </…>

What is an Attribute?

Attributes are usually optional and are part of the opening tag, but never the closing tag! They are used to give the engine more information about the markup element. We assign the attribute a value using the = sign. This value should be inside double quotes, like this: name = "value". An opening tag can have more than one attribute, and the order doesn't matter.

For example, we want to tell our engine that a certain piece of text should be displayed as a hyperlink. Our original text would look like the following:

Please visit my website <link url = "mywebsite.me">here</link>

Which would then get displayed as

Please visit my website here

The word “here” becomes a hyperlink when the text is displayed, and when the user clicks on it they would be taken to the address provided by the “url” attribute.

Summary:

  • Attributes are usually optional

  • They are part of the opening tag, and never the closing tag

  • They follow the pattern: name = "value"

Introduction to Text Normalization


When using speech technology, a process called “text normalization” must be used on all input words, sentences, and paragraphs. When text is provided to SBL for the building of a CLM, or when a CURL request is sent to the Fluency Web Service, text normalization happens under the hood to ensure the text is compatible with SBL systems. Some portions of the text are pre-normalized by the customer (as per CLM guidelines), and other portions of the text are normalized automatically by SBL.

Typical examples of text normalization include:

  • stripping punctuation

  • lowercasing all letters

  • changing digits to words

Example:

“I have 2 you can borrow,” she saidi have two you can borrow she said

Text normalization helps to ensure the best possible performance of our speech technology.

Default vs Custom Normalization

Standard normalization such as lowercasing and stripping punctuation are straight-forward processes. However, some aspects of text have numerous ways they can be normalized. A key example is digits in text.

If we use “1991” as an example, we see that this can be said in the following ways:

  • nineteen ninety one

  • one thousand nine hundred ninety one

  • one thousand nine hundred and ninety one

  • one nine nine one

Because of this ambiguity, we give customers the option to customize the way certain opaque texts are normalized. However, we maintain a default setting for all text in case this customization is not utilized or needed.

Using Markup for Custom Normalization


We have defined some markup tags in our engine which can be useful for specifying how some content should be normalized. This can be used both in the CLM text sent to SBL and the CURL request sent to the Fluency Web Service.

Here is a quick example which shows the difference between default and custom normalization:

In <year>1776</year>, the declaration of independence was signed.

If this markup is used, the sentence will be normalized to:

in seventeen seventy six the declaration of independence was signed

If this mark is not used, the sentence will be normalized to:

in one thousand seven hundred seventy six the declaration of independence was signed

Use Cases & Examples


We currently support mark-up for the following use cases:

  • Years and Decades

  • Dates

  • Email Addresses

  • Letters

  • Sounding Out

  • Custom Word

Years and Decades

image of a diagram of years use case
year example

Years and decades can both be normalized using the year tag. It will ensure that years and decades in the text are not interpreted as cardinal numbers. The content must consist of digits only, but decades can have a trailing 's' at the end. There are no available attributes for this tag.

Correct Usage:

  • In <year>1913</year>, I sold 1913 apples.

    • in nineteen thirteen i sold one thousand nine hundred thirteen apples

  • In <year>1776</year>, the declaration of independence was signed.

    • in seventeen seventy six the declaration of independence was signed

  • Grandma’s birthday is on August 2nd, <year>1945</year>

    • grandmas birthday is on august second nineteen forty five

  • The Second World War was from <year>1939</year> to <year>1945</year>

    • the second world war was from nineteen thirty nine to nineteen forty five

  • 100 dollars in the <year>1800s</year> is equivalent to 1,904 dollars today.

    • one hundred dollars in the eighteen hundreds is equivalent to one thousand nine hundred four dollars today

  • The women’s suffrage movement in Canada began in the <year>1870s</year>

    • the womens suffrage movement in canada began in the eighteen seventies

  • If I could go to any decade, I’d go to the <year>80s</year>, or maybe the <year>50s</year>

    • if i could go to any decade id go to the eighties or maybe the fifties

Incorrect Usage:

  • <year>19 13</year>

    • Cannot have spaces in the middle of year content

  • <year>nineteen thirteen</year>

    • Content must consist of digits with an optional 's' at the end

  • <year>1913<year>

    • No / symbol in closing tag!

Dates

image of a diagram of a date use case
date example

Dates are normalized using the <date> tag. Currently we only support dates in US format (month/day/year), where each component is a numerical value. For instance, January can be written as 1, or 01 etc. The year value is optional. There are no available attributes for this tag.

Correct Usage:

  • It was <date>10/25/2021</date>

    • it was october twenty fifth twenty twenty one

  • John bought a new car on <date>1/9/1999</date>

    • john bought a new car on january ninth nineteen ninety nine

  • St. Patricks Day is always on <date>03/17</date>

    • st patricks day is always on march seventeenth

Incorrect Usage:

  • <date>13/04/21</date>

    • Month values must be between 1 and 12

  • <date>january/10</date>

    • Components must be numerical values

  • <date>01-01-2022</date>

    • Components must be separated by a “/” symbol

Email Addresses

image of a diagram of an email use case
email example

Email addresses are normalized using the <email> tag. The content can consist of letters, numbers, and most special characters. To keep things simple most symbols are ignored and numbers will be normalized as singular digits. There are no available attributes for this tag.

Correct Usage:

  • My email address is <email>john_doe17@hotmail.com</email>.

    • my email address is john doe one seven at hotmail dot com

  • Please contact me at <email>jane42@test-mail.co.uk</email>

    • please contact me at jane four two at test mail dot co dot uk

Incorrect Usage:

  • <email>jane doe@hotmail.com</email>

    • Spaces are not allowed in email addresses!

  • <email>john.doe.hotmail.com</email>

    • No @ symbol detected

  • <email>john@doe@hotmail.com</email>

    • More than one @ symbol detected

Letters

image of a diagram of a letter use case
letter example

Letters are normalized using the <letter> tag. They have an optional pronunciation attribute which allows the user to specify the desired phoneme breakdown. Phonemes must be separated by a space. Only a single letter is allowed in the content.

Correct Usage:

  • <letter>X</letter>

    • x

  • <letter pronunciation = "ey">a</letter>

    • a

  • <letter pronunciation = "hh ey ch">H</letter>

    • h

Incorrect Usage:

  • <letter>ay</letter>

    • Content must consist of a single letter only

  • <letter pronunciation = "eych">h</letter>

    • Each phoneme must be 1 or 2 characters long

  • <letter pron = "k s">x</letter>

    • Invalid attribute name

Sounding Out

image of a diagram of a sound out example
sound out example

Sounding out a word can be achieved using the sound-out tag. Only singular valid words are supported and must not contain spaces or hyphens. There is an optional pronunciation attribute for specifying the phoneme breakdown. The phonemes must be separated by a space. When the sound-out tag is used for isolated phonemes/phonics (e.g., hh or o), the pronunciation attribute is always required, because these entries are not words in the dictionary.

Correct Usage:

  • <sound-out>Tree</sound-out>

    • t r iy

  • <sound-out pronunciation = "b ao r k">bark</sound-out>

    • b ao r k

  • <sound-out pronunciation = "k ae t">dog</sound-out>

    • k ae t

  • <sound-out pronunciation = "hh">hh</sound-out>

    • hh

  • <sound-out pronunciation = "ao">o</sound-out>

    • ao

  • <sound-out pronunciation = "v uw s" > voos </sound-out>

    • v uw s

Incorrect Usage:

  • <sound-out>hello world</sound-out>

    • Content contains more than one word

  • <sound-out>round-trip</sound-out>

    • Content contains a hyphenated word

  • <sound-out pronunciation = "ks"> x </sound-out>

    • no space in the phoneme sequence

  • <sound-out pronunciation = "a"> a </sound-out>

    • the symbol “a” is not a valid phoneme

Custom Words

image of a diagram for a custom word use case
custom word example

The custom-word tag is used to add custom or made up words. Note that the pronunciation attribute is compulsory! This is to allow the engine know what the phoneme breakdown is, even if it's a valid word that exists in the language.

Correct Usage:

  • <custom-word pronunciation = "v uw s">voos</custom-word>

    • voos

  • <custom-word pronunciation = "m ae s">mas</custom-word>

    • mas

Incorrect Usage:

  • <custom-word>voos</custom-word>

    • Requires pronunciation attribute

  • <custom-word pronunciation = "m ae s k aa t">mas cot</custom-word>

    • Content contains more than one word

What is markup?

Markup is simply a way to annotate text in a document. It is used to add instructions for specific content, such as a letter, word, paragraph, or even the entire page. An advantage of markup languages is that the original text and the markup instructions are intermixed in the same document. This can prevent errors and makes the experience easier for the user.

Markup is used widely across the IT industry, and each form follows a similar structure where target content is placed between a pair of corresponding tags. Optional information can be added using specific attributes, but these are not always required.

Markup element / instruction

A markup instruction is also known as an element. Here is a diagram showing the basic components of a markup element.

Diagram showing an annotated example of an element structure
example element structure

What is a tag?

A markup tag is a keyword that instructs the engine to treat some content (letter, word, number, etc.) in a special way. Tags are enclosed inside angled-brackets, like <this>. They are pre-defined and have a specific meaning to the engine that parses them. Each markup element consists of an opening tag and a corresponding closing tag with some content in between. This tells the engine where the markup element begins and ends.

Every opening tag must have a corresponding closing tag!

For example, suppose we have an engine that parses a document and then displays the text on a web page. If we want to tell the engine that a certain word should be displayed in bold, then our original text might look like the following:

I will follow up <bold>asap</bold>

This would then be displayed as:

I will follow up asap

In the example above, we told the engine to display the content “asap” in bold by placing it between the relevant opening and closing tags. Note that we included a forward-slash, “/”, symbol at the start of the closing tag. This tells the engine that it is the end of our instruction. If we left out this symbol, the engine would instead detect two opening tags and no closing tag. This would cause an error!

Summary

  • A tag is a keyword that instructs the engine to treat content in a special way.

  • Tags are enclosed in <…> brackets.

  • Every opening tag must have a corresponding closing tag.

  • Closing tags are detected when a forward-slash symbol is placed before the tag name, </…>.

What is an attribute?

Attributes are usually optional and are part of the opening tag, but never the closing tag! They are used to give the engine more information about the markup element. We assign the attribute a value using the = sign. This value should be inside double quotes, like this: name = "value". An opening tag can have more than one attribute, and the order doesn't matter.

For example, we want to tell our engine that a certain piece of text should be displayed as a hyperlink. Our original text would look like this:

Please visit my website <link url = "mywebsite.me">here</link>

Which would then get displayed as

Please visit my website here

The word “here” becomes a hyperlink when the text is displayed, and when the user clicks on it they would be taken to the address provided by the “url” attribute.

Summary:

  • Attributes are usually optional.

  • They are part of the opening tag, and never the closing tag.

  • They follow the pattern: name = "value".

Introduction to text normalization


When using speech technology, a process called “text normalization” must be used on all input words, sentences, and paragraphs. When text is provided to SoapBox Labs for the building of a custom language model (CLM), or when a CURL request is sent to the Fluency Web Service, text normalization happens under the hood to ensure the text is compatible with SoapBox systems. Some portions of the text are pre-normalized by the customer (as per CLM guidelines), and other portions of the text are normalized automatically by SoapBox.

Typical examples of text normalization include:

  • stripping punctuation

  • lowercasing all letters

  • changing digits to words

Example:

“I have 2 you can borrow,” she said.i have two you can borrow she said

Text normalization helps to ensure the best possible performance of our speech technology.

Default normalization vs using markup for normalization

Standard normalization, such as lowercasing and stripping punctuation, are straight-forward processes. However, some aspects of text have numerous ways they can be normalized. A key example is digits in text.

If we use “1991” as an example, we see that this can be said in the following ways:

  • nineteen ninety one

  • one thousand nine hundred ninety one

  • one thousand nine hundred and ninety one

  • one nine nine one

Because of this ambiguity, we give customers the option to customize the way potentially messy pieces of text are normalized. However, we maintain a default setting for all text in case this customization is not utilized or needed.

Here are a few examples of the default normalization for years and dates:

Text

Default normalization

1991

one thousand nine hundred ninety one

2022

two thousand twenty two

10/25/2021

ten million two hundred fifty two thousand twenty one

And the same examples when using markup

Text

Using markup tag

Markup normalization

1991

<year>

nineteen ninety one

2022

<year>

twenty twenty two

10/25/2021

<date>

october twenty fifth twenty twenty one

Using markup for normalization


We have defined some markup tags in our voice engine, which can be useful for specifying how some content should be normalized. This can be used both in the CLM text sent to SoapBox and the CURL request sent to the Fluency Web Service.

Here is a quick example that shows the difference between default and custom normalization:

In <year>1776</year>, the declaration of independence was signed.

If this markup is used, the sentence will be normalized to:

in seventeen seventy six the declaration of independence was signed

If this mark is not used, the sentence will be normalized to:

in one thousand seven hundred seventy six the declaration of independence was signed

Use cases and examples


We currently support markup for the following use cases:

  • Years and decades

  • Dates

  • Email addresses

  • Letters

  • Sounding out

  • Custom words

Years and decades

image of a diagram of years use case
year example

Years and decades can both be normalized using the year tag. It will ensure that years and decades in the text are not interpreted as cardinal numbers. The content must consist of digits only, but decades can have a trailing 's' at the end. There are no available attributes for this tag.

Correct Usage:

  • In <year>1913</year>, I sold 1913 apples.

    • in nineteen thirteen i sold one thousand nine hundred thirteen apples

  • In <year>1776</year>, the declaration of independence was signed.

    • in seventeen seventy six the declaration of independence was signed

  • Grandma’s birthday is on August 2nd, <year>1945</year>.

    • grandmas birthday is on august second nineteen forty five

  • The Second World War was from <year>1939</year> to <year>1945</year>.

    • the second world war was from nineteen thirty nine to nineteen forty five

  • 100 dollars in the <year>1800s</year> is equivalent to 1,904 dollars today.

    • one hundred dollars in the eighteen hundreds is equivalent to one thousand nine hundred four dollars today

  • The women’s suffrage movement in Canada began in the <year>1870s</year>.

    • the womens suffrage movement in canada began in the eighteen seventies

  • If I could go to any decade, I’d go to the <year>80s</year>, or maybe the <year>50s</year>

    • if i could go to any decade id go to the eighties or maybe the fifties

Incorrect Usage:

  • <year>19 13</year>

    • Cannot have spaces in the middle of year content

  • <year>nineteen thirteen</year>

    • Content must consist of digits with an optional 's' at the end

  • <year>1913<year>

    • No / symbol in closing tag!

Dates

image of a diagram of a date use case
date example

Dates are normalized using the <date> tag. Currently, we only support dates in US format (month/day/year), where each component is a numerical value. For instance, January can be written as 1, or 01 etc. The year value is optional. There are no available attributes for this tag.

Correct Usage:

  • It was <date>10/25/2021</date>.

    • it was october twenty fifth twenty twenty one

  • John bought a new car on <date>1/9/1999</date>.

    • john bought a new car on january ninth nineteen ninety nine

  • St. Patricks Day is always on <date>03/17</date>.

    • st patricks day is always on march seventeenth

Incorrect Usage:

  • <date>13/04/21</date>

    • Month values must be between 1 and 12

  • <date>january/10</date>

    • Components must be numerical values

  • <date>01-01-2022</date>

    • Components must be separated by a “/” symbol

Email addresses

image of a diagram of an email use case
email example

Email addresses are normalized using the <email> tag. The content can consist of letters, numbers, and most special characters. To keep things simple, most symbols are ignored and numbers will be normalized as singular digits. There are no available attributes for this tag.

Here’s a summary of how symbols are normalized:

Symbol

Markup normalization

0 - 9

zero - nine

@

at

.

dot

_ - ! £ $ % ^ & * ( ) + = { }

[ ignored ]

Correct Usage:

  • My email address is <email>john_doe17@hotmail.com</email>.

    • my email address is john doe one seven at hotmail dot com

  • Please contact me at <email>jane42@test-mail.co.uk</email>.

    • please contact me at jane four two at test mail dot co dot uk

Incorrect Usage:

  • <email>jane doe@hotmail.com</email>

    • Spaces are not allowed in email addresses!

  • <email>john.doe.hotmail.com</email>

    • No @ symbol detected

  • <email>john@doe@hotmail.com</email>

    • More than one @ symbol detected

Letters

image of a diagram of a letter use case
letter example

Letters are normalized using the <letter> tag. They have an optional pronunciation attribute, which allows the user to specify the desired phoneme breakdown. Phonemes must be separated by a space. Only a single letter is allowed in the content.

Correct Usage:

  • <letter>X</letter>

    • x

  • <letter pronunciation = "ey">a</letter>

    • a

  • <letter pronunciation = "hh ey ch">H</letter>

    • h

Incorrect Usage:

  • <letter>ay</letter>

    • Content must consist of a single letter only

  • <letter pronunciation = "eych">h</letter>

    • Each phoneme must be 1 or 2 characters long, and phonemes must be separated by spaces

  • <letter pron = "k s">x</letter>

    • Invalid attribute name

Sounding out

image of a diagram of a sound out example
sound out example

Sounding out a word can be achieved by using the sound-out tag. Only singular valid words are supported and must not contain spaces or hyphens. There is an optional pronunciation attribute for specifying the phoneme breakdown. The phonemes must be separated by a space. When the sound-out tag is used for isolated phonemes/phonics (e.g., hh or o), the pronunciation attribute is always required, because these entries are not words in the dictionary.

Correct Usage:

  • <sound-out>Tree</sound-out>

    • t r iy

  • <sound-out pronunciation = "b aa r k">bark</sound-out>

    • b aa r k

  • <sound-out pronunciation = "hh">hh</sound-out>

    • hh

  • <sound-out pronunciation = "ao">o</sound-out>

    • ao

  • <sound-out pronunciation = "v uw s" >voos</sound-out>

    • v uw s

Incorrect Usage:

  • <sound-out>hello world</sound-out>

    • Content contains more than one word

  • <sound-out>round-trip</sound-out>

    • Content contains a hyphenated word

  • <sound-out pronunciation = "ks"> x </sound-out>

    • no space in the phoneme sequence

  • <sound-out pronunciation = "a"> a </sound-out>

    • the symbol “a” is not a valid phoneme

Custom words

image of a diagram for a custom word use case
custom word example

The custom-word tag is used to add custom or made-up words. Note that the pronunciation attribute is compulsory! This is so the engine knows what the phoneme breakdown is, even if it's a valid word that exists in the language.

Correct Usage:

  • <custom-word pronunciation = "v uw s">voos</custom-word>

    • voos

  • <custom-word pronunciation = "m ae s">mas</custom-word>

    • mas

Incorrect Usage:

  • <custom-word>voos</custom-word>

    • Requires pronunciation attribute

  • <custom-word pronunciation = "m ae s k aa t">mas cot</custom-word>

    • Content contains more than one word