Custom Language Model Text & Audio Checklist

This checklist covers the requirements for both your text and audio content handover.

Domain Text

This is the complete body of texts that will be used to create your Custom Language Model. We will work with you to ensure that the process is smooth and the outcome is successful.

Required:

  • files are in plain text format (.txt)
  • all text prompts to be read by students are included
  • text files contain ONLY the text to be read by the student
    (please remove any extraneous text e.g. instructions to teachers, word counts, page numbers etc.)
  • characters have UTF-8 encoding
    (please use UTF-8, not UTF-8 with BOM. For more information:https://en.wikipedia.org/wiki/UTF-8)
  • ensure years and decades are wrapped in years markup; ensure ordinal numbers have ordinal markers
    (i.e. In <year>1985</year>; In the <year>1800s</year>; May 15th
  • no HTML tags (i.e. &nbsp, <br>)
  • proper names and unique/rarer words in text are spelled correctly
  • 1 piece of content per text file
    (i.e. story1_speaker1.txt; story1_speaker2.txt; story1_speaker3.txt)


Suggested:

Unique words or phrases will need to be reviewed to ensure best performance and confirming they are being pronounced as expected.
When possible, a list of this content will greatly expedite the creation of the CLM:

  • a copy of your pronunciation scheme (if applicable)
  • character names or unique IP/brands
  • acronyms (i.e. WWII, YMCA)
  • roman numerals (i.e. VIII, XII)
  • alphanumeric words (i.e. PS2)

Audio:


If you are providing audio recordings of your domain text please ensure that:

  • they are recorded in WAV format (not converted)
  • recorded at 16kHz
  • each audio file is 2 minute or less
  • in the same folder as the corresponding text
  • naming conventions for both audio and associated text are consistent
    (i.e. story1_speaker1.txt -> story1_speaker1.wav, story1_speaker2.wav)