Home Data Explore Tools Pubs About

Links to tools and our code for the extraction and analysis process.

Extraction Process

Utterance Alignments

CMU Wilderness

CMU Wilderness releases pre-computed utterance alignments for the chapter-level audio and text on bible.is. The original audio and text cannot be directly re-distributed, so Wilderness provides scripts to download the audio and text from source and apply the indices for utterance segmentation they have estimated. However, in many cases the links on bible.is have changed since the original Wilderness paper; we are working to resolve and update the scripts where possible.

[GitHub repository] [paper]

Grapheme-to-Phoneme (G2P) Resources


Epitran is a precision G2P model which uses a combination of hand-written rules and postprocessing for inadequate orthographies. Epitran supports both IPA and X-SAMPA. For the languages in our corpus for which Epitran has high-quality support for the orthography present on bible.is, we use Epitran as our highest-quality G2P resource.

We encourage contributions to Epitran to e.g. extend its language or orthography coverage. If you are comfortable with GitHub, you may submit a pull-request, or if not, please contact David Mortensen directly at dmortens@cs.cmu.edu with mappings/rules and he will incorporate them.

[GitHup repository] [paper]

Wiktionary Annotations

Wiktionary contains word-level pronunciation annotations in IPA at both the phonemic and phonetic level, with more annotations available at the phonemic level. We used the WikiPron package to retrieve pronunciations for languages with coverage on Wiktionary, and use the phonemic annotations for our phoneme alignments which we map to X-SAMPA. We additionally use WikiPron-mined data from the SIGMORPHON 2020 task to tune our G2P models.

[GitHub repository] [paper]

Previous work has used additional tools for Wiktionary such as wikt2pron, which supports both IPA and X-SAMPA.


Phonetisaurus contains implementations of multiple G2P models. We use their WFST implementation to train G2P models using Wiktionary pronunciation annotations to supplement word tokens in our vocabulary for which there are no Wiktionary pronunciations.

[GitHub repository] [paper]


Unitran is a mapping that takes individual graphemes and maps them to 1-several phonemes in X-SAMPA, regardless of language and context. We discuss clear caveats of this approach in detail in our paper. Nonetheless, this provides initial G2P to generate first-pass alignments, to hopefully facilitate analysis.

We used the Unitran version in Festvox following the Wilderness procedure; we plan to push python scripts of this Unitran version to our GitHub repository for greater accessibility.

Note that Alan W Black’s Unitran has diverged slightly from the version of Unitran in NLTK/ScriptTranscriber; he added or edited many common graphemes in the Wilderness dataset without Unitran mappings or for which the expansions were incorrect.


Phoneme Alignments


We train multilingual ASR models for phoneme alignment in kaldi, given lexicons created with G2P using methods above. Training data and model specifics are described in the ASRU 2019 paper below. We link to an older version of the Wilderness recipe in ESPnet below; we will push our kaldi recipe soon.

[GitHub repository] [paper]

Wilderness Festvox synthesis

The CMU Wilderness paper trains language-specific acoustic models for phoneme alignments via speech synthesis using Festvox, which we recreate here following their recipes.

[GitHub repository] [paper]

Phonetic Measures

Our Code

Our Praat and R scripts to extract vowels and sibilants, given phoneme alignments, are available on GitHub.

[GitHub repository]


Phonetic Uniformity and Phonetic Dispersion

Our Code

Our R scripts to perform analysis on vowel and sibilant phonetic measures are available on GitHub.

[GitHub repository]