Notes
Outline
Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
"Transcript - imported or typed..."
Transcript - imported or typed in to text window
Playback controls
Visual representation of audio material
Linked layers: segmentation (pink), speaker (grey), transcript (blue), timecode (white)
"DTD declaration"
DTD declaration
Transcriber (operator) identification
Topic (segmentation)
Speaker identification
Transcript text
Timecode
"Parser will convert XML to..."
Parser will convert XML to QuickTime text track, RealText
XML becomes archival copy of transcript
Slide 15
Slide 16
SPEAKER  ADAPTATION
General motivation
Improve robustness of LVSR systems for unseen speakers
Specific motivation
NSF digital voice library (DVL) project
10,000 hours library/historical speech recorded
Collaboration between RSPL-CSLU, MSU, LDC
RSPL-CSLU: robust audio stream search engine
Samples     T. Edison (1930’s)
         B. Clinton (1990’s)
EXPERIMENTAL  SETUP
Base system
Sphinx-II/III: continuous Gaussian mixture density HMM based LVCSR system
Feature: static + dynamic MFCC + energy
Triphone HMM, bigram/trigram language model
Databases
Two sets of models: (1) trained from WSJ SI-284; (2) trained from Broadcast News (BN) database
Test set: 1996 BN development test data set (segmented into 7 focus conditions, total about 2 hours)
EXPERIMENT  RESULTS
Test results for BN96devtest set
Preliminary result for DVL data
WSJ models: WER - 81.6%
BN models: WER - 67.2%
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27