|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Transcript - imported or typed in to text window |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Playback controls |
|
|
|
|
|
Visual representation of audio material |
|
|
|
|
|
|
|
Linked layers: segmentation (pink), speaker
(grey), transcript (blue), timecode (white) |
|
|
|
|
|
|
DTD declaration |
|
|
|
Transcriber (operator) identification |
|
|
|
Topic (segmentation) |
|
|
|
|
|
|
|
Speaker identification |
|
|
|
|
|
|
|
|
|
|
|
Transcript text |
|
|
|
|
|
Timecode |
|
|
|
|
Parser will convert XML to QuickTime text track,
RealText |
|
|
|
XML becomes archival copy of transcript |
|
|
|
|
|
|
|
|
|
|
General motivation |
|
Improve robustness of LVSR systems for unseen
speakers |
|
Specific motivation |
|
NSF digital voice library (DVL) project |
|
10,000 hours library/historical speech recorded |
|
Collaboration between RSPL-CSLU, MSU, LDC |
|
RSPL-CSLU: robust audio stream search engine |
|
Samples
T. Edison (1930’s) |
|
B. Clinton (1990’s) |
|
|
|
|
|
|
Base system |
|
Sphinx-II/III: continuous Gaussian mixture
density HMM based LVCSR system |
|
Feature: static + dynamic MFCC + energy |
|
Triphone HMM, bigram/trigram language model |
|
Databases |
|
Two sets of models: (1) trained from WSJ SI-284;
(2) trained from Broadcast News (BN) database |
|
Test set: 1996 BN development test data set
(segmented into 7 focus conditions, total about 2 hours) |
|
|
|
|
|
Test results for BN96devtest set |
|
|
|
|
|
|
|
|
|
Preliminary result for DVL data |
|
WSJ models: WER - 81.6% |
|
BN models: WER - 67.2% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|