Spectral Modeling Based on Gaussian Conditional Random Field for Statistical Parametric Speech Synthesis

Spectral Modeling Based on Gaussian Conditional Random Field for Statistical Parametric Speech Synthesis

Soheil Khorram, Hossein Sameti, Fahimeh Bahmaninezhad


This paper proposes an innovative spectral modeling approach based on Gaussian conditional random field (GCRF) theory. The proposed method is also incorporated in a statistical parametric speech synthesis (SPSS) framework. Conventionally, SPSS systems exploit hidden Markov model (HMM)-based spectral modeling technique which suffers from a trivial problem known as state independence assumption. This shortcoming refers to the fact that the distributions of adjacent frames are modeled independently in HMM, whilst they are highly dependent and correlated. The proposed model assumes that spectral trajectories form a left-to-right linear-chain conditional random field (CRF) with Gaussian potential functions. Therefore, instead of the inaccurate independence assumption, Markov assumption is established for adjacent frames in a latent state. In order to train the proposed GCRF model a Viterbi algorithm along with a maximum likelihood (ML)-based parameter estimation procedure have been applied. The estimation algorithm leads to an optimization problem which is solved numerically through the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. In synthesis phase, an efficient parameter generation algorithm optimizing output probability measure has been derived. The designed parameter generation algorithm has the ability to exploit dynamic features as well as static features. Two sets of experiments are reported to prove the effectiveness of the proposed GCRF. In the first set, GCRF with some heuristic context clusters and ML-based parameter estimation is evaluated in contrast to the predominant HMM-based method. The results of objective and subjective tests confirm that the proposed system using heuristic contextual clusters outperformed the standard HMM in small training databases (i.e. 50, 100 and 200 sentences), but in large datasets HMM performs better. It is mainly due to the inability of the proposed system to adjust the number of model parameters with the size of training database. In the second set of experiments, the performance of GCRF using decision tree-based clusters is investigated. This later model has the ability to change the model complexity according to the size of training database. All evaluation results of this experiment confirm significant improvement of the proposed system over the conventional HMM.


Gaussian Conditional Random Field, GCRF, Hidden Markov Model, HMM, HMM-Based Speech Synthesis, Spectral Modeling, State Independence Assumption, Statistical Parametric Speech Synthesis