Farsi Speech Synthesis using Hidden Markov Model and Decision Trees

Farsi Speech Synthesis using Hidden Markov Model and Decision Trees

Mohammad Mehdi Homayounpour, Seyyed Mostafa Mousavi

Abstract

The Hidden Markov Model as a suitable model for time sequence modeling is used in this project for estimation of speech synthesis parameters. In our approach, HMMs generate cepstral coefficients and pitch parameter which are then feed to a speech synthesis filter named MLSA. To generate the parameters of speech synthesis using HMMs, an algorithm is used which utilizes the context dependent information of speech units provided by cepstral coefficients, and their first and second derivatives. In our project, a phone with known left and right context, named triphone, is used as speech unit. For speech unit modeling, we compare observations of each triphone in the database with its HMM model. The result of this comparison is a sequence of HMM states. The comparison is done using viterbi algorithm. Average number of presence times in each state of each triphone, constitute a model for triphone duration. During speech synthesis, in order to obtain necessary parameters for synthesizing a triphone, HMM parameters such as mean and variance vectors of each state are repeated based on duration model. Using mean and variances obtained from HMM models, cepstral coefficients and pitch frequency are calculated and then transformed to speech using MLSA filter. In order to take into account the effects of various parameters on the pronunciation of triphones, cart decision trees are also used. These trees generate pitch and the duration of phonemes. In another way for automatic generation of pitch contour, we used the method proposed by Fujisaki. In this method, there is a global component for pitch contour and The CSI Journal on Computer Science and Engineering Vol. 2, No. 1&3 (a), Spring & Fall 20042 some local components for modeling of accents. To evaluate the performance of our speech synthesis system, MOS and DRT tests were conducted. The results of the MOS test were 3.8 for intelligibility, 3.9 for naturalness, and 3.5 for pleasantness when no decision tree was used for duration and pitch modeling. In another MOS test, pitch and duration were modeled using decision trees. The results of the MOS test were 4.2, 4.4, and 4.1 for sentences existing in training database. These results were 4.3, 4.2, and 3.4 respectively for sentences out of training database. pitch contour was also modeled using Fujisaki method. The results of the MOS test for this kind of pitch modeling were 4.6, 4.3, and 4.5 for sentences existing in training data base. These results were 4.5, 4.0, and 4.4 respectively for sentences out of training database. The DRT test result was 88% for word pairs synthesized using decision trees for both duration and pitch modeling. These results show the suitability of the method used in this project.

Keywords

text to speech, speech synthesis, synthesis filter, hidden Markov model, decision tree, cepstral coefficients

References