Jekyll2020-07-04T18:19:55+00:00https://maelfabien.github.io/feed.xmlMaël FabienData Science BlogMaël FabienHMM acoustic modeling2020-06-30T00:00:00+00:002020-06-30T00:00:00+00:00https://maelfabien.github.io/machinelearning/speech_reco_1<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p>Let’s now dive into acoustic modeling, with the historical approach: Hidden Markov Models and Gaussian Mixture Models (HMM-GMM).</p>
<h1 id="introduction-to-hmm-gmm-acoustic-modeling">Introduction to HMM-GMM acoustic modeling</h1>
<p>Rather than covering into the maths of HMMs and GMMs in this article, I would like to invite you to read these slides that I have made on the topic of Expectation Maximization for HMMs-GMMs, it starts from very basic concepts and covers up to the end. Before going through the slides, let us just remind us what we try to do here.</p>
<p>We want to cover the acoustic modeling, meaning that the HMM-GMM will model <script type="math/tex">P(X \mid W)</script> in the diagram below.</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_21.png" alt="image" /></p>
<p>In the ASR course of the University of Edimburgh, this diagram illustrates where this HMM-GMM architecture takes place:</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_22.png" alt="image" /></p>
<p>From the utterance <script type="math/tex">W</script>, we can break it down into words, then into subwords (or phonemes or phones), and we represent each subword as a HMM. Therefore, for each subword, we have a HMM with several hidden states, which generates features based on a GMM at each state, and these features will represent the acoustic features <script type="math/tex">X</script>.</p>
<p>Alright, let’s jump to the slides:</p>
<iframe width="700" height="500" src="https://www.youtube.com/embed/hxr-UijYbpk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p>If you follow the ASR course of the University of Edimburgh, the slides above will correspond to:</p>
<ul>
<li><a href="http://www.inf.ed.ac.uk/teaching/courses/asr/2019-20/asr02-hmmgmm.pdf">ASR 02</a></li>
<li><a href="http://www.inf.ed.ac.uk/teaching/courses/asr/2019-20/asr03-hmm-algorithms.pdf">ASR 03</a></li>
</ul>
<h1 id="context-dependent-phone-models">Context-dependent phone models</h1>
<h2 id="overview">Overview</h2>
<p>As seen in the slides, there are several ways to model words with HMM models. We can either consider that a single phone is represented by several hidden states of a HMM:</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_23.png" alt="image" /></p>
<p>Or that each phone is modeled by a single state of a HMM:</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_24.png" alt="image" /></p>
<p>The <strong>acoustic phonetic context</strong> of a speech unit describes how articulation (acoustic realization) changes depending on the surrounding units. For example, “/n/” is not pronounced the same in “ten” (alveolar) and “tenth” (dental).</p>
<p>And this violates the Markov assumption of the acoustic realization being independent of the previous states. But how can we model context?</p>
<ul>
<li>using pronounciations, hence leading to a pronounciation model</li>
<li>using subwords units with context:
<ul>
<li>use longer units that incorporate context, e.g. biphones or triphones, demisyllables or syllables</li>
<li>use multiples models for each</li>
</ul>
</li>
</ul>
<p>For example, left biphones modeling would look like this:</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_25.png" alt="image" /></p>
<p>And triphones can be represented this way:</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_26.png" alt="image" /></p>
<p>Context dependent models are:</p>
<ul>
<li>more specific</li>
<li>can define multiple context-dependent models to increase the state-space</li>
<li>handles incorrectness of Markov assumption</li>
<li>each model is now responsible for a smaller region of the acoustic-phonetic space</li>
</ul>
<h2 id="triphones-models">Triphones models</h2>
<p>There are 2 main types of triphones:</p>
<ul>
<li>word-internal triphones: we only take triphones within a word</li>
<li>cross-word triphones: triphones can model the links between words too</li>
</ul>
<p>If we have a system with 40 phones, then the total number of triphones that can occur is: <script type="math/tex">40^4 = 64000</script>. In a cross-word system, typicall 50’000 of them can occur.</p>
<p>The number of gaussians of 50’000 3-states HMMs with 10 components per gaussian is 1.5 million. If features are 39-dimensional (12 MFCCs + energy, delta and accelaration), then each Gaussian has 790 parameters, leading to 118 million parameters! We need huge amount of training data, which ensures that all combinations are covered. Otherwise, we can explore alternatives.</p>
<h2 id="modeling-infrequent-triphones">Modeling infrequent triphones</h2>
<p>There are several ways to handle infrequent triphones rather than expecting large amount of training data:</p>
<h1 id="conclusion">Conclusion</h1>
<p>If you want to improve this article or have a question, feel free to leave a comment below :)</p>
<p>References:</p>
<ul>
<li><a href="http://www.inf.ed.ac.uk/teaching/courses/asr/2019-20/asr02-hmmgmm.pdf">ASR 02, University of Edimburgh</a></li>
<li><a href="http://www.inf.ed.ac.uk/teaching/courses/asr/2019-20/asr03-hmm-algorithms.pdf">ASR 03, University of Edimburgh</a></li>
<li><a href="http://www.inf.ed.ac.uk/teaching/courses/asr/2019-20/asr06-cdhmm.pdf">ASR 06, University of Edimburgh</a></li>
</ul>Maël FabienSpeech ProcessingSubmitting a first paper to ArXiv2020-06-03T00:00:00+00:002020-06-03T00:00:00+00:00https://maelfabien.github.io/phd/arxiv<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<h1 id="why-arxiv">Why ArXiv?</h1>
<p>I recently submitted my first conference paper. The paper is called “Speaker Identification Enhancement using Network Knowledge in Criminal Investigations”, and you can find it <a href="https://arxiv.org/abs/2006.02093">here</a> if you want.</p>
<p>I do not have the response of the conference yet. However, I decided to piublish the paper on ArXiv for several reasons:</p>
<ul>
<li>It touches a wider audience</li>
<li>It contributes to making research open to all, and not only to those who pay subscriptions to some journals</li>
<li>I read most of the papers on ArXiv, so it’s cool to have mine there too :)</li>
</ul>
<p>You should pay attention to the guidelines of the conference you published to. Some of them don’t require anything, some of them want you to mention “Sumbitted to MYCONFERENCE” as a comment, and some of them forbid to publish it anywhere else while still in review.</p>
<h1 id="from-overleaf-to-arxiv">From OverLeaf to ArXiv</h1>
<p>The process of submitting your first ArXiv paper might take longer than you expect. I though that uploading the PDF downloaded from OverLeaf would be enough, but no. ArXiv actually wants the source of your paper, including all of your figures, the .bib references files and the .tex file…</p>
<p>Hopefully, if like me, you use Overleaf, there is this magix “Submit” button that will help:</p>
<p><img src="https://maelfabien.github.io/assets/images/arx_0.png" alt="image" /></p>
<p>So first thing first, you need to clean the files that you have in your project:</p>
<ul>
<li>remove <strong>any</strong> file that is not used</li>
<li>if you use a PDF file at some point, include in your Tex document: <code class="language-plaintext highlighter-rouge">\pdfoutput=1</code></li>
<li>correct any error in the logs and output files (you can however have warning, no problem with that)</li>
</ul>
<p>Now, once you click on the Submit button, you can select “ArXiv.org”, and then download the ZIP with submission files. You will end up with a ZIP file that you can now <a href="https://arxiv.org/help/submit">upload to ArXiv</a>.</p>
<p>Follow the procedure, and your paper will finally appear as submitted. A few hours/days later, it should be on ArXiv :)</p>
<p><img src="https://maelfabien.github.io/assets/images/arx_1.png" alt="image" /></p>Maël FabienOtherAutoHome, a tool to find your dream house2020-06-03T00:00:00+00:002020-06-03T00:00:00+00:00https://maelfabien.github.io/project/autohome<p>My girlfriend and I were recently looking for a house to buy. Rather than spending time on each of the real-estate websites individually, I decided to build a web application that scraps 5 of the most common real-estate agencies in the specific region of France we were looking at:</p>
<p><img src="https://maelfabien.github.io/assets/images/autohome.png" alt="image" /></p>
<p>It basically:</p>
<ul>
<li>scraps 5 real-estate agencies in the North-West part of France</li>
<li>gathers the results in a single dataframe</li>
<li>displays and sorts the results on a Streamlit web application</li>
</ul>
<p>It mainly relies on:</p>
<ul>
<li>BeautifulSoup</li>
<li>Streamlit</li>
</ul>
<h2 id="features">Features</h2>
<p>The application:</p>
<ul>
<li>shows you details and pictures on houses from OuestFrance Immo and other real-estate agencies</li>
<li>has a filter on sea view</li>
<li>allows you to select a minimum and maximum budget</li>
<li>allows you to sort by date, price, self-determined score…</li>
<li>allows you to specify the amount of money you need to borrow, the interest rate, and computes your monthly payments</li>
<li>re-directs you to the source link with a simple click</li>
</ul>
<p>Cool things:</p>
<ul>
<li>you can click on the “Actualiser” button, and it will re-scrap the whole set of websites (± 1mn)</li>
<li>otherwise, results are stored in a dataframe, which makes the navigation way faster</li>
</ul>
<h2 id="github">GitHub</h2>
<p>The Github repository can also be found here:</p>
<div class="github-card" data-github="maelfabien/AutoHome" data-width="100%" data-height="" data-theme="default"></div>
<script src="//cdn.jsdelivr.net/github-cards/latest/widget.js"></script>
<p>To run it, simply use:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install -r requirements.txt
</code></pre></div></div>
<p>And launch the app via:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>streamlit run app.py
</code></pre></div></div>Maël FabienMy girlfriend and I were recently looking for a house to buy. Rather than spending time on each of the real-estate websites individually, I decided to build a web application that scraps 5 of the most common real-estate agencies in the specific region of France we were looking at:Introduction to Automatic Speech Recognition (ASR)2020-05-26T00:00:00+00:002020-05-26T00:00:00+00:00https://maelfabien.github.io/machinelearning/speech_reco<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p>This article provides a summary of the course <a href="http://people.irisa.fr/Gwenole.Lecorve/lectures/ASR.pdf">“Automatic speech recognition” by Gwénolé Lecorvé from the Research in Computer Science (SIF) master</a>, to which I added notes of the Statistical Sequence Processing course of EPFL, and from some tutorials/personal notes. All references are presented at the end.</p>
<h1 id="introduction-to-asr">Introduction to ASR</h1>
<h2 id="what-is-asr">What is ASR?</h2>
<blockquote>
<p>Automatic Speech Recognition (ASR), or Speech-to-text (STT) is a field of study that aims to transform raw audio into a sequence of corresponding words.</p>
</blockquote>
<p>Some of the speech-related tasks involve:</p>
<ul>
<li>speaker diarization: which speaker spoke when?</li>
<li>speaker recognition: who spoke?</li>
<li>spoken language understanding: what’s the meaning?</li>
<li>sentiment analysis: how does the speaker feel?</li>
</ul>
<p>The classical pipeline in an ASR-powered application involves the Speech-to-text, Natural Language Processing and Text-to-speech.</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_0.png" alt="image" /></p>
<p>ASR is not easy since there are lots of variabilities:</p>
<ul>
<li>acoustics:
<ul>
<li>variability between speakers (inter-speaker)</li>
<li>variability for the same speaker (intra-speaker)</li>
<li>noise, reverberation in the room, environment…</li>
</ul>
</li>
<li>phonetics:
<ul>
<li>articulation</li>
<li>elisions (grouping some words, not pronouncing them)</li>
<li>words with similar pronounciation</li>
</ul>
</li>
<li>linguistics:
<ul>
<li>size of vocabulary</li>
<li>word variations</li>
<li>…</li>
</ul>
</li>
</ul>
<p>From a Machine Learning perspective, ASR is also really hard:</p>
<ul>
<li>very high dimensional output space, and a complex sequence to sequence problem</li>
<li>few annotated training data</li>
<li>data is noisy</li>
</ul>
<h2 id="how-is-speech-produced">How is speech produced?</h2>
<p>Let us first focus on how speech is produced. An excitation <script type="math/tex">e</script> is produced through lungs. It takes the form of an initial waveform, describes as an airflow over time.</p>
<p>Then, vibrations are produced by vocal cords, filters <script type="math/tex">f</script> are applied through pharynx, tongue…</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_1.png" alt="image" /></p>
<p>The output signal produced can be written as <script type="math/tex">s = f * e</script>, a convolution between the excitation and the filters. Hence, assuming <script type="math/tex">f</script> is linear and time-independent:</p>
<script type="math/tex; mode=display">s(t) = \int_{-\infty}^{+\infty} e(t) f(t-\tau)d \tau</script>
<p>From the initial waveform, we generate the glotal spectrum, right out of the vocal cords. A bit higher the vocal tract, at the level of the pharynx, pitches are formed and produce the formants of the vocal tract. Finally, the <strong>output spectrum</strong> gives us the intensity over the range of frequencies produced.</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_2.png" alt="image" /></p>
<h2 id="breaking-down-words">Breaking down words</h2>
<p>In automatic speech recognition, you do not train an Artificial Neural Network to make predictions on a set of 50’000 classes, each of them representing a word.</p>
<p>In fact, you take an input sequence, and produce an output sequence. And each word is represented as a <strong>phoneme</strong>, a set of elementary sounds in a language based on the International Phonetic Alphabet (IPA). To learn more about linguistics and phonetic, feel free to check <a href="https://scholar.harvard.edu/files/adam/files/phonetics.ppt.pdf">this course</a> from Harvard. There are around 40 to 50 different phonemes in English.</p>
<p><strong>Phones</strong> are speech sounds defined by the acoustics, potentially unlimited in number,</p>
<p>For example, the word “French” is written under IPA as : / f ɹ ɛ n t ʃ /. The phoneme describes the voiceness / unvoiceness as well as the position of articulators.</p>
<p>Phonemes are language-dependent, since the sounds produced in languages are not the same. We define a <strong>minimal pair</strong> as two words that differ by only one phoneme. For example, “kill” and “kiss”.</p>
<p>For the sake of completeness, here are the consonant and vowel phonemes in standard french:</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_3.png" alt="image" /></p>
<p><img src="https://maelfabien.github.io/assets/images/asr_4.png" alt="image" /></p>
<p>There are several ways to see a word:</p>
<ul>
<li>as a sequence of phonemes</li>
<li>as a sequence of graphemes (mostly a written symbol representing phonemes)</li>
<li>as a sequence of morphemes (meaningful morphological unit of a language that cannot be further divided) (e.g “re” + “cogni” + “tion”)</li>
<li>as a part-of-speech (POS) in morpho-syntax: grammatical class, e.g noun, verb, … and flexional information, e.g singular, plural, gender…</li>
<li>as a syntax describing the function of the word (subject, object…)</li>
<li>as a meaning</li>
</ul>
<p>Therefore, labeling speech can be done at several levels:</p>
<ul>
<li>word</li>
<li>phones</li>
<li>…</li>
</ul>
<p>And the labels may be <strong>time-algined</strong> if we know when they occur in speech.</p>
<p>The <strong>vocabulary</strong> is defined as the set of words in a specific task, a language or several languages based on the ASR system we want to build. If we have a large vocabulary, we talk about <strong>Large vocabulary continuous speech recognition (LVCSR)</strong>. If some words we encounter in production have never been seen in training, we talk about <strong>Out Of Vocabulary</strong> words (OOV).</p>
<p>We distinguish 2 types of speech recognition tasks:</p>
<ul>
<li>isolated word recognition</li>
<li>continuous speech recognition, which we will focus on</li>
</ul>
<h2 id="evaluation-metrics">Evaluation metrics</h2>
<p>We usually evaluate the performance of an ASR system using Word Error Rate (WER). We take as a reference a manual transcript. We then compute the number of mistakes made by the ASR system. Mistakes might include:</p>
<ul>
<li>Substitutions, <script type="math/tex">N_{SUB}</script>, a word gets replaced</li>
<li>Insertions, <script type="math/tex">N_{INS}</script>, a word which was not pronounced in added</li>
<li>Deletions, <script type="math/tex">N_{DEL}</script>, a word is omitted from the transcript</li>
</ul>
<p>The WER is computed as:</p>
<script type="math/tex; mode=display">WER = \frac{N_{SUB} + N_{INS} + N_{DEL}}{\mid N_{words-transcript} \mid}</script>
<p>The perfect WER should be as close to 0 as possible. The number of substitutions, insertions and deletions is computed using the Wagner-Fischer dynamic programming algorithm for word alignment.</p>
<h1 id="statistical-historical-approach-to-asr">Statistical historical approach to ASR</h1>
<p>Let us denote the optimal word sequence <script type="math/tex">W^{\star}</script> from the vocabulary. Let the input sequence of acoustic features be <script type="math/tex">X</script>. Stastically, our aim is to identify the optimal sequence such that:</p>
<script type="math/tex; mode=display">W^{\star} = argmax_W P(W \mid X)</script>
<p>This is known as the “Fundamental Equation of Statistical Speech Processing”. Using Bayes Rule, we can rewrite is as :</p>
<script type="math/tex; mode=display">W^{\star} = argmax_W \frac{P(X \mid W) P(W)}{P(X)}</script>
<p>Finally, we suppose independence and remove the term <script type="math/tex">P(X)</script>. Hence, we can re-formulate our problem as:</p>
<script type="math/tex; mode=display">W^{\star} = argmax_W P(X \mid W) P(W)</script>
<p>Where:</p>
<ul>
<li><script type="math/tex">argmax_W</script> is the search space, a function of the vocabulary</li>
<li><script type="math/tex">P(X \mid W)</script> is called the acoustic model</li>
<li><script type="math/tex">P(W)</script> is called the language model</li>
</ul>
<p>The steps are presented in the following diagram:</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_5.png" alt="image" /></p>
<h2 id="feature-extraction-x">Feature extraction <script type="math/tex">X</script></h2>
<p>From the speech analysis, we should extract features <script type="math/tex">X</script> which are:</p>
<ul>
<li>robust across speakers</li>
<li>robust against noise and channel effects</li>
<li>low dimension, at equal accuracy</li>
<li>non-redondant among features</li>
</ul>
<p>Features we typically extract include:</p>
<ul>
<li>Mel-Frequency Cepstral Coefficients (MFCC), as desbribed <a href="https://maelfabien.github.io/machinelearning/Speech9/#6-mel-frequency-cepstral-coefficients-mfcc">here</a></li>
<li>Perceptual Linear Prediction (PLP)</li>
<li>…</li>
</ul>
<p>We should then normalize the features extracted to avoid mismatches across samples with mean and variance normalization.</p>
<h2 id="acoustic-model">Acoustic model</h2>
<h3 id="1-hmm-gmm-acoustic-model"><strong>1. HMM-GMM acoustic Model</strong></h3>
<p>The acoustic model is a complex model, usually based on Hidden Markov Models and Artificial Neural Networks, modeling the relationship between the audio signal and the phonetic units in the language.</p>
<p>In isolated word/pattern recognition, the acoustic features (here <script type="math/tex">Y</script>) are used as an input to a classifier whose rose is to output the correct word. However, we take input sequence and should output sequences too when it comes to <em>continuous speech recognition</em>.</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_6.png" alt="image" /></p>
<p>The acoustic model goes further than a simple classifier. It outputs a sequence of phonemes.</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_7.png" alt="image" /></p>
<p>Hidden Markov Models are natural candidates for Acoustic Models since they are great at modeling sequences. If you want to read more on HMMs and HMM-GMM training, you can read <a href="https://maelfabien.github.io/machinelearning/GMM/">this article</a>. The HMM has underlying states <script type="math/tex">s_i</script>, and at each state, observations <script type="math/tex">o_i</script> are generated.</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_8.png" alt="image" /></p>
<p>In HMMs, 1 phoneme is typically represented by a 3 or 5 state linear HMM (generally the beginning, middle and end of the phoneme).</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_9.png" alt="image" /></p>
<p>The topology of HMMs is flexible by nature, and we can choose to have each phoneme being represented by a single state, or 3 states for example:</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_10.png" alt="image" /></p>
<p>The HMM supposes observation independence, in the sense that:</p>
<script type="math/tex; mode=display">P(o_t = x \mid s_t = q_i, s_{t-1} = q_j, ...) = P(o_t = x \mid s_t = q_i)</script>
<p>The HMM can also output context-dependent phonemes, called triphones. Triphones are simply a group of 3 phonemes, the left one being the left context, and the right one, the right context.</p>
<p>The HMM is trained using Baum-Welsch algorithm. The HMMs learns to give the probability of each end of phoneme at time t. We usually suppose the observations are generated by a mixture of Gaussians (Gaussian Mixture Models, GMMs) at each state, i.e:</p>
<script type="math/tex; mode=display">P(o_t = y \mid s_t = q_i) = P(X \mid W) = \sum_{m=1} \mathcal{N}(y, \mu_{jm}, \Sigma_{jm})</script>
<p>The training of the HMM-GMM is solved by Expectation Maximization (EM). In the EM training, the outputs of the GMM <script type="math/tex">P(X \mid W)</script> are used as inputs for the GMM training iteratively, and the Viterbi or Baum Welsch algorithm trains the HMM (i.e. identifies the transition matrices) to produce the best state sequence.</p>
<p>The full pipeline is presented below:</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_11.png" alt="image" /></p>
<h3 id="2-hmm-dnn-acoustic-model"><strong>2. HMM-DNN acoustic model</strong></h3>
<p>Latest models focus on hybrid HMM-DNN architectures and approach the acoustic model in another way. In such approach, we do not care about the acoustic model <script type="math/tex">P(X \mid W)</script>, but we directly tackle <script type="math/tex">P(W \mid X)</script> as the probability of observing state sequences given <script type="math/tex">X</script>.</p>
<p>Hence, back to the first acoustic modeling equation, we target:</p>
<script type="math/tex; mode=display">W^{\star} = argmax_W P(W \mid X)</script>
<p>The aim of the DNN is to model the <strong>posterior probabilities</strong> over HMM states.</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_12.png" alt="image" /></p>
<p>Some considerations on the HMM-DNN framework:</p>
<ul>
<li>we usually take a large number of hidden layers</li>
<li>the inputs features typically are extracted from large windows (up to 1-2 seconds) to have a large context</li>
<li>early stopping can be used</li>
</ul>
<p>You might have noticed that the training of the DNN produces posterior, whereas the Viterbi Backward-Forward algorithm requires <script type="math/tex">P(X \mid W)</script> to identify the optimal sequence when training the HMM. Therefore, we use Bayes Rule:</p>
<script type="math/tex; mode=display">P(X \mid W) = \frac{P(W \mid X) P(X)}{P(W)}</script>
<p>The probability of the acoustic feature <script type="math/tex">P(X)</script> is not known, but it just scales all the likelihoods by the same factor, and therefore does not modify the alignment.</p>
<p>The training of HMM-DNN architectures is based:</p>
<ul>
<li>either on the original hybrid HMM-DNN, using EM, where:
<ul>
<li>E-step keeps DNN and HMM parameters constant and estimates the DNN outputs to produce scaled likelihoods</li>
<li>M-step re-trains the DNN parameters on the new targets from E-step</li>
</ul>
</li>
<li>either using REMAP, with a similar architecture, except that the states priors are also given as inputs to the DNN</li>
</ul>
<h3 id="3-hmm-dnn-vs-hmm-gmm"><strong>3. HMM-DNN vs. HMM-GMM</strong></h3>
<p>Here is a brief summary of the pros and cons of HMM/DNN and HMM/GMM:</p>
<table>
<thead>
<tr>
<th>HMM/DNN</th>
<th>HMM/GMM</th>
</tr>
</thead>
<tbody>
<tr>
<td>Considers short term correlation</td>
<td>Assumes no correlation in inputs</td>
</tr>
<tr>
<td>No probability distribution function assumption</td>
<td>Assumes GMMs as PDFs</td>
</tr>
<tr>
<td>Discriminative training in the generated distributions</td>
<td>No discriminative training in the generated distributions (can be overlapping)</td>
</tr>
<tr>
<td>Discriminative acoustic model at frame level</td>
<td>Poor discrimination (Maximum Likelihood instead of Maximum A Posteriori)</td>
</tr>
<tr>
<td>Higher performance</td>
<td>Lower performance</td>
</tr>
</tbody>
</table>
<h3 id="4-end-to-end-models"><strong>4. End-to-end models</strong></h3>
<p>In End-to-end models, the steps of feature extraction and phoneme prediction are combined:</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_13.png" alt="image" /></p>
<p>This concludes the part on acoustic modeling.</p>
<h2 id="pronunciation">Pronunciation</h2>
<p>In small vocabulary sizes, it is quite easy to collect a lot of utterances for each word, and the HMM-GMM or HMM-DNN training is efficient. However, “statistical modeling requires a sufficient
number of examples to get a good estimate of the relationship between speech input and the parts of words”. In large-vocabulary tasks, we might collect 1 or even 0 training examples. t. Thus, it is not feasible to train a model for each word, and we need to share information across words, based on the pronunciation.</p>
<script type="math/tex; mode=display">W^{\star} = argmax_W P(W \mid X)</script>
<p>We consider words are being sequences of states <script type="math/tex">Q</script>.</p>
<script type="math/tex; mode=display">W^{\star} = argmax_W P(X \mid Q, W) P(Q, W)</script>
<script type="math/tex; mode=display">W^{\star} \approx argmax_W P(X \mid Q) \sum_Q P(Q \mid W) P(W)</script>
<script type="math/tex; mode=display">W^{\star} \approx argmax_W max_Q P(X \mid Q) P(Q \mid W) P(W)</script>
<p>Where <script type="math/tex">P(Q \mid W)</script> is the <strong>pronunciation model</strong>.</p>
<p>The pronunciation dictionary is written by human experts, and defined in the IPA. The pronunciation of words is typically stored in a lexical tree, a data structure that allows us to share histories between words in the lexicon.</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_15.png" alt="image" /></p>
<p>When decoding a sequence in prediction, we must identify the most likely path in the tree based on the HMM-DNN output.</p>
<p>In ASR, most recent approaches are:</p>
<ul>
<li>either end to end</li>
<li>or at the character level</li>
</ul>
<p>In both approaches, we do not care about the full pronunciation of the words. Grapheme-to-phoneme (G2P) models try to learn automatically the pronunciation of new words.</p>
<h2 id="language-modeling">Language Modeling</h2>
<p>Let’s get back to our ASR base equation:</p>
<script type="math/tex; mode=display">W^{\star} = argmax_W P(W \mid X)</script>
<script type="math/tex; mode=display">W^{\star} = argmax_W P(X \mid W) P(W)</script>
<p>The language model is defined as <script type="math/tex">P(W)</script>. It assigns a probability estimate to word sequences, and defines:</p>
<ul>
<li>what the speaker may say</li>
<li>the vocabulary</li>
<li>the probability over possible sequences, by training on some texts</li>
</ul>
<p>The contraint on <script type="math/tex">P(W)</script> is that <script type="math/tex">\sum_W P(W) = 1</script>.</p>
<script type="math/tex; mode=display">W^{\star} = argmax_W P(X \mid Q, W) P(Q, W)</script>
<p>In statistical language modeling, we aim to disambiguate sequences such as:</p>
<p>“recognize speech”, “wreck a nice beach”</p>
<p>The maximum likelihood estimation of a sequence is given by:</p>
<script type="math/tex; mode=display">P(w_i \mid w_1, ..., w_{i-1}) = \frac{C(w_1, ..., w_i)}{\sum_v C(w_1, ..., w_{i-1} v)}</script>
<p>Where <script type="math/tex">C(w_1, ..., w_i)</script> is the observed count in the training data. For example:</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_16.png" alt="image" /></p>
<p>We call this ratio the <strong>relative frequency</strong>. The probability of a whole sequence is given by the <strong>chain rule</strong> of probabilities:</p>
<script type="math/tex; mode=display">P(w_1, ..., w_N) = \prod_{k=1}^N (w_k \mid w_{k-1})</script>
<p>This approach seems logic, but the longer the sequence, the most likely it will be that we encounter 0’s, hence bringing the probability of the whole sequence at 0.</p>
<p>What solutions can we apply?</p>
<ul>
<li>smoothing: redistribute the probability mass from observed to unobserved events (e.g Laplace smoothing, Add-k smoothing)</li>
<li>backoff: explained below</li>
</ul>
<h3 id="1-n-gram-language-model"><strong>1. N-gram language model</strong></h3>
<p>But one of the most popular solution is the <strong>n-gram model</strong>. The idea behind the n-gram model is to truncate the word history to the last 2, 3, 4 or 5 words, and therefore approximate the history of the word:</p>
<script type="math/tex; mode=display">P(w_i \mid h) = P(w_i \mid w_{i-n+1}, ..., w_{i-1})</script>
<p>We take <script type="math/tex">n</script> as being 1 (unigram), 2 (bigram), 3 (trigram)…</p>
<p>Let us now discuss some practical implementation tricks:</p>
<ul>
<li>we compute the log of the probabilities, rather than the probabilities themselves (to avoid floating point approximation to 0)</li>
<li>for the first word of a sequence, we need to define <strong>pseudo-words</strong> as being the first 2 missing words for the trigram: <script type="math/tex">% <![CDATA[
P(I \mid <s><s>) %]]></script></li>
</ul>
<p>With N-grams, it is possible that we encounter unseen N-grams in prediction. There is a technique called <strong>backoff</strong> that states that if we miss the trigram evidence, we use the bigram instead, and if we miss the bigram evidence, we use the unigram instead…</p>
<p>Another approach is <strong>linear interpolate</strong>, where we combine different order n-grams by linearly interpolating all the models:</p>
<script type="math/tex; mode=display">P(w_n \mid w_{n-2} w_{n-1}) = \lambda_1 P(w_n \mid w_{n−2} w_{n−1}) + \lambda_2 P(w_n \mid w_{n−1}) + \lambda_3 P(w_n)</script>
<h3 id="2-language-models-evaluation-metrics"><strong>2. Language models evaluation metrics</strong></h3>
<p>There are 2 types of evaluation metrics for language models:</p>
<ul>
<li><em>extrinsic evaluation</em>, for which we embed the language model in an application and see by which factor the performance is improved</li>
<li><em>intrinsic evaluation</em> that measures the quality of a model independent of any application</li>
</ul>
<p>Extrinsic evaluations are often heavy to implement. Hence, when focusing on intrinsic evaluations, we:</p>
<ul>
<li>split the dataset/corpus into train and test (and development set if needed)</li>
<li>learn transition probabilities from the trainig set</li>
<li>use the <strong>perplexity</strong> metric to evaluate the language model on the test set</li>
</ul>
<p>We could also use the raw probabilities to evaluate the language model, but the perpeplixity is defined as the inverse probability of the test set, normalized by the number of words. For example, for a bi-gram model, the perpeplexity (noted PP) is defined as:</p>
<script type="math/tex; mode=display">PP(W) = \sqrt[^N]{ \prod_{i=1}^{N} \frac{1}{P(w_i \mid w_{i-1})}}</script>
<p>The lower the perplexity, the better</p>
<h3 id="3-limits-of-language-models"><strong>3. Limits of language models</strong></h3>
<p>Language models are trained on a closed vocabulary. Hence, when a new unknown word is met, it is said to be <strong>Out of Vocabulary</strong> (OOV).</p>
<h3 id="4-deep-learning-language-models"><strong>4. Deep learning language models</strong></h3>
<p>More recently in Natural Language Processing, neural network-based language models have become more and more popular. Word embeddings project words into a continuous space <script type="math/tex">R^d</script>, and respect topological properties (semantics and morpho-syntaxic).</p>
<p>Recurrent neural networks and LSTMs are natural candidates when learning such language models.</p>
<h2 id="decoding">Decoding</h2>
<p>The training is now done. The final step to cover is the decoding, i.e. the predictions to make when we collect audio features and want to produce transcript.</p>
<p>We need to find:</p>
<script type="math/tex; mode=display">W^{\star} = argmax_W P(X \mid W) P(W) I^{\mid W \mid}</script>
<p>However, exploring the whole spact, especially since the Language Model <script type="math/tex">P(W)</script> has a really large scale factor, can be incredibly long.</p>
<p>One of the solutions is to explore the <strong>Beam Search</strong>. The Beam Search algorithm greatly reduces the scale factor within a language model (whether N-gram based or Neural-network-based). In Beam Search, we:</p>
<ul>
<li>identify the probability of each word in the vocabulary for the first position, and keep the top K ones (K is called the Beam width)</li>
<li>for each of the K words, we compute the conditional probability of observing each of the second words of the vocabulary</li>
<li>among all produced probabilities, we keep only the top K ones</li>
<li>and we move on to the third word…</li>
</ul>
<p>Let us illustrate this process the following way. We want to evaluate the sequence that is the most likely. We first compute the probability of the different words of the vocabulary to be the starting word of the sentence:</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_18.png" alt="image" /></p>
<p>Here, we fix the beam width to 2, meaning that we only select the 2 most likely words to start with. Then, we move on to the next word, and compute the probability of observing it using conditional probability in the language model: <script type="math/tex">P(w_2, w_1 \mid W) = P(w_1 \mid W) P(w_2 \mid w_1, W)</script>. We might see that a potential candidate, e.g. “The”, when selecting the top 2 candidates second words among all possible words, is not a possible path anymore. In that case, we narrow the search, since we know that the first must must be “a”.</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_19.png" alt="image" /></p>
<p>And so on… Another approach to decoding is the Weighted Finite State Transducers (I’ll make an article on that).</p>
<h2 id="summary-of-the-asr-pipeline">Summary of the ASR pipeline</h2>
<p>In their paper <a href="https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/42543.pdf">“Word Embeddings for Speech Recognition”</a>, Samy Bengio and Georg Heigold present a good summary of a modern ASR architecture:</p>
<ul>
<li>Words are represented through lexicons as phonemes</li>
<li>Typically, for context, we cluster triphones</li>
<li>We then assume that these triphones states were in fact HMM states</li>
<li>And the the observations each HMM state generates are produced by DNNs or GMMs</li>
</ul>
<p><img src="https://maelfabien.github.io/assets/images/asr_17.png" alt="image" /></p>
<h1 id="end-to-end-approach">End-to-end approach</h1>
<p>Alright, this article is already long, but we’re almost done. So far, we mostly covered historical statistical approaches. These approaches work very well. However, most recent papers and implementations focus on end-to-end approaches, where:</p>
<ul>
<li>we encode <script type="math/tex">X</script> as a sequence of contexts <script type="math/tex">C</script></li>
<li>we decode <script type="math/tex">C</script> into a sequence of words <script type="math/tex">W</script></li>
</ul>
<p>These approaches, also called encoder-decoder, are part of sequence-to-sequence models. Sequence to sequence models learn to map a sequence of inputs to a sequence of outputs, even though their length might differ. This is widely used in Machine Translation for example.</p>
<p>As illustrated below, the Encoder reduces the input sequence to a encoder vector through a stack of RNNs, and the decoder vector uses this vector as an input.</p>
<p><img src="https://maelfabien.github.io/assets/images/asr_20.jpg" alt="image" /></p>
<p>I will write more about End-to-end models in another article.</p>
<h1 id="conclusion">Conclusion</h1>
<p>This is all for this quite long introduction to automatic speech recognition. After a brief introduction to speech production, we covered historical approaches to speech recognition with HMM-GMM and HMM-DNN approaches. We also mentioned the more recent end-to-end approaches. If you want to improve this article or have a question, feel free to leave a comment below :)</p>
<p>References:</p>
<ul>
<li><a href="http://people.irisa.fr/Gwenole.Lecorve/lectures/ASR.pdf">“Automatic speech recognition” by Gwénolé Lecorvé from the Research in Computer Science (SIF) master</a></li>
<li>EPFL Statistical Sequence Processing course</li>
<li><a href="https://www.youtube.com/watch?v=WSBZ0hBJn7E">Stanford CS224S</a></li>
<li><a href="https://mycourses.aalto.fi/pluginfile.php/426574/mod_folder/content/0/Rasmus_Robert_DNN.pdf?forcedownload=0">Rasmus Robert HMM-DNN</a></li>
<li><a href="https://link.springer.com/chapter/10.1007/978-3-540-45115-0_3">A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition</a></li>
<li><a href="https://web.stanford.edu/~jurafsky/slp3/3.pdf">N-gram Language Models, Stanford</a></li>
<li><a href="https://www.youtube.com/watch?v=RLWuzLLSIgw">Andrew Ng’s Beam Search explanation</a></li>
<li><a href="https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346">Encoder Decoder model</a></li>
<li><a href="http://www.inf.ed.ac.uk/teaching/courses/asr/2019-20/asr01-intro.pdf">Automatic Speech Recognition Introduction, University of Edimburgh</a></li>
</ul>Maël FabienSpeech ProcessingIllustrating EM for GMMs and HMMs2020-05-09T00:00:00+00:002020-05-09T00:00:00+00:00https://maelfabien.github.io/project/EM<p>I recently gave a talk on EM for GMMs and HMMs at EPFL and published the slides <a href="https://maelfabien.github.io/machinelearning/GMM/">here</a>. For the sake of the presentation, I built an interactive web application using Dash, Plotly, scikit-learn, open-cv and hmm-learn. In the app, I included:</p>
<ul>
<li>GMM generated data exploration</li>
<li>K-Means vs. GMMs performance on overlapping clusters</li>
<li>Fitting EM on GMM data</li>
<li>EM-GMM for gender detection</li>
<li>Vector Quantization with k-Means</li>
<li>Background substraction with GMMs</li>
<li>Breast cancer data clustering with GMMs</li>
<li>AIC-BIC over the number of components</li>
<li>HMM-GMM data generation</li>
<li>HMM-GMM training and visualization</li>
<li>HMM-GMM for spoken digit speech recognition</li>
</ul>
<iframe width="700" height="500" src="https://www.youtube.com/embed/hxr-UijYbpk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p><br /></p>
<p>I summarized the presentation mentioned above on Towards Data Science, right <a href="https://towardsdatascience.com/expectation-maximization-for-gmms-explained-5636161577ca">here</a>.</p>Maël FabienI recently gave a talk on EM for GMMs and HMMs at EPFL and published the slides here. For the sake of the presentation, I built an interactive web application using Dash, Plotly, scikit-learn, open-cv and hmm-learn. In the app, I included: GMM generated data exploration K-Means vs. GMMs performance on overlapping clusters Fitting EM on GMM data EM-GMM for gender detection Vector Quantization with k-Means Background substraction with GMMs Breast cancer data clustering with GMMs AIC-BIC over the number of components HMM-GMM data generation HMM-GMM training and visualization HMM-GMM for spoken digit speech recognitionExpectation Maximization for Gaussian Mixture Models and Hidden Markov Models2020-05-09T00:00:00+00:002020-05-09T00:00:00+00:00https://maelfabien.github.io/machinelearning/GMM<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p>For a course at EPFL, I recently gave a presentation on Expactation Maximization for Gaussian Mixture Models and Hidden Markov Models. The presentation had nice feedbacks, and I thought that including it here could be useful:</p>
<div style="width:100%; text-align:justify; align-content:left; display:inline-block;">
<embed src="https://maelfabien.github.io/assets/files/EM.pdf" type="application/pdf" width="100%" height="138px" />
</div>
<p><br /></p>
<p>I summarized this presentation on Towards Data Science, right <a href="https://towardsdatascience.com/expectation-maximization-for-gmms-explained-5636161577ca">here</a>.</p>
<p>I also made an interactive Dash web application. Here’s a preview of what you can find in it:</p>
<iframe width="700" height="500" src="https://www.youtube.com/embed/hxr-UijYbpk" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p><br /></p>
<p>The Github of the project can be found <a href="https://github.com/maelfabien/EM_GMM_HMM">here</a>.</p>Maël FabienSpeech ProcessingBuilding a Dash Web application for Data Viz and ML2020-04-19T00:00:00+00:002020-04-19T00:00:00+00:00https://maelfabien.github.io/project/Dash<p>I recently had to build a Dash web application to illustrate what Dash-Plotly can do. I chose to present some capabilities regarding Data Viz and Machine Learning.</p>
<iframe width="700" height="500" src="https://www.youtube.com/embed/UggjszESuUw" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p><br /></p>
<p>I chose to explore the well-known Iris dataset. I also chose to use only Plotly Express for the visualizations, since the library is a light and well performing tool. There are 2 tabs, the other one is for Machine Learning. The user can select the column to predict, and the columns to use in training. Then, a Support Vector Machine Algorithm is ran on top. Since there are 3 classes, a 3D plot displays the probabilities of belonging to each class. The more separated the probabilities are, the easier it was for the algorithm to split the classes.</p>
<p>This could be a first step for a generic ML tool, a bit lit MindsDB. It took few lines of code and an afternoon, and was deployed in minutes with Heroku.</p>Maël FabienI recently had to build a Dash web application to illustrate what Dash-Plotly can do. I chose to present some capabilities regarding Data Viz and Machine Learning.A supervised learning approach to predicting nodes betweenness-centrality in time-varying networks2020-04-17T00:00:00+00:002020-04-17T00:00:00+00:00https://maelfabien.github.io/machinelearning/node_pred<p>I have recently been working on time-varying networks, i.e. networks for which we have timestamps of various interactions between the nodes. This is the case for social networks or criminal networks for example. When we analyze the centrality of the nodes of a graph, it gives us a snapshot at the exact moment of the structure of the graph.</p>
<p>However, these networks are time-varying by nature. They evolve, new interactions are being made, new nodes are created… And knowing which nodes are going to be central next month can be a key information in criminal investigations. For this reason, I wanted to spend some time and look at whether one can actually predict the central nodes in the future.</p>
<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<h1 id="dataset">Dataset</h1>
<p>I am working on the Enron e-mail dataset, enriched by phone calls that I was able to match. I have overall 1264 events, each event being either an email or a phone call between 2 characters (or more). For each event, I create a row in the training dataset for each node. Overall, my final dataset is made of more than 50’000 rows.</p>
<p>The timestamps vary between 2000-08-03 09:10:00 and 2001-01-29 22:21:00. Thus, we have a time period of close to 4 months of events. The first thing that we should look at is the evolution of the centrality of the nodes over time.</p>
<p><img src="https://maelfabien.github.io/assets/images/node_evol.png" alt="image" /></p>
<p>There seems to be some changes over time over in the order of the major nodes. There also seems to be some dates at which at lot of events were collected, typically because phone calls were not registered continuously.</p>
<h1 id="feature-extraction">Feature extraction</h1>
<p>I turned the node betweenness centrality score prediction into a supervised learning problem. My task will be to predict which node will be central in 1 month from now. This can be useful for police investigations since it does take time to plan when to arrest criminals for example.</p>
<p>To build my dataset, for each node, for each date, I am collecting:</p>
<ul>
<li>the conversation date</li>
<li>the betweenness centrality</li>
<li>the relative degree centrality</li>
<li>the clustering coefficient of the node</li>
<li>the eigen vector centrality</li>
<li>the katz centrality</li>
<li>the load centrality</li>
<li>the harmonic centrality</li>
<li>if the node is in the max clique of the graph</li>
<li>the average clustering of the graph</li>
<li>if the node is in the minimum weighted dominating set</li>
</ul>
<p>In Python, I use NetworkX to implement this feature extraction.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">all_features</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">conv</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">iterrows</span><span class="p">():</span>
<span class="c1"># If a least 2 characters in the conversation
</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">conv</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'characters'</span><span class="p">])</span> <span class="o">>=</span> <span class="mi">2</span><span class="p">:</span>
<span class="c1"># Add the edges
</span> <span class="k">for</span> <span class="n">elem</span> <span class="ow">in</span> <span class="nb">list</span><span class="p">(</span><span class="n">itertools</span><span class="p">.</span><span class="n">combinations</span><span class="p">(</span><span class="n">conv</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'characters'</span><span class="p">],</span> <span class="mi">2</span><span class="p">)):</span>
<span class="n">G</span><span class="p">.</span><span class="n">add_edge</span><span class="p">(</span><span class="n">elem</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">elem</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="c1"># Collect node features
</span> <span class="k">for</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">G</span><span class="p">.</span><span class="n">nodes</span><span class="p">():</span>
<span class="n">feature</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">feature</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">conv</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'Date'</span><span class="p">])</span>
<span class="n">feature</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">node</span><span class="p">)</span>
<span class="n">feature</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">betweenness_centrality</span><span class="p">(</span><span class="n">G</span><span class="p">)[</span><span class="n">node</span><span class="p">])</span>
<span class="n">feature</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">G</span><span class="p">.</span><span class="n">degree</span><span class="p">[</span><span class="n">node</span><span class="p">]</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span><span class="n">G</span><span class="p">.</span><span class="n">degree</span><span class="p">).</span><span class="n">values</span><span class="p">()))</span>
<span class="n">feature</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">nx</span><span class="p">.</span><span class="n">clustering</span><span class="p">(</span><span class="n">G</span><span class="p">)[</span><span class="n">node</span><span class="p">])</span>
<span class="n">feature</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">nx</span><span class="p">.</span><span class="n">eigenvector_centrality</span><span class="p">(</span><span class="n">G</span><span class="p">)[</span><span class="n">node</span><span class="p">])</span>
<span class="n">feature</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">nx</span><span class="p">.</span><span class="n">katz_centrality</span><span class="p">(</span><span class="n">G</span><span class="p">)[</span><span class="n">node</span><span class="p">])</span>
<span class="n">feature</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">nx</span><span class="p">.</span><span class="n">closeness_centrality</span><span class="p">(</span><span class="n">G</span><span class="p">)[</span><span class="n">node</span><span class="p">])</span>
<span class="n">feature</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">nx</span><span class="p">.</span><span class="n">load_centrality</span><span class="p">(</span><span class="n">G</span><span class="p">)[</span><span class="n">node</span><span class="p">])</span>
<span class="n">feature</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">nx</span><span class="p">.</span><span class="n">harmonic_centrality</span><span class="p">(</span><span class="n">G</span><span class="p">)[</span><span class="n">node</span><span class="p">])</span>
<span class="k">if</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">max_clique</span><span class="p">(</span><span class="n">G</span><span class="p">):</span>
<span class="n">feature</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">feature</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">feature</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">average_clustering</span><span class="p">(</span><span class="n">G</span><span class="p">))</span>
<span class="k">if</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">min_weighted_dominating_set</span><span class="p">(</span><span class="n">G</span><span class="p">):</span>
<span class="n">feature</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">feature</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">all_features</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">feature</span><span class="p">)</span>
</code></pre></div></div>
<p>I then build a column “is top 5” if the node is within the 5 nodes with the highest centrality at the given date. In order to add some additional features, I also append for each node the features at the 5 previous states, and create features that reflect the differences between each state.</p>
<p>Then, knowing the situation of the network in 1 month from now, I collect the 5 nodes with the highest centrality at that time, and create a column “will be top 5”. I must drop the last month of my dataset since it typically would be the period I would neeed to predict on in real life.</p>
<h1 id="model-performance">Model Performance</h1>
<p>My dataset is now made of 48744 rows and 125 columns. To create my training and test sets, I must split in time the dataset, in order not to include information from the future. Since there is a class imbalance (is among the five nodes with the highest centrality, and all the other nodes), I chose the F1-score metric. I then compare:</p>
<ul>
<li>the naive approach of predicting the current centrality</li>
<li>and the model output</li>
</ul>
<p>Talking about the model, I chose an XGBoost with 250 estimators and a max-depth of 6. Since splitting at a random point in time would not be reliable enough, I chose to split to 50 different points in time (every 1000 rows), and plot the results below:</p>
<p><img src="https://maelfabien.github.io/assets/images/evol_node.png" alt="image" /></p>
<h1 id="discussion">Discussion</h1>
<p>As we can see, when the XGBoost model sees few training examples, the naive approach clearly outperforms our model. However, with around 6-7’000 training samples, XGBoost clearly outperforms the naive approach. The F1-Score of our prediction is ± 80%, with an accuracy of around 98%, largely over the average F1-Score of the naive approach of 50.6%.</p>
<p>We can plot the feature importance of the XGBoost model:</p>
<p><img src="https://maelfabien.github.io/assets/images/feat_imp.png" alt="image" /></p>
<p>We observe that:</p>
<ul>
<li>the current load centrality appears to be most important feature to predict betweenness centrality 1 month ahead</li>
<li>other features extracted from the current topology of the network are important</li>
<li>clustering coefficient and relative degree from previous steps are also important</li>
<li>the evolution between relative degrees from one state to another are not as useful features as expected</li>
</ul>
<blockquote>
<p>Overall, predicting the node centrality based on extracted features from the network in a supervised fashion seems to be feasible. The results seem ancouraging and may suggest that investigators in a criminal investigation could use such approach to predict the node centrality one month ahead and plan their surveillance programs accordingly.</p>
</blockquote>Maël FabienCriminal NetworksDestabilizing Netorks2020-04-15T00:00:00+00:002020-04-15T00:00:00+00:00https://maelfabien.github.io/machinelearning/disruption<p>In this article, I will summarize and discuss the paper: <a href="http://www.casos.cs.cmu.edu/publications/protected/2000-2004/2000-2002/carley_2001_destabilizingnetworks.pdf">“Destabilizing Netorks”</a> by Kathleen M. Carley, Ju-Sung Lee and David Krackhardt.</p>
<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<h1 id="background">Background</h1>
<p>Classic network analysis tools are able to identify:</p>
<ul>
<li>individuals whose removal would alter the network significantly, usuually due to a high centrality</li>
<li>individuals likely to act</li>
<li>individuals able to propagate information rapidly</li>
<li>individuals with more power</li>
<li>individuals providing redundancy in the network</li>
</ul>
<p>Tools are also able to identify patterns:</p>
<ul>
<li>basic structure</li>
<li>central tendency</li>
<li>coherency of the network</li>
<li>significantly different sub-networks</li>
<li>…</li>
</ul>
<p><img src="https://maelfabien.github.io/assets/images/var.png" alt="image" /></p>
<h1 id="discussion">Discussion</h1>Maël FabienCriminal NetworksSpeaker Verification using Gaussian Mixture Model (GMM-UBM)2020-04-09T00:00:00+00:002020-04-09T00:00:00+00:00https://maelfabien.github.io/machinelearning/PLDA<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p>The method introduced below is called GMM-UBM, which stands for Gaussian Mixture Model - Universal Background Model. This method has, for a long time, been a state-of-the-art approach.</p>
<p>I will use as a reference the paper: “A Tutorial on Text-Independent Speaker Verification” by Frédétic Bimbot et al. Although from 2002, this tutorial describes this classical approach pretty well.</p>
<p>This article requires that you have understood the <a href="https://maelfabien.github.io/machinelearning/basics_speech/">basics of Speaker Verification</a>.</p>
<p>In this article, I will present the main steps of a Speaker Verification system.</p>
<p><img src="https://maelfabien.github.io/assets/images/bs_1.png" alt="image" /></p>
<p>The main steps of speaker verification are:</p>
<ul>
<li>Development: learn speaker-idenpendent models using large amount of data. This is a pre-training part, called a Universal Background Model (UBM). It can be gender-specific, in the sense that we have 1 for Males, and 1 for Females.</li>
<li>Enrollment: learn distinct characteristics of a speaker’s voice. This step typically creates one model per unique speaker considered. This is the training part.</li>
<li>Verification: distinct characteristics of a claimant’s voice are compared with previously enrolled claimed speaker models. This is the prediction part.</li>
</ul>
<p>The first step is to extract features from the development set, enrollment set and verification set.</p>
<h1 id="i-speech-acquisition-and-feature-extraction">I. Speech acquisition and Feature extraction</h1>
<p>We should extract features from the signal to convert the raw signal into a sequence of acoustic feature vectures which we will use to identify the speaker. We make the assumption that each audio sample that we have contains only one speaker.</p>
<p>Most speech features used in speaker verification rely on a cepstral representation of speech.</p>
<h2 id="1-filterbank-based-cepstral-parameters-mfcc">1. Filterbank-based cepstral parameters (MFCC)</h2>
<h3 id="pre-emphasis">Pre-emphasis</h3>
<p>The first step is usually to apply a pre-emphasis of the signal to enhance the high frequencies of the spectrum, reduced by the speech production process:</p>
<script type="math/tex; mode=display">x_p(t) = x(t) - a x(t-1)</script>
<p>Where <script type="math/tex">a</script> takes values between 0.95 and 0.98.</p>
<h3 id="framing">Framing</h3>
<p>The signal is then split into successive frames. Most of the time, a length of frame of 20 milliseconds is used, with a shift of 10 milliseconds.</p>
<h3 id="windowing">Windowing</h3>
<p>Then, a windowing is applied. Indeed, when you cut your signal into frames, it is most likely that the end of a frame will not match the start of the next frame. Therefore, a windowing function is needed. The Hamming window is one of the most common approaches. Windowing also gives a more accurate idea of the original signal’s frequency spectrum, as is “cuts off” signals at their end.</p>
<p>The Hamming window is given by:</p>
<script type="math/tex; mode=display">w[n]=a_{0}-\underbrace {(1-a_{0})} _{a_{1}}\cdot \cos \left({\tfrac {2\pi n}{N}}\right),\quad 0\leq n\leq N</script>
<p>Where <script type="math/tex">a_0 = 0.53836</script> is the optimal value.</p>
<p><img src="https://maelfabien.github.io/assets/images/hamming.png" alt="image" /></p>
<p>All windowing functions can be found on <a href="https://en.wikipedia.org/wiki/Window_function">Wikipedia</a>.</p>
<h3 id="fast-fourier-transform-fft">Fast Fourier Transform (FFT)</h3>
<p>Then, a FFT algorithm is picked (most often Cooley–Tukey) to compute efficiently the Discrete Fourrier Transform (DFT):</p>
<script type="math/tex; mode=display">X_{k}=\sum _{n=0}^{N-1}x_{n}e^{-i2\pi kn/N}\qquad k=0,\ldots ,N-1</script>
<p>We typically make the computation on 512 points.</p>
<h3 id="modulus">Modulus</h3>
<p>The absolute value of the FFT is then computed, which gives the magnitude. At that point, we have a <em>power spectrum</em> sampled over 512 points. However, since the spectrum is symmetric, only half of those points are useful.</p>
<h3 id="mel-filters">Mel Filters</h3>
<p>The spectrum at that point has lots of fluctuations, and we don’t need them. We need to apply a smoothing, which will reduce the size of the spectral vectors. We therefore multiply the spectrum by a filterbank, a series of bandpass frequency filters.</p>
<p>Filters can be central, right or left, and defined by their shape (triangular most often). A common choice is the Bark/Mel scale for the frequency localization, a scale similar to the frequency scale of the human ear. A Mel is a unit of measure based on the human ears perceived frequency. It does not correspond linearly to the physical frequency of the tone, as the human auditory system apparently does not perceive pitch linearly. The Mel scale is approximately a linear frequency spacing below 1 kHz and a logarithmic spacing above 1 kHz. See more <a href="https://link.springer.com/content/pdf/bbm%3A978-3-319-49220-9%2F1.pdf">here</a>.</p>
<script type="math/tex; mode=display">f_{MEL} = 2595 * \log_{10} ( {1 + \frac{f}{700}} )</script>
<p>Where <script type="math/tex">f</script> is the physical frequency in Hz, and <script type="math/tex">f_{MEL}</script> should be close to the percieved frequency.</p>
<p>We can now compute the Mel spectrum of the magnitude spectrum:</p>
<script type="math/tex; mode=display">s(m) = \sum_{k=0}^{N-1} [ {\mid X(k) \mid}^2 H_m(k) ]</script>
<p>Where <script type="math/tex">H_m(k)</script> is the weight given to the <script type="math/tex">k^{th}</script> energy spectrum bin contributing to the <script type="math/tex">m^{th}</script> output band.</p>
<h3 id="discrete-cosine-transform-dct">Discrete Cosine Transform (DCT)</h3>
<p>Finally, we take the log of the spectrum and a Discrete Cosine Transform is applied. We obtain the Mel-Frequency Cepstral Coefficients (MFCC), and since most of the information is gathered in the first few coefficients, we can only select the first few ones (usually 12 or 20).</p>
<script type="math/tex; mode=display">c_n = \sum_{m=0}^{M-1} log_{10}(s(m)) cos(\frac{\pi}{M} n(m-\frac{1}{2}))</script>
<p>There we are, we obtained the MFCC coefficients describing the input signal window.</p>
<h3 id="cepstral-mean-substraction-cms">Cepstral Mean Substraction (CMS)</h3>
<p>Finally, and espacially in Speaker Verification tasks, the cepstral mean vector is substracted from each vector. This step is called Cepstral Mean Substraction (CMS) and removes slowly varying convolutive noises.</p>
<h3 id="cepstral-mean-variance-normalization-cmvn">Cepstral mean variance normalization (CMVN)</h3>
<p>Cepstral mean variance normalization (CMVN) minimizes distortion by noise contamination for robust feature extraction by linearly transforming the cepstral coefficients to have the same segmental statistics (mean 0, variance 1).</p>
<p>It is however known to degrade the performance of speaker verification tasks on short utterances.</p>
<h2 id="2-lpc-based-cepstral-parameters">2. LPC-based cepstral parameters</h2>
<p>In Linear Predictive Coding (LPC) analysis, we represent the audio using the information of linear predictive models. We first split the input signal into the fundamental elements of the speech production apparatus:</p>
<ul>
<li>the glottal source, (the space between the vocal folds) produces the buzz, characterizes the intensity (loudness) and frequency (pitch)</li>
<li>the vocal tract (the throat and mouth) forms the tube, which is characterized by its resonances</li>
<li>the nasal tract</li>
<li>the lips, generates hisses and pops</li>
</ul>
<p>And we model each of the with an Auto Regressive filter on each window. More specifically:</p>
<ul>
<li>a lowpass filter for the glottal source</li>
<li>an AR filter for the vocal tract</li>
<li>an ARMA filter for the nasal tract</li>
<li>an MA filter for the lips</li>
</ul>
<p>Overall, the speech production process becomes an ARMA process, simplified in an AR process. We take each window and estimate the coefficients of an AR filter on the speech signal.</p>
<script type="math/tex; mode=display">c_0 = ln(\sigma^2)</script>
<script type="math/tex; mode=display">c_m = a_m + \sum_{k=1}^{m-1} (\frac{k}{m}) c_k a_{m-k}, 1 ≤ m ≤ p</script>
<script type="math/tex; mode=display">% <![CDATA[
c_m = \sum_{k=1}^{m-1} (\frac{k}{m}) c_k a_{m-k}, p < m %]]></script>
<p>Where <script type="math/tex">\sigma^2</script> is the gain term in LPC model, <script type="math/tex">a_m</script> are the LPC coefficients and <script type="math/tex">p</script> the number of LPC coefficients calculated.</p>
<p>There are many other features to extract, but MFCCs are the most frequent ones, LPC is sometimes used, so I won’t dive deeper in this.</p>
<h2 id="3-voice-activity-detection">3. Voice Activity Detection</h2>
<p>We might now want to discard useless information in the frames we extracted features for. We do so by removing frames that do not contain speech using Voice Activity Detection (VAD).</p>
<p>A common approach is the Gaussian-based VAD, but one can also use the energy-based VAD. The aim of a VAD is to aquire speech only when it occurs. I described a bit further the concept and implementation of Voice Activity Detection in <a href="https://maelfabien.github.io/project/Speech_proj/#">this project</a>.</p>
<p>The main steps behind building a VAD are:</p>
<ul>
<li>Break audio signal into frames</li>
<li>Extract features from each frame</li>
<li>Train a classifier on a known set of speech and silence frames (could ba a gaussian model or a rule-based decision)</li>
<li>Classify unseen frames as speech or silence</li>
</ul>
<p>VAD performs well on audio with relatively low signal-to-noice ratio (SNR), a ratio which compares the level of a desired signal to the level of background noise.</p>
<h1 id="ii-statistical-modeling">II. Statistical Modeling</h1>
<p>The core of the speaker verification decision is the likelihood ratio. Say that we want to determine if speech sample Y (from the verification set) was spoken by S.</p>
<p>Then, the verification task is a basic hypothesis testing:</p>
<p><script type="math/tex">H_0</script> : Y is from speaker S
<script type="math/tex">H_1</script> : Y is not from speaker S</p>
<p>The test to decide whether to accept <script type="math/tex">H_0</script> or not is the Likelihood Ratio (LR):</p>
<script type="math/tex; mode=display">LR = \frac{p(Y \mid H_0)}{p(Y \mid H_1)}</script>
<p>If the Likelihood ratio is greater than the threshold <script type="math/tex">\theta</script>, we accept <script type="math/tex">H_0</script>, otherwise we accept <script type="math/tex">H_1</script>.</p>
<p>If we talk in terms of logs, then the log-likelihood ration is simply the difference between the logs of the 2 probability density functions:</p>
<script type="math/tex; mode=display">\log(LR) = \log(p(Y \mid H_0)) - \log(p(Y \mid H_1))</script>
<p>We have a speaker to test for <script type="math/tex">H_0</script>, and we can build a model, say <script type="math/tex">\lambda_{hyp}</script>, being for example a Gaussian Distribution of the features extracted.</p>
<p>However, we do not have an alternative model for <script type="math/tex">H_1</script>. We must compute what is called a “Background Model”, which would be a Gaussian Model <script type="math/tex">\lambda_{\overline{hyp}}</script>.</p>
<p>There are 2 options for the background model:</p>
<ul>
<li>either consider the closed set of other speakers and compute: <script type="math/tex">p(X \mid \lambda_{\overline{hyp}}) = f ( p(X \mid \lambda_1), ..., p(X \mid \lambda_N))</script>, where <script type="math/tex">f</script> is an aggregative function like the mean or the max. It however requires a model per alternative hypothesis, i.e. per speaker</li>
<li>or consider a pool of several different speakers to train a single model, called the Universal Background Model (UBM)</li>
</ul>
<p>The main advantage of the UBM is that it is <em>universal</em> in the sense that it can be used by any of the speakers, without having to re-train a model.</p>
<p>The pipeline can be represented as such:</p>
<p><img src="https://maelfabien.github.io/assets/images/bs_2.png" alt="image" /></p>
<h2 id="1-universal-background-model--development">1. Universal Background Model : Development</h2>
<p>A UBM is a high-order Gaussian Mixture Model (usually 512 to 2048 mixtures with 24 dimensionsa) trained on a large quantity of speech, from a wide population. This step is used to learn speaker-independent distribution of features, used in the alternative hypothesis in the likelihood ratio.</p>
<p>For a D-dimensional feature vector <script type="math/tex">x</script>, the mixture density is:</p>
<script type="math/tex; mode=display">P(x \mid \lambda) = \sum_{k=1}^M w_k \times g(x \mid \mu_k, \Sigma_k)</script>
<p>Where:</p>
<ul>
<li><script type="math/tex">x</script> is a D-dimensional feature vector</li>
<li><script type="math/tex">w_k, k = 1, 2, ..., M</script> is the mixture weights s.t. they sum to 1</li>
<li><script type="math/tex">\mu_k, k = 1, 2, ..., M</script> is mean of each Gaussian</li>
<li><script type="math/tex">\Sigma_k, k = 1, 2, ..., M</script> is the covariance of each Gaussian</li>
<li><script type="math/tex">g(x \mid \mu_k, \Sigma_k)</script> are the Gaussian densities such that:</li>
</ul>
<script type="math/tex; mode=display">g(x \mid \mu_k, \Sigma_k) = \frac{1}{(2 \pi)^{\frac{D}{2}} {\mid \Sigma_k \mid}^{\frac{1}{2}}} exp^{ - \frac{1}{2}(x - \mu_k)^T \Sigma_k^{-1} (x-\mu_k)}</script>
<p>The parameters of the GMM are therefore : <script type="math/tex">\lambda = (w_k, \mu_k, \Sigma_k), k = 1, 2, 3, ..., M</script>.</p>
<p>We typically use a diagonal covariance-matrix rather than a full-covariance one since it is more computationally efficient and empirically works better.</p>
<p>The GMM is trained on a collection of training vectors. The parameters of the GMM are computed iteratively using Expectation-Maximization (EM) algorithm, and therefore there are no guarantees that it will converce twice to the same solution depending on the initialization.</p>
<p>Under assumption of idependent feature vectors, the log-likelihood of a model <script type="math/tex">\lambda</script> for a sequence <script type="math/tex">X = (x_1, x_2, ..., x_T)</script> is simply the average over all feature vectors:</p>
<script type="math/tex; mode=display">\log p(X \mid \lambda) = \frac{1}{T} \sum_t \log p(x_t \mid \lambda)</script>
<h2 id="2-speaker-enrollment">2. Speaker Enrollment</h2>
<p>The last step before the verification is to perform the speaker enrollment. The aim is still to also train one Gaussian Mixture Model on the extracted features for each speaker, thus resulting in 20 models if we have 20 speakers.</p>
<p>There are 2 approaches to model the speakers:</p>
<ul>
<li>train a lower dimensional GMM (64 to 256 mixtures) depending on the amount of enrollment data that we have</li>
<li>adapt the UBM GMM to the speaker model using Maximum a Posteriori Adaptation (MAP), usually the approach selected</li>
</ul>
<p>In MAP, we simply start the EM algorithm with the parameters learned by the UBM. Through this step, we only adapt the mean, and not the covariance, since updating the covariance does not improve the performance.</p>
<p>For the mean to update, we perform a <em>maximum a posteriori adaptation</em> :</p>
<script type="math/tex; mode=display">\mu_k^{MAP} = \alpha_k \mu_k + (1 - \alpha_k) \mu_k^{UBM}</script>
<p>Where :</p>
<ul>
<li><script type="math/tex">\alpha_k = \frac{n_k}{n_k + \tau_k}</script> is the mean adaptation coefficient</li>
<li><script type="math/tex">n_k</script> is the count for the adaptation data</li>
<li><script type="math/tex">\tau_k</script> is the relevance factor, between 8 and 32</li>
</ul>
<h2 id="3-speaker-verification">3. Speaker Verification</h2>
<p>For a sample in the test folder, we compute the score of the claimed identity GMM in the enrollment set. We substract the score of the GMM of the UBM for each, and obtain the likelihood ratio. We then compare the score to our threshold (usually 0), and accept or decline the identity of the speaker.</p>
<p>However, the scores might not always be independent from the speaker, and there might also be differences between the enrollment and the test data. For this reason, in litterature, there has been lots of research on score normalization. Among popular techniques:</p>
<ul>
<li>cohort-based normalizations</li>
<li>centered impostor distribution</li>
<li>Znorm</li>
<li>Hnorm</li>
<li>Tnorm</li>
<li>HTnorm</li>
<li>Cnorm</li>
<li>Dnorm</li>
<li>WMAP</li>
</ul>
<h1 id="limits-of-gmm-ubm">Limits of GMM-UBM</h1>
<p>Nowadays, GMM-UBM are not state-of-the-art approaches anymore. Indeed, it requires too much training data in general. Better performing approaches have been developed such as :</p>
<ul>
<li>SVM-based methods</li>
<li>I-vector methods</li>
<li>Deep-learning based methods</li>
</ul>Maël FabienSpeech Processing