In this article, we will build a language recognition app using Markov Chains and likelihood decoding algorithm. If you have not seen my previous articles on this topic, I invite you to check them :)

Language Recognition

We aim to build a small web app able to recognize the language of an input text. We will :

  • Build a transition probabilities matrix from Wikipedia’s articles
  • Find the most “likely” language by multiplying the transition probabilities for a given sequence
  • Identify the highest result to return the language

Build the transition probability matrix

Start by importing the following packages :

import numpy as np
import matplotlib.pyplot as plt

# Parsing
import requests
from bs4 import BeautifulSoup
import re

Then, we’ll pick one of the longest articles from Wikipedia: South African Labor Law, and learn transition probabilities from it. Transition probabilities are simply the probabilities, from a given letter (say s) to move to another letter (say o).

url = 'https://en.wikipedia.org/wiki/South_African_labour_law'
res = requests.get(url)
html_page = res.content

We can then parse the content :

soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

We will use the following code to clean the content as much as possible :

output = ''

blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head', 
    'input',
    'script',
    '\n'
]

for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)

The raw text contains many HTML tags, and elements specific to Wikipedia. We need to clean the text a bit :

def preprocess_text(text) :

    text = text.replace('\n', '').replace('[ edit ]', '').replace("\'", "'")
    text = ''.join(c.lower() for c in text if not c.isdigit())
    text = re.sub('[^A-Za-z]+', ' ', text)

    return text

And apply the function to the clean text :

text = preprocess_text(output)

Then, define two dictionnaries we’ll need later on :

dic={1 : ' ', 
    2 : 'a', 
    3 : 'b', 
    4: 'c', 
    5 : 'd', 
    6 : 'e', 
    7: 'f', 
    8 : 'g', 
    9 : 'h', 
    10: 'i', 
    11: 'j', 
    12 : 'k', 
    13 : 'l', 
    14: 'm', 
    15 : 'n', 
    16 : 'o', 
    17: 'p', 
    18 : 'q', 
    19 : 'r' , 
    20: 's', 
    21 : 't', 
    22 : 'u', 
    23: 'v', 
    24 : 'w', 
    25 : 'x' , 
    26: 'y', 
    27 : 'z'
}

dic_2 = {' ' : 0, 
    'a' : 1, 
    'b' : 2, 
    'c' : 3, 
    'd' : 4, 
    'e' : 5, 
    'f' : 6, 
    'g' : 7, 
    'h' : 8, 
    'i' : 9, 
    'j' : 10, 
    'k' : 11, 
    'l' : 12, 
    'm' : 13, 
    'n' : 14, 
    'o' : 15, 
    'p' : 16, 
    'q' : 17, 
    'r' : 18, 
    's' : 19, 
    't' : 20, 
    'u' : 21, 
    'v' : 22, 
    'w' : 23, 
    'x' : 24, 
    'y' : 25, 
    'z' : 26
}

Alright, we now need to go through the whole text, and compute the number of time we went from one letter to another. I have kept the implementation really simple for explainability purposes. There are several ways to improve this part :

a = np.zeros(27)
b = np.zeros(27)
c = np.zeros(27)
d = np.zeros(27)
e = np.zeros(27)
f = np.zeros(27)
g = np.zeros(27)
h = np.zeros(27)
i = np.zeros(27)
j = np.zeros(27)
k = np.zeros(27)
l = np.zeros(27)
m = np.zeros(27)
n = np.zeros(27)
o = np.zeros(27)
p = np.zeros(27)
q = np.zeros(27)
r = np.zeros(27)
s = np.zeros(27)
t = np.zeros(27)
u = np.zeros(27)
v = np.zeros(27)
w = np.zeros(27)
x = np.zeros(27)
y = np.zeros(27)
z = np.zeros(27)
space = np.zeros(27)

prev = ' '

for char in text:
    if prev == ' ':
        space[dic_2[char]] += 1
    elif prev == 'a' : 
        a[dic_2[char]] += 1
    elif prev == 'b':
        b[dic_2[char]] += 1
    elif prev == 'c':
        c[dic_2[char]] += 1
    elif prev == 'd':
        d[dic_2[char]] += 1
    elif prev == 'e':
        e[dic_2[char]] += 1
    elif prev == 'f':
        f[dic_2[char]] += 1
    elif prev == 'g':
        g[dic_2[char]] += 1
    elif prev == 'h':
        h[dic_2[char]] += 1
    elif prev == 'i':
        i[dic_2[char]] += 1
    elif prev == 'j':
        j[dic_2[char]] += 1
    elif prev == 'k':
        k[dic_2[char]] += 1
    elif prev == 'l':
        l[dic_2[char]] += 1
    elif prev == 'm':
        m[dic_2[char]] += 1
    elif prev == 'n':
        n[dic_2[char]] += 1
    elif prev == 'o':
        o[dic_2[char]] += 1
    elif prev == 'p':
        p[dic_2[char]] += 1
    elif prev == 'q':
        q[dic_2[char]] += 1
    elif prev == 'r':
        r[dic_2[char]] += 1
    elif prev == 's':
        s[dic_2[char]] += 1
    elif prev == 't':
        t[dic_2[char]] += 1
    elif prev == 'u':
        u[dic_2[char]] += 1
    elif prev == 'v':
        v[dic_2[char]] += 1
    elif prev == 'w':
        w[dic_2[char]] += 1
    elif prev == 'x':
        x[dic_2[char]] += 1
    elif prev == 'y':
        y[dic_2[char]] += 1
    elif prev == 'z':
        z[dic_2[char]] += 1

    prev = char

At that point, we have raw number which we need to transform into probabilities :

a = a / np.sum(a)
b = b / np.sum(b)
c = c / np.sum(c)
d = d / np.sum(d)
e = e / np.sum(e)
f = f / np.sum(f)
g = g / np.sum(g)
h = h / np.sum(h)
i = i / np.sum(i)
j = j / np.sum(j)
k = k / np.sum(k)
l = l / np.sum(l)
m = m / np.sum(m)
n = n / np.sum(n)
o = o / np.sum(o)
p = p / np.sum(p)
q = q / np.sum(q)
r = r / np.sum(r)
s = s / np.sum(s)
t = t / np.sum(t)
u = u / np.sum(u)
v = v / np.sum(v)
w = w / np.sum(w)
x = x / np.sum(x)
y = y / np.sum(y)
z = z / np.sum(z)
space = space / np.sum(space)

To retrieve the final matrix, we can declare :

trans_eng = np.matrix([space, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z])

I have summarized this into a function. Here is the whole code needed :

global dic
global dic_2

dic={1 : ' ', 
    2 : 'a', 
    3 : 'b', 
    4: 'c', 
    5 : 'd', 
    6 : 'e', 
    7: 'f', 
    8 : 'g', 
    9 : 'h', 
    10: 'i', 
    11: 'j', 
    12 : 'k', 
    13 : 'l', 
    14: 'm', 
    15 : 'n', 
    16 : 'o', 
    17: 'p', 
    18 : 'q', 
    19 : 'r' , 
    20: 's', 
    21 : 't', 
    22 : 'u', 
    23: 'v', 
    24 : 'w', 
    25 : 'x' , 
    26: 'y', 
    27 : 'z'
}

dic_2 = {' ' : 0, 
    'a' : 1, 
    'b' : 2, 
    'c' : 3, 
    'd' : 4, 
    'e' : 5, 
    'f' : 6, 
    'g' : 7, 
    'h' : 8, 
    'i' : 9, 
    'j' : 10, 
    'k' : 11, 
    'l' : 12, 
    'm' : 13, 
    'n' : 14, 
    'o' : 15, 
    'p' : 16, 
    'q' : 17, 
    'r' : 18, 
    's' : 19, 
    't' : 20, 
    'u' : 21, 
    'v' : 22, 
    'w' : 23, 
    'x' : 24, 
    'y' : 25, 
    'z' : 26
    }

def preprocess_text(text) :

    text = text.replace('\n', '').replace('[ edit ]', '').replace("\'", "'")
    text = ''.join(c.lower() for c in text if not c.isdigit())
    text = re.sub('[^A-Za-z]+', ' ', text)

    return text

def compute_transition(url_input):

    url = url_input
    res = requests.get(url)
    html_page = res.content

    soup = BeautifulSoup(html_page, 'html.parser')
    text = soup.find_all(text=True)

    output = ''
    blacklist = [
        '[document]',
        'noscript',
        'header',
        'html',
        'meta',
        'head', 
        'input',
        'script',
        '\n'
    ]

    for t in text:
        if t.parent.name not in blacklist:
        output += '{} '.format(t)

    text = preprocess_text(output)

    a = np.zeros(27)
    b = np.zeros(27)
    c = np.zeros(27)
    d = np.zeros(27)
    e = np.zeros(27)
    f = np.zeros(27)
    g = np.zeros(27)
    h = np.zeros(27)
    i = np.zeros(27)
    j = np.zeros(27)
    k = np.zeros(27)
    l = np.zeros(27)
    m = np.zeros(27)
    n = np.zeros(27)
    o = np.zeros(27)
    p = np.zeros(27)
    q = np.zeros(27)
    r = np.zeros(27)
    s = np.zeros(27)
    t = np.zeros(27)
    u = np.zeros(27)
    v = np.zeros(27)
    w = np.zeros(27)
    x = np.zeros(27)
    y = np.zeros(27)
    z = np.zeros(27)
    space = np.zeros(27)

    prev = ' '

    for char in text:
        if prev == ' ':
            space[dic_2[char]] += 1
        elif prev == 'a' : 
            a[dic_2[char]] += 1
        elif prev == 'b':
            b[dic_2[char]] += 1
        elif prev == 'c':
            c[dic_2[char]] += 1
        elif prev == 'd':
            d[dic_2[char]] += 1
        elif prev == 'e':
            e[dic_2[char]] += 1
        elif prev == 'f':
            f[dic_2[char]] += 1
        elif prev == 'g':
            g[dic_2[char]] += 1
        elif prev == 'h':
            h[dic_2[char]] += 1
        elif prev == 'i':
            i[dic_2[char]] += 1
        elif prev == 'j':
            j[dic_2[char]] += 1
        elif prev == 'k':
            k[dic_2[char]] += 1
        elif prev == 'l':
            l[dic_2[char]] += 1
        elif prev == 'm':
            m[dic_2[char]] += 1
        elif prev == 'n':
            n[dic_2[char]] += 1
        elif prev == 'o':
            o[dic_2[char]] += 1
        elif prev == 'p':
            p[dic_2[char]] += 1
        elif prev == 'q':
            q[dic_2[char]] += 1
        elif prev == 'r':
            r[dic_2[char]] += 1
        elif prev == 's':
            s[dic_2[char]] += 1
        elif prev == 't':
            t[dic_2[char]] += 1
        elif prev == 'u':
            u[dic_2[char]] += 1
        elif prev == 'v':
            v[dic_2[char]] += 1
        elif prev == 'w':
            w[dic_2[char]] += 1
        elif prev == 'x':
            x[dic_2[char]] += 1
        elif prev == 'y':
            y[dic_2[char]] += 1
        elif prev == 'z':
            z[dic_2[char]] += 1

        prev = char

    a = a / np.sum(a)
    b = b / np.sum(b)
    c = c / np.sum(c)
    d = d / np.sum(d)
    e = e / np.sum(e)
    f = f / np.sum(f)
    g = g / np.sum(g)
    h = h / np.sum(h)
    i = i / np.sum(i)
    j = j / np.sum(j)
    k = k / np.sum(k)
    l = l / np.sum(l)
    m = m / np.sum(m)
    n = n / np.sum(n)
    o = o / np.sum(o)
    p = p / np.sum(p)
    q = q / np.sum(q)
    r = r / np.sum(r)
    s = s / np.sum(s)
    t = t / np.sum(t)
    u = u / np.sum(u)
    v = v / np.sum(v)
    w = w / np.sum(w)
    x = x / np.sum(x)
    y = y / np.sum(y)
    z = z / np.sum(z)
    space = space / np.sum(space)

    return np.matrix([space, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z])

We can then pick long articles in English, French and Italian for example :

trans_eng = compute_transition('https://en.wikipedia.org/wiki/South_African_labour_law')
trans_fr = compute_transition('https://fr.wikipedia.org/wiki/Histoire_du_m%C3%A9tier_de_plombier')
trans_it = compute_transition('https://it.wikipedia.org/wiki/Storia_d%27Italia')

The transition matrices are now computed! Let’s move to the language recognition part.

Identify the language

We will now try to identify the language based on the transition likelihood. All we need to do is, for each language, roll back identify the transition probability from one letter to another, and return the most likely language.

def rec_language(dic, dic_2, bi_eng, bi_fr, bi_it, seq) :

    seq = preprocess_text(seq)

    key_0 = 0
    trans_eng = 1
    trans_fra = 1
    trans_it = 1

    for letter in seq :
    
    # If unknown character missed by pre-processing
        try :
            key_1 = dic_2[letter]

            trans_eng = trans_eng * bi_eng[key_0, key_1]
            trans_fra = trans_fra * bi_fr[key_0, key_1]
            trans_it = trans_it * bi_it[key_0, key_1]

            key_0 = dic_2[letter]
        except :
            continue

    if trans_eng > trans_fra and trans_eng > trans_it :
        print("It's English !")
    elif trans_fra > trans_eng and trans_fra > trans_it :
        print("It's French !") 
    else :
        print("It's Italian !")

We can now try it in some sentences! First, a french sentence :

rec_language(dic, dic_2, trans_eng, trans_fr, trans_it, "Quel beau temps aujourd'hui !")

Returns :

It's French !

Then, in English :

rec_language(dic, dic_2, trans_eng, trans_fr, trans_it, 'What a nice weather today !')

Returns :

It's English !

And in italian :

rec_language(dic, dic_2, trans_eng, trans_fr, trans_it, 'Che bello tempo fa oggi !')

Returns :

It's Italian !

Potential improvements

We fetched the transition probabilities from single articles in Wikipedia. To develop a more robust solution, we should consider a large input corpus.

The text pre-processing is not perfect, and we should add some more features to it.

Finally, we tested only 3 languages, but we could generalize the solution we have developed to other languages.

Standalone App with Voilà

You might have heard of Voilà that lets you run your Jupyter Notebooks as standalone apps. Let’s try this out!

Start by installing Voilà :

 pip install voila

Then, in the notebook, create an interactive widget :

 from ipywidgets import widgets
 from ipywidgets import interact, interactive, fixed, interact_manual
 
 def reco_interactive(x):
    return rec_language(dic, dic_2, trans_eng, trans_fr, trans_it, x)

And run the interactive cell :

 interact(reco_interactive, x='Hi there!');

You should see something like this :

image

Then, to create an app from it, simply run from your terminal, in the notebook’s folder :

 voila notebook.ipynb

You’ll have access to a webpage where the interactive widget works as a standalone app!

image

Conclusion : Although text embedding and deep learning seem to be everywhere nowadays, simple approaches like likelihood Decoding algorithm and Markov Chains can bring value if we’re looking for a light, fast and explainable solution (think about including this in a smartphone’s software for example). I hope this was useful, and don’t hesitate to comment if you have any question.