Markov chains in Java: Suggest what Narendra Modi will say using Markov chains


Recently, I read an article[1] on Markov chains. In the post, author showed how we can build autocomplete functionality using them. The article piqued my interest to learn more about Markov chain and I started looking for an example application that I can build using it. I decided to build a web application that will suggest me what Indian prime minister Narendra Modi[2] will say after a word/pair of words/triplet for words.

I am not a supporter of Narendra Modi style of leadership. The reason I chose him is because I could easily find text of all his speeches on the web [3].

This post is divided into three sections:

  1. What is Markov chain?
  2. Create dataset for the application
  3. Build the application that uses Markov chain

Before we start let’s look at the application video to better understand what we are trying to build.

What is Markov chain?

A markov chain is a mathematical system that transitions from one system to another according to certain probabilistic rules. The key idea behind Markov chain is that the next state of the process depends on the previous state and not sequence of events.They are widely employed in economics, game theory, communication theory, genetics and finance.

The example application url is https://desolate-meadow-59571.herokuapp.com/

The Github reposistory for the application is https://github.com/shekhargulati/what-will-X-say

Create dataset for the application

Now that we understand what Markov chain let’s talk about how we can create dataset for the application. For this application, we need to fetch text of all the web links from the website[3] and save them on filesystem. This is what we need to do:

  1. Find URLS of all web pages from the infinite scrollable web page
  2. Find text from the HTML web page
  3. Convert Hindi text to English
  4. Save each web page english text to a file

Find URLs of all the web pages from the infinite scrollable web page

To find all the URLs, I used the fact that the website fetches data by making an HTTP GET request to https://www.narendramodi.in/speech/loadspeeche?page=1&language=en for each page. I wrote a simple program using jsoup library[4] to fetch all the URLs by selecting all anchor tags with a particular CSS selector.

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public static void main(String[] args) throws Exception {
    List<String> urls = new ArrayList<>();
    for (int i = 1; i < 100; i++) {
        Connection connect = Jsoup.connect(String.format("https://www.narendramodi.in/speech/loadspeeche?page=%d&language=en", i));
        Document doc = connect.get();
        Elements elements = doc.select(".speeches .speechesBox .speechesItemLink a");
        for (Element element : elements) {
            urls.add(element.attr("href"));
        }
    }

    Path file = Files.write(Paths.get("urls.txt"), urls);
    System.out.println(file.toAbsolutePath());

}

The above code will make request to teh web page and find all anchor tags matching the CSS selector and add them to a list. Finally, we write list to a file named urls.txt.

The content of file will look like as shown below. A single speech URL per line.

https://www.narendramodi.in/pm-modi-inaugurates-jan-aushadhi-centres-through-video-conferencing
https://www.narendramodi.in/text-of-pm-s-speech-at-foundation-stone-laying-ceremony-of-shikshan-bhavan-vidyarthi-bhavan-in-adalaj-
https://www.narendramodi.in/text-of-pm-s-speech-at-inauguration-of-new-civil-cancer-eye-hospitals-in-ahmedabad-
https://www.narendramodi.in/text-of-pm-s-speech-at-foundation-stone-laying-ceremony-of-vishwa-umiyadham-temple-complex-in-jaspur-
https://www.narendramodi.in/text-of-pm-s-address-at-the-india-today-conclave-2019

I initially wanted to do this step using Puppeteer as mentioned in the post [5] but Puppeteer script was failing to scroll the page because GET request to https://www.narendramodi.in/speech/loadspeeche?page=2&amp;language=en were going to Pending status.

Find text from the HTML web page

Now that we have all the urls we can find article texts from them. To find text from HTML, I used Boilerpipe library [6].

import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;
import de.l3s.boilerpipe.sax.HTMLDocument;
import de.l3s.boilerpipe.sax.HTMLFetcher;

public static void main(String[] args) throws Exception {
    DatasetGenerator.writeTextToOutputDir(
            Paths.get("src", "main", "resources", "urls.txt"),
            Paths.get("src", "main", "resources", "speeches")
    );
}


public static void writeTextToOutputDir(Path inputFile, Path outputDir) throws IOException {
    List<String> urls = Files.readAllLines(inputFile);
    int totalUrls = urls.size();
    for (int i = 0; i < totalUrls; i++) {
        String url = urls.get(i);
        System.out.println(String.format("Downloading text for url %d - %s", i, url));
        String text = text(url);
        Files.write(
                outputDir.resolve(String.format("%d_%s.txt", i + 1, UUID.randomUUID().toString())),
                text.getBytes());
    }
}

private static String text(String url) {
    try {
        HTMLDocument htmlDocument = HTMLFetcher.fetch(new URL(url));
        final TextDocument doc = new BoilerpipeSAXInput(htmlDocument.toInputSource()).getTextDocument();
        return ArticleExtractor.INSTANCE.getText(doc);
    } catch (Exception e) {
        // skipping exception
        throw new RuntimeException(e);
    }
}

The code above does the following:

  1. We iterate over all the urls
  2. For each URL, we get the text of the document using the Boilerpipe API
  3. We write each file to output directory

Convert Hindi text to English

The text files that we have downloaded are in Hindi. I wanted to convert the text to English so I decided to use Google translate API. The problem with Translate API is that you need API keys. You can get API keys by registering your billing information with Google Cloud. Few years back I worked with textblob[7] library and I remember that I didn’t had to provide it Google Translate API key to work.

This made me look into source code of textblob library. I found the translate.py Python file that was reponsible for providing Google translate support. The textblob library programmatically submits the form that is loaded when you go to google.translate.com. To submit the form programmatically, you have to pass the CSRF token. textblob library uses the reverse engineered code to generate the CSRF token. You can see the code on Github here[8].

I couldn’t find a Java library that generates the CSRF token so I took the JavaScript code and using Java’s ScriptEngine evaluate called the JavaScript function from the Java code.

The code for GoogleTranslator is shown below.

import okhttp3.*;

import javax.script.Invocable;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;
import javax.script.ScriptException;
import java.io.IOException;
import java.net.URLEncoder;

public class GoogleTranslator {

    private ScriptEngine engine = new ScriptEngineManager().getEngineByName("JavaScript");
    private Invocable invocable;
    private String url;
    private OkHttpClient client = new OkHttpClient();

    public GoogleTranslator() {
        this.url = "http://translate.google.com/translate_a/t?client=webapp&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=at&ie=UTF-8&oe=UTF-8&otf=2&ssel=0&tsel=0&kc=1";
        evaluateScript();
    }

    private void evaluateScript() {
        String script = "function tk(a) {"
                + "var TKK = ((function() {var a = 561666268;var b = 1526272306;return 406398 + '.' + (a + b); })());\n"
                + "function b(a, b) { for (var d = 0; d < b.length - 2; d += 3) { var c = b.charAt(d + 2), c = 'a' <= c ? c.charCodeAt(0) - 87 : Number(c), c = '+' == b.charAt(d + 1) ? a >>> c : a << c; a = '+' == b.charAt(d) ? a + c & 4294967295 : a ^ c } return a }\n"
                + "for (var e = TKK.split('.'), h = Number(e[0]) || 0, g = [], d = 0, f = 0; f < a.length; f++) {"
                + "var c = a.charCodeAt(f);"
                + "128 > c ? g[d++] = c : (2048 > c ? g[d++] = c >> 6 | 192 : (55296 == (c & 64512) && f + 1 < a.length && 56320 == (a.charCodeAt(f + 1) & 64512) ? (c = 65536 + ((c & 1023) << 10) + (a.charCodeAt(++f) & 1023), g[d++] = c >> 18 | 240, g[d++] = c >> 12 & 63 | 128) : g[d++] = c >> 12 | 224, g[d++] = c >> 6 & 63 | 128), g[d++] = c & 63 | 128)"
                + "}"
                + "a = h;"
                + "for (d = 0; d < g.length; d++) a += g[d], a = b(a, '+-a^+6');"
                + "a = b(a, '+-3^+b+-f');"
                + "a ^= Number(e[1]) || 0;"
                + "0 > a && (a = (a & 2147483647) + 2147483648);"
                + "a %= 1E6;"
                + "return a.toString() + '.' + (a ^ h)\n"
                + "}";
        try {
            engine.eval(script);
        } catch (ScriptException e) {
            throw new RuntimeException(e);
        }
        invocable = (Invocable) engine;
    }

    public String translate(String source, String fromLang, String toLang) {
        try {
            String updatedUrl = String.format(
                    "%s&sl=%s&tl=%s&hl=%s&tk=%s",
                    this.url,
                    fromLang,
                    toLang,
                    toLang,
                    calculateTk(source)
            );

            String postParam = "q=" + URLEncoder.encode(source, "UTF-8");
            Request request = new Request.Builder()
                    .url(updatedUrl)
                    .header("Accept", "*/*")
                    .addHeader("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) ")
                    .addHeader("User-Agent", "AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168 Safari/535.19)")
                    .post(RequestBody.create(MediaType.get("application/x-www-form-urlencoded"), postParam))
                    .build();

            try (Response response = client.newCall(request).execute()) {
                ResponseBody body = response.body();
                return body != null ? body.string() : null;
            }
        } catch (IOException e) {
            throw new RuntimeException(e);
        }

    }

    private String calculateTk(String str) {
        try {
            return (String) invocable.invokeFunction("tk", str);
        } catch (ScriptException | NoSuchMethodException e) {
            throw new RuntimeException(e);
        }
    }

In the code shown above, we make submit form by making HTTP POST request. We evaluate the JavaScript function to calculate the CSRF token and pass it during the form submission. Now, this API can be used for converting text from one language to another.

Save each web page english text to a file

This step is simple now. We have to update our code in step 2 to make use of the Google Translator. The updated code is shown below.

public static void writeTextToOutputDir(Path inputFile, Path outputDir) throws IOException {
    GoogleTranslator translator = new GoogleTranslator();
    List<String> urls = Files.readAllLines(inputFile);
    int totalUrls = urls.size();
    for (int i = 0; i < totalUrls; i++) {
        String url = urls.get(i);
        System.out.println(String.format("Downloading text for url %d - %s", i, url));
        String text = text(url);
        translator.translate(text, "en", "hi");
        Files.write(
                outputDir.resolve(String.format("%d_%s.txt", i + 1, UUID.randomUUID().toString())),
                text.getBytes());
    }
}

Build the application that uses Markov chain

Now, we will build a simple Spring Boot web application that will expose suggestion REST API.

We will start by writing a simple Java program that will build the 5 order Markov chain as shown below.

public class SuggestionService {


    Map<String, Map<String, Integer>> first = new HashMap<>();
    Map<String, Map<String, Integer>> second = new HashMap<>();
    Map<String, Map<String, Integer>> third = new HashMap<>();
    Map<String, Map<String, Integer>> fourth = new HashMap<>();
    Map<String, Map<String, Integer>> fifth = new HashMap<>();

    @Value("${app.speech.data.path}")
    private String speechDataPath;

    public SuggestionService() {
    }

    @PostConstruct
    public void init() throws Exception {
        Stream<Path> files = Files.list(Paths.get(speechDataPath));

        String data = files.flatMap(file -> {
            try {
                return Files.lines(file);
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }).collect(Collectors.joining(" "));


        String[] sentences = data.toLowerCase()
                .replace("\\r", " ")
                .replace("\\n", " ")
                .replace("?", ".")
                .replace("\\", " ")
                .replace("/", " ")
                .replace("!", ".")
                .replace("“", ".")
                .replace("”", ".")
                .replace("\"", ".")
                .replace("‘", " ")
                .replace("-", " ")
                .replace("’", " ")
                .replace("\"", " ")
                .split("\\.");

        for (String sentence : sentences) {
            String[] wordsInSent = sentence.replace(",", " ").split(" ");
            String[] words = Arrays.stream(wordsInSent).filter(word -> !word.trim().isEmpty()).toArray(String[]::new);
            if (words.length == 0) {
                continue;
            }
            for (int i = 0; i < words.length; i++) {
                if (i >= 1) {
                    String seq = toSeq(words, i, 1);
                    if (seq != null) {
                        updateOcc(first, seq, words[i]);
                    }
                }
                if (i >= 2) {
                    String seq = toSeq(words, i, 2);
                    if (seq != null) {
                        updateOcc(second, seq, words[i]);
                    }
                }
                if (i >= 3) {
                    String seq = toSeq(words, i, 3);
                    if (seq != null) {
                        updateOcc(third, seq, words[i]);
                    }
                }
                if (i >= 4) {
                    String seq = toSeq(words, i, 4);
                    if (seq != null) {
                        updateOcc(fourth, seq, words[i]);
                    }
                }
                if (i >= 5) {
                    String seq = toSeq(words, i, 5);
                    if (seq != null) {
                        updateOcc(fifth, seq, words[i]);
                    }
                }
            }

        }
    }

    private String toSeq(String[] words, int i, int max) {
        if (i <= words.length) {
            return IntStream.range(i - max, i).mapToObj(j -> words[j]).collect(Collectors.joining(" "));
        }
        return null;
    }


    private void updateOcc(Map<String, Map<String, Integer>> dict, String seq, String w) {
        if (!dict.containsKey(seq)) {
            dict.put(seq, new HashMap<>());
        }
        Map<String, Integer> map = dict.get(seq);
        map.put(w, map.getOrDefault(w, 0) + 1);
    }
}

In the code shown above,

  1. We start by reading all the files in a data directory
  2. We clean out the text and convert them to an array of sentences
  3. For each sentence we update the Markov chain level map. Each Markov chain level is a Map of word with Map of next possible words and their occurrence count.

Next, we will add couple of more methods that our REST API will call to get suggestions from these Markov chains as shown below.

public List<String> suggestions(String input) {
    input = input.toLowerCase().trim();
    String[] words = Arrays.stream(input.split(" ")).filter(s -> s.trim().length() > 0).toArray(String[]::new);

    int length = words.length;
    int lastIdx = length - 1;
    String seq = toSeq(words, lastIdx + 1, length);
    Map<String, Map<String, Integer>> map = getMap(length);
    if (!map.containsKey(seq)) {
        return Collections.emptyList();
    }
    Map<String, Integer> t = map.get(seq);
    return sortSuggestions(input, t);
}


private List<String> sortSuggestions(String input, Map<String, Integer> suggestions) {
    int total = suggestions.values().stream().mapToInt(i -> i).sum();
    Comparator<Map.Entry<String, Integer>> comparator = Comparator.comparingInt(Map.Entry::getValue);
    List<Map.Entry<String, Integer>> entries = suggestions.entrySet().stream()
            .sorted(comparator.reversed())
            .collect(toList());

    return entries.stream()
            .map(e -> {
                String likelihood = decimalFormat.format(Double.valueOf(((double) e.getValue() / total) * 100));
                return input + " " + e.getKey() + " " + likelihood  + "%";
            })
            .collect(toList());
}

REST API is exposed using Spring Reactive support. You can look at the Github repository to learn more about it.

Our frontend is a simple web page that uses typeahead.js library to consume the REST API as shown below.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>What will X say?</title>
    <a href="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js">https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js</a>
    <a href="https://cdnjs.cloudflare.com/ajax/libs/typeahead.js/0.11.1/typeahead.bundle.min.js">https://cdnjs.cloudflare.com/ajax/libs/typeahead.js/0.11.1/typeahead.bundle.min.js</a>

$(document).ready(function(){
    // Defining the local dataset

    // Constructing the suggestion engine
    var suggestions = new Bloodhound({
        datumTokenizer: Bloodhound.tokenizers.whitespace,
        queryTokenizer: Bloodhound.tokenizers.whitespace,
        remote: {
            url: '/api/v1/suggest?input=%QUERY',
            wildcard: '%QUERY'
        }
    });

    suggestions.initialize();

    // Initializing the typeahead
    $('.typeahead').typeahead({
        highlight: true
    },
    {
        limit: 20,
        source: suggestions
    });
});



    <style type="text/css">
.bs-example {
    font-family: sans-serif;
    position: relative;
    margin: 100px;
}
.typeahead, .tt-query, .tt-hint {
    border: 2px solid #CCCCCC;
    border-radius: 8px;
    font-size: 22px; /* Set input font size */
    height: 30px;
    line-height: 30px;
    outline: medium none;
    padding: 8px 12px;
    width: 396px;
}
.typeahead {
    background-color: #FFFFFF;
}
.typeahead:focus {
    border: 2px solid #0097CF;
}
.tt-query {
    box-shadow: 0 1px 1px rgba(0, 0, 0, 0.075) inset;
}
.tt-hint {
    color: #999999;
}
.tt-menu {
    background-color: #FFFFFF;
    border: 1px solid rgba(0, 0, 0, 0.2);
    border-radius: 8px;
    box-shadow: 0 5px 10px rgba(0, 0, 0, 0.2);
    margin-top: 12px;
    padding: 8px 0;
    width: 422px;
}
.tt-suggestion {
    font-size: 22px;  /* Set suggestion dropdown font size */
    padding: 3px 20px;
}
.tt-suggestion:hover {
    cursor: pointer;
    background-color: #0097CF;
    color: #FFFFFF;
}
.tt-suggestion p {
    margin: 0;
}


    </style>
</head>
<body>
<div class="bs-example">
    <h2>Type something and see what will modi say</h2>

</div>
</body>
</html>

References

  1. Autocomplete using Markov chains – Link
  2. Narendra Modi Wikipedia page – Link
  3. Narendra Modi Text Speeches – Link
  4. jsoup library – Link
  5. Using Puppeteer to scrape websites with infinite scrolling – Link
  6. Boilerpipe library – Link
  7. textblob library – Link
  8. textblob reverse engineered CSRF token code – Link

1 thought on “Markov chains in Java: Suggest what Narendra Modi will say using Markov chains

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s