Building a Spell Checker in Clojure using IntelliJ Idea


Today, I decided to play with a new language so I gave Clojure a try. I have previously played with Lisp so I expected some fun with brackets.

I decided to build a simple spell checker.

Setup

Install the Cursive plugin in IntelliJ. Refer to the docs.

Once installed start by creating a new project.

Next choose the location on the file system and press the create Finish.

It will install the Clojure and Leign. Once project is setup you will see the structure as shown below.

Now that we dev environment setup let’s start creating our spell checker.

There are two main files generated by the plugin

  • core.clj: It is the place where we will define our spell checker logic
  • project.clj: It is our build file
  • Rest are documentation and test files.

The core.clj has following code.

(ns spell-checker.core)

(defn foo
  "I don't do a whole lot."
  [x]
  (println x "Hello, World!"))

It only prints Hello, World!.

It is not executable so we first make it executable by adding the main function as shown below. And, then we call the foo function.

To run this using IntelliJ we have to define a new run configuration as shown below.After setting the values press Apply and then Ok.

Now, you will be able to run the Clojure program.

Let’s write the Psuedocode of what we to go to implement a simple and basic spell checker.

  • Find English word list on the web
  • Load English words in a Set. We will use since we care about uniqueness and order does not matter
  • Next, we will check if the given word exists in our Set if yes we will return the same word in a list
  • If word does not exist then we find words that are similar to the given word using levenshtein distance
  • Finally, we return a list of similar words

Step 1: Find English word list on the web

I did a quick Google search and found the following English word list https://www.mit.edu/~ecprice/wordlist.10000. It has 10,000 words.

Download this file to resources directory.

Step 2: Load English words in a Set

We used the Clojure slurp function to first load a file to a String. Then, we split the String using split-lines and lower case each word. And, finally we added it to a Set.

(ns spell-checker.core
  (:require [clojure.string :as str]))

(def words
  (set (map str/lower-case (str/split-lines (slurp "resources/words.txt")))))

(defn -main [] (println "Total number of words loaded: " (count words)))

Run the main program and you will get the following output.

Total number of words loaded:  10000

Step 3: Word exists in the Set

Now, we will define our spell-checker function that will check the Set for the word. We used the collection contains? function to check for existence of a word. If word is found we return the word in a list else we return an empty list.

(ns spell-checker.core
  (:require [clojure.string :as str]))

(def words
  (set (map str/lower-case (str/split-lines (slurp "resources/words.txt")))))

(defn spell-checker [input]
  (let [word (str/lower-case input)]
    (let [exactMatchFound? (contains? words word)]
      (if exactMatchFound?
        (list word)
        ()))))

(defn -main []
  (println (spell-checker "English"))
  (println (spell-checker "Englisx"))
  (println (spell-checker "nxsnk"))
)

When we run the code shown above we get following result.

(English)
()
()

The result was as expected. English is an English word so we returned that

Step 4: Find similar words using levenshtein distance

First, we have to update the dependencies section of project.clj

:dependencies [
                 [org.clojure/clojure "1.10.1"]
                 [clj-fuzzy "0.4.1"]
                 ]

We added [clj-fuzzy "0.4.1"] dependency.

First, we will implement the distance function that takes two words and returns their levenshtein distance as shown below.

(ns spell-checker.core
  (:require [clojure.string :as str])
  (:use clj-fuzzy.metrics)
  )

(def words
  (set (map str/lower-case (str/split-lines (slurp "resources/words.txt")))))

(defn distance [word1 word2]
  (levenshtein word1 word2))

Next, we will have to implement similar-words function that is called by spell-checker

(defn spell-checker [input]
  (let [word (str/lower-case input)]
    (let [exactMatchFound? (contains? words word)]
      (if exactMatchFound?
        (list word)
        (similar-words word words)))))

The similar-words function take the word and our English words set as input and return a list of similar words.

I wrote an inefficient version of similar-words function. Please don’t use it. It can be optimized. I am a Clojure newbie so this is my first implementation that works.

(defn similar-words [word words]
  (let [mapF (partial distance word)]
    (take 3
          (map second
               (sort-by first
                        (map list
                             (map mapF words) words))))))

The above code does following

  • We create a partial function mapF. It allows us to convert distance to one argument function. This is required map expects a function that take the collection element and return a response.
  • Next, we apply mapF on words. It will give us the distance between word and collection element.
  • Next, I zip distance list with original words
  • We sort the resultant tuple list by distance
  • Map the list to get only the collection element
  • Finally, take only the three elements from the result list

Now, if I run my code with three examples – English, Englisx, and nxsnk we get following response. The response is sorted by levenshtein distance.

(english)
(english england eagles)
(funk junk desk)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: