Finding all the indexes of a whole word in a given string using java

Intent

To find all the indexes of a whole word in a given searchable string.

Motivation

Today, I had to write a piece of code in which I had to find all the indexes of a particular keyword in a searchable string. Most of the times, when we have to find index of a keyword in a searchable string we use indexOf method of String class.

For example,


String searchableString = “Don’t be evil. Being evil is bad”;

String keyword = “be”;

So, we can find the index of keyword as


int index = searchableString.indexOf(keyword);

This will give us the index of first occurrence of keyword (“be”). Now suppose, we have to find all the indexes of keyword (“be”), we will have to loop over searchableString and find all the indexes of keyword “be”.

This can be done as follows

	public static void findIndexes(){
		String searchableString = "don’t be evil.being evil is bad";
		String keyword = "be";

		int index = searchableString.indexOf(keyword);
		while (index >=0){
			System.out.println("Index : "+index);
			index = searchableString.indexOf(keyword, index+keyword.length())	;
		}

	}

	public static void main(String[] args) {
		findIndexes();
	}

This program will print :

Index :  6

Index : 14

There is a problem in this code as we should not get “Index : 14” because we are looking for a keyword “be” and the index it has found is from word “being” . This just looks for any occurrence of keyword in the string it does not have to be a whole word. To solve this problem, we will be writing a solution using Regular expressions.

Solution

We have to find out the indexes which correspond to the whole word. For example, we should only get “Index : 6” as that corresponds to the keyword “be” not “Index : 14” as it corresponds to the word “being” .

This can be done as follows :-

WholeWordIndexFinder.java

package org.dailywtf.string;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WholeWordIndexFinder {

	private String searchString;

	public WholeWordIndexFinder(String searchString) {
		this.searchString = searchString;
	}

	public List<IndexWrapper> findIndexesForKeyword(String keyword) {
		String regex = "\\b"+keyword+"\\b";
		Pattern pattern = Pattern.compile(regex);
		Matcher matcher = pattern.matcher(searchString);

		List<IndexWrapper> wrappers = new ArrayList<IndexWrapper>();

		while(matcher.find() == true){
			int end = matcher.end();
			int start = matcher.start();
			IndexWrapper wrapper = new IndexWrapper(start, end);
			wrappers.add(wrapper);
		}
		return wrappers;
	}

	public static void main(String[] args) {
		WholeWordIndexFinder finder = new WholeWordIndexFinder("don’t be evil.being evil is bad");
		List<IndexWrapper> indexes = finder.findIndexesForKeyword("be");
		System.out.println("Indexes found "+indexes.size() +" keyword found at index : "+indexes.get(0).getStart());
	}

}

IndexWrapper.java

package org.dailywtf.string;

public class IndexWrapper {

	private int start;
	private int end;

	public IndexWrapper(int start, int end) {
		this.start = start;
		this.end = end;
	}

	public int getEnd() {
		return end;
	}

	public int getStart() {
		return start;
	}

	@Override
	public int hashCode() {
		final int prime = 31;
		int result = 1;
		result = prime * result + end;
		result = prime * result + start;
		return result;
	}

	@Override
	public boolean equals(Object obj) {
		if (this == obj)
			return true;
		if (obj == null)
			return false;
		if (getClass() != obj.getClass())
			return false;
		IndexWrapper other = (IndexWrapper) obj;
		if (end != other.end)
			return false;
		if (start != other.start)
			return false;
		return true;
	}

}

Explanation
First of all a regular expression is built using the keyword. Regex \b defines a word boundary. \b allows you to perform a “whole words only” search using a regular expression in the form of \bword\b.
More information on regex could be found at this link.  Next, we just iterate over the matcher till matcher can find any match. Whenever a match is found an wrapper object is created which contains the start and end position of the keyword.

Using the above code, we can find all the indexes of a keyword in a searchable string.

Advertisements