How MongoDB write/read speed varies with or without index on a field?

Last 3 weeks I have been busy working on a PoC where we are thinking of using MongoDB as our datastore. In this series of blog posts I will be sharing my finding with the community. Please take these experiments with grain of salt and try out these experiments on your dataset and hardware. Also share with me if I am doing something stupid.  In this blog I will be sharing my findings on how index affect the write speed.

Scenario

I will be inserting 60 million documents and will be noting the time taken to write each batch of 10 million records.  The average document size is 2400 bytes (Look at the document in under Document heading). The test will be run first without index on the name field and then with index on the name field.

Conclusion

Write speed with index dropped to 0.27 times of write speed without index after inserting 20 million documents. 

Setup 

Dell Vostro Ubuntu 11.04 box with 4 GB RAM and 300 GB hard disk.

Java 6

MongoDB 2.0.1

Spring MongoDB 1.0.0.M5 which internally uses MongoDB Java driver 2.6.5 version.

Document

The documents I am storing in MongoDB looks like as shown below. The average document size is 2400 bytes.  Please note the _id field also has an index. The index that I will be creating will be on name field.

{
"_id" : ObjectId("4ed89c140cf2e821d503a523"),
"name" : "Shekhar Gulati",
"someId1" : NumberLong(1000006),
"str1" : "U",
"date1" : ISODate("1997-04-10T18:30:00Z"),
"bio" : "I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a
Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. "
}

JUnit TestCases

The first JUnit test inserts 10 million record and after every 10 million records dumps the time taken to write batch of 10 million records. Perform  a find query on an unindexed field name and prints the time taken to perform the find operation. This tests runs for 6 batches so 60 million records are inserted.

@Configurable
@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations = "classpath:/META-INF/spring/applicationContext*.xml")
public class Part1Test {

private static final String FILE_NAME = "/home/shekhar/dev/test-data/10mrecords.txt";

	private static final int TOTAL_NUMBER_OF_BATCHES = 6;

	private static final Logger logger = Logger
			.getLogger(SprintOneTestCases.class);

	@Autowired
	MongoTemplate mongoTemplate;

	@Before
	public void setup() {
		mongoTemplate.getDb().dropDatabase();
	}

	@Test
	public void shouldWrite60MillionRecordsWithoutIndex() throws Exception {

		for (int i = 1; i <= TOTAL_NUMBER_OF_BATCHES; i++) {
			logger.info("Running Batch ...." + i);
			long startTimeForOneBatch = System.currentTimeMillis();
		LineIterator iterator = FileUtils.lineIterator(new File(FILE_NAME));
			while (iterator.hasNext()) {
				String line = iterator.next();
				User user = convertLineToObject(line);
				mongoTemplate.insert(user);
			}

			long endTimeForOneBatch = System.currentTimeMillis();
			double timeInSeconds = ((double) (endTimeForOneBatch - startTimeForOneBatch)) / 1000;
			logger.info(String
					.format("Time taken to write %d batch of 10 million records is %.2f seconds",
							i, timeInSeconds));

			Query query = Query.query(Criteria.where("name").is(
					"Shekhar Gulati"));
			logger.info("Unindexed find query for name Shekhar Gulati");
			performFindQuery(query);
			performFindQuery(query);

			CommandResult collectionStats = mongoTemplate
					.getCollection("users").getStats();
			logger.info("Collection Stats : " + collectionStats.toString());

			logger.info("Batch finished running...." + i);
		}

	}

	private void performFindQuery(Query query) {
		long firstFindQueryStartTime = System.currentTimeMillis();
		List<User> query1Results = mongoTemplate.find(query, User.class);
		logger.info("Number of results found are " + query1Results.size());
		long firstFindQueryEndTime = System.currentTimeMillis();
		logger.info("Total Time Taken to do a find operation "
				+ (firstFindQueryEndTime - firstFindQueryStartTime) / 1000
				+ " seconds");
	}

	private User convertLineToObject(String line) {
		String[] fields = line.split(";");
		User user = new User();
		user.setFacebookName(toString(fields[0]));
		user.setSomeId1(toLong(fields[1]));
		user.setStr1(toString(fields[2]));
		user.setDate1(toDate(fields[3]));
		user.setBio(StringUtils.repeat("I am a Java Developer. ", 100));
		return user;
	}

	private long toLong(String field) {
		return Long.parseLong(field);
	}

	private Date toDate(String field) {
		SimpleDateFormat dateFormat = new SimpleDateFormat(
				"yyyy-MM-dd HH:mm:ss");
		Date date = null;
		try {
			date = dateFormat.parse(field);
		} catch (ParseException e) {
			date = new Date();
		}
		return date;
	}

	private String toString(String field) {
		if (StringUtils.isBlank(field)) {
			return "dummy";
		}
		return field;
	}

}

listing 1. 60 million records getting inserted and read without index

In the second test case I first created the index and then started inserting the records. This time find operations were performed on the indexed field name.

	@Test
	public void shouldWrite60MillionRecordsWithIndex()
			throws Exception {

		long startTime = System.currentTimeMillis();
		createIndex();

		for (int i = 1; i <= TOTAL_NUMBER_OF_BATCHES; i++) {
			logger.info("Running Batch ...." + i);
			long startTimeForOneBatch = System.currentTimeMillis();
		LineIterator iterator = FileUtils.lineIterator(new File(FILE_NAME));
			while (iterator.hasNext()) {
				String line = iterator.next();
				User obj = convertLineToObject(line);
				mongoTemplate.insert(obj);
			}

			long endTimeForOneBatch = System.currentTimeMillis();
			logger.info("Total Time Taken to write " + i
					+ " batch of Records in milliseconds : "
					+ (endTimeForOneBatch - startTimeForOneBatch));
	double timeInSeconds = ((double)(endTimeForOneBatch - startTimeForOneBatch))/1000;

                        logger.info(String.format("Time taken to write %d batch of 10 million records is %.2f seconds", i,timeInSeconds));

		Query query = Query.query(Criteria.where("name").is("Shekhar Gulati"));
			logger.info("Indexed find query for name Shekhar Gulati");
			performFindQuery(query);
			performFindQuery(query);

			CommandResult collectionStats = mongoTemplate.getCollection("name").getStats();
			logger.info("Collection Stats : " + collectionStats.toString());

			logger.info("Batch finished running...." + i);
		}

	}

	private void createIndex() {
		IndexDefinition indexDefinition = new Index("name", Order.ASCENDING)
				.named(" name_1");
		long startTimeToCreateIndex = System.currentTimeMillis();
		mongoTemplate.ensureIndex(indexDefinition, User.class);
		long endTimeToCreateIndex = System.currentTimeMillis();
		logger.info("Total Time Taken createIndex "
				+ (endTimeToCreateIndex - startTimeToCreateIndex) / 1000
				+ " seconds");
	}

Write Concern

WriteConcern value was NONE which is fire and forget. You can read more about write concerns here.

Write Resuts

After running the test cases shown above I found out that for the first 10 million i.e. from 0 to 10 million inserts write per second with index was 0.4 times of without index. The more surprising was that for the next batch of 10 million records the write speed with index was reduced to 0.27 times without index.

Looking at the table above you can see that the write speed when we don’t have index is remains consistent and does not degrades. But the write speed when we had index varied a lot from 3492 documents per second to 2281 documents per second. I was not able to complete the test after 20 million as it was taking way too much time to do next 10 million. This can lead to lot of problems in case you added index on a field after you have inserted first 10 million records without index. The write speed is not even consistent and you have to think of sharding to achieve the speed limits you want.

Read Results

Read results don’t show anything interesting except that you should have index on the field you would be querying on otherwise read performance will be very bad. This can be explained very easily because data will not be in RAM and you will be hitting disk. And when you hit disk performance will take a ride.

This is what all I have for this post. I am not making any judgement whether these numbers are good or bad. I think that should be governed by the use case, data, hardware you will working on. Please feel free to comment and share your knowledge.

IBM DeveloperWorks Introducing Spring Roo, Part 4: Rapid application development in cloud with Spring Roo and Cloud Foundry

Take the rapid development of Roo a step further by creating applications to work in the cloud with Cloud Foundry, the first open platform as a service project created by VMWare. Learn more about the environment and then deploy an application into Cloud Foundry using the Roo shell. Read about it here http://www.ibm.com/developerworks/opensource/library/os-springroo4/index.html

Introducing Spring Roo, Part 3: Developing Spring Roo add-ons

Spring Roo is a RAD tool that lets you build applications (mainly web) quickly and easily. Under the hood, Spring Roo is based on OSGI add-on architecture, which makes it easy to extend Spring Roo by adding add-ons. Spring Roo provides commands to create add-ons that can be very easily made available to the Spring Roo user community. In this article, we first talk about Spring Roo architecture, talking about how Spring Roo leverages its own add-on architecture to provide different features, then we will create add-ons using the Roo shell and modify them to suit our needs. Please read the third article at IBM DeveloperWorks.

Java Puzzler : They just find me !!

Couple of days back I wrote a piece of code which was behaving in an unexpected manner. I was confused what was happening. Take a look at the sample code below and predict its behavior

package com.shekhar;

public class JavaPuzzler {

	public static void main(String[] args) {
		JavaPuzzler javaPuzzler = new JavaPuzzler();
		javaPuzzler.doSth();
	}

	public void doSth() {
		float f = 1.2f;
		if (f >= 1.0) {
			f = 0.9999999999999f;
		}
		InnerClass innerClass = new InnerClass(f);
		System.out.println(innerClass.getValue());
	}

	private class InnerClass {

		private float value;

		public InnerClass(float value) {
			if (value >= 1.0f) {
				throw new IllegalArgumentException(
						"Value can't be greater than 1.0f");
			}

			this.value = value;
		}

		public float getValue() {
			return value;
		}
	}
}

My initial expectation was that I would get value 0.9999999999999f as answer.Try it and find the answer. Share your answer and reasoning in comments.

How to exclude a sub package from Spring component scanning?

Suppose you are using Spring component scanning and you don’t want to scan classes in the sub-package you should use <context:exclude-filter> subtag of <context:component-scan> tag as shown below

<context:component-scan base-package="com.test.ws">
	<context:exclude-filter type="aspectj" expression="com.test.ws.client.*" />
</context:component-scan>

Here we have defined a aspectj pointcut which will exclude all the classes in the sub-package.

Hadoop Maven Archetype

Today, I found out the easiest way to generate a maven based Hadoop project using a maven archetype. This will generate a sample Hadoop project which uses hadoop version 0.20.2. The sample project also contains the famous WordCount example. To generate the maven project type following on the command line

mvn archetype:generate -DarchetypeCatalog=http://dev.mafr.de/repos/maven2/ -DgroupId=com.hadoop.example -DartifactId=hadoop-example

You can also get more information about the archetype at http://blog.mafr.de/2010/08/01/maven-archetype-hadoop/

Spring Scoped Proxy Beans – An Alternative to Method Injection

Few months back, I wrote on JavaLobby about how you can use Method Injection in scenarios where bean lifecycles are different i.e. you want to inject a non-singleton bean inside a singleton bean. For those of you who are not aware of Method Injection, it allows you to inject methods instead of objects in your class. Method Injection is useful in scenarios where you need to inject a smaller scope bean in a larger scope bean. For example, you have to inject a prototype bean inside an singleton bean , on each method invocation of Singleton bean. Just defining your bean prototype, does not create new instance each time a singleton bean is called because container creates a singleton bean only once, and thus only sets a prototype bean once. So, it is completely wrong to think that if you make your bean prototype you will get  new instance each time prototype bean is called.

To make this problem more concrete, lets consider we have to inject a validator instance each time a RequestProcessor bean is called. I want validator bean to be prototype because Validator has a state like a collection of error messages which will be different for each request.

RequestProcessor class

@Component
public class RequestProcessor implements Processor {

    @Autowired(required = true)
    private Validator validator;

    public Response process(Request request) {
        List errorMessages = validator.validate(request);
        if (!errorMessages.isEmpty()) {
            return null;
        }
        return null;
    }

}

Validator class

@Component
@Scope(value = "prototype")
public class Validator {

    private List<String> errorMessages = new ArrayList<String>();

    public Validator() {
        System.out.println("New Instance");
    }

    public List<String> validate(Request request) {
        errorMessages.add("Validation Failed");
        return errorMessages;
    }

}

Because I am using Spring annotations, the only thing I need specify in xml is to activate annotation detection and classpath scanning for annotated components.

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:aop="http://www.springframework.org/schema/aop"
	xmlns:context="http://www.springframework.org/schema/context"
	xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.0.xsd
	http://www.springframework.org/schema/aop http://www.springframework.org/schema/aop/spring-aop-3.0.xsd
	http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context-3.0.xsd">

	<context:annotation-config />

	<context:component-scan base-package="com.shekhar.spring.scoped.proxy"></context:component-scan>
</beans>

Now lets write a test to see how many instances of prototype bean gets created when we call the processor 10 times.

@ContextConfiguration
@RunWith(SpringJUnit4ClassRunner.class)
public class RequestProcessorTest {

    @Autowired
    private Processor processor;

    @Test
    public void testProcess() {
        for (int i = 0; i < 10; i++) {
            Response response = processor.process(new Request());
            assertNull(response);
        }
    }

}

If you run this test, you will see that only one instance of validator bean gets created as “New Instance” will get printed only once. This is because container creates RequestProcessor bean only once, and thus only sets Validator bean once and reuses that bean on each invocation of process method.

One possible solution for this problem is to use Method Injection which I have already talked about in my JavaLobby article. I didn’t liked method injection approach because it imposes restriction on your class i.e. you have to make your class abstract. In this blog, I am going to talk about the second approach for handling such a problem. The second approach is to use Spring AOP Scoped proxies which injects a new validator instance each time RequestProcessor bean is called. To make it work, the only change you have to do is to specify proxyMode in Validator class.

@Component
@Scope(proxyMode = ScopedProxyMode.TARGET_CLASS, value = "prototype")
public class Validator {

    private List<String> errorMessages = new ArrayList<String>();

    public Validator() {
        System.out.println("New Instance");
    }

    public List<String> validate(Request request) {
        errorMessages.add("Validation Failed");
        return errorMessages;
    }

}

There are four possible values for proxyMode attribute.I have used TARGET_CLASS which creates a class based proxy. This mode requires that you have CGLIB jar in your classpath.If you run your test again you will see that 10 instances of Validator class will be created. I found this solution more cleaner as I don’t have to make my class abstract and it makes unit testing of RequestProcessor class easier.

Reduce Boilerplate Code for DAO’s — Hades Introduction

Most web applications will have DAO’s for accessing the database layer. A DAO provides an interface for some type of database or persistence mechanism, providing CRUD and finders operations without exposing any database details. So, in your application you will have different DAO’s for different entities. Most of the time, code that you have written in one DAO will get duplicated in other DAO’s because much of the functionality in DAO’s is same (like CRUD and finder methods).

One of way of avoiding this problem is to have generic DAO and have your domain classes inherit this generic DAO implementation. You can also add finders using Spring AOP; this approach is explained Per Mellqvist in this article. There is a problem with the approach: this boiler plate code becomes part of your application source code and you will have to maintain it. The more code you write, there are more chances of new bugs getting introduced in your application. So, to avoid writing this code in an application, we can use an open source framework called Hades.

Read More