shekhargulati

How MongoDB write/read speed varies with or without index on a field?

Last 3 weeks I have been busy working on a PoC where we are thinking of using MongoDB as our datastore. In this series of blog posts I will be sharing my finding with the community. Please take these experiments with grain of salt and try out these experiments on your dataset and hardware. Also share with me if I am doing something stupid. In this blog I will be sharing my findings on how index affect the write speed.

Scenario

I will be inserting 60 million documents and will be noting the time taken to write each batch of 10 million records. The average document size is 2400 bytes (Look at the document in under Document heading). The test will be run first without index on the name field and then with index on the name field.

Conclusion

Write speed with index dropped to 0.27 times of write speed without index after inserting 20 million documents.

Setup

Dell Vostro Ubuntu 11.04 box with 4 GB RAM and 300 GB hard disk.

Java 6

MongoDB 2.0.1

Spring MongoDB 1.0.0.M5 which internally uses MongoDB Java driver 2.6.5 version.

Document

The documents I am storing in MongoDB looks like as shown below. The average document size is 2400 bytes. Please note the _id field also has an index. The index that I will be creating will be on name field.

{
"_id" : ObjectId("4ed89c140cf2e821d503a523"),
"name" : "Shekhar Gulati",
"someId1" : NumberLong(1000006),
"str1" : "U",
"date1" : ISODate("1997-04-10T18:30:00Z"),
"bio" : "I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a
Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. "
}

JUnit TestCases

The first JUnit test inserts 10 million record and after every 10 million records dumps the time taken to write batch of 10 million records. Perform a find query on an unindexed field name and prints the time taken to perform the find operation. This tests runs for 6 batches so 60 million records are inserted.

@Configurable
@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations = "classpath:/META-INF/spring/applicationContext*.xml")
public class Part1Test {

private static final String FILE_NAME = "/home/shekhar/dev/test-data/10mrecords.txt";

	private static final int TOTAL_NUMBER_OF_BATCHES = 6;

	private static final Logger logger = Logger
			.getLogger(SprintOneTestCases.class);

	@Autowired
	MongoTemplate mongoTemplate;

	@Before
	public void setup() {
		mongoTemplate.getDb().dropDatabase();
	}

	@Test
	public void shouldWrite60MillionRecordsWithoutIndex() throws Exception {

		for (int i = 1; i <= TOTAL_NUMBER_OF_BATCHES; i++) {
			logger.info("Running Batch ...." + i);
			long startTimeForOneBatch = System.currentTimeMillis();
		LineIterator iterator = FileUtils.lineIterator(new File(FILE_NAME));
			while (iterator.hasNext()) {
				String line = iterator.next();
				User user = convertLineToObject(line);
				mongoTemplate.insert(user);
			}

			long endTimeForOneBatch = System.currentTimeMillis();
			double timeInSeconds = ((double) (endTimeForOneBatch - startTimeForOneBatch)) / 1000;
			logger.info(String
					.format("Time taken to write %d batch of 10 million records is %.2f seconds",
							i, timeInSeconds));

			Query query = Query.query(Criteria.where("name").is(
					"Shekhar Gulati"));
			logger.info("Unindexed find query for name Shekhar Gulati");
			performFindQuery(query);
			performFindQuery(query);

			CommandResult collectionStats = mongoTemplate
					.getCollection("users").getStats();
			logger.info("Collection Stats : " + collectionStats.toString());

			logger.info("Batch finished running...." + i);
		}

	}

	private void performFindQuery(Query query) {
		long firstFindQueryStartTime = System.currentTimeMillis();
		List<User> query1Results = mongoTemplate.find(query, User.class);
		logger.info("Number of results found are " + query1Results.size());
		long firstFindQueryEndTime = System.currentTimeMillis();
		logger.info("Total Time Taken to do a find operation "
				+ (firstFindQueryEndTime - firstFindQueryStartTime) / 1000
				+ " seconds");
	}

	private User convertLineToObject(String line) {
		String[] fields = line.split(";");
		User user = new User();
		user.setFacebookName(toString(fields[0]));
		user.setSomeId1(toLong(fields[1]));
		user.setStr1(toString(fields[2]));
		user.setDate1(toDate(fields[3]));
		user.setBio(StringUtils.repeat("I am a Java Developer. ", 100));
		return user;
	}

	private long toLong(String field) {
		return Long.parseLong(field);
	}

	private Date toDate(String field) {
		SimpleDateFormat dateFormat = new SimpleDateFormat(
				"yyyy-MM-dd HH:mm:ss");
		Date date = null;
		try {
			date = dateFormat.parse(field);
		} catch (ParseException e) {
			date = new Date();
		}
		return date;
	}

	private String toString(String field) {
		if (StringUtils.isBlank(field)) {
			return "dummy";
		}
		return field;
	}

}

listing 1. 60 million records getting inserted and read without index

In the second test case I first created the index and then started inserting the records. This time find operations were performed on the indexed field name.

	@Test
	public void shouldWrite60MillionRecordsWithIndex()
			throws Exception {

		long startTime = System.currentTimeMillis();
		createIndex();

		for (int i = 1; i <= TOTAL_NUMBER_OF_BATCHES; i++) {
			logger.info("Running Batch ...." + i);
			long startTimeForOneBatch = System.currentTimeMillis();
		LineIterator iterator = FileUtils.lineIterator(new File(FILE_NAME));
			while (iterator.hasNext()) {
				String line = iterator.next();
				User obj = convertLineToObject(line);
				mongoTemplate.insert(obj);
			}

			long endTimeForOneBatch = System.currentTimeMillis();
			logger.info("Total Time Taken to write " + i
					+ " batch of Records in milliseconds : "
					+ (endTimeForOneBatch - startTimeForOneBatch));
	double timeInSeconds = ((double)(endTimeForOneBatch - startTimeForOneBatch))/1000;

                        logger.info(String.format("Time taken to write %d batch of 10 million records is %.2f seconds", i,timeInSeconds));

		Query query = Query.query(Criteria.where("name").is("Shekhar Gulati"));
			logger.info("Indexed find query for name Shekhar Gulati");
			performFindQuery(query);
			performFindQuery(query);

			CommandResult collectionStats = mongoTemplate.getCollection("name").getStats();
			logger.info("Collection Stats : " + collectionStats.toString());

			logger.info("Batch finished running...." + i);
		}

	}

	private void createIndex() {
		IndexDefinition indexDefinition = new Index("name", Order.ASCENDING)
				.named(" name_1");
		long startTimeToCreateIndex = System.currentTimeMillis();
		mongoTemplate.ensureIndex(indexDefinition, User.class);
		long endTimeToCreateIndex = System.currentTimeMillis();
		logger.info("Total Time Taken createIndex "
				+ (endTimeToCreateIndex - startTimeToCreateIndex) / 1000
				+ " seconds");
	}

Write Concern

WriteConcern value was NONE which is fire and forget. You can read more about write concerns here.

Write Resuts

After running the test cases shown above I found out that for the first 10 million i.e. from 0 to 10 million inserts write per second with index was 0.4 times of without index. The more surprising was that for the next batch of 10 million records the write speed with index was reduced to 0.27 times without index.

Looking at the table above you can see that the write speed when we don’t have index is remains consistent and does not degrades. But the write speed when we had index varied a lot from 3492 documents per second to 2281 documents per second. I was not able to complete the test after 20 million as it was taking way too much time to do next 10 million. This can lead to lot of problems in case you added index on a field after you have inserted first 10 million records without index. The write speed is not even consistent and you have to think of sharding to achieve the speed limits you want.

Read Results

Read results don’t show anything interesting except that you should have index on the field you would be querying on otherwise read performance will be very bad. This can be explained very easily because data will not be in RAM and you will be hitting disk. And when you hit disk performance will take a ride.

This is what all I have for this post. I am not making any judgement whether these numbers are good or bad. I think that should be governed by the use case, data, hardware you will working on. Please feel free to comment and share your knowledge.

A very bad Software Developer Job Advertisement by Infosys.

I don’t read newspaper daily but today morning somehow I had some extra time so I thought to spent that time reading news paper. While reading or browsing the newspaper I saw a Job advertisement from Infosys. In recent past I have interviewed lot of Java developers from companies like Infosys , TCS and other biggies and found that most of them don’t know how to write code, never heard of new things happening in Software world, technically very weak overall. So, I was very curious to read about what all technologies they are looking to hire.

You can view the Job Opening here.

The first thing I read was …

Web Technologies

Java, J2EE, EJB, WebLogic, WebSphere Commerce Server, WebSphere Portal Server

Here I don’t get what they are trying to say when they write Java as a Web technology. J2EE is very old now it is JavaEE. EJB ?? I am sure they are working on EJB 2 or even prior version of it. And finally the big beasts from Oracle and IBM..

Second Thing which caught my eye…

Open Systems

C++, Unix

..WTF.. Please tell me what do you mean by this.. are you looking for developers who have knowledge of C++, and can work on Unix boxes.. ??

Last Thing that I saw.

Others

Hadoop, Apace Cassandra, OpenLink, etc…

Now you are talking.. Hadoop and Cassandra are the latest technologies and any good developer will love to work on these technologies. Why you are putting these technologies in others..

I will be reviewing Spring Roo 1.1 Cookbook

I will be reviewing Spring Roo 1.1 cookbook in next one month. You can download the first chapter of the book free of cost.

Enable wireless on dell vostro 1720 with Ubuntu 11.04

Fire these commands

lsmod | grep dell

sudo rmmod -f dell_laptop
sudo rfkill unblock all

IBM DeveloperWorks Introducing Spring Roo, Part 4: Rapid application development in cloud with Spring Roo and Cloud Foundry

Take the rapid development of Roo a step further by creating applications to work in the cloud with Cloud Foundry, the first open platform as a service project created by VMWare. Learn more about the environment and then deploy an application into Cloud Foundry using the Roo shell. Read about it here http://www.ibm.com/developerworks/opensource/library/os-springroo4/index.html

Blog created in couple of minutes using Spring Roo and Cloud Foundry

Yesterday I gave a session on Cloud Foundry at SiliconIndia Cloud conference and I showed audience how they can create a blog in couple of minutes. I developed blog using Spring Roo and deployed to Cloud Foundry public cloud using roo cloud foundry add-on. The audience really liked the demo and saw the power of Spring Roo and Cloud Foundry. In case you want to create your blog using Spring Roo and Cloud Foundry . Open the roo shell and fire these commands.

project --topLevelPackage com.xebia.blog --projectName xebiablog
mongo setup
entity mongo --class ~.domain.Blog
field string --fieldName title --class ~.domain.Blog --notNull
field string --fieldName body --notNull --sizeMax 200000
field date --type java.util.Date --fieldName publishDate --notNull
field boolean --fieldName publish --primitive --notNull
field string --fieldName author --notNull
entity mongo --class ~.domain.Comment
field string --fieldName email --notNull --class ~.domain.Comment
field string --fieldName comment --notNull --sizeMax 4000
field set --type ~.domain.Comment --fieldName comments --class ~.domain.Blog
field reference --type ~.domain.Blog --fieldName blog --class ~.domain.Comment --notNull
repository mongo --interface ~.repository.BlogRepository --entity ~.domain.Blog
repository mongo --interface ~.repository.CommentRepository --entity ~.domain.Comment
service --interface ~.service.BlogService --entity ~.domain.Blog
service --interface ~.service.CommentService --entity ~.domain.Comment
web mvc setup
web mvc all --package ~.web
mongo setup --cloudFoundry true
perform package
cloud foundry login --email <username> --password <password>
cloud foundry deploy --appName xebiablog --path /target/xebiablog-0.1.0.BUILD-SNAPSHOT.war --memory 512
cloud foundry create service --serviceName xebiablog-mongo --serviceType mongodb
cloud foundry bind service --appName xebiablog --serviceName xebiablog-mongo
cloud foundry start app --appName xebiablog
cloud foundry list apps

This blog will use MongoDB as its datastore. If you already have Spring Roo and CloudFoundry addon installed it will take less than couple of minutes to have your blog created and deployed in cloud.

SCP command for Copying files from remote linux machine to local machine

I just faced a requirement that I needed to copy a zip file containing logs from a remote server to my local machine for some analysis.
I googled around and found at that you can use scp command to that for you.

scp -r <username of remote machine>:<remote location> <location of local directory>

scp -r shekhar@dev.xebia.com:/home/xebia/application/logs/logs-1-09-2011.zip /home/shekhar/tmp

To enable JGroups on ubuntu system

While you are using jgroups on ubuntu you might get this exception

INFO: JGroups version: 2.6.16.GA
org.jgroups.ChannelException: failed to start protocol stack
	at org.jgroups.JChannel.startStack(JChannel.java:1617)
	at org.jgroups.JChannel.connect(JChannel.java:366)
	at org.jgroups.demos.Draw.go(Draw.java:174)
	at org.jgroups.demos.Draw.main(Draw.java:144)
Caused by: java.lang.Exception: problem creating sockets (bind_addr=shekhar/127.0.1.1, mcast_addr=228.10.10.10:45588)
	at org.jgroups.protocols.UDP.start(UDP.java:389)
	at org.jgroups.stack.Configurator.startProtocolStack(Configurator.java:129)
	at org.jgroups.stack.ProtocolStack.startStack(ProtocolStack.java:402)
	at org.jgroups.JChannel.startStack(JChannel.java:1614)
	... 3 more
Caused by: java.net.SocketException: bad argument for IP_MULTICAST_IF: address not bound to any interface
	at java.net.PlainDatagramSocketImpl.socketSetOption(Native Method)
	at java.net.PlainDatagramSocketImpl.setOption(PlainDatagramSocketImpl.java:309)
	at java.net.MulticastSocket.setInterface(MulticastSocket.java:424)
	at org.jgroups.protocols.UDP.createSockets(UDP.java:527)
	at org.jgroups.protocols.UDP.start(UDP.java:385)

To solve this you need to disable ipv6 on ubuntu. To do that do the following

java -Djava.net.preferIPv4Stack=true -classpath /home/shekhar/JB431/opt/jboss-soa-p-5/jboss-as/lib/concurrent.jar:/home/shekhar/JB431/opt/jboss-soa-p-5/jboss-as/common/lib/commons-logging.jar:/home/shekhar/JB431/opt/jboss-soa-p-5/jboss-as/server/all/lib/jgroups.jar org.jgroups.demos.Draw

Introducing Spring Roo, Part 3: Developing Spring Roo add-ons

Spring Roo is a RAD tool that lets you build applications (mainly web) quickly and easily. Under the hood, Spring Roo is based on OSGI add-on architecture, which makes it easy to extend Spring Roo by adding add-ons. Spring Roo provides commands to create add-ons that can be very easily made available to the Spring Roo user community. In this article, we first talk about Spring Roo architecture, talking about how Spring Roo leverages its own add-on architecture to provide different features, then we will create add-ons using the Roo shell and modify them to suit our needs. Please read the third article at IBM DeveloperWorks.

Java Puzzler : They just find me !!

Couple of days back I wrote a piece of code which was behaving in an unexpected manner. I was confused what was happening. Take a look at the sample code below and predict its behavior

package com.shekhar;

public class JavaPuzzler {

	public static void main(String[] args) {
		JavaPuzzler javaPuzzler = new JavaPuzzler();
		javaPuzzler.doSth();
	}

	public void doSth() {
		float f = 1.2f;
		if (f >= 1.0) {
			f = 0.9999999999999f;
		}
		InnerClass innerClass = new InnerClass(f);
		System.out.println(innerClass.getValue());
	}

	private class InnerClass {

		private float value;

		public InnerClass(float value) {
			if (value >= 1.0f) {
				throw new IllegalArgumentException(
						"Value can't be greater than 1.0f");
			}

			this.value = value;
		}

		public float getValue() {
			return value;
		}
	}
}

My initial expectation was that I would get value 0.9999999999999f as answer.Try it and find the answer. Share your answer and reasoning in comments.