December 2011 – Shekhar Gulati

Spring Roo 1.1 Cookbook Review

This review is about Spring Roo 1.1 Cookbook by Ashish Sarin from Packt Publishing. Before I get into details lets me first tell you what is Spring Roo in case you don’t know.

Note : I am author of Spring Roo series on IBM DeveloperWorks.

What is Spring Roo?

Spring Roo is a lightweight productivity tool for Java™ technology that makes it fast and easy to develop Spring-based applications. Applications created using Spring Roo follow Spring best practices and are based on standards such as JPA, Bean Validation (JSR-303), and Dependency Injection (JSR-330). Roo offers a usable, context-aware, tab-completing shell for building applications. Spring Roo is extensible and allows add-ons, enhancing its capability.

Book Details

7 chapters covered in 460 pages. Each chapter has some recipes and each recipe is divide into four parts — Getting Started, How to do it, How it works, and there’s more. So each recipe give you all the details required from getting started to details related to how it works. I liked this approach.
Book costs $44.99 http://www.packtpub.com/spring-roo-1-1-cookbook/book
This book covers Spring Roo in detail.

Pros

Recipes represents solution to real world problems. This book covered how to define composite keys, talked about how you can use JNDI datasource, how you can create application which has to interact with multiple databases etc. These are the problems you will face in real world and this book tries to answer them.
The two chapters I liked the most are chapter 3 and 4 which covered Advanced JPA and Spring MVC web applications. Chapter 3 covered in detail how you can create many-to-one, one-to-many relations, create mapped super class etc. Chapter 4 gives you all detail you need to create Spring MVC application and customize them according to your requirement.
This book from the start talks how different features are provided by Spring Roo through Add-ons. I think it is very important to share Spring Roo add-on architecture from the start as this tells you that you can extend Spring Roo by writing your own add-ons.
The book not only covers Spring Roo but also provides good introduction to the technologies it uses.
Chapter 6 gives good introduction to technologies like Solr, GAE, JMS. It really shows the power of Spring Roo.
I think this book does good justice to the power of Spring Roo.

Cons

One thing that I was looking forward was customizing GWT applications. This book talked in detail about the GWT application created about Spring Roo but does not cover how to customize the GWT app.
Same is the case with the 7th chapter which cover writing add-ons. The book cover in good detail the default add-on generated by Spring Roo but does not shows how you can customize them to write your own add-ons.
This book is based on Spring Roo 1.1.5 Release version but recently Spring Roo has released 1.2 version. Some commands talked in this book are deprecated and are not the recommended now. Also, lot of new features like MongoDB support, JSF support , multi module projects etc. are not there.

Conclusion

Overall I liked the book and I think it is a good book to read if you are thinking of using Spring Roo in future. Yes it is missing some new features but if you will not be using those features this book is a good buy.

Deploying Spring Roo Applications on JBoss AS7

I have update the post and put it my company blog http://xebee.xebia.in/2011/12/25/deploying-spring-roo-mysql-applications-on-jboss-as7/

Using MongoDB Replica Set With Spring MongoDB 1.0.0.RC1

The primary means for replication is to ensure data survives single or multiple machine failures. The more replicas you have, the more likely is your data to survive one or more hardware crashes. With three replicas, you can afford to lose two nodes and still serve the data. MongoDB supports two forms of replication, Replica Sets and Master Slave. Replica Sets is the recommended way to do replication in MongoDB and will cover only Replica Sets in this post.

Couple of weeks back I was working in POC where we need to set up MongoDB replication. As I am Spring aficionado I decided to use Spring MongoDB to interact with Replica Set. We used Spring Roo to quickly bootstrap the project. All the project setup, Spring MongoDB setup, JUnit test cases, evern Spring MVC UI was created in minutes thanks to Spring Roo. I am big Spring Roo fan — I just love it. Thanks SpringSource for such an amazing project. Spring Roo uses Spring MongoDB version 1.0.0.M5 which has a bug that it does not support WriteConcern value REPLICAS_SAFE. But with the current release 1.0.0.RC1 that issue has been fixed and now you can use REPLICAS_SAFE. REPLICAS_SAFE is the recommended value for WriteConcern in case of replication. This is a step by step guide from creation of Spring project to working MongoDB replica set.

Create the project using Spring Roo. If you are not aware of Spring Roo you can read my Spring Roo series. I am using Spring Roo to quickly configure a Spring MongoDB project.

project --topLevelPackage com.xebia.mongodb.replication --projectName mongodb-replication-demo --java 6
mongo setup --databaseName bookshop --host localhost --port 27017
entity mongo --class ~.domain.Book --testAutomatically --identifierType org.bson.types.ObjectId
field string --fieldName title --notNull
field string --fieldName author --notNull
field number --type double --fieldName price --notNull
repository mongo --interface ~.repository.BookRepository --entity ~.domain.Book

This will create a Spring maven project, configure MongoDB to work with Spring, create one Collection Book and will add three fields title, author, and price to the collection. All the CRUD operations will carried out using BookRepository.

Start the MongoDB server using ./mongod and run BookIntegrationTest and make sure all tests pass.
Setup replica set following the MongoDB documentation http://www.mongodb.org/display/DOCS/Replica+Set+Tutorial.

Update the applicationContext-mongo.xml as shown below but before add the property mongo.replicaset which will have all nodes.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<beans xmlns="http://www.springframework.org/schema/beans" xmlns:cloud="http://schema.cloudfoundry.org/spring" xmlns:context="http://www.springframework.org/schema/context" xmlns:mongo="http://www.springframework.org/schema/data/mongo" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context-3.0.xsd        http://www.springframework.org/schema/data/mongo        http://www.springframework.org/schema/data/mongo/spring-mongo-1.0.xsd        http://www.springframework.org/schema/beans        http://www.springframework.org/schema/beans/spring-beans-3.0.xsd        http://schema.cloudfoundry.org/spring http://schema.cloudfoundry.org/spring/cloudfoundry-spring-0.8.xsd">

    <mongo:db-factory dbname="${mongo.database}" id="mongoDbFactory" mongo-ref="mongo"/>

    <mongo:repositories base-package="com.xebia.mongodb.replication"/>

    <!-- To translate any MongoExceptions thrown in @Repository annotated classes -->
    <context:annotation-config/>

    <bean class="org.springframework.data.mongodb.core.MongoTemplate" id="mongoTemplate">
        <constructor-arg ref="mongoDbFactory"/>
    </bean>

	<mongo:mongo id="mongo" replica-set="${mongo.replicaset}" write-concern="REPLICA_SAFE">
		<mongo:options auto-connect-retry="true"/>
	</mongo:mongo>
</beans>

If you run the tests again all the tests will fail and you will see following exception.

Caused by: org.springframework.beans.factory.BeanDefinitionStoreException: Unexpected exception parsing XML document from file [/home/shekhar/dev/workspaces/writing/mongodb-replication-demo/target/classes/META-INF/spring/applicationContext-mongo.xml]; nested exception is java.lang.ArrayIndexOutOfBoundsException: 1
at org.springframework.beans.factory.xml.XmlBeanDefinitionReader.
doLoadBeanDefinitions(XmlBeanDefinitionReader.java:412)

The reason for this exception is because there is a bug in Spring MongoDB 1.0.0.M5 which is not able to parse WriteConcern REPLICA_SAFE value.

To make it work we have to work with Spring MongoDB latest version 1.0.0.RC1. This is released just 3 days back on 7th December 2011.Update the pom.xml with 1.0.0.RC1.

 <dependency>
	<groupId>org.springframework.data</groupId>
        <artifactId>spring-data-mongodb</artifactId>
        <version>1.0.0.RC1</version>
</dependency>

Run the BookIntegrationTest the tests will fail again and see the following exception stacktraces.

java.lang.NoSuchMethodError: org.springframework.core.annotation.AnnotationUtils
.getAnnotation(Ljava/lang/reflect/AnnotatedElement;Ljava/lang/Class;)
Ljava/lang/annotation/Annotation;
at org.springframework.transaction.annotation.SpringTransactionAnnotationParser
.parseTransactionAnnotation(SpringTransactionAnnotationParser.java:38)

To make it ran you have to use latest Spring version 3.1.0.RC2 in pom.xml
```
<spring.version>3.1.0.RC2</spring.version>
```

Final change you need to make is in applicationContext-mongo.xml. Change the value of write-concern to REPLICAS_SAFE.

<mongo:mongo id="mongo" replica-set="${mongo.replicaset}" write-concern="REPLICAS_SAFE">
	<mongo:options auto-connect-retry="true"/>
</mongo:mongo>

Run the tests and all the tests will pass.

Are We Really Talking About Commodity Hardware When Working With MongoDB?

Last couple of months I have been reading, learning, playing with MongoDB and one thing that I have read or found myself is that its performance depends largely on the amount of RAM in your system. As a general rule larger the RAM better the performance which I can easily understand as you are not hitting disk so you get great performance. When we talk about commodity hardware I think we talk about 4GB or at max 8 GB RAM boxes which means if your application working set can fit in 4GB or 8 GB RAM you are good otherwise your performance will suffer. Then you have two choices either add more RAM or horizontally scale your system i.e Sharding. To me adding more RAM means you are moving away from commodity hardware and moving toward big costly boxes. So we should horizontally scale our system by adding more 4 GB or 8GB RAM boxes. Correct??

I thought companies or people who are using MongoDB would have been following this approach i.e. they are using commodity boxes and scaling their systems. But I was wrong. Most of presentations (from companies like Craiglist and ForeSquare) that I saw are using big 64 GB or more RAM, faster disks. So where are we talking about commodity hardware?

Installing HBase over HDFS on a Single Ubuntu Box

I faced some issues making HBase run over HDFS on my Ubuntu box. This is a informal step-by-step guide from setting up HDFS to running HBase on a single Ubuntu machine.

Download hadoop (hadoop-0.20.203.0rc1.tar.gz)and install it following this great tutorial http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/. I installed on my system user rather than creating hduser. Make sure the 4 files (core-site.xml, hadoop-env.sh, hdfs-site.xml, mapred-site.xml) under hadoop/conf folder have values as shown below. Check the hadoop is working fine by running wordcount example as mentioned in tutorial. Also update .bashrc files with required variables.core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/shekhar/hadoop-data</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

</configuration>


<strong>hdfs-site.xml</strong>

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

</configuration>

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

</configuration>

hadoop-env.sh

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.26

# Extra Java CLASSPATH elements.  Optional.
# export HADOOP_CLASSPATH=

# The maximum amount of heap to use, in MB. Default is 1000.
# export HADOOP_HEAPSIZE=2000

# Extra Java runtime options.  Empty by default.
# export HADOOP_OPTS=-server

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS"
export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS"
export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS"
# export HADOOP_TASKTRACKER_OPTS=
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
# export HADOOP_CLIENT_OPTS

# Extra ssh options.  Empty by default.
# export HADOOP_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR"

# Where log files are stored.  $HADOOP_HOME/logs by default.
# export HADOOP_LOG_DIR=${HADOOP_HOME}/logs

# File naming remote slave hosts.  $HADOOP_HOME/conf/slaves by default.
# export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves

# host:path where hadoop code should be rsync'd from.  Unset by default.
# export HADOOP_MASTER=master:/home/$USER/src/hadoop

# Seconds to sleep between slave commands.  Unset by default.  This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
# export HADOOP_SLAVE_SLEEP=0.1

# The directory where pid files are stored. /tmp by default.
# export HADOOP_PID_DIR=/var/hadoop/pids

# A string representing this instance of hadoop. $USER by default.
# export HADOOP_IDENT_STRING=$USER

# The scheduling priority for daemon processes.  See 'man nice'.
# export HADOOP_NICENESS=10

Download HBase(version hbase-0.90.4.tar.gz). Update hbase-site.xml in hbase/conf folder with required properties.
hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>

	<property>
		<name>hbase.rootdir</name>
    		<value>hdfs://localhost:54310/hbase</value>
	</property>

	<property>
		<name>dfs.replication</name>
		<value>1</value>
	</property>

	<property>
	      <name>hbase.zookeeper.property.clientPort</name>
	      <value>2222</value>
	      <description>Property from ZooKeeper's config zoo.cfg.
	      The port at which the clients will connect.
	      </description>
    	</property>
	<property>
	      <name>hbase.zookeeper.quorum</name>
	      <value>localhost</value>
	      <description>Comma separated list of servers in the ZooKeeper Quorum.
	      For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
	      By default this is set to localhost for local and pseudo-distributed modes
	      of operation. For a fully-distributed setup, this should be set to a full
	      list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh
	      this is the list of servers which we will start/stop ZooKeeper on.
	      </description>
	</property>
    <property>
      <name>hbase.zookeeper.property.dataDir</name>
      <value>/home/shekhar/zookeeper</value>
      <description>Property from ZooKeeper's config zoo.cfg.
      The directory where the snapshot is stored.
      </description>
    </property>

</configuration>

Update hbase-env.sh so that HBase should manage ZooKeeper.

# Tell HBase whether it should manage it's own instance of Zookeeper or not.
export HBASE_MANAGES_ZK=true

Run hbase using ./start-hbase.sh in bin folder. You will see following exception in log file.

2011-12-06 13:59:29,979 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
java.io.IOException: Call to localhost/127.0.0.1:54310 failed on local exception: java.io.EOFException

2011-12-06 13:59:30,577 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181
2011-12-06 13:59:30,577 WARN org.apache.zookeeper.ClientCnxn: Session 0x134127deaaf0002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)

Kill the HBase using kill -9 <processid>

The exception in step 3 is because hadoop jar in hbase lib directory is different from the one used in hadoop. Copy the hadoop-core-0.20.203.0.jar in hadoop folder to the hbase/lib folder.

start the hbase again using ./start-hbase.sh and you will get another exception

2011-12-06 14:51:05,778 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration

Kill the HBase using kill -9 <proceesid>

To fix this copy commons-configuration-1.6.jar from hadoop lib folder to hbase lib folder.
Start the hbase again using ./start-hbase.sh it should start fine now and you should be able to see hbase running at http://localhost:60010/master.jsp . If you see a valid page coming
hbase has started fine.

How MongoDB Different Write Concern Values Affect Performance On A Single Node?

In the first post I talked about how indexes affect the write speed in MongoDB. In this second post I will share my findings on how different write concerns affect the write speed on a single node. Please refer to the first post for the setup related information. A write concern controls the behavior of write operation and gives developers the choice to choose the value matching their requirements. For instance there are some documents which are not very important and if one of them get lost your business will not get screwed. For those you can choose less stricter value of write concern and for objects where you want don’t want your object to be lost you should choose stricter value of write concern. Let’s take a look at different write concern values available in Java driver. Please note in this experiment I used MongoDB java driver 2.7.2 instead of Spring MongoDB.

Normal : This is the default option where every write operation is fire and forget which means it just writes to the driver and return back. It does not wait for write to be available in server. So, if another thread tries to read the document just after the document has been written it might found not find it. There is a very high probability of data loss with this option. I think this should not be considered in cases where data durability is important and you are only using single instance of MongoDB server. Even with replication you can loose data with this option (I will talk about in my future post).
None : This is almost same as Normal with just one different that in Normal if network goes down or there is some other network issue you get an exception but with None you don’t get any exception if there are some network issues. This makes it highly unreliable.
Safe : As suggested by name it is safer than the above two. The write operation waits for the MongoDB server to acknowledge the write but data is still not written to disk. With safe you will not face issue that when another thread tried to read the object you just wrote, the object was not found. So, it provides a guarantee that object once written will be found. That’s Good. But still you can loose data because data is not written to disk and if server died for some reason data will be lost.
Journal Safe : Before we talk about this option. Lets first talk about what is Journaling in MongoDB. Journaling is a feature of MongDB where a write ahead log file of all the operations is maintained. In scenarios when MongoDB is not cleanly shutdown like using kill -9 command the data can be recovered from Journal files. By default data is written to journal files after every 100 milliseconds. You can change it to lie between 2 ms to 300 ms. With version 2.0 journaling is enable by default on 64 bit MongoDB servers. With Journal Safe write concern option your write will wait till the journal file is updated.
Fysnc : With Fsync write concern the write operation waits till the data is not written to disk. This is the safest option on a Single node as only way you can loose data is when the hard disk crashes.

I have left the other values which are not applicable to single node but make more sense when replication is enable. I will cover them in future posts.

Test Case

The test case was very simple I will be doing 1 million writes with each of options except fsync and will find out the writes per second speed for each of the write concern values.

Document

The document is similar to the one used in first post. It is 2395 bytes.

{
"_id" : ObjectId("4eda74ef84ae8b2410f5fa8e"),
"age" : "27",
"lName" : "Gulati1",
"fName" : "Shekhar1"
"bio" : "I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. ",
}

JUnit Test

The JUnit test case is shown below. In each test case it inserts one million records with a different value of write concern.

public class SingleNodeWriteConcernTests {

	private final static int ONE_MILLION = 1000000;

	private final Logger logger = Logger
			.getLogger(SingleNodeWriteConcernTests.class);

	@Test
	public void shouldInsertRecordsInNonCurrentMode() throws Exception {
		ServerAddress serverAddress = new ServerAddress("localhost", 27017);

		Mongo mongo = new Mongo(serverAddress);
		mongo.setWriteConcern(WriteConcern.NONE);
		runASingleTestCase(mongo, "NONE");

		mongo = new Mongo(serverAddress);
		mongo.setWriteConcern(WriteConcern.JOURNAL_SAFE);
		runASingleTestCase(mongo, "JOURNAL_SAFE");

		mongo = new Mongo(serverAddress);
		mongo.setWriteConcern(WriteConcern.NORMAL);
		runASingleTestCase(mongo, "NORMAL");

		mongo = new Mongo(serverAddress);
		mongo.setWriteConcern(WriteConcern.SAFE);
		runASingleTestCase(mongo, "SAFE");

	}

	private void runASingleTestCase(Mongo mongo, String name) throws Exception {
		DB db = mongo.getDB("play");
		DBCollection people = db.getCollection("people");
		if (db.collectionExists("people")) {
			people.drop();
		}
		insertRecords(mongo, name);

		mongo.dropDatabase("play");
	}

	private void insertRecords(Mongo mongo, final String name) throws Exception {

		DB db = mongo.getDB("play");
		final DBCollection collection = db.getCollection("people");
		collection.ensureIndex("fName");
		long startTime = System.currentTimeMillis();
		for (int i = 1; i <= ONE_MILLION; i++) {
			BasicDBObject obj = new BasicDBObject();
			Map<String, String> map = new HashMap<String, String>();
			map.put("fName", "Shekhar" + i);
			map.put("lName", "Gulati" + i);
			map.put("age", String.valueOf(i));
			map.put("bio", StringUtils.repeat("I am a Java Developer. ", 100));
			obj.putAll(map);
			collection.insert(obj);
		}
		long endTime = System.currentTimeMillis();
		double seconds = ((double) (endTime - startTime)) / (1000);
		double rate = ONE_MILLION / seconds;

		String message = String
				.format("WriteConcern %s inserted %d records in %.2f seconds at %.2f (rec/s)",
						name, ONE_MILLION, seconds, rate);
		logger.info(message);

	}

}

Results

As you might have also expected Normal and None are the fastest because of the way they work i.e. fire and forget. Safe writes takes 3.5 times more than Normal writes. With Journal safe value you come down to 24 documents per second which is very low. As you can see as you move towards more write safety you loose a lot on write speed. This is again a decision you have to make depending on your use case.

Can something be done to increase write speed in Safe and Journal Safe options?

The results shown above are based on records being inserted sequentially one at a time. I tried an experiment where in I divided 1 million records to a batch of 100,000 records each. And let 10 threads write 1 million record in parallel. The write speed for Safe and Journal Safe increased but None and Normal decreased as shown below.

The write speed for Safe with 10 threads is 1.4 times the write speed with one thread and similarly write speed for Journal Safe is 10 times of the write speed with one thread. This is because while one thread is waiting other threads can work in parallel which allows to better utilize CPU.

Spring Roo CompareTo Add-on Released

Over the weekend I created an advanced Spring Roo add-on to provide support for creation of compareTo method. This add-on is on similar lines of Spring Roo own equals add-on which provide implementation of equals and hashcode methods. The add-on is released and available for community to use. The add-on is listed in Spring Roo repository XML http://spring-roo-repository.springsource.org/roobot/roobot.xml . It is built using Spring Roo latest version 1.2.0.RC1. This post will talk about how to install and use the add-on.

Installing the compareTo add-on

Open the operating system command line and type roo. In case you don’t have Spring Roo setup please refer to my IBM DeveloperWork article on Spring Roo. After the roo shell initiates type the following command to install the add-on.

addon install bundle --bundleSymbolicName org.xebia.roo.addon.compareto

If you don’t have automatic pgp trust enable you will see a message on Roo shell saying that you first have to trust my pgp key before you can install the add-on. To trust the key type the following Roo command.

pgp trust --keyId 0x9B68220C

Once it is enable re-run the addon install command and add-on should get installed. To verify that add-on is installed type the osgi ps command and you should compareto add-on listed at the bottom of the command output.

[  78] [Active     ] [    1] spring-dw-roo-compareto-addon (0.1.1.BUILD)

This add-on exposes two commands compareto setup and compareto and these commands are enabled after you have created entity. This is because before you have created any entity it does not make sense to have compareTo method.

Using the compareTo add-on

Lets first create a simple bookshop application. Type the following command which will create the project, setup database and create a simple book entity.

project --topLevelPackage com.xebia.bookshop --projectName bookshop
jpa setup --database HYPERSONIC_IN_MEMORY --provider HIBERNATE
entity jpa --class ~.domain.Book --testAutomatically
field string --fieldName title --notNull
field string --fieldName author --notNull
field number --type double --fieldName price --notNull

After you have ran the commands above you can now create compareTo method for book entity.To create the compareTo method type the following commands.

compareto setup
compareto add --type ~.domain.Book

The first command will add the required dependencies and second will create the Book_Roo_Compareto.aj ITD which contains the compareTo method.

If you want to exclude some fields in compareTo method you should specify i.e. if you don’t want author and price field to be used in compareTo method you can specify them in excludeFields attribute as shown below.

compareto add --type ~.domain.Book --excludeFields "author price"

Use this add-on and share feedback.

How MongoDB write/read speed varies with or without index on a field?

Last 3 weeks I have been busy working on a PoC where we are thinking of using MongoDB as our datastore. In this series of blog posts I will be sharing my finding with the community. Please take these experiments with grain of salt and try out these experiments on your dataset and hardware. Also share with me if I am doing something stupid. In this blog I will be sharing my findings on how index affect the write speed.

Scenario

I will be inserting 60 million documents and will be noting the time taken to write each batch of 10 million records. The average document size is 2400 bytes (Look at the document in under Document heading). The test will be run first without index on the name field and then with index on the name field.

Conclusion

Write speed with index dropped to 0.27 times of write speed without index after inserting 20 million documents.

Setup

Dell Vostro Ubuntu 11.04 box with 4 GB RAM and 300 GB hard disk.

Java 6

MongoDB 2.0.1

Spring MongoDB 1.0.0.M5 which internally uses MongoDB Java driver 2.6.5 version.

Document

The documents I am storing in MongoDB looks like as shown below. The average document size is 2400 bytes. Please note the _id field also has an index. The index that I will be creating will be on name field.

{
"_id" : ObjectId("4ed89c140cf2e821d503a523"),
"name" : "Shekhar Gulati",
"someId1" : NumberLong(1000006),
"str1" : "U",
"date1" : ISODate("1997-04-10T18:30:00Z"),
"bio" : "I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a
Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I
am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java Developer. I am a Java
Developer. I am a Java Developer. "
}

JUnit TestCases

The first JUnit test inserts 10 million record and after every 10 million records dumps the time taken to write batch of 10 million records. Perform a find query on an unindexed field name and prints the time taken to perform the find operation. This tests runs for 6 batches so 60 million records are inserted.

@Configurable
@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations = "classpath:/META-INF/spring/applicationContext*.xml")
public class Part1Test {

private static final String FILE_NAME = "/home/shekhar/dev/test-data/10mrecords.txt";

	private static final int TOTAL_NUMBER_OF_BATCHES = 6;

	private static final Logger logger = Logger
			.getLogger(SprintOneTestCases.class);

	@Autowired
	MongoTemplate mongoTemplate;

	@Before
	public void setup() {
		mongoTemplate.getDb().dropDatabase();
	}

	@Test
	public void shouldWrite60MillionRecordsWithoutIndex() throws Exception {

		for (int i = 1; i <= TOTAL_NUMBER_OF_BATCHES; i++) {
			logger.info("Running Batch ...." + i);
			long startTimeForOneBatch = System.currentTimeMillis();
		LineIterator iterator = FileUtils.lineIterator(new File(FILE_NAME));
			while (iterator.hasNext()) {
				String line = iterator.next();
				User user = convertLineToObject(line);
				mongoTemplate.insert(user);
			}

			long endTimeForOneBatch = System.currentTimeMillis();
			double timeInSeconds = ((double) (endTimeForOneBatch - startTimeForOneBatch)) / 1000;
			logger.info(String
					.format("Time taken to write %d batch of 10 million records is %.2f seconds",
							i, timeInSeconds));

			Query query = Query.query(Criteria.where("name").is(
					"Shekhar Gulati"));
			logger.info("Unindexed find query for name Shekhar Gulati");
			performFindQuery(query);
			performFindQuery(query);

			CommandResult collectionStats = mongoTemplate
					.getCollection("users").getStats();
			logger.info("Collection Stats : " + collectionStats.toString());

			logger.info("Batch finished running...." + i);
		}

	}

	private void performFindQuery(Query query) {
		long firstFindQueryStartTime = System.currentTimeMillis();
		List<User> query1Results = mongoTemplate.find(query, User.class);
		logger.info("Number of results found are " + query1Results.size());
		long firstFindQueryEndTime = System.currentTimeMillis();
		logger.info("Total Time Taken to do a find operation "
				+ (firstFindQueryEndTime - firstFindQueryStartTime) / 1000
				+ " seconds");
	}

	private User convertLineToObject(String line) {
		String[] fields = line.split(";");
		User user = new User();
		user.setFacebookName(toString(fields[0]));
		user.setSomeId1(toLong(fields[1]));
		user.setStr1(toString(fields[2]));
		user.setDate1(toDate(fields[3]));
		user.setBio(StringUtils.repeat("I am a Java Developer. ", 100));
		return user;
	}

	private long toLong(String field) {
		return Long.parseLong(field);
	}

	private Date toDate(String field) {
		SimpleDateFormat dateFormat = new SimpleDateFormat(
				"yyyy-MM-dd HH:mm:ss");
		Date date = null;
		try {
			date = dateFormat.parse(field);
		} catch (ParseException e) {
			date = new Date();
		}
		return date;
	}

	private String toString(String field) {
		if (StringUtils.isBlank(field)) {
			return "dummy";
		}
		return field;
	}

}

listing 1. 60 million records getting inserted and read without index

In the second test case I first created the index and then started inserting the records. This time find operations were performed on the indexed field name.

	@Test
	public void shouldWrite60MillionRecordsWithIndex()
			throws Exception {

		long startTime = System.currentTimeMillis();
		createIndex();

		for (int i = 1; i <= TOTAL_NUMBER_OF_BATCHES; i++) {
			logger.info("Running Batch ...." + i);
			long startTimeForOneBatch = System.currentTimeMillis();
		LineIterator iterator = FileUtils.lineIterator(new File(FILE_NAME));
			while (iterator.hasNext()) {
				String line = iterator.next();
				User obj = convertLineToObject(line);
				mongoTemplate.insert(obj);
			}

			long endTimeForOneBatch = System.currentTimeMillis();
			logger.info("Total Time Taken to write " + i
					+ " batch of Records in milliseconds : "
					+ (endTimeForOneBatch - startTimeForOneBatch));
	double timeInSeconds = ((double)(endTimeForOneBatch - startTimeForOneBatch))/1000;

                        logger.info(String.format("Time taken to write %d batch of 10 million records is %.2f seconds", i,timeInSeconds));

		Query query = Query.query(Criteria.where("name").is("Shekhar Gulati"));
			logger.info("Indexed find query for name Shekhar Gulati");
			performFindQuery(query);
			performFindQuery(query);

			CommandResult collectionStats = mongoTemplate.getCollection("name").getStats();
			logger.info("Collection Stats : " + collectionStats.toString());

			logger.info("Batch finished running...." + i);
		}

	}

	private void createIndex() {
		IndexDefinition indexDefinition = new Index("name", Order.ASCENDING)
				.named(" name_1");
		long startTimeToCreateIndex = System.currentTimeMillis();
		mongoTemplate.ensureIndex(indexDefinition, User.class);
		long endTimeToCreateIndex = System.currentTimeMillis();
		logger.info("Total Time Taken createIndex "
				+ (endTimeToCreateIndex - startTimeToCreateIndex) / 1000
				+ " seconds");
	}

Write Concern

WriteConcern value was NONE which is fire and forget. You can read more about write concerns here.

Write Resuts

After running the test cases shown above I found out that for the first 10 million i.e. from 0 to 10 million inserts write per second with index was 0.4 times of without index. The more surprising was that for the next batch of 10 million records the write speed with index was reduced to 0.27 times without index.

Looking at the table above you can see that the write speed when we don’t have index is remains consistent and does not degrades. But the write speed when we had index varied a lot from 3492 documents per second to 2281 documents per second. I was not able to complete the test after 20 million as it was taking way too much time to do next 10 million. This can lead to lot of problems in case you added index on a field after you have inserted first 10 million records without index. The write speed is not even consistent and you have to think of sharding to achieve the speed limits you want.

Read Results

Read results don’t show anything interesting except that you should have index on the field you would be querying on otherwise read performance will be very bad. This can be explained very easily because data will not be in RAM and you will be hitting disk. And when you hit disk performance will take a ride.

This is what all I have for this post. I am not making any judgement whether these numbers are good or bad. I think that should be governed by the use case, data, hardware you will working on. Please feel free to comment and share your knowledge.