Last week, we (me and my team) started a new sprint of our software development project. We follow a two-week sprint cycle that starts on Wednesdays and continues until Tuesday. Following the planning session, I picked up a technical story that required me to update the JPA entity classes.
Just to give you some context, one of the components of the application is a Spring batch consumer. Every night, we receive a data dump in the form of CSV files, which we need to insert into the PostgreSQL database. The batch consumer application reads the data files, converts each row to the corresponding JPA entity object, and inserts them into the database using Spring Batch JpaWriter. Before writing the Spring batch consumer, we used PostgreSQL COPY command to insert the data. This had let us work on the data processing layer in parallel while Spring Batch application was still under development. Some columns in the CSV files are timestamps in the format
yyyy-MM-dd-HH.mm.ss . Since I was unable to tell PostgreSQL COPY command to use a specific timestamp format, I ended up marking the field as text in our DDL scripts. I was not sure if our data processing layer required to interact with timestamp fields. Therefore, I decided to postpone the dilemma/decision for the right time. I consider this as a form of YAGNI (You aren’t gonna need it) principle of extreme programming. YAGNI principle says
Always implement things when you actually need them, never when you just force that you need them – Ron Jeffries
I had completely forgotten that we had used text type for timestamps, until one fine day while doing refactoring, I came across a JPQL query that was using ORDER BY clause on one of the timestamp fields.
During the same time, the client added some new fields in the CSV files and changed the DDL scripts. It was an appropriate time to handle this work. So, I created a technical story. Following a discussion with the Product Owner, we prioritized this task to the top of the backlog.
The reason we used Spring batch for data ingestion instead of PostgreSQL COPY command was to do data processing before writing data to the database. Spring Batch allows us to do custom data transformation with its ItemProcessor and FieldSetMapper abstractions.
The batch consumer application takes around 25 minutes to write 6GB data to the PostgreSQL database. We have ideas on how to reduce this number further, but for the moment, the current number looks fine to us.
The process that we use to generate model class is following:
- Our client provides us with IBM DB2 DDL scripts.
- Next, we convert the DDL script to PostgreSQL syntax. Besides a few small changes, IBM DB2 script works seamlessly with PostgreSQL.
- Next, we create a temporary database and run the DDL scripts.
- After that, we use Netbeans reverse engineering capabilities to convert tables to JPA entities. For development, our team uses the IntelliJ Idea. We couldn’t find reverse engineering capabilities in IntelliJ Idea. I think they are not part of community edition. Please correct me if I am wrong.
- We copy the entity classes to our project source code. Then, we fix any compilation errors that might be caused by the new model classes.
Although the above is a manual process, we don’t have to do it repetitively. In the last 7 sprints, I had to do it only 4 times. For now, we have been told that the schema will not change. I am hopeful that it was for the last time I did the process manually. Personally, I prefer to automate things that usually require multiple doings. I had spent half a day automating the process. I couldn’t find the library Netbeans uses internally to create entity classes from database tables so I used jpa-entity-generator open source project. There were three main issues, which I found while using the jpa-entity-generator project:
- In our DDL scripts, we have many composite primary keys. Netbeans JPA entities generator created
@EmbeddableIdfor them where as jpa-entity-generator use
@IdClass. Since we started with Netbeans generator, in multiple places we were using embeddable keys. Using jpa-entity-generator would have led to more work.
- The second issue, which I found with jpa-entity-generator is that it uses Lombok. I personally dislike Lombok. My biggest concern against tools like Lombok is that you have to install a plugin in the IDE to make your code compile. This means a new developer can’t cleanly import a project until he or she has Lombok plugin installed. I prefer that in my projects it should be dead easy to import the project in the IDE. During this endeavor, I found that there is an option to delombok. This is available via the Lombok plugin. But the code it generates is ugly.
- The third issue is it does not value database char type. It ends up creating everything as String. There are many fields that use char type, so this would have led to a lot of unwanted refactoring.
If you are reading this blog and know which library the Netbeans uses internally, then please let me know.
After generating the updated JPA entity classes using the Netbeans, I decided to run batch processor against the latest data dump. The first issue, which I faced was related to the timestamp format. You have to tell Spring batch the timestamp pattern to use. We use custom
FieldSetMapper for converting Spring Batch FieldSet to our domain entities. The code snippet shown below shows how we set the timestamp field in an entity.
entity.setRowChgTs(fieldSet.readDate("rowChgTs", "yyyy-MM-dd-HH.mm.ss", defaultValue));
Spring Batch internally uses
java.text.SimpleDateFormat to parse String to date. We updated all our FieldSetMappers to use timestamp format for converting text to proper date. We don’t generate FieldSetMapper by hand. There are 35 entity classes and there could be as many as 250 fields in the entity. I wrote a small utility using the awesome JavaPoet library to generate
FieldSetMapper from the entity classes.
After fixing the timestamp format in the code, I reran the batch consumer job and it took ages to finish. The job that took 25 minutes was not finished in 2 hours. I got frustrated and decided to kill the job. There were still many records which were yet to be written to the database. I started thinking what has changed and why the job time has increased.
As you probably would have guessed, I realized it was the timestamp change that was causing the issue. As mentioned previously, Spring Batch used
java.text.SimpleDateFormat to convert String to
Date. I remember reading something about SimpleDateFormat performance issues. When you are writing millions of records, even 1 millisecond per record could lead to big performance issues. I googled “Java SimpleDateFormat Performance” and found the SimpleDateFormat is slow post on the first page. I knew I was onto something.
My random reading on the web habit saved me from hours of debugging. I looked around to find faster ways to parse dates. A few people suggested to use Joda-Time
org.joda.time.format.DateTimeFormatter. We already had joda-time as one of the dependencies of our project, so there was not much I had to do. I rewrote a custom method to parse String to date using the joda-time library. I reran the job and it still took ages to finish.
While I was holding SimpleDateFormat the culprit, one thought that continually kept coming to my mind was “Shekhar, try running the Spring Batch code without your changes and see what performance you are getting today?” For some reason, I was convinced that
SimpleDateFormat was the issue.
Finally, when nothing worked, I decided to rerun without timestamp changes. However, the result remained the same. The Spring Batch job was taking such a significant amount of time to finish. At the same time, I saw that my Mac machine started giving me the warning that my disk space is almost full. I was puzzled — how did I consume 250GB of disk space?
I used my favorite Disk Inventory X app to find the disk usage. In a few minutes, it told me that there is a directory called
~/dev/postgres that is consuming 70GB of database. In the last Sprint, while trying to log all the actions that happen in PostgreSQL, I changed postgresql.conf. One of the properties that I changed was
log_statement. I changed its value to
log_statement='all'. This means PostgreSQL was logging all the actions. I changed log directory to
~/dev/postgres as well. This made me realize that performance issue is caused by PostgreSQL logging all actions to the filesystem.
So, what is the key lesson learned from this story?
I drove this particular point home — we must validate our assumptions before building any theory around them. I agree, this is tough to do. Still validate your assumptions!
I found that testing, whether automated or manual, is the best way to validate our assumptions.