Improving Spring Data JPA/Hibernate Bulk Insert Performance by more than 100 times


This week I had to work on a performance issue. Performance issues are always fun to work with. They give an opportunity to get into the depth of the technology we are using. We learn how much we don’t know about the technology we are using everyday. These days we too quickly think about changing the database or underlying library when faced with the performance bottleneck. When what we really need to do is learn about the technology we are using in depth.

The performance issue I am talking about was related to bulk insertion of data in to the database. We were using Spring Data JPA with SQL Server. In our use case, end user was uploading an excel file that the application code first parse, then process, and finally store in the database. The NFR for this requirement was that we should be able to process and store 100,000 records in less than 5 minutes.

Before my changes we were processing 10,000 records in 47 minutes. This certainly looked bad.

After making the changes that I will discuss in this post we were able to process 10,000 records in 20 seconds. We were able to process 100,000 in less that 4 minutes which is well below our NFR.

Let’s start by looking at the high level code structure of the example use case.

Let’s assume we are getting an excel sheet with list of todo items. Each todo item is linked to a todolist as shown below. As we can see line item 1 and 3 should be linked to todolist t1.

todolist; item; done; date
t1, item 1, false, 20/10/2020
t2, item 2, true, 20/09/2020
t1, item 3, false, 20/11/2020

I am not sharing the real code for obvious reasons. But, I think it is important to look at the structure of the code to understand the problem statement better.

public class TodoListService {
    @Transactional
    public FileResult processFile(File excelFile){
        // parse the excel file into the list of records. Record is a flat structure
        // The final structure is TodoList has List<TodoItem>

        List<Record> records = recordProcessor.process(excelFile);

        for(Record record : records){
            List<String> errors = recordValidator.vaidate(record);
            if(!errors.isEmpty()){
                // add to rerrors
            } else {
                Optional<TodoList> todolistOptional = todolistRepository
                .findByField1AndField2AndField3(record.getField1(),record.getField2(), record.getField3());

                if(todolistOptional.isPresent()){ // Todolist exists
                    Todolist todolist = todolistOptional.get();
                    if(duplicate){
                        // add as error
                    }else if(isNotUpdateable){
                        // add as error
                    }else {
                        // Add TodoItem to existing TodoList
                        TodoItem todoItem = recordToTodoItem(record);
                        todolist.addTodoItem(todoItem);
                        todolistRepository.save(todolist);
                    }
                } else { // New TodoList needs to be created
                    TodoList todolist = createNewTodoList(record);
                    todolistRepository.save(todolist);
                }
            }
        }

        // save FileResult into the database. FileResult has one to many FileRecord
        // It store statistics of the file and any associated errors
        return fileResultRepository.save(fileResult);
    }

}

The code shown above does the following:

  1. Parse the excel sheet and convert it to a list of records. Record is a flat structure with no relationship.
  2. We iterate over each record. For each record we
  3. Run a validator that finds any validation errors. We use JSR303 bean validation to validate record objects.
  4. If there are no validation errors then we check if the todolist corresponding to the item already exists.
  5. If it exists we check if it is a duplicate or not updatable. If it is duplicate or not updatable then we mark it as error record else we convert record to todo item and add to todolist. Finally we save to database.
  6. If it does not exist then we create new todolist with one todo item and save to the database.
  7. Finally, we create file statistics object FileResult with total records, processed, errors and save to the database.

I hope the above gives you enough understanding of the problem statement.

Now, I will walk you through all the changes I made to improve the performance of processFile operation.

Change 1: Not creating JSR 303 validator for validating each record

I mentioned in passing that we used JSR 303 bean validation library to validate the records. The validations are length check, numeric check, regex, etc.

In the validator we were creating the Validator in the validate method. This meant for 10000 records it will be created 10000 times.

public List<String> validate(Record record) {
  ValidatorFactory factory = Validation.buildDefaultValidatorFactory();
  Validator validator = factory.getValidator();
  Set<ConstraintViolation<Record>> errors = validator.validate(record)
    // return response
}

As it turns out Validation.buildDefaultValidatorFactory() is an expensive operation. It will first find the default bean validation provider, then configure it, and finally create the the factory object.

Since we are using Spring Boot we didn’t had to create this in our validator. Spring Boot creates the one for at the startup that we can inject in our validator.

@Autowired
Validator validator;

@Override
public List<String> validate(Record record) {
    Set<ConstraintViolation<Record>> errors = validator.validate(record);
     // return response
}

This change improved performance by 30%. For 10,000 records we reduced processing time from 47 mins to 33 mins.

The way we figured out this using the jvisualvm profiler.

Change 2: Prefetch data from the database in batches

The second change that I made was related to prefetching the todolist records from the database in batches outside of the for loop. So, rather than making 10,000 read calls we can make 10 database calls of 1000 batch size as shown in the code above.

List<String> ids = records.stream()
        .map(Record::getTodoListId)
        .distinct().collect(Collectors.toList());

int batchSize = 1000;

int numberOfBatches = calculateNumberOfBatches(batchSize, ids.size());

List<TodoLis> todolistsInDb = new ArrayList<>();
for (int i = 0; i < reps; i++) {
    // create batch
    todolistsInDb.addAll(todolistRepository.findByIdIn(idsInBatch));
}

Map<Triple<String, String, String>, List<TodoList>> cache = todolistsInDb.stream()
        .collect(groupingBy(d -> Triple.of(d.getField1(), d.getField2(), d.getField3())));

for(Record record : records){
    List<String> errors = recordValidator.vaidate(record);
    if(!errors.isEmpty()){
        // add to rerrors
    } else {
        Triple<String, String, String> key = Triple.of(record.getField1(), record.getField2(), record.getField3()));

        Optional<TodoList> todolistOptional = cache.get(key);

So, as you can see in the code above we first fetch the records from the database in batches. Convert that into an in-memory cache and finally get the value from the cache.

Since we will not have more than 100,000 records I am fine keeping records in memory. It is less than 100MB of data.

After making this change we made another big improvement in the processing time. We reduce 10,000 record processing time to less than 2 mins. On repeated run it was between 90 seconds to 120 seconds.

For 20,000 records the process time was around 300 seconds ~= 5mins. We were still far away from our goal of processing 100,000 records in 5 mins.

Change 3: Use Sequence id generator strategy instead of Identity

In our JPA entities we were using GenerationType.IDENTITY as shown below.

@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
public Long id;

In my Google research I found that IDENTITY generator cannot use JDBC batching.

The only drawback is that we can’t know the newly assigned value prior to executing the INSERT statement. This restriction is hindering the transactional write-behind strategy adopted by Hibernate. For this reason, Hibernates cannot use JDBC batching when persisting entities that are using the IDENTITY generator.

Till then I was thinking JDBC batching was working. This was the real find.

I changed our id generation strategy as shown below.

@Id
@GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "todoSeqGen")
@SequenceGenerator(name = "todoSeqGen", sequenceName = "todoSeq", initialValue = 1, allocationSize = 100)
public Long id;

As I learnt from the Vlad Mihalcea blog on different generators:

A SEQUENCE is a database object that generates incremental integers on each successive request. SEQUENCES are much more flexible than IDENTIFIER columns because:

  • A SEQUENCE is table free and the same sequence can be assigned to multiple columns or tables
  • A SEQUENCE may preallocate values to improve performance
  • A SEQUENCE may define an incremental step, allowing us to benefit from a “pooled” Hilo algorithm
  • A SEQUENCE doesn’t restrict Hibernate JDBC batching
  • A SEQUENCE doesn’t restrict Hibernate inheritance models

The impact of this will be covered in the next change.

Change 4: Use Hibernate JDBC Batching

To enable JDBC batching you have to set following properties in our LocalContainerEntityManagerFactoryBean bean definition.

properties.put("hibernate.jdbc.batch_size", 30);
properties.put("hibernate.order_inserts", "true");
properties.put("hibernate.order_updates", "true");
  • hibernate.jdbc.batch_size: A non-zero value enables use of JDBC2 batch updates by Hibernate (e.g. recommended values between 5 and 30)
  • hibernate.order_inserts: This is used to order insert statements so that they are batched together
  • hibernate.order_updates: This is used to order update statements so that they are batched together

One more change that I will add in this point is to use the correct way of saving One To Many bidirectional relationships.

In our case one TodoList has many TodoItems.

@Entity
@Table(name = "todolists")
class TodoList {

    @Id
    @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "todoSeqGen")
    @SequenceGenerator(name = "todoSeqGen", sequenceName = "todoSeq", initialValue = 1, allocationSize = 100)
    public Long id;

    private String name;

    @OneToMany(
        mappedBy = "todolist",
        cascade = CascadeType.ALL,
        orphanRemoval = true
    )
    private List<TodoItem> todoItems = new ArrayList<>();

    public void addTodoItem(TodoItem todoItem) {
        todoItems.add(todoItem);
        todoItem.setTodoList(this);
    }

    public void removeTodoItem(TodoItem todoItem) {
        todoItems.remove(todoItem);
        todoItem.setTodoList(null);
    }
  // equals and hashcode

}

@Entity
@Table(name="todos")
class TodoItem{
    @Id
    @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "todoSeqGen")
    @SequenceGenerator(name = "todoSeqGen", sequenceName = "todoSeq", initialValue = 1, allocationSize = 100)
    public Long id;

    @ManyToOne(fetch = FetchType.LAZY)
    private TodoList todolist;

    private String task;

    private boolean done;

    // equals and hashcode
}

Now, to save we will do following

TodoList todolist = new TodoList("List 1");
todolist.add(new TodoItem("Item 1"));
todolist.add(new TodoItem("Item 2"));
todolistRepository.save(todolist);

I also changed the logic to store all the new or updated todolist in one list and store them in a batch.

todolistRepository.saveAll(todolists);

The impact of sequence generator, batching, and changes related to one to many relationship persistence the processing time was reduced to 20 seconds for 10,000 records.

As mentioned earlier, we are able to process 100,000 records in under 4 mins.

Conclusion

I hope this helps anyone who is also trying to achieve better performance in their bulk upload process. This process again reinstantiated the point that we need to better understand the technology we are using. We don’t have to change the database or switch from Hibernate to JDBC or anything. We just need to apply the best practices that are already documented by many other people.

2 thoughts on “Improving Spring Data JPA/Hibernate Bulk Insert Performance by more than 100 times”

  1. HI Shekhar,
    I would have used the Spring batch to solve this problem. It basically solves the same problem your solved in a fresh way to prefetch data in batches. We can define the bacth reader, processor, writer component and we can define/configure the chunk

    Neverthess, it is nice you have the solved the problem yourself with out using the existing framework.

    Regards,
    Sunil

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s