What happens when teams integrate their work?

Integration is the real thing. It is when rubber meets the road. Your teams can do their work in isolation but it has limited value until it is integrated with your application. The whole DevOps(CI/CD) movement is about integration. You want to integrate often and deliver value faster to the customer. You can’t deliver value if you don’t integrate.

It is sad that in 2022 still we are struggling to integrate our work. We have all the tools and processes that promote integration yet I work with so many teams struggling to integrate their work.

In this post I will cover things that are uncovered when you integrate. I am not giving any advice on how to integrate. The only advice on how part is that you have to make it your top priority and do it. The sooner you do the better.

Let’s come back to the main topic of this post. Following is a list of things that I see happen when teams integrate their work.

#1. Your assumptions and expectations start to fail

We make assumptions about systems we use and expect certain behaviours from them. We expect our assumptions and expectations to hold good when we integrate. But, more often than not we are proven wrong. Dependent systems fail in ways we didn’t expect. You only uncover them when you integrate stuff. A couple of examples are mentioned below.

The downstream system starts returning error codes that you don’t understand. What should you do? Should you have a generic exception handling in case you don’t understand the error code?
You expect the downstream system to take less than 1 second to process a request but it starts to take more than 5 seconds. Now what should you do? Should you timeout? Should you time out non-idempotent operation?
You assumed it was safe to retry the failed operation. But, when you integrated you realised that the downstream system was never designed to retry non-idempotent operations. You assumed that their transactionRefId is some sort of idempotency key but you were proven wrong.
You assumed APIs are not rate limited but turns out they are

In any project we make such assumptions and expectations. They are only validated when we do the actual integration and testing. Hope is not the best strategy here.

#2. Hard Dependencies become visible

Two types of dependencies become visible when you integrate:

Hard dependencies on other services. You learn that your whole system does not work without Redis since logged in customer session state is maintained there. This was not apparent before.
Hard dependencies on people: Work is never divided and done equally. When you integrate you find that the bus factor is low for certain areas. If those people are not available nothing gets delivered and problems are not solved.

#3. Communication anti-patterns become visible

Communication plays a key role in software engineering. To build real systems communication and collaboration are critical both within the team and with other teams. When you integrate communication anti-patterns become visible. Examples of such anti-patterns include:

Siloed communication
- Teams communicate in silos. They focus more on 1-1 communication than group channels.
- Since people prefer or feel comfortable doing siloed 1-1 communication information is not shared with the wider audience and knowledge sharing does not happen
Lack of accountability
- To solve a bug that require multiple teams it is unclear who has to facilitate and how accountability will be shared
- No one took notes
Confusing updates
- It is difficult to get proper updates from certain teams. It feels like you are moving in circles. Everything is great but nothing is working.
- People don’t give correct updates. They fear it can be used against them.

#4. Tools start to show their maturity

Installing tools has become easy now. With cloud managed services, package managers like Helm in Kubernetes ecosystem, and wealth of resources on Google it is not difficult to install tools. But, the problem is installation is only the starting point. You still have to make those tools useful and make them perform at a certain availability level if you want people to rely on them. I have realised this is hard and it takes time.

Few examples:

As more and more CI and CD jobs happen your build and release agents start to become slow. More jobs are queued and perceived job time increases. You start to see connectivity issues often. Your bill for these resources start to increase.
Your centralised log aggregation tool is not very useful as you have not created any useful dashboard in it. Logs are not parsed correctly, multi-line log messages are split in multiple documents, exception stack traces are multiple documents long. This makes developers not use your tools.
Deployment pipelines don’t handle rollbacks
Resources in terms of CPU, memory, storage you have given to tools are not sufficient. Tools go down and no alerts are sent. Now, when developer needs to debug things tools are unavailable and platform team is unaware
Your tools are not made available to developers easily. They have to go over multiple hoops to use them. This leads to poor adoption of tools

#5. Lack of resourcefulness become visible

Things will go wrong during integration no matter what. To come to the correct state it is required that teams are resourceful. When things go wrong, can you reach out to the right people for help, do you know which books or websites can offer help, can you think out of the box, or is there a different way to handle this right now. There are many ways we can be resourceful and keep moving things. You have to look for answers in a different way and look from different perspectives.

#6. Inefficiencies in the system become visible

You want a tool like Kibana to be accessible over VPN. In a distributed system you need these tools to debug the whole system. To make the tool accessible it required a firewall configuration change. You were expecting it to be a less than an hour job. But, you discovered that it takes 3 days to make the changes because that’s the amount of time your customer IT team takes for such requests. This issue never became visible before.
Team member onboarding takes multiple days since the customer IT department takes 5 days to create login.

Conclusion

Not all teams are vocal. This is especially true when they are struggling to deliver. Till they integrate they will tell you that everything is green and they are making good progress. But, when they integrate real issues start to come to surface. As I said before, integration has to be your top priority if you want to deliver stuff on time.

Discover more from Shekhar Gulati

Subscribe to get the latest posts sent to your email.