A conversation that continues to surprise me is the one that starts with “why don’t we just share the database”. I feel that in the state of software development today that this is known to be a Bad Idea™ and we should try and avoid it first or if not avoid it, at least protect ourselves from the Bad Things That Can Happen. The evidence in front of me is that this is not universally believed, so I will endeavour to capture my thinking here. Hopefully this will help us discuss what the trade-offs look like for the situation we’re talking about. Maybe at least have the same frame of reference.
This turned out to be long. Let’s get started!
Why is database integration so appealing?
It is easy. The amount of effort to share your database with another application is trivial. It is that simple. One database connection string with authentication and we’re done. Business value delivered.
Solutions like breaking up a monolith into multiple services, sharing data between two apps, 3rd party tooling that needs access to your data, or getting data to a data warehouse all could be sped up by sharing of your database as a quick solution to move forward.
This simplicity makes it very appealing. What can go wrong?
Writing from multiple places is not fun
There are many patterns that try to help avoid updates from disparate places. OO discourages direct updates through encapsulation – you cannot touch my data except through my interface. In DDD you can use the aggregate pattern in order to help reason about changes to data which must all go through the Aggregate Root. The functional solution is to make everything immutable. When using a pattern like Master Data – we’re trying to solve this problem by making an application / service the single source of truth. All of these mechanisms are intended to help us reason about software. Software that writes to the database from multiple different codebases can be very hard to reason about.
Writing data from multiple applications violates the Don’t Repeat Yourself principle. Knowledge about the tables, their defaults and their validations all need to be shared across multiple applications. It is also hard for a developer to discover that a second application is writing to the table. Doing a search in the repo they are working in will not find the access. This can result in applications writing different defaults, validations, data types or not being aware of new columns or the meaning of new data.
A mitigation to this is to build a gem to share. This is great until we do not keep all the applications on the same gem version and therefore we still have the same problems. We need to lock step deploy every application using the gem to keep them in sync. We now are building a distributed monolith.
What about read only – surely that is fine?
If you’re doing it to experiment and learn if something (like a 3rd party tool) is valuable, then definitely learn with direct integration. After we’ve learnt, consider what the long term implementation is. The pain that grows with this solution is in the long term maintenance and changeability of the solution, not the short term win.
For a long term solution, if you never change the database schema at all, then this is the quickest, easiest and – if you’re 100% right – most painless mechanism. If you’re wrong, you’re probably in for pain. How much pain depends on how rapidly things need to change and potentially how much the consumer is actually interested in.
The biggest problem is that we don’t know what will happen in the future, so assuming that we will never change the database in the future may always be a naïve choice.
What’s the big deal?
When you share your database you share your entire schema to be coupled to. This is like exposing all of your data via an API where the API is every single SQL query that can be made. Any change that is made to the schema can impact the consumer. If you give the entire schema that means you may need to ask permission for any change to your database schema as you do not know if the change you are making will impact the consumer.
The database schema is the API contract
Any shared API is a shared contract that the team owning the API guarantees to keep for all consumers. When changing the API there needs to be negotiation and planning. When we provide our DB to another team to consume, the entire database becomes the contract that we guarantee to maintain. And any changes to that contract must be negotiated.
Why is that bad?
If we need to ask permission to change something it will introduce friction. We need to remember to talk to another team. We then need to actually talk to the other team. Other teams have their own priorities. Given the best will in the world, communication still takes non-zero time. And it will need to happen for every single change we make. This slows us down, particularly when we’re making a lot of changes.
Friction slows us down. Can we solve that?
The friction around needing to ask permission can introduce several responses
- We could start to work really, really hard to not change data in the DB
- We could start to think really, really hard up front before we build anything
- We could start to think about mechanisms that can reduce the number of things we need to communicate about.
The first two options introduce negative feedback cycles. We spend more time in order to achieve them instead of less. And, given that things will change and we can’t know everything up front, we may make even more of a mess of the data when optimising for them.
Communication is important, so making sure the things that we’re communicating about are valuable is useful. It keeps the signal to noise ratio high. Starting to think of ways to reduce the number of things we are required to communicate about is useful, especially when that does not increase the time spent doing the new thing. Ideally we should only communicate when the consumer must change.
Can’t we mitigate these problems?
We can introduce mitigations.
We can introduce rules – like adding columns is always allowed. If we are allowed to add columns that is one change we don’t need to communicate about. This is a net win as most likely the consumers would generally not notice additive changes. We need to jointly make this rule as it is plausible that consumers could notice and fail.
The consumer could document what they are using. If we can find out what is being used, then we can self-serve. The only catch is – what if it isn’t up to date. Humans are fallible so this may happen. This solution moves the work that was done by the publishing team in needing to communicate to the consuming team as they always need to document. But we’re still doing that work. A failure may result in a negative feedback cycle – be better! Document more! Spend more time achieving something that will fail any time a human forgets.
Any mitigation that does not involve feedback from code / tests will potentially fail at some point. If we can introduce automated mechanisms that tell us that we’re breaking something we could get ahead of that. Automation does not forget. If it is setup to run, it will run every time. If we can get automated feedback, then we know that we’re impacting and can get ahead of the game.
A good mitigation might be to come up with solutions similar to those for HTTP APIs – perhaps some of the contract testing ideas could help. For contract tests, the consuming team writes tests that test the API how they use it. The team exposing the API cannot change these tests. Their system needs to keep them alive. This is a great feedback cycle for exposed APIs. And then the two teams can negotiate about how their usage can change.
The negative part about defining a contract on the base schema of a database is that it represents the base model that we’re trying to build and invest in. This is where we’re experimenting and learning internally to the team and the system. Having that coupled to an external consumer makes that experimentation harder
Is there a better way?
Could we define the contract explicitly? Could we make it work like an HTTP API? At that point potentially all the rules and expectations that we have around HTTP APIs come into play.
An option could be to populate tables or provide views that the consumer uses. The internal team does not use these tables or views but signs up to them being the API contract for the consumer. This means that we can now apply all the standard mitigations around APIs that we are exposing outside of our system. This becomes a known space to work in. This does not make it easy, but it does define the deliberate place where we can support the integration to our database while allowing us to retain control and freely refactor the rest of the database as we wish.
A key idea around refactoring is to build the simplest solution that we can safely refactor out of. Coupling to a defined, contained interface as a contract feels like a solution that can enable safe refactoring of the underlying code design while keeping the defined contract working.
What does this solution lead to?
Suddenly we have reduced the scope of questions we need to ask when we change the database. We now know we are fine if the change in the database does not relate to one of the contracted tables / views. Suddenly a whole category of friction is removed. We can write tests to give us feedback when we are accidentally changing those tables or views. We do not use them in our systems so we have no reason to change them unless we are changing them at the request of the consumer. Another radical idea would be to use the Dependency Inversion principle and allow the consumer to define what the data should look like. It is for them after all.
This sounds like a lot of work. Can’t we just use database refactoring techniques to solve this?
The mechanisms for database refactoring are awesome. They make your database more fluid just like code refactoring makes your code more fluid.
When you own all of the code, code refactoring flows in small steps that allow you to incrementally deploy software over and over again. It is an awesome engineering feat.
When you own all of the database, database refactoring allows changes to the database to happen in small steps that allow you to incrementally deploy changes across the software and database over and over again. It is another awesome engineering feat
Code refactoring slows down and has friction when the edge of the code that you are refactoring is shared by another consumer. For instance an HTTP API or a class exposed by a Ruby Gem. The steps to quickly change the system run into friction around communication, planning and the potential that you might need to wait a long time before you can finally remove the interim code that is helping make the progression from one form of system to another
Database refactoring slows down and has friction when the database that you are refactoring is shared by another consumer. The steps to quickly change the system run into friction around communication, planning and the potential that you might need to wait a long time before you can finally remove the interim code that is helping make the progression from one database design to another.
Database refactoring is really useful for helping understand how to change a legacy database shared by another consumer out of your control. It moves it from static and scary to change, to slow moving and more malleable.
Database refactoring is really useful for helping speed up your changes in data when you own the whole stack. It allows you to incrementally change the whole system in a far more fluid way.
Database refactoring doesn’t solve the communication and waiting overheads that come with co-ordinating with other consumers. You can only clean up your refactoring as fast as the consumer follows your changes. You still need to communicate about every change in case it has impact or introduce mitigations for that as described so far.
Is not sharing the database an option?
The first prize would be to not share the database at all. If we do that we remove a category of conversation and design problem that we no longer need to think about. If we don’t take a dependency, we don’t have to deal with it.
But one must assume that the sharing of data is necessary, otherwise we would not be contemplating that in the first place. In which case we need to be able to build something to provide that. Whether it is an API to be called; or we write data to another database and that is the contract; or we event it and other code reads it and makes it available in another database and that is the contract. It is inevitable that something needs to write the data somewhere for the other application that wants it to gain access. It will depend on what that application is as to what makes most sense. But something will need to be built to provide the data.
Another problem… data is not always meaningful without code
The data in an application’s database might need code to make it mean something to the business domain. Keeping this in sync across multiple applications accessing the database directly can also lead to pain and violates DRY. Deliberately providing data that has meaning is far more useful – whether that is directly into a database tables where that interpretation is potentially set or published out in an API that uses the code to do the interpretation.
Some practical history and burnt fingers
I have learnt this lesson painfully in the past. The clearest example is an application which was exposing a new API to replace an old one in a large monolithic app. The desire was to build a model that better represented the domain. The implementer was going to build it in a new rails app. It would be the new API and it would represent a better view of the system. At the same time, it was viewed that this API would not change after it was released. So when reviewing and conversing about the solution and how it would be maintained, the fateful “it is not going to change, so lets move forward and not worry about maintainability now” was stated.
The decision in this instance conversed around sharing the model as a Gem between the two apps, or otherwise sharing it as views and having the new model consume the views. Having a model as a Gem for a concrete implementation of data is not a great idea as it means that all consumers must always have the latest gem deployed as that is the latest representation of the DB and hence any time a DB change happens all consumers containing the gem should be redeployed – even if the change has no bearing on a given consumer, they need to keep in sync.
In this instance we chose to share views to share the data. We chose to make the primary app own the DB schema – as otherwise we had multiple apps owning the DB schema and we could get into trouble managing changes to the DB. But that meant any view change required for the new API needed to happen in the primary app first. It would also need to be deployed to live before the API code could consume it.
If you recall the “its not going to change” comment. It turned out six months later that we thought this API was useful and started to use it – both directly and from a search service. Long story short, we did change the data a lot and this design choice became a real pain. Changes only needed in data for the API caused multiple repos to change and be deployed. Deploy could be painful as sometimes the different database schemas were out of sync. Deploying an older version would downgrade the views for the API consuming it. Testing could be painful as we needed a matching test database based on the main application so that the views were up to date. There were lots of lots of little niggles due to the coupling of the implementation.
The worst part of this choice was when people moved into the project they struggled to understand the couple. They would forget the process as too many things were solved by “we should remember to do XX”. The solutions were too often human’s being better, not systems providing feedback when the humans inevitably did the wrong thing. But it is hard to provide automated feedback about changing code in one repository that is affecting another one.
When we did this I was working with the best bunch of agile software developers I have worked with. It still caused us a lot of pain and we owned all parts of code accessing the database.
Change is inevitable, how fast you embrace it is up to you.
I work in fluid environments that accept that code will change. I accept that we don’t know everything. We will change the systems we build as new knowledge and designs emerge. Any team that needs to co-ordinate with another team to ask permission to make a change slows down. There will be friction in refactoring at the shared edge of our systems. I prefer to live in hope that we will be able to make more and more changes, to experiment and innovate with new products and ideas simply because we choose not to embrace solutions that will slow us down. Solutions that continue to be reoccurring work aren’t great. Solutions that stop us from having to do a certain category of work speed us up. Work not done is time available to do other useful work.
Prizes… or is that prices?
First prize – don’t integrate at the database. If we do not do it, we do not need to read this entire complex post that touches on many nuances and implications of doing this. And inevitably there are more subtleties I haven’t even touched on. If we don’t do it, we don’t have to think about what it means to do it.
Second prize – integrate at the database with a completely static data schema. This is probably both unlikely and not necessarily very useful for a company that is driving innovation. But that is the only time that integrating blindly is not going to cause pain.
Third prize – integrate but do not give up control or give up as little control as possible. Look to how you can reduce friction as much as possible without generating repetitive work and especially without creating work that humans need to remember to do or that require us to simply “be more better”. This is beneficial for both the data owner and consumers in the long run.
I default to first prize and hang on to it for as long as I can.