For 10 years I’ve worked on an application that has, as a major feature, the ability to import a customer’s data from another application we sell. Our customers can click a button to pull over information, review it, and ultimately display it to their customers. Nobody every has to enter information twice, the system is set up with (sometimes custom) rules to determine what data our customer’s customers can see, and everyone is happy.
Well of course not, but it does work rather well for the vast majority of our clients. However, some of our clients are very large and have appropriately large amounts of data they would like to import (sometimes daily). A multi hour import is not great for their business flow.
The ‘problem’ with the import is that it is stupid. It, the import, gets everything from the System Of Record (SOR) every time. Document who’s title has recently changed? Imported. Fund value that hasn’t changed in 15 years? Imported.
At this point it seems like there are some obvious solutions:
- All the data is in the SOR’s DB, why not just get it straight from there?
- Why pull, why not have the SOR push data when it changes?
- Only import the entities that have changed.
If you’ve every worked a place that has 2 applications share a DB, you probably know this is an infamous anti-pattern. Changing and tidying up a DB only used by one application is hard enough but with 2 apps it is crazy easy to break everything by changing a column name or storing data in a different format. The natural outcome is that everyone is scared to change anything about the shared DB so they don’t and duplicate/unused columns and tables abound. Think of it this way: If a SQL database is global mutable state (and it is), then having a whole other application run by a different team that can change that state at anytime is a recipe for disaster.
But that’s okay, we still have 2 perfectly valid solutions: Only import the differences or destroy the import all together and rely on the SOR to push any changes.
The push proposal seems wonderful as the data would flow constantly and no one need ever again wait on a big import. Unfortunately there’s a few problems with that. Our clients are publishing financial data to their customers and they really want to make sure that information is exactly what it should be before releasing it to their clients. One of the nice features of our preview and publish (PAP) app is the ability to see what the website will look like after they push the publish button. Our client can ‘preview’ as any of their customers and check to make sure they are seeing the correct documents and account data. If we adopted a ‘SOR pushes changes to the PAP’ strategy the client would never be quite sure what they are publishing. Sure our client could preview as one of their users and click around but who knows if that data will change right before they press publish? The nice thing about a big dumb import is that it isn’t a moving target like push systems are. We have discussed having the SOR push changes to a ‘holding area’ that could then be imported but then we’re back to an import. It would be a faster import because the holding area could store the data exactly how the PAP system likes it but this plan is a pretty large project and we were looking for a quicker win.
We settled on only importing the differences. Which is, basically, caching and as we all know caching is one of the HARD PROBLEMS IN COMPUTERS. But really, how hard could it be?
Thusly tempting fate, we wrote 450 lines of Clojure that listens to a queue populated by the SOR with creations/updates/deletions, persists those changes, and can answer queries about what objects have changed over a given time period. Now the PAP system, at the beginning of an import, asks the micro service what has changed since the last import and only imports those changes. It is much, much faster and everyone was happy.
Until the bugs came for us. First up was the code inside the SOR exploding b/c the serialization required to push changes out to the micro service would error out when trying to deal with certain objects only used by a small fraction of our customers which is why this bug evaded testing. Then after that was fixed we found, of course, other ‘freak objects’ that didn’t respond well to serialization. That took our change service out of use in production a few times. Each time we’d put the clients back on the ‘dumb’ import they would call in concerned that their import times had gone up and we’d have to explain our issues. Embarrassing and not at all confidence inspiring.
After a few failures in front of our clients the word came down that we need to make these delta imports clever enough to know when the SOR wasn’t outputting proper diffs. We wrote code to automatically switch to a full, dumb, import when we detected a bad state of the caching system. Good idea, right? Except that in a particular case our QA thought they were testing a cached import but they really were testing a full import because some part of the testing environment were messed up and cached imports were magically becoming full imports. That bit of protection ended up hiding a not small bug which escaped into production.
Our next bug had to do with determining which entities could be deleted during a differences only import. The query to determine which objects to delete was complicated when it came to figuring out which linkages between objects to delete. Turns out it had been written in such a way that it unknowingly relied on a bug in removing relationships between entities. When we fixed the relationships bug the query mistakenly destroyed a lot of links between objects. The data was there, but the missing links meant our system (the PAP) thought customers should not see that data and hid it from them. That issue was awful to debug as no one had directly touched the delta handling code in quite some time so it seemed impossible that it had stopped working because of a few releases that only added unrelated features and fixed what we thought were unrelated bugs.
Now anytime we fix the speedy (difference only) import system we have our client care people asking “Well how do we know this time it is really working this time?” They are right to question us. This feature has produced way more bugs than any other feature in recent memory. Usually, after a feature has been out for awhile, you can count on the bugs (if any) diminishing to zero over time. This speed up the import feature, by adding a type of caching, is a permanent ‘sharp corner’ that all developers and testers need to be aware of. It will almost certainly be the direct or indirect cause of other bugs in the future and we’re just going to have to live with that as we have embraced the speed and additional complexity of caching.
Coda: The Big Dumb Import stays in the picture.
We will always, if I have anything to say about it, have the big dumb import because it absolves all sins. It’s a state destroyer: The equivalent of turning it off and on again which, as much we make fun of the practice, is essential when dealing with global mutable state (aka: a SQL DB). Not being the system of record has its perks.