Matthew Lancaster - using even-driven architecture to transform core banking

Written by Andrea PasswaterEdit this post

Emit is the conference on event-driven, serverless architectures.

Matthew pulled us out of the tech stack for a second to focus on what's beneath it, the foundational layer of the application pyramid: business drivers. This, he says, is what will turn everyone event-driven. Non-tech companies, even in old-school industries like banking, are increasingly becoming tech-centric.

Matt's been working with core banking at Accenture for a while, and it comes with lots of legacy challenges. You can't just wipe everything clean and start over in a greenfield project, you have to respect and learn to integrate with legacy systems, some of which are well over a decade old.

He's got some practical advice from his own forays into bringing banks into the future. Watch below or read the transcript for the juicy details.

#More videos:

The entire playlist of talks is available on our YouTube channel here: Emit Conf 2017

To stay in the loop about Emit Conf, follow us at @emitconf and/or sign up for the Serverless.com newsletter.

#Transcript

Matt: All righty, so I think we can take a couple of things for granted, just as industry trends. And I wanna take us a little bit outside of the technology for a second and just talk about the sort of business drivers that are behind a lot of this stuff, and why in many cases we need to move to an event-driven future in a lot of traditional industries, otherwise they're going to be disrupted, cannibalized, and something else is gonna come out of them, right? So there's this accelerating trend for every company to become a software company, especially in financial services where the product was already sort of at arm's-length, in many cases. Our friend from Capital One earlier can probably tell us quite a bit about that as well.

And then just, you know, creating more interactive systems, creating new financial products and actually getting those out to the marketplace quickly. Responding to regulators in an agile way, right? All of this stuff has become an increasing challenge as they have adopted what I like to call the architecture of the Gordian knot, right? But really, everything are these big monoliths that are all interdependent and intertwined and kind of gross. So we need to make sure that we shift to nicely decoupled event-driven systems that can be released quickly, you know, microservices functions, all the good stuff we've all talked about, right?

The challenge in a pretty traditional industry, in that case, is that you have systems that not only are 30 years old or 40 years old in many cases, but that have been continuously developed for 30 or 40 years, right? A lot of the early adopters of computing in the '60s, '70s and '80s built a lot of these big mainframe systems, a lot of complex business rules and business logic, and logic that has been sort of updated slowly for an evolving regulatory environment and an evolving product environment, and an evolving way that customers expect to interact with the bank, right? And a lot of this logic has been there for a long time or has been updated, and you don't necessarily know what touches what, right? It's sort of our standard DevOps problem where, "What's your unit test coverage on this code?" We don't know. Somewhere close to 0%, right?

So how do we actually build really interesting, cool products on top of that, remain relevant in the marketplace, but still, you know, still have to deal with the anchor behind the speedboat, so to speak? So a couple of things to keep us grounded. We can't replace Legacy with Greenfield quickly, even though we'd all like to, but we do need the ability to build on top of what is already there to move fast. So that's gonna become one of our core business problems to think through and our core technical problems to think through.

In terms of, you know, customers demanding better experience, we all know that stuff. And the regulation's here to stay. We can't get around that. So, you know, when we talk about things like continuous deployment, continuous delivery, that doesn't necessarily exist in highly regulated industries because you have to literally have someone to sign off on certain things, right? So you have to work that into the process as well. So let's actually talk through this. So we have...you know, here's the business situation of nearly every bank on the planet right now. You have mainframes that are still running a large amount of the backend business logic. A large amount of the business is actually run through that, so trillions of dollars of transactions. Mainframe costs are going up every year, right, steadily, generally about 5% a year. That's a Gartner number. That's not mine.

Then we have an explosion of different devices and different channels to actually interact with your financial information through this. And not only through your phone, your tablet, your laptop, but also through different services, like a lot of the stuff that will help you find a credit score or help you plan for certain savings and stuff like that. Those are all accessing your banking information, right? It's actually getting that stuff out of there means getting more information out of that mainframe environment in many cases. And we're kind of stuck in a catch-22 in the industry, which is write operations cost just as much as read operations. And we all pay the mainframe vendors for the privilege of using our own systems, right?

So that's one of those major cost areas that we can actually attack immediately and get a bit of breathing room to start to innovate on top. But how do we unlock the data from that environment? And then we also have very slow innovation in many cases. A lot of that stuff, like I said before, you know, the architecture's all put together, the teams are structured in such a way that it's very difficult to get anything done. And you end up having sort of security theater environments that don't necessarily make our infrastructure and architecture more secure, it just makes it more difficult to do our jobs. So I'm not gonna cover all of that right now. What I do wanna talk about is a particular business case.

So I have a team in Austria that...some of the technology's a little dated because it was 2014/2015, but I think it's a really interesting business case of how to move to microservices, how to move to a really event-driven streaming architecture and still sort of coexist and build the plane in the air with the existing systems. So one of the things we took for granted is that we're not gonna rewrite and extract all of that business logic right away that exists in the mainframe. So the write activity, at least in phase one of the overall program, the write activity needs to stay where it's at, right? So anytime we actually make a transaction, it needs to go back through the mainframe, it needs to go back through all that nasty business logic, and actually post somewhere, right?

But for read activity, we don't actually have to go back to DB2, right? We don't have to go back to the big database. We can do something more interesting. So what we ended up doing in that situation was put a little reader sitting on top of the commit log for DB2, right? So all databases are really only three things, right? They're the data and big binary blobs that are sitting in various storage partitions, the actual application logic of the database, and then there's a big, essentially glorified text file that is actually the single source of truth for the database, whether it's Oracle or DB2 or any of the old SQL databases, right? It's just a big insert, update, read, etc. All those operations are stored in that commit log. You play back from it when you roll back, etc., etc. So if we read the changes directly to the commit log and we sit it really close, and we re-replicate those changes out to, in this case, it was Hadoop. I probably would do it with something more interesting today, but re-replicate out to Hadoop. By the time the record was unlocked by DB2, most of the time it was already replicated out, a sub-second replication for inserts or for changes to the Legacy database.

So now we have a full replica of that database that we can start to do interesting stuff to because we have it in a really fast data environment, right? So all of this stuff is sitting on top of HBase. We can build some microservices running off the same JBMs because HBase is pretty...it sips CPU from the nodes, so we can sort of have co-tenant architecture there. These folks were in their own data centers and in rented data centers because the regulations in the EU prevented them from being in the public cloud. That may no longer be true soon enough, and we'll be able to use a lot of cool AWS stuff. But for the time being, it was locked in their existing environment.

So one of the other interesting things here is that since we can stream this data out, suddenly all of those transactions we can feed to an event log, and then we can start to attach and do more and more interesting things with it. We'll get into one of those particular business cases in just a second. So when we did this, it actually took about seven months and eight people, right, to pull all of this out. I'll share some of the business results of it in a second, which were actually really, really interesting. One of the things that happened in the middle of this program was that a new EU regulation came down that had much more stringent fraud detection requirements for basically your commercial banking transactions, right? So most of the competition set for a client was running around with their...you know, running around like the world was on fire because they were only given seven months to implement this new fairly stringent fraud detection regulation.

And you can imagine in that world, that's actually at least to their traditional waterfall mindset with really big release cycles and five-month in many cases integration testing cycles for core system changes, that's a huge, huge deal. So we were actually able to do that in three-and-a-half weeks because we just read the changes on the new data environment, read in UI changes from the mobile apps and from the web, and then looked for fraudulent activity and then we were able to kick people out from there, right? So we used our read copy of the data to do a new business function that wouldn't have been possible without essentially sort of grappling a nice event-driven architecture, nice set of microservices on top of the Legacy architecture and sort of slowly starting to build value on top of that, right?

And it was actually kind of interesting because when you look at it, if somebody is accessing their account in Vienna and they're following sort of their normal patterns, they're probably just fine, right? If they're accessing it from Thailand and they're behaving really weirdly about filling out a mortgage application, you may actually want to engage...you know, pull the emergency brakes there, right? So actually being able to look at that data and look at where they're at, all of that is essentially data exhaust of write activity and then the read copy that we have, right? So we can suddenly start to do much more interesting things.

So on top of that, some of these microservices started to extract business logic for new products, business logic for making modifications to existing products to make them a bit more customer-friendly and more user experience-friendly. We can slowly extract that stuff out of the mainframe because we've isolated a lot of the big mainframe components such that it becomes a routing problem to move around them as opposed to actually changing anything in Cobalt that somebody who's now 90 and retired in Florida wrote, right? So some interesting stuff can go on there. And the happy accident of this, or actually the original business case, the speed to market and all that, was the happy accident. The original business case was just reducing mainframe cost. But within that first 7-month project, the mainframe cost was reduced by 50% because we reduced the CPU load on the mainframe by 50% just because we rerouted all of the read transactions that no longer needed to be there, right? So the first project paid for itself and it paid for the second project in just reduction in OPEX, right?

So the message that I wanna leave you with, with that, is that in any industry there's really clever things that we can do with the tools that we have at hand, with a lot of the patterns that we've been talking about all day, a lot of the technologies to not only create new, innovative things, not only to essentially, you know, sort of flip how we're doing the user experience and the digital engagement of a lot of these customer-facing systems, it can also be a cost play, and it can also be a time to market and sort of, you know, mature play, right? That becomes a very, very powerful thing when we're trying to negotiate for money for a few shekels from the business in many cases. Well, you know, if we do this, we can also reduce costs and have a 12-month payback period. That starts to become music to a CFO's ears, right?

So that got us to thinking, "What if we could replicate this kind of success on top of multiple mainframe-ish or big JEE, or big sort of traditional monolithic environments," right? So how would we actually design a reference architecture where we could have a repeatable process to build on top of that? And right now, it's mostly focused in the banking world. I have a few folks who we're working with to actually implement this, but it should look strikingly like a lot of the things that we've already discussed today. You can see a set of microservices that are handling REST transactions, but they're only communicating with the back-end through an event stream. It kind of sounds like an event-sourcing pattern, right? You have a set of utility services that are listening to the event stream and then acting upon the rest of the system, and then we're able to actually, you know, have real-time analytics, real-time, say, next best offer.

So if you can imagine the real-life scenarios of this, if you're filling out a mortgage form but we can tell halfway through that you don't actually qualify for what you're filling out, we can give you something that you do qualify for and potentially keep you as a customer because we're listening to the events that are coming off of you filling out that form. Or we can have a customer service agent literally share the same screen with you because we're just capturing the down events, right? Because all of this is...you know, they're synchronous transactions as far as submitting forms, what have you, but most of it is streamed over either WebSockets or MQTT or what have you.

So a couple of other interesting things have come out of that. So it's specifically in the financial services world, but I suspect in many other places as well there's a really strong, almost religious attachment to the concept of a session, right? You need that sort of transaction integrity, right? I see laughing from the other banking guy here. It's like, "Yeah." I suspect that some of the older VPs in many banks have an altar somewhere in their house where they sacrifice to the session gods, right? But they actually have a point, right? Because you need to be able to guarantee transaction integrity, you need to be able to play it back, you need to be able to send transactions in order to regulators so that they know everything's on the up and up. There's a lot of heavyweight behind that. And frankly, that's sort of, you know, it's how a lot of the back-end business processes work. So we need to have, you know, nice double handshake asset-compliant transactions in many cases. But what does the concept of sticky sessions leave us with, with our services, with the rest of our architecture? It has a massive tradeoff, right? We find it very, very difficult to be scalable, to be distributed, to have developers work on isolated pieces, right?

So what we're doing here is a little bit different as we have this thing up at the top that we call the reactive API gateway. I've already been talking to the server-less folks about doing a little bit of integration with what they're doing. But what we care about here is, number one, that we can apply standard sort of API gateway as policies, RegEx security policies etc., to streaming connections as well as REST in a really, really lightweight way, and keep track of what customer or what node the particular set of transactions is connected to and be able to order those, play them back in order, send them out across the stream, etc., without having any concept of sticky session in the rest of the system. So we used probably one of my favorite sets of technologies that in the enterprise world they haven't got to use a whole lot. You know, I tend to be a PowerPoint engineer these days a lot and I got to actually write some error line code for the first time in like three years, and I was super happy about it.

And then my whole organization was like, "Hey, can you come up for error and actually answer our questions and do stuff?" It's like, "No, I'm coding. It's fun," right? But we wrote that in Elixir, built a lot of interesting stuff there. We're keeping track of the transactions via CRDTs so, you know, keeping them ordered, getting them tagged to a particular customer. I always like to call it a customer-centric architecture because everything's based around the customer transactions, whether it's commercial or whether it's, you know, personal commercial banking or whether it's, you know, business banking, etc.

The other interesting thing here is this re-replication system. We've sort of industrialized that a little bit so that we can get data out of the Legacy system. And oftentimes the communication pattern back with the mainframe is we'll drop...if it's queue-based we'll drop a single message on the queue and do sort of a micro-batch, or communicate with it over actually hooking into the transaction manager, which is actually fairly parallelized itself. It's just an internal communication mechanism, so we can do a couple of different things. But what that ultimately does for us is it separates out the different sort of core pieces of the back-end architecture and allows us to, you know, treat them separately, right? And when we start to pull out functionality and deliver incremental value to the business, right, when they ask for X to be done in X amount of time with different systems of record, etc., we can pull that stuff out, eventually sort of reroute away from the mainframe, and maybe three or four years down the road have the Holy Grail conversation in most traditional enterprises, which is maybe we can shut this thing off, right?

So one of the big things to kind of talk through there, I have another client where we're doing this in hospitality space and a couple of my colleagues back there are familiar with, where we're already two years down this journey and we've turned off half of the mainframe environment. And two years from now we'll be able to turn off the rest of it, right? But it was largely through the initial set of cost savings that we were able to bring in the initial sort of speed to market around some of their products, integrating new brands, doing new experiences for some of their hotel brands that were focused toward younger customers, where they...you know, we all expect a little bit more high-tech approach. So moving reservations and loyalty to a set of services, sitting in front of a big Kafka Stream, and then eventually hooking that into the rest of the customer-facing systems so we can make really interesting intelligent decisions, like maybe if you were coming close to the hotel and you were ready to check in that day, we can check you in and have somebody greet you by name rather than you walking up to the desk.

And it's always funny, in sort of the coded business speech that a lot of these folks have. When you walk up to the front desk and somebody says, "Hi, how are you doing? Can I see your ID?" It's really code for, "Who the hell are you and why are you here," right? Which is not a very hospitable interaction, if you think about it, right? It breaks the immersiveness and kind of the customer experience in that space. So if we can use all of the intelligence that we have to actually, you know, talk to you and know who you are, and already have you checked in, already have your rewards amenity ready, etc., etc., we're using the same event-driven paradigm that we were talking about in terms of the technology side of it to sort of hack the business side of it as well, right?

A colleague who's here, I think one of his favorite things to say around this is maybe we can get rid of the front desk entirely, right? Why do we need queues in real life when we can have parallel streams and just serve people quickly, get them what they need and get them on their way? So, you know, how would we think about hooking onto those back-end systems in other industries, in other places, using some of these concepts of the standard reference architecture, pulling out some of the business functionality piece by piece, right, putting it into whether it's functions or microservices.

I think in that case, most of the time it really doesn't matter, right? As long as we have a good set of patterns, and we've usually chosen a few technologies that we're gonna coalesce around. I see a lot of these efforts fail both in the startup space and in enterprise where when we move to microservices it becomes the Wild West, and there's 75 different technologies that everybody's picking up. You know, there's a bunch of stuff sitting on Lambda, there's a bunch of stuff that some guy that is in the basement of one of the offices wrote in Go lang for some reason, and then there's a bunch of node stuff and a bunch of legacy.net stuff, and then suddenly there's 80 different pipelines for all of this, and you have Docker or you have, you know, your event gateway. You have all kinds of stuff to manage the spaghetti that you've just built for yourself. You know, there's good, solid patterns that we can attach a few technologies to, and then sort of industrialize and repeat, right?

And I think the hardest thing here is that moving in this direction, once we get the right technology sitting on top of the Legacy applications, and in many ways in the Legacy business, right, is to...you know, the technology is one leg of the three-legged stool. We have to figure out how to make sure that the delivery process and the engineering systems are in line, but I think more importantly, we need to work with the business folks and actually get them working in the same way where they're focused on the individual product areas rather than, you know, these big sort of integrations. It's not an easy thing, right?

But when we can initially deliver, "Okay, here's your cost savings. Here's your 12-month payback period and the initial project." Now, we can deliver in two weeks to a month for fairly large pieces of functionality that it used to take you years to get. You suddenly get folks that start to question their old religious beliefs, right? And then you slowly bring them into new projects, and then you get evangelists that can come down from the shining city on the hill that you just built and convert the rest of the masses. And suddenly, we're all living in a world that we want to live in where this technology isn't just the fringe stuff that we can all get excited about in the bleeding age, it's actually suddenly becomes the new core business systems that...you know, Sunday can become Sunday again because we don't have all the nasty outages and whatnot with the traditional error-prone systems are, you know, susceptible to. I mean you guys at Nordstrom, you probably never had a big outage of core retail systems anywhere around like Black Friday or anything like that, never happens.

One of the great things about moving in this direction is we take so much load off of the traditional systems that is cost savings, that's great, but we're also acting as a back pressure valve, right? When you have asynchronous load environments or you have services that you have to stand up in a short period of time, if you're building that on technology that was fundamentally designed to lock threads until it tips over, you're gonna be in for a bad experience if you have more customer influx than you want. If you build it on something that's naturally designed to scale out, then you're building for resiliency and sort of rather than protecting from failure, which is the old mindset, you're essentially embracing failure and then letting it only affect one or two folks, right? So it looks like I'm running out of time here, so any questions or comments or horror stories from anybody else?

Man 1: So you talked about how to read software from mainframe, what's your approach [inaudible 00:24:57]?

Matt: That's a longer conversation. But the first part of the answer, which is the most cop-out piece, I promise, then I can get into more of it, but is carefully. But I think one of the soapboxes I often get up on is that we've forgotten some of the brass tacks stuff that we used to do in technology really well, like domain-driven design and some of the other fundamentals. If we separate out those big, nasty pieces of the mainframe into core business areas, then we do have to do some code analysis and see where things are ticking. But we can usually rope it off to a pretty good degree of certainty, and then extract it piece by piece into new services.

And as we do that, rather than route those particular transactions back to the mainframe, we just intercept it and route it to the new services. And then it slowly gets replaced and strangled off, and it's replaced and strangled off with things that have all their unit test coverage, that have good definitions of done and can actually be completed, right? And so you get this sort of comminatory effect where you can start to tackle more of these a little bit more confidently, and then you can attack the really big ones that are spidered everywhere, right? And then eventually that knot, you cut it to death by strands, and then you can unhook it.

Man 2: Do you feel like there's a maybe sort of like a doomsday clock on this kind of moving off the mainframes, or are you finding it hard to find people who can [inaudible 00:26:31]?

Matt: Oh, yeah, absolutely, absolutely. I have a client in the U.K. who they had a major system issue. They had to bring back an 85-year-old woman from retirement to look at the system and actually fix the problem because she was one of the original authors. She was the last one left from about the 30 people that had actually built it. And had they not been able to do that, they would have had to rebuild a good portion of it, and this is a big retailer. So you can imagine how screwed they would have been had they been down for more than two days, right, especially around the November timeframe. So absolutely, there is a big doomsday clock. And I think the hesitancy is that a lot of folks have tried to move to this a couple times and they failed because they try these big bang, big replacement projects where they can turn a key. What we're talking about is implementing the strangler pattern.

My marketing folks have a nicer term for it, which we call it hallowing out the core, right, because strangler patterns sound slightly serial killery. But, you know, so hopefully we'll all be able to move in this direction as we move forward. And keep in mind some of the business case stuff because it really can convince the folks who don't necessarily understand the nitty-gritty of event sourcing and what have you.

David: Any other questions? All right, thank you very much, Matt.

Matt: Thanks.

About Andrea Passwater

Andrea writes about tech at serverless and keeps her eyes on her growing cactus collection.

Serverless Blog

The blog on serverless & event-driven compute

New to serverless?

To get started, pop open your terminal & run

npm install serverless -g

how? learn more

Subscribe

Join 12,000+ other serverless devs & keep up to speed on the latest serverless trends

Comments