< Blog

Making Software Last Forever

Hero image for Making Software Last Forever
48 min read

How many of us have bought a new home and moved because our prior home was not quite meeting our needs? Maybe we needed an extra bedroom, or wanted a bigger backyard? Now, as a thought experiment, assume you couldn't sell your existing home. If you bought a new home and moved, you'd have to abandon your prior home (and your investment in it). That changes your thinking doesn't it?

Further, imagine you had a team of ten people constantly working on your home, improving it, and keeping it updated, for the last fifteen years. You'd have an investment of 150 person/years in your existing home (10 people x 15 years). If each person was paid the equivalent of a software developer (we'll use $200k to include benefits, office space, leadership, etc.) you'd have an investment of $30 million dollars (150 person/years x $200,000). What would make you walk away from that investment?

When companies decide to re-write or replace an existing software application because it is no longer meeting their needs, they are making a similar decision. Existing software is abandoned (along with its cumulative investment) for "something new and better". Yet the belief that new code is better than old is patently absurd. Old code has weathered and withstood the test of time. It has been battle-tested. Bugs have been found (lots), and more importantly, fixed. New and better code means new and better bugs.

Joel Spolsky (of Fog Creek Software and Stack Overflow) describes system re-writes in "Things You Should Never Do, Part I" as “the single worst strategic mistake that any software company can make.” Why throw away a working system? We know from both research and experience that iterating and maintaining existing solutions is a less risky, and less expensive, way to improve software.

Why can't software last forever? It's not made of wood, concrete, or steel. It doesn't rot, weather, or rust. A working algorithm is a working algorithm. Technology doesn’t need to be beautiful or to impress other people to be effective, and all technologists are ultimately in the business of producing effective technology.

The World's Oldest Software Systems

In 1958 the United States Department of Defense launched a new computer-based contract management system. It was called "Mechanization of Contract Administration Services", or MOCAS (pronounced “MOH-cass”). In 2015, MIT Technology Review stated that MOCAS was the oldest computer program they could verify. At that time MOCAS managed about $1.3 trillion in obligations and 340,000 contracts.

According to the Guiness Book of World Records the oldest software system in use today is either the SABRE Airline Reservation System (introduced in 1960), or the IRS Individual Master File (IMF) and Business Master File (BMF) systems introduced in 1962–63.

SABRE went online in 1960. It had cost $40 million to develop and install (about $400 million in 2022 dollars). The system took over all American Airlines booking functions in 1964, when the name oficially changed to SABRE. The system was expanded to provide access to external travel agents in 1976.

What is the secret to the long lifespan of these systems? Shouldn't companies with long-lived products (annuities, life insurance, etc.) study these examples? After all, they need systems that last most of a human lifespan.

Shouldn't any company want to make their investments in software last?

Maintenance is About Making Something Last

Airlines recognize that the value of maintenance in general. Airlines created a system called "progressive maintenance" for their aircraft that works, very, very well.

Commercial aircraft are inspected at least once every two days. Engines, hydraulics, environmental, and electrical systems all have additional maintenance schedules. A "heavy" maintenance inspection occurs once every few years. This process maintains the aircraft's service life over decades.

On average, an aircraft is operable for about 30 years before it must be retired. A Boeing 747 can endure 35,000 pressurization cycles — roughly 135,000 to 165,000 flight hours — before metal fatigue sets in. However, most older airplanes are retired for fuel-efficiency reasons, not because they're worn out.

The goal of maintenance is catching problems before they happen. That’s the difference between maintenance and repair. Repair is about fixing something that’s already broken. Maintenance is about making something last.

The most amazing story of the power of maintenance shows that even stuctures made of grass can last indefinitely. Inca rope bridges were simple suspension bridges constructed by the Inca Empire. The bridges were an integral part of the Inca road system were constructed using ichu grass. The grass deteriorated rapidly. The bridge's strength and reliability came from the fact that each cable was replaced every June.

Inca Rope Bridge

Even though they were made of grass, these bridges were maintained with such regularity and attention they lasted centuries

Unfortunately, Maintenance is Chronically Undervalued

There is a Chinese proverb about a discussion between a king and a famous doctor. The well-known doctor explains to the king that his brother (who is also a doctor) is superior at medicine, but he is unknown because he always successfully treats small illnesses, preventing them from evolving into terminal ones. So, people say "Oh he is a fine doctor, but he only treats minor illnesses". It's true: Nobody Ever Gets Credit for Fixing Problems that Never Happened.

Maintenance is one of the easiest things to cut when budgets get tight. Some legacy software systems have decades of underinvestment. This leads up to the inevitable "we have to replace it" discussion which somehow always sounds better (even though it’s more expensive and riskier) than arguing that we have to invest in system rehabilitation and deferred system maintenance.

Executives generally can't refuse "repair" work because the system is broken and must be fixed. However, maintenance is a tough sell. It’s not strictly necessary — or at least it doesn’t seem to be until things start falling apart. It is so easy to divert maintenance budget into a halo project that gets an executive noticed (and possibly promoted) before the long-term effects of underinvestment in maintenance become visible. Even worse, the executive is also admired for reducing the costs of maintenance and switching costs from "run" to "grow" - while they are torpedoing the company under the waterline.

The other more insidious problem is conflating enhancement work with maintenance work. Imagine you have $1,000 and you want to add a sunroof to your car, but you also need new tires (which coincidentally cost $1,000). You have to replace the tires every so often, but a sunroof is "forever" right? If you spend the money on the sunroof the tires could get replaced next month, or maybe the month after - they'll last a couple more months, won't they?

When we speak about software the users can't even see "the bald tires" - they only thing they see, or experience (and value), are new features and capabilities. Can you imagine the pressure to add new features, compared to maintenance work, in an organization? Budget always swings away from maintenance work towards enhancements.

Finally, maintenance work is typically an operational cost, yet building a new new system, or large new feature, can often be capitalized and depreciated over many years. Thus making the future costs someone else's problem in many cases.

"Most of the systems I work on rescuing are not badly built. They are badly maintained."

— Marianne Bellotti, Kill it With Fire

Once a system degrades it is an even bigger challenge to fund deferred maintenance (or "technical debt"). No one plans for it, no one wants to pay for it, and no one even wants to do it. Initiatives to repair and restore operational excellence gradually, much the way one would fix up an old house, tend to have few volunteers among engineering teams. No one gets noticed doing maintenance. No one gets promoted because of maintenance. Instead, these are the jobs that are either eliminated or offshored to less skilled, less knowledgeable, and of course cheaper labor.

Maintenance mostly happens out of sight, mysteriously. If we notice it, it’s a nuisance. When road crews block off sections of highway to fix cracks or potholes, we treat it as an obstruction, not a vital and necessary process. When your refrigerator needs maintenance you defer it, hoping maybe you can just buy a new one later. If you wait long enough and your refrigerator breaks down, you get upset at the cost of the repair bill - not at yourself for deferring the maintenance.

This is especially true in the public sector: it’s almost impossible to get governmental action on, or voter interest in, spending on preventive maintenance, yet governments make seemly unlimited funds available once we have a disaster. We are okay spending a massive amount of money to fix a problem, but consistently resist spending a much smaller amount of money to prevent it; as a strategy this makes no sense.

Recent price increases for construction materials like lumber, drywall, and wiring (and frankly everything else) should, according to Economics 101, force us to treat our homes more dearly. Similarly, price increases for software engineers should force companies to treat existing software more dearly. This kind of pragmatism is sorely needed by companies that are trying to be frugal.

Risks of Replacing Software Systems

Engineers tend to prefer to replace or re-write a system rather than maintain it. They get to write a new story rather than edit someone else's. They will attempt to convince a senior executive to fund a project to replace a problematic system by describing all the new features and capabilities that could be added. As discussed earlier, maintenance is not rewarded, incentivized, or praised.

The incentives of individual praise aside, engineering teams tend to gravitate toward system rewrites because they incorrectly think of old systems as specs. They assume that since an old system works, all technical challenges and possible problems have been settled. The risks have been eliminated. They can focus on adding more features to the new system or make changes to the underlying architecture without worry. Either they do not perceive the ambiguity these changes introduce, or they see such ambiguity positively, imagining only gains in performance and the potential for greater innovation.

We are swapping a system that works and needs to be adjusted for an expensive and difficult migration to something unproven. It’s the minor adjustments to systems that have not been actively developed in a while that create the impression that failure is inevitable and push otherwise rational engineers toward doing rewrites when rewrites are not necessary.

Rewrite or replacement plans do not take the form of redesigning the system using the same language or merely fixing a well-defined structural issue. The goal of full replacements is to restore ambiguity and, therefore, engineering enthusiasm. They fail because the assumption that the old system can be used as a spec and be trusted to have diagnosed accurately and eliminated every risk is wrong.

Even so, eventually the "replacement" project will be funded (at a much higher expenditure than rehabilitating the existing system). Why not authorize that multimillion-dollar rewrite if the engineers convince management the existing system is doomed? Even if the executives are not listening to the engineers, they will be listening to external consultants telling them they have a "burning platform" and they are falling behind.

Everyone is excited about the new system that will do "all the things". But what do you do with the old system while you’re building the new one? Most organizations choose to put the old system on “life support” and give it only the resources for patches and fixes necessary to keep the lights on.

Who gets to work on the new system, and who takes on the maintenance tasks of the old system? If the old system is written in older technology that the company is actively abandoning, the team maintaining the old system is essentially sitting around waiting to be fired. And don’t kid yourself, they know it. So, if the people maintaining the old system are not participating in the creation of the new system, you should expect that they are also looking for new jobs. If they leave before your new system is operational, you lose both their expertise and their institutional knowledge.

If the new project falls behind schedule (and it almost certainly will), the existing system continues to degrade, and knowledge continues to walk out the door. If the new project fails and is subsequently canceled, the gap between the legacy system and operational excellence has widened significantly in the meantime.

This explains why executives are loathe to cancel system replacement projects even when they are obviously years behind schedule and failing to live up to expectations. Stopping the replacement project seems impossible because the legacy system is now so degraded that restoring it to operational excellence seems impossible. Plus, politically canceling a marquee project can be career suicide for the sponsoring executive(s). Much better to do "deep dives" and "assessments" on why the project is failing and soldier on than cancel it.

The interim state is not pretty. The company now has two systems to operate. The new system will have high costs, limited functionality, new and unique errors and issues, and lower volumes (so the "per unit cost" of the new system will be quite high). The older system, having lost its best engineers and subject matter experts will still be running most of the business, yet its maintenance budget will have been whittled down to nothing to redirect spending to implement (save?) the new system. Neither system will exhibit operational excellence, and both put the organization at significant risk in addition to the higher costs and complexity of running two systems.

How do you know when you aren't spending enough on maintenance?

First, can you even quantify the maintenance budget and resources? How do you separate it from feature requests? How do you even define maintenance?

Second, how do you define a "reasonable" maintenance budget?

In the case of the Inca rope bridges what was the cost of maintenance annually? Let's assume some of the build work was site preparation and building the stone anchors on each side, but most of the work was constructing the bridge itself. Since the bridge was entirely replaced each year, the maintenance costs could be as much as 80% of the initial build effort, year after year.

In the case of a brand-new software system, that is well architected, well designed and built, and meets all reliability, scalability and performance needs (basically no software ever) it's conceivable that there is no maintenance necessary for some period of time. Maintenance budget = 0.

This is like buying a brand-new car. The maintenance costs are negligible in the first couple years, until they start to creep up. As the car ages the maintenance costs continue to increase until at some point it makes economic sense to buy another new car. Except none of us wait that long. Most of us buy new cars before our old one is completely worn out. Yet in Cuba cars have been maintained meticulously for 30-40 years and some even run better than new.

The challenges are should be obvious. First budgets in large organizations tend be last year's budget plus 3%.If you start with a maintenance budget of zero on a new system how do you ever get to the point of a healthy maintenance budget in the future? Second, maintenance costs are unpredictable, and organizations hate unpredictable costs. It's impossible to say when the next new revolution in hardware, or storage, or programming contruct will occur, or when the existing system will hit a performance or scalability inflection point.

Creating a Maintenance Fund

The costs of proper maintenance are unpredictable. In addition, there is some amount of discretion that can be applied. When your house needs a new roof it's reasonable to defer it through summer, but it probably needs to be done before winter.

Since business require predictablity of costs, unpredictable maintenance costs are easy to defer. "We didn't budget for that, we'll have to put it in next year's budget." Except of course in the budget process it will compete with other projects and enhancement work, where it's highly likely to be deprioritized.

What's the solution? Could it be possible to create some type of maintenance fund where a predictable amount is budgeted each year, and spent "unpredictaby" when/as needed?

Third, how do you incent engineers to want to perform maintenance?

Maintaining Software to Last Forever

As I discussed in How Software Learns, software learns and adapts over time - it is continually refined and reshaped by maintenance. Maintenance is crucial to software's lifespan and human value.

When software systems are first developed, they are based on a prediction of the future - a prediction of the future that we know is wrong even as we create it. No set of requirements has ever been perfect. However, it becomes "less wrong" as time, experience, and knowledge are continually added (e.g., maintenance). Maintenance is tailoring software to its true needs and use over time.

Futureproofing means constantly rethinking and iterating on the existing system. We know from both research and experience that iterating and maintaining existing solutions is much more likely, and less expensive way to improve software's lifespan and functionality.

It's not the age of a system that causes it to fail but rather neglect. People never build software thinking that they will neglect it until it’s a huge liability for their organization. People fail to maintain software systems because they are not given the time, incentives, or resources to maintain them. So, how do you ensure proper maintenance over time?

Many times I have found the technology team and the business users arguing over "bug" vs. "enhancement". When I ask why the answer is always the same - it is a way of assigning blame. "Bug" means its engineering's fault, "enhancement" means it was a missed requirement by the business. I always tell everyone they are both just maintenance and our only decision is which to fix first.

  • Dedicated team vs. integrated team
  • what is "Maintenance"?
  • Measuring the backlog - what is healthy?

Legacy System Modernization

Bad software is unmaintained software. Due to factors discussed above, software does not always receive the proper amount of maintenance to remain healthy. Eventually a larger modernization effort (or replacement) becomes necessary to restore a system to operational and functional excellence.

Legacy modernization projects start off feeling easy. The organization once had a reliable working system and kept it running for years. All the modernizing team should need to do is simply rebuild it using better technology, the benefit of hindsight, and improved tooling. It should be simple. But, because people do not see the hidden technical challenges they are about to uncover, they also assume the work will be boring. There’s little glory to be had re-implementing a solved problem.

Modernization projects take months, if not years of work. Keeping a team of engineers focused, inspired, and motivated from beginning to end is difficult. Keeping their senior leadership prepared to invest in what is, in effect, something they already have is a huge challenge. Creating momentum and sustaining it are where most modernization projects fail.

The hard part about legacy modernization is the "system around the system". The organization, its communication structures, its politics, and its incentives are all intertwined with the technical product in such a way that to improve the product, you must do it by turning the gears of this other, complex, undocumented system. Pay attention to politics and culture. Technology is at most only 50% of the legacy problem, ways of working, organization structure and leadership are just as important to success.

An organization about to embark on such an undertaking craves new features, new functionality, and new benefits. Modernization projects are typically the ones organizations just want to get out of the way, so they usually launch into them unprepared for the time and resource commitments they require.

To do this, you need to provide significant value right away, as soon as possible, so that you overcome people’s natural skepticism and get them to buy in. The important word in the phrase "proof of concept" is proof. You need to prove to people that success is possible and worth doing. It can't be just an MVP, because MVPs are dangerous.

Large, expensive projects kicked off to fix things that are not obviously broken break trust with the nontechnical parts of the organization. It inconveniences colleagues, frustrates them, and sometimes confuses them. A modernization effort needs buy-in beyond engineering to be successful. Spending time and money on changes that will have no immediate visible impact on the business or mission side of operations makes it hard to secure that buy-in in the future.

A big red flag is raised when people talk about the phases of their modernization plans in terms of which technologies they are going to use rather than what value they will add. Engineering needs to be reminded it's not about the technology. For all that people talk about COBOL dying off, it is good at certain tasks. The problem with most old COBOL systems is that they were designed at a time when COBOL was the only option. Start by sorting which parts of the system are in COBOL because COBOL is good at performing that task, and which parts are in COBOL because there were no other technologies available. Once you have that mapping, start by pulling the latter off into separate services that are written and designed using the technology we would choose for that task today.

Good measurable business problems and business value have to be aligned with problems that your engineers care about. In all likelihood, you won’t be able to say your database migration is going to do that, but engineers feel passionate about other things. Talk to them and figure out what those are.

Counterintuitively, SLAs/SLOs are valuable because they give people a "failure budget". When organizations stop aiming for perfection and accept that all systems will occasionally fail, they stop letting their technology rot for fear of change. Some organizations can’t be talked out of wanting five or six nines of availability. In those cases, mean time to recovery (MTTR) is a more useful statistic to push than reliability. MTTR tracks how long it takes the organization to recover from failure. Resilience in engineering is all about recovering stronger from failure. That means better monitoring, better documentation, and better processes for restoring services, but you can’t improve any of that if you don’t occasionally fail.

Although a system that constantly breaks, or that breaks in unexpected ways without warning, will lose its users’ trust, the reverse isn’t necessarily true. A system that never breaks doesn’t necessarily inspire high degrees of trust.

"As an industry, we reflect on success but study failure."

— Marianne Bellotti, Kill it With Fire

People take systems that are too reliable for granted. Italian researchers Cristiano Castelfranchi and Rino Falcone have been advancing a general model of trust in which trust degrades over time, regardless of whether any action has been taken to violate that trust. Under Castelfranchi and Falcone’s model, maintaining trust doesn’t mean establishing a perfect record; it means continuing to rack up observations of resilience. If a piece of technology is so reliable it has been completely forgotten, it is not creating those regular observations. Through no fault of the technology, the user’s trust in it slowly deteriorates.

When both observability and testing are lacking on your legacy system, observability comes first. Tests tell you only what shouldn’t fail; monitoring tells you what is failing. Don’t forget: a perfect record will always be broken, but resilience is an accomplishment that lasts. Modern engineering teams use stats like service level objectives, error budgets, and mean time to recovery to move the emphasis away from avoiding failure and toward recovering quickly.

Organizational Design

Conway’s law tells us that the technical architecture and the organization’s structure are general equivalents so organizational design matters. Nothing says you’re serious about accomplishing something more effectively than changing people's scenery, but don't assume that organizational change is necessary out of the gate.

Reorgs are incredibly disruptive. They are demoralizing. They send the message to rank and file engineers that something is wrong — they built the wrong thing or the product they built doesn’t work or the company is struggling. It increases workplace anxiety and decreases productivity. The fact that reorgs almost always end up with a few odd people out who are subsequently let go exacerbates the issue.

What you don’t want to do is draw a new organization chart based on your vision for how teams will be arranged with the new system. You don’t want to do this for the same reason that you don’t want to start product development with everything designed up front. Your concept of what the new system will look like will be wrong in some minor ways you can’t possibly foresee. You don’t want to lock in your team to a structure that will not fit their needs.

The only way to design communication pathways is to give people something to communicate about. In each case, we allow the vision for the new organization to reveal itself by designing structures that encourage new communication pathways to form in response to our modernization challenges. As the work continues, those communication pathways begin to solidify, and we can begin documentation and formalizing new teams or roles. In this way, we sidestep the anxiety of reorganizing. The workers determine where they belong based on how they adapt to problems; workers typically left out are given time and space to learn new skills or prove themselves in different roles, and by the time the new organization structure is ratified by leadership, everyone already has been working that way for a couple months.

Who needs to communicate with whom may not be clear when you get started. This is an exercise I use to help reveal where the communication pathways are or should be. I give everyone a piece of paper with a circle drawn on it. The instructions are to write down the names of the people whose work they are dependent on inside the circle (in other words, “If this person fell behind schedule, would you be blocked?”) and the names of people who give them advice outside the circle. If there’s no one specific person, they can write a group or team name or a specific role, like frontend engineer, instead. Then I compare the results across each team. In theory, those inside the circle are people with whom the engineer needs to collaborate closely. Each result should resemble that engineer’s actual team with perhaps a few additions or deletions based on current issues playing out.

Outside the circle should be all the other teams. Experts not on the team should be seen as interchangeable with other experts in the same field. Small variations will exist from person to person, but if the visualizations that people produce don’t look like their current teams, you know your existing structure does not meet your communication needs.

Incentives

Engineering loves new technology. It gains the engineers attention and industry marketability. "[Boring technology](https://engineering.atspotify.com/2013/02/in-praise-of-boring-technology/}" on the other hand is great for the company. The engineering cost is lower, and the skills are stickier, because these engineers are not being pulled out of your organization for double their salary by Amazon or Google.

To most software engineers, legacy systems may seem like torturous dead-end work, but the reality is systems that are not used get turned off. Working on legacy systems means working on some of the most critical systems that exist — computers that govern millions of people’s lives in enumerable ways. This is not the work of technical janitors, but battlefield surgeons.

Pay attention to how they are incentivized. What earns them the acknowledgment of their peers? What gets people seen is what they will ultimately prioritize, even if those behaviors are in open conflict with the official instructions they receive from management.

The specific form of acknowledgment also matters a lot. Positive reinforcement in the form of social recognition tends to be a more effective motivator than the traditional incentive structure of promotions, raises, and bonuses. Behavioral economist Dan Ariely attributes this to the difference between social markets and traditional monetary-based markets. Social markets are governed by social norms (read: peer pressure and social capital), and they often inspire people to work harder and longer than much more expensive incentives that represent the traditional work-for-pay exchange. In other words, people will work hard for positive reinforcement; they might not work harder for an extra thousand dollars.

In fact, the idea that one needs a financial reward to want to do a good job for an organization is cynical. It assumes bad faith on the part of the employee, which builds resentment. Traditional incentives have little positive influence, therefore, because they disrupt what was otherwise a personal relationship based on trust and respect. Behavioralist Alfie Kohn puts it this way: Punishment and rewards are two sides of the same coin. Rewards may have a negative effect because they, like outright punishment, are manipulative. “Do this and you’ll get that” is not really very different from “Do this or here’s what will happen to you.” In the case of incentives, the reward itself may be highly desired; but by making that bonus contingent on certain behaviors, managers manipulate their subordinates, and that experience of being controlled is perceived negatively.

People’s perception of risk is not static, and it’s often not connected to the probability of failure so much as it is the potential feeling of rejection and condemnation from their peers. Since social pressures and rewards are better incentives than money and promotions, you can improve your odds of success by learning how to manipulate an organization’s perception of risk.

The first task is to understand what behaviors get individuals within an organization acknowledged. Those are the activities that people will ultimately prioritize. If they are not the activities that you think will advance your modernization process, explore constructs, traditions, and — there’s no better word for it — rituals that acknowledge complementary activities.

The second task is to look at which part of the organization gets to determine when the operator can exercise discretion and deviate from defined procedure. Those are the people who set the ratio of blamelessness to accountability in handling failure. Those are the people who will grant air cover and who need to buy in to any breaking change strategy for it to be successful. Once that air cover is established, anxiety around failure tends to relax.

Once you know how to manipulate the organization’s perception of risk, successfully managing the break is all about preparation. While you will not be able to predict everything that could go wrong, you should be able to do enough research to give your team the ability to adapt to the unknown with confidence. At a minimum, everyone should know and understand the criteria for rolling back a breaking change.

Building Software to Last Forever

Stewart Brand, the editor of the Whole Earth Catalog, described a concept that buildings have architectural "layers":

  • (INFRA)STRUCTURE: foundation & load bearing, rarely changed
  • SKIN: changes every 20 years, more air-tight, better insulated
  • SERVICES: Plumbing, electrical, etc. - changes every 7-15 years
  • SPACE PLAN: interior layout, walls, ceilings, floors, doors. changes every 3-10 years

This concept can be applied to software systems in a similar way.

In short, small organizations build monoliths because small organizations are monoliths. (Location 2073)

Culturally, the engineering organization was flat, with teams formed on an ad hoc basis. Opportunities to work on interesting technical challenges were awarded based on personal relationships, so the organization’s regular hack days became critical networking events. Engineering wanted to build difficult and complex solutions to advertise their skills to the lead engineers who were assembling teams.

Well-designed, high-functioning software that is easy to understand usually blends in. Simple solutions do not do much to enhance one’s personal brand. They are rarely worth talking about. Therefore, when an organization provides limited pathways to promotion for software engineers, they are incentivized to make technical decisions that emphasize their individual contribution over integrating well into an existing system.

=====

The first step of legacy system replacement is to understand the desired outcomes, according to the authors. They point out that "it is vital for an organization to agree the outcomes they want to achieve when tackling legacy. While this may seem obvious, all too often different parts of an organization can have quite different views on the desired outcomes." Some typical legacy replacement outcomes are reducing the cost of change, improving business processes, retiring old systems that are no longer supported or overcoming market disruptions by new players. The authors add that introducing new technology should never be a goal by itself; it must support the current and predicted needs of the business instead.

The second step is to decompose the problem into smaller parts, the authors add. They note that "many applications are built to serve multiple logical products from the same physical system. Often this is driven by a desire for reuse. (...) A major problem we come across is that superficially the products look similar but they [are] very different when it comes to the detail." The Extract Product Lines pattern reduces risk and divides the legacy system into slices that can be (re)built simultaneously and delivered incrementally. Slices are extracted by discovering the product or product lines in the system and their business capabilities. The authors then prioritise the extraction of the products found by risk and begin with the second riskiest, maintaining the business’s attention but not risking a significant failure in case of problems.

Aiming for feature parity is often an underestimated goal and is not recommended except for a few use cases, the authors write. They even state that "if feature parity is a genuine requirement then this pattern describes what it might take to do well. It is not an easy path, nor one to be taken lightly." It requires running complete system surveys to understand user journeys and the system’s inputs and outputs, integrations, and batch processes. Such surveys often need to be complemented with systems archaeology to discover their inner workings, instrumentation to discover what is being used, and feature value mapping to understand which features may be dropped. Naturally, tests must be used to ensure that the new system behaves like the old one. The authors conclude that because this pattern is so expensive, it should only be used when all specifications are well known or when any changes would disrupt existing users too much.

===

In 1983, Charles Perrow coined the term "normal accidents" to describe a situation where systems are so prone to failure, no amount of safety procedures could eliminate accidents entirely. According to Perrow, normal accidents are not the product of bad technology or incompetent staff. Rather, systems that experience normal accidents display two important characteristics:

  1. They are tightly coupled. When two separate components are dependent on each other, they are said to be coupled. In tightly coupled situations, there’s a high probability that changes with one component will affect the other. Tightly coupled systems produce cascading effects.

  2. They are complex. Signs of complexity in software include the number of direct dependencies and the depth of the dependency tree. Computer systems naturally grow more complex as they age, because we tend to add more and more features to them over time.

Tightly coupled and complex systems are prone to failure because the coupling produces cascading effects, and the complexity makes the direction and course of those cascades impossible to predict.

In a simple example of coupling imagine software that tightly links three components together. Each component is 99% reliable. The overall reliability of the system is 0.99 to the third power, or 0.99 x 0.99 x 0.99 = 0.97 (97% reliable). Now imagine instead of three tightly linked components it requires ten. The systems reliability would be 0.99 to the 10th power, or only 90% reliable. It would be out of service 10% of the time, despite being made of highly reliable components.

=====

--

The planning isn't the issue, it's the conviction, resolve, determination...whatever you want to call it.

COBOL modernization is quite expensive and fraught with major risk. You're talking about multi-year efforts where the key players who initiate the effort are often not around to see it through. Limited legacy resources, poor-to-absent documentation, mission critical systems...it's a mess no one internally wants to touch or take the fall for. MicroFocus, Blu Age (now part of AWS), TSRI, and plenty of other companies are happy to take your money and tell you it can be done, and it certainly can be done technically.

The reality is the cost benefit is often years down the line, so keeping the right people around to sustain the effort and see it through financially and technically is the tallest of enterprise orders. So it's the old two-step: 1) Pay IBM for another three years of support and licensing and 2) Go through the motions of migration planning then repeat Step 1.

--

It's funny how if you ask people about forward-looking systems today, they have either learned mistruths or haven't learned enough about the history of computing to picture much that's reasonable. "Java!" used to be talked about as a good way to create software that you could run in the future, but anyone who has had to keep around old laptops / old VMs to run ancient Java to handle KVMs or device management tools knows how ridiculous an expectation about the stability of Java can be.

"Windows!" is funny, because I can't tell you the number of places that have an ancient Windows PC that has either been repaired or carefully replaced with new enough, yet old enough hardware to still run Windows 2000 or XP, because copy protection is stupid, and / or because drivers for specific hardware can't be updated or run on newer Windows, and / or because the software can't be compiled and run on newer Windows without tons of work.

On the other hand, you can take large, complicated programs from the 1980s written in C and compile them on a modern Unix computer without issues, even when the modern computer is running architectures which hadn't even been dreamt of when the software was written...

--

But I think you're forgetting how much code is written to C89. How much of bash, for instance, is ancient, with updates that avoid newer toolchain features so that it can be as portable as possible?

Yes, people don't often write stuff with portability as a goal at the beginning, but once something is relatively portable, it stays that way. Lots of code that wasn't poorly written made it from all-the-world's-a-VAX to i386, from i386 to amd64, and now ARM and aarch64, with a minimum of extra effort. There just had to be a little effort to NOT program like a jerk, which, as funny as it is, is still an issue now.

I'm running Pine as my daily email program, which was written in 1989 and hasn't had any real updates since 2005. New architecture? Compile it. Lots of modern software started out as C software from the 1980s.

  • Study the cadence, topics, and invite lists of meetings. Too often, meetings are maladapted attempts to solve problems. So if you want to know what parts of the project are suffering the most, pay attention to what the team is having meetings about, how often meetings are held, and who is being dragged into those meetings. In particular, look for meetings with long invite lists. Large meetings are less effective than small meetings, but they do convincingly spread the blame around by giving everyone the impression that all parties were consulted and all opinions were explored. Meetings with ever-expanding invite lists suggest something is wrong in that area of the project. (Location 2000)

  • Meetings slow down work, which almost always leads to more meetings. (Location 2017)

  • In short, small organizations build monoliths because small organizations are monoliths. (Location 2073)

  • A similar but inverse brainstorming exercise to the critical factors exercise is asking your team to play saboteur. If you wanted to guarantee that the project fails, what would you do? How can you achieve the worst possible outcome? Once this list is generated, the team discusses if there are any behaviors either internally or from external partners that are close to items on the saboteur list. (Location 2584)

  • Exercise: The 15 Percent6 In Chapter 3, I talked about the value of making something 5 percent, 10 percent, or 20 percent better. This exercise asks team members to map out how much they can do on their own to move the project toward achieving its goals. What are they empowered to do? What blockers do they foresee, and when do they think they become relevant? How far can they go without approval, and who needs to grant that approval when the time comes?

(Location 2614)

  • Exercise: Probabilistic Outcome-Based Decision-Making Probabilistic outcome-based decision-making is better known as betting. It’s a great technique for decisions that are hard to undo, have potentially serious impacts, and are vulnerable to confirmation bias. I tend to use it a lot to run hiring committees, for example. Firing people is difficult; making a wrong hire can destroy a team’s productivity, and people often see what they want to see in potential candidates. This is how it works: as a group, we make a list of potential outcomes from the decision that needs to be made. Outcomes like “We’re able to scale 2× by doing this” or “We will implement this new feature by this date.” You can mix both positive and negative outcomes if you like, but I find the conversation usually goes better if the list of outcomes is either positive or negative. Then team members place bets as to whether the outcome will come true. Traditionally, this exercise is run with imaginary money. Depending on the specific decision to be made, I sometimes ask them to bet with hours of their time instead of money.

(Location 2631)

Some of his other observations include the following: Individual incentives have a role in design choices. People will make design decisions based on how a specific choice — using a shiny new tool or process — will shape their future. Minor adjustments and rework are unflattering. They make the organization and its future look uncertain and highlight mistakes. To save face, reorgs and full rewrites become preferable solutions, even though they are more expensive and often less effective.

An organization’s size affects the flexibility and tolerance of its communication structure. When a manager’s prestige is determined by the number of people reporting up to her and the size of her budget, the manager will be incentivized to subdivide design tasks that in turn will be reflected in the efficiency of the technical design — or as Conway put it: “The greatest single common factor behind many poorly designed systems now in existence has been the availability of a design organization in need of work.” Conway’s observations are even more important in the maintaining of existing systems than they are in the building of new systems.

(Location 2673)

  • (Location 2697)

  • whether we are trying to ensure success or avoid failure. When success seems certain, we gravitate toward more conservative, risk-averse solutions. When failure seems more likely, we switch mentalities completely. We go bold, take more risks.11 If we are judging odds correctly, this behavior makes sense. Why not authorize that multimillion-dollar rewrite if the existing system is doomed? (Location 2763)

-. (Location 2908)

  • The only way to design communication pathways is actually to give people something to communicate about. In each case, we allow the vision for the new organization to reveal itself by designing structures that encourage new communication pathways to form in response to our modernization challenges. As the work continues, those communication pathways begin to solidify, and we can begin documentation and formalizing new teams or roles. In this way, we sidestep the anxiety of reorganizing. The workers determine where they belong based on how they adapt to problems; workers typically left out are given time and space to learn new skills or prove themselves in different roles, and by the time the new organization structure is ratified by leadership, everyone already has been working that way for a couple months. (Location 2921)

  • Exercise: In-Group/Out-Group Who needs to communicate with whom may not be clear when you get started. This is an exercise I use to help reveal where the communication pathways are or should be. I give everyone a piece of paper with a circle drawn on it. The instructions are to write down the names of the people whose work they are dependent on inside the circle (in other words, “If this person fell behind schedule, would you be blocked?”) and the names of people who give them advice outside the circle. If there’s no one specific person, they can write a group or team name or a specific role, like frontend engineer, instead. Then I compare the results across each team. In theory, those inside the circle are people with whom the engineer needs to collaborate closely. Each result should resemble that engineer’s actual team with perhaps a few additions or deletions based on current issues playing out. Outside the circle should be all the other teams. Experts not on the team should be seen as interchangeable with other experts in the same field. Small variations will exist from person to person, but if the visualizations that people produce don’t look like their current teams, you know your existing structure does not meet your communication needs. (Location 2949)

  • (Location 3052)

(Location 3134)

New highlights added September 20, 2022 at 12:07 PM

  • To summarize, people’s perception of risk is not static, and it’s often not connected to the probability of failure so much as it is the potential feeling of rejection and condemnation from their peers. Since social pressures and rewards are better incentives than money and promotions, you can improve your odds of success by learning how to manipulate an organization’s perception of risk. The first task is to understand what behaviors get individuals within an organization acknowledged. Those are the activities that people will ultimately prioritize.

  • If they are not the activities that you think will advance your modernization process, explore constructs, traditions, and — there’s no better word for it — rituals that acknowledge complementary activities. The second task is to look at which part of the organization gets to determine when the operator can exercise discretion and deviate from defined procedure. Those are the people who set the ratio of blamelessness to accountability in handling failure. Those are the people who will grant air cover and who need to buy in to any breaking change strategy for it to be successful. Once that air cover is established, anxiety around failure tends to relax.

  • Once you know how to manipulate the organization’s perception of risk, successfully managing the break is all about preparation. While you will not be able to predict everything that could go wrong, you should be able to do enough research to give your team the ability to adapt to the unknown with confidence. At a minimum, everyone should know and understand the criteria for rolling back a breaking change.

New highlights added September 21, 2022 at 4:42 PM

  • In terms of defining success, success criteria and diagnosis-policy-actions have different strengths and weaknesses. Success criteria connects modernization activities more directly to the value add they can demonstrate. It affords more flexibility in exactly what the team does by not prescribing a specific approach or set of tasks. It is an excellent exercise to run with bosses and any other oversight forces that might be inclined to micromanage a team. How to do something should be the decision of the people actually entrusted to do it. (Location 3430)

  • Just as humans are terrible judges of probability, we’re also terrible judges of time. What feels like ages might only be a few days. By marking time, we can realign our emotional perception of how things are going. Find some way to record what you worked on and when so the team can easily go back and get a full picture of how far they’ve come. (Location 3454)

  • Remember that we tend to think of failure as bad luck and success as skill. We do postmortems on failure because we’re likely to see them as complex scenarios with a variety of contributing factors. We assume that success happens for simple, straightforward reasons. In reality, success is no more or less complex than failure. You should use the same methodology to learn from success that you use to learn from failure. (Location 3471)

  • Your timeline in a postmortem for success should be built around these questions: How did the organization execute on the original strategy, how did the strategy change, when did those changes happen, and what triggered them? (Location 3506)

  • postmortem’s key questions. What went well? What could have gone better? Where did you get lucky? (Location 3513)

  • The people in the war room of our project were a working group. The people in the war room of the failed project were a committee. (Location 3540)

  • Working groups relax hierarchy to allow people to solve problems across organizational units, whereas committees both reflect and reinforce those organizational boundaries and hierarchies. (Location 3542)

  • Working groups are typically initiated and staffed by people on the implementation layer of the organization. A committee is formed to advise an audience external to the committee itself, typically someone in senior leadership. Whereas the working group is open to those who consider themselves operating in the same space as the working group’s topic area, committees are selected by the entity they will be advising and typically closed off to everyone else. (Location 3554)

  • The important takeaway is working groups relax organizational boundaries while committees reinforce them. (Location 3564)

  • Funny story: I would, from time to time, run war rooms filled with senior executives. It was a tactic I resorted to when the organization was in such a state of panic that senior leadership members were micromanaging their teams. Software engineers can’t fix problems if they spend all their time in meetings giving their managers updates about the problems. The boot has to be moved off their necks. In that situation, my team would typically run two war rooms: one where the engineers were solving problems together and one outfitted with lots of fancy dashboards where I was babysitting the senior executives. The most valuable skill a leader can have is knowing when to get out of the way. (Location 3570)

  • Although this visual model might just look like an illustration, we can actually program it for real and use it to explore how our team manages its work in various conditions. Two tools popular with system thinkers for these kinds of models are Loopy (https://ncase.me/loopy/). and InsightMaker (https://insightmaker.com/). Both are free and open source, and both allow you to experiment with different configurations and interactions. (Location 3868)

  • Organizations choose to keep the bus moving as fast as possible because they can’t see all the feedback loops. Shipping new code gets attention, while technical debt accrues silently and without fanfare. It’s not the age of a system that causes it to fail, but the pressure of what the organization has forgotten about it slowly building toward an explosion.

  • Engineering organizations that maintain a separation between operations and development, for example, inevitably find that their development teams design solutions so that when they go wrong, they impact the operations team first and most severely. Meanwhile, their operations team builds processes that throw barriers in front of development, passing the negative consequences of that to those teams. These are both examples of muted feedback loops. The implementers of a decision cannot feel the impacts of their decisions as directly as some other group. One of the reasons the DevOps and SRE movements have had such a beneficial effect on software development is that they seek to re-establish accountability. If product engineering teams play a role in running and maintaining their own infrastructure, they are the ones who feel the impact of their own decisions. When they build something that doesn’t scale, they are the ones who are awakened at 3 am with a page. Making software engineers responsible for the health of their infrastructure instead of a separate operations team unmutes the feedback loop.

  • Designing a modernization effort is about iteration, and iteration is about feedback. Therefore, the real challenge of legacy modernization is not the technical tasks, but making sure the feedback loops are open in the critical places and communication is orderly everywhere else.

  • More robust maintenance practices could help preserve modernity’s finest achievements, from artwork to public transit systems to power grids while at the same time saving money. But first maintenance must be valued, funded, and applied. Do we value maintenance enough to fund it properly?

Resilient Design

Software that is resilient is generally infrastructure agnostic.

  1. Use open standards, or widely adopted standards
  2. Avoid
  3. Be boring
  4. When both observability and testing are lacking on your legacy system, observability comes first. Tests tell you only what won’t fail; monitoring tells you what is failing.

References

https://news.ycombinator.com/item?id=32997102


Image Credit: Bill Gates, CEO of Microsoft, holds Windows 1.0 floppy discs.

(Photo by Deborah Feingold/Corbis via Getty Images) Software evolves. The underlying hardware, CPU, memory and storage evolves. The operating system evolves. Of course the software we use must evolves as well. This was the release of Windows 1.0. The beginning.

New York subway trains, nicknamed the Brightliners for their shiny unpainted exteriors, were introduced in 1964 and built to last 35 years. When they were finally retired from service in January 2022, they had been riding the rails of New York City for 58 years and were the oldest operating subway cars in the world. That’s 23 unplanned, extra years of use.

Replacement parts were impossible to find; the firms that made them — the Budd Company, Pullman-Standard and Westinghouse — long gone. Components were well past their design life, mechanics plucked similar parts from retired vehicles or engineered substitutes. Maintenance is an improvisational process that relies on the know-how that long-time employees built up over many years. In a way, it was a small miracle that they lasted so long, an “anomaly,” as one mechanic put it. Except it wasn’t.

Back in the 1980s MTA president David L. Gunn kicked off an ambitious overhaul of the crumbling, graffiti-covered subway. In 1990, this approach became the "scheduled maintenance system". Every two to three months, more than 7,000 railcars are taken in for inspection with the goal of catching problems before they happen. That’s the difference between maintenance and repair. Repair is fixing something that’s already broken. Maintenance is about making something last.

Notably, in 2020 Sabre signed a ten year deal with Google to migrate SABRE to Google's cloud infrastructure.

Sharing is Caring

Edit this page