< Blog

Making Software Last Forever

Hero image for Making Software Last Forever
42 min read

How many of us have bought a new home and moved because our prior home was not quite meeting our needs? Maybe we needed an extra bedroom, or wanted a bigger backyard? Now, as a thought experiment, assume you couldn't sell your existing home. If you bought a new home you'd have to abandon your prior home (and your investment in it). Does that change your thinking?

Further, imagine after you bought your prior home you had a team of ten people maintaining it, improving it, and keeping it updated, for the last fifteen years. You'd have a cumulative investment of 150 person/years in your existing home (10 people x 15 years) on top of the initial investment. If each person was paid the equivalent of a software developer (we'll use $200k to include benefits, office space, leadership, etc.) you'd have an investment of over $30 million dollars (150 person/years x $200,000). Would you walk away from that investment?

When companies decide to re-write or replace an existing software application, they are making a similar decision. Existing software is abandoned (along with its cumulative investment) for "something new and better". Yet the belief that new code is always better than old is patently absurd. Old code has weathered and withstood the test of time. It has been battle-tested. Bugs have been found, and more importantly, fixed. New and better code means new and better bugs.

Why throw away a working system? Joel Spolsky (of Fog Creek Software and Stack Overflow) describes system re-writes in "Things You Should Never Do, Part I" as “the single worst strategic mistake that any software company can make.”

Recent price increases for construction materials like lumber, drywall, and wiring (and frankly everything else) should, according to Economics 101, force us to treat our homes more dearly. Similarly, price increases for software engineers should force companies to treat existing software more dearly. This kind of pragmatism is necessary to be frugal.

Why can't software last forever? It's not made of wood, concrete, or steel. It doesn't rot, weather, or rust. A working algorithm is a working algorithm. Technology doesn’t need to be beautiful, or to impress other people, to be effective. All technologists are ultimately in the business of producing effective technology.

The World's Oldest Software Systems

In 1958, the United States Department of Defense launched a new computer-based contract management system called "Mechanization of Contract Administration Services", or MOCAS (pronounced “MOH-cass”). In 2015, MIT Technology Review stated that MOCAS was the oldest computer program in continuous use they could verify. At that time MOCAS managed about $1.3 trillion in government obligations and 340,000 contracts.

According to the Guiness Book of World Records the oldest software system in use today is either the SABRE Airline Reservation System (introduced in 1960), or the IRS Individual Master File (IMF) and Business Master File (BMF) systems introduced in 1962–63.

SABRE went online in 1960. It had cost $40 million to develop and install (about $400 million in 2022 dollars). The system took over all American Airlines booking functions in 1964, and the system was expanded to provide access to external travel agents in 1976.

What is the secret to the long lifespan of these systems? Shouldn't companies with long-lived products (annuities, life insurance, etc.) study these examples? After all, they need systems to support products that last most of a human lifespan.

The Open Mainframe Project estimates that there about 250 billion lines of COBOL code running today in the world economy, and nearly all COBOL code contains critical business logic. Shouldn't all companies want to their investments in software to last as along as possible?

Maintenance is About Making Something Last

We spoke of SABRE above, and we know that airlines recognize the value of maintenance. Commercial aircraft are inspected at least once every two days. Engines, hydraulics, environmental, and electrical systems all have additional maintenance schedules. A "heavy" maintenance inspection occurs once every few years. This process maintains the aircraft's service life over decades.

On average, an aircraft is operable for about 30 years before it must be retired. A Boeing 747 can endure 35,000 pressurization cycles — roughly 135,000 to 165,000 flight hours — before metal fatigue sets in. However, most older airplanes are retired for fuel-efficiency reasons, not because they're worn out.

The most amazing story of the power of maintenance shows that even stuctures made of grass can last indefinitely. Inca rope bridges were simple suspension bridges constructed by the Inca Empire. The bridges were an integral part of the Inca road system were constructed using ichu grass. The grass deteriorated rapidly. The bridge's strength and reliability came from the fact that each cable was replaced every June.

Inca Rope Bridge

Even though they were made of grass, these bridges were maintained with such regularity and attention they lasted centuries

The goal of maintenance is catching problems before they happen. That’s the difference between maintenance and repair. Repair is about fixing something that’s already broken. Maintenance is about making something last.

Unfortunately, Maintenance is Chronically Undervalued

There is a Chinese proverb about a discussion between a king and a famous doctor. The well-known doctor explains to the king that his brother (who is also a doctor) is superior at medicine, but he is unknown because he always successfully treats small illnesses, preventing them from evolving into terminal ones. So, people say "Oh he is a fine doctor, but he only treats minor illnesses". It's true: Nobody Ever Gets Credit for Fixing Problems that Never Happened.

Maintenance is one of the easiest things to cut when budgets get tight. Some legacy software systems have decades of underinvestment. This leads up to the inevitable "we have to replace it" discussion - which somehow always sounds more persuasive (even though it’s more expensive and riskier) than arguing to invest in system rehabilitation and deferred system maintenance.

Executives generally can't refuse "repair" work because the system is broken and must be fixed. However, maintenance is a tougher sell. It’s not strictly necessary — or at least it doesn’t seem to be until things start falling apart. It is so easy to divert maintenance budget into a halo project that gets an executive noticed (and possibly promoted) before the long-term effects of underinvestment in maintenance become visible. Even worse, the executive is also admired for reducing the costs of maintenance and switching costs from "run" to "grow" - while they are torpedoing the company under the waterline.

The other problem is conflating enhancement work with maintenance work. Imagine you have $1,000 and you want to add a sunroof to your car, but you also need new tires (which coincidentally cost $1,000). You have to replace the tires every so often, but a sunroof is "forever" right? If you spend the money on the sunroof the tires could get replaced next month, or maybe the month after - they'll last a couple more months, won't they?

When we consider software the users can't even see "the bald tires" - they only thing they see, or experience (and value), are new features and capabilities. Pressure is always present to cut costs and to add new features. Budget always swings away from maintenance work towards enhancements.

Finally, maintenance work is typically an operational cost, yet building a new new system, or a significant new feature, can often be capitalized - making the future costs someone else's problem.

"Most of the systems I work on rescuing are not badly built. They are badly maintained."

— Marianne Bellotti, Kill it With Fire

Risks of Replacing Software Systems

It's not the age of a system that causes it to fail but rather neglect. People fail to maintain software systems because they are not given the time, incentives, or resources to maintain them.

Once a system degrades it is an enormous challenge to fund deferred maintenance (or "technical debt"). No one plans for it, no one wants to pay for it, and no one even wants to do it. Initiatives restore operational excellence, much the way one would fix up an old house, tend to have few volunteers among engineering teams. No one gets noticed doing maintenance. No one ever gets promoted because of maintenance.

Thus, engineers prefer to replace or re-write a system rather than maintain it. They get to "write a new story" rather than edit someone else's. They will attempt to convince a senior executive to fund a project to replace a problematic system by describing all the new features and capabilities that could be added as well as how "bad" the existing, unmaintained, system has become.

Incentives aside, engineering teams tend to gravitate toward system rewrites because they incorrectly think of old systems as specs. They assume that since an old system works, the functional risks have been eliminated. They can focus on adding more features to the new system or make changes to the underlying architecture without worry. Either they do not perceive the ambiguity these changes introduce, or they see such ambiguity positively, imagining only gains in performance and the potential for greater innovation.

Before choosing to replace a system that needs deferred maintenance remember it’s the lack of maintenance that create the impression that failure is inevitable, and push otherwise rational engineers and executives toward doing rewrites when rewrites are not necessary. Even so, eventually a "replacement" project will be funded (typically at a much higher expenditure than rehabilitating the existing system). Why not authorize that multimillion-dollar rewrite if the engineers convince management the existing system is doomed? Even if the executives are not listening to the engineers, they will be listening to external consultants telling them they are falling behind.

Everyone starts off excited about the new system that will do "all the things". But what do you do with the old system while you’re building the new one? Most organizations choose to put the old system on “life support” and give it only the resources for patches and fixes necessary to keep the lights on. Of course this reduces maintenance even further, and becomes a self-fulfilling prophecy that the existing system will fail.

Who gets to work on the new system, and who takes on the maintenance tasks of the old system? If the old system is written in older technology that the company is actively abandoning, the team maintaining the old system is essentially sitting around waiting to be fired. And don’t kid yourself, they know it. So, if the people maintaining the old system are not participating in the creation of the new system, you should expect that they are also looking for new jobs. If they leave before your new system is operational, you lose both their expertise and their institutional knowledge.

If the new project falls behind schedule (and it almost certainly will), the existing system continues to degrade, and knowledge continues to walk out the door. If the new project fails and is subsequently canceled, the gap between the legacy system and operational excellence has widened significantly in the meantime.

This explains why executives are loathe to cancel system replacement projects even when they are obviously years behind schedule and failing to live up to expectations. Stopping the replacement project seems impossible because the legacy system is now so degraded that restoring it to operational excellence seems impossible. Plus, politically canceling a marquee project can be career suicide for the sponsoring executive(s). Much better to do "deep dives" and "assessments" on why the project is failing and soldier on then cancel it.

The interim state is not pretty. The company now has two systems to operate. The new system will have high costs, limited functionality, new and unique errors and issues, and lower volumes (so the "per unit cost" of the new system will be quite high). The older system, having lost its best engineers and subject matter experts will still be running most of the business, yet its maintenance budget will have been whittled down to nothing to redirect spending to implement (save?) the new system. Neither system will exhibit operational excellence, and both put the organization at significant risk in addition to the higher costs and complexity of running two systems.

Maintaining Software to Last Forever

As I discussed in How Software Learns, software adapts over time - as it is continually refined and reshaped by maintenance. Maintenance is crucial to software's lifespan and business relevance/value. When software systems are first developed, they are based on a prediction of the future - a prediction of the future that we know is wrong. No set of requirements has ever been perfect. However, any new system becomes "less wrong" as time, experience, and knowledge are continually added (e.g., maintenance).

Maintenance is tailoring software to its true needs and use over time. Futureproofing means constantly rethinking and iterating on the existing system. We know from both research and experience that iterating and maintaining existing solutions is much more likely, and less expensive way to improve software's lifespan and functionality.

More robust maintenance practices could preserve software for decades, but first maintenance must be valued, funded, and applied. To maintain software properly we have to consider:

  1. How do you measure the overall health of a system?
  2. How do you define maintenance work?
  3. How do you define a reasonable maintenance budget?
  4. How can you protect that budget?
  5. How can you incent engineers to perform maintenance?

1. How do you measure the overall health of a system?

Objective measures

  1. Maintenance Backlog — If you added up all the open work requests, including work the software engineers deem necessary to eliminate technical debt, what is the total amount of effort? Now, divide that by the team capacity. For example, imagine you have a total amount of work of 560 days, and you have one person assigned to support the system - they work approximately 200 days annually. The backlog in days in 560, but in time it is 2.8 years (560 days / 200 days/year = 2.8 years).

  2. System Reliability/Downtime — If you added up all the time the system is down in a given period, what is the total amount? What is the user or customer impact of that downtime? Conversely, what would reducing that downtime be worth?

Financials

  • Governance — Healthy application architecture isn’t just about technology—it’s also about having well-documented and well-understood governance documents that guide software selection and the right number of applications for your organization. Good governance is the key to avoiding the next consideration: “shadow IT” applications.

Subjective measures

  1. User Satisfaction: User satisfaction includes both how happy your employees are with the applications and/or how well those applications meet your customers needs.

  2. “Shadow IT” applications — If you used applications in the past that didn’t meet employees’ needs and didn’t have a good governance plan to address problems, you may have noticed those employees found other applications on their own.

  3. How up-do-date is your architecture? — Mobile applications, the cloud, big data and the internet of things (IoT) are no longer “nice to haves.” They are now “must haves.” Your architecture needs to be ready to integrate these if it isn’t already.

2. How do you define maintenance work?

There are four general types of software maintenance. The first two types take up the majority of most organizations maintenance budget. If you aren't able to address types three and four then your maintenace budget is not adequate.

1. Corrective Software Maintenance (also called repair)

Corrective software maintenance is the typical, classic form of maintenance (for software and anything else for that matter). Corrective software maintenance is necessary when something goes wrong in a piece of software including faults and errors. These can have a widespread impact on the functionality of the software in general and therefore must be addressed as quickly as possible. However it is important to consider repair work separate from the other types of maintenance because repair work must get done - thsi is the only type of work that happens when a system is put on "life support".

2. Perfective Software Maintenance (also called enhancements)

Once software is released and is being used new issues and ideas come to the surface. Users will think up new features or requirements that they would like to see in the software to make it the best tool available for their needs. This is when perfective software maintenance comes into play. Perfective software maintenance aims to adjust software by adding new features as necessary and removing features that are irrelevant or not effective in the given software. This process keeps software relevant as the market, and user needs, change. Many software engineers consider this type of maintenace to be "enhancements".

3. Preventative Software Maintenance (true maintenance is catching problems before they happen.)

Preventative software maintenance is looking into the future so that your software can keep working as desired for as long as possible. This includes making necessary changes, upgrades, and adaptations. Preventative software maintenance may address small issues which at the given time may lack significance but may turn into larger problems in the future. These are called latent faults which need to be detected and corrected to make sure that they won’t turn into effective faults. This is the type of maintenace that is always underfunded.

4. Adaptive Software Maintenance

Adaptive software maintenance has to do with the changing technologies as well as policies and rules regarding your software. These include operating system changes, cloud storage, hardware, etc. When these changes are performed, your software must adapt in order to properly meet new requirements and continue to run well.

IsIs Not
Preventing future issuesAdding new features
Technical work to take advantage of new hardware?

3. How do you define a "reasonable" maintenance budget?

In the case of the Inca rope bridges what was the cost of maintenance annually? Let's assume some of the build work was site preparation and building the stone anchors on each side, but most of the work was constructing the bridge itself. Since the bridge was entirely replaced each year, the maintenance costs could be as much as 80% of the initial build effort, every year.

In the case of a brand-new software system, that is well architected, well designed and built, and meets all reliability, scalability and performance needs (basically no software ever) it's conceivable that there is no maintenance necessary for some period of time. So, in the beginning the maintenance budget = $0.

A general rule of thumb is 20-40% of the initital build cost will be required for ongoing maintenace work. However, maintenance costs usually start off lower and increase over time. They are also "spiky" costs that are often hard to predict well.

This is like buying a brand-new car. The maintenance costs are negligible in the first couple years, until they start to creep up. Then things start to need maintenance, replacement or repair. As the car ages the maintenance costs continue to increase until at some point it makes economic sense to buy another new car. Except none of us wait that long. Most of us buy new cars before our old one is completely worn out. Yet in Cuba cars have been maintained meticulously for 30-40 years and some even run better than new. Food for thought.

The challenges should be obvious. First, budgets in large organizations tend be last year's budget plus 2-3%. If you start with a maintenance budget of zero on a new system how do you ever get to the point of a healthy maintenance budget in the future? Second, maintenance costs are unpredictable, and organizations hate unpredictable costs. It's impossible to say when the next new hardware, or storage, or programming contruct will occur, or when the existing system will hit a performance or scalability inflection point.

How can you protect that budget? Creating a Maintenance Fund

We know that maintenance cost increase over time, and the costs of proper maintenance are unpredictable. In addition, there is some amount of discretion that can be applied. When your house needs a new roof it's reasonable to defer it through summer, but it probably needs to be done before winter.

Since business require predictablity of costs, unpredictable maintenance costs are easy to defer. "We didn't budget for that, we'll have to put it in next year's budget." Except of course in the budget process it will compete with other projects and enhancement work, where it's again likely to be deprioritized.

What's the solution? Could it be possible to create some type of maintenance fund where a predictable amount is budgeted each year, and spent "unpredictaby" when/as needed? Could this also be a solution to preventing exectives from diverting maintenance budget into pet projects by protecting this maintenance fund in some fashion?

Third, how do you incent engineers to want to perform maintenance?

Legacy System Modernization

Bad software is unmaintained software. Due to factors discussed above, software does not always receive the proper amount of maintenance to remain healthy. Eventually a larger modernization effort (or replacement) becomes necessary to restore a system to operational and functional excellence.

Legacy modernization projects start off feeling easy. The organization once had a reliable working system and kept it running for years. All the modernizing team should need to do is simply rebuild it using better technology, the benefit of hindsight, and improved tooling. It should be simple. But, because people do not see the hidden technical challenges they are about to uncover, they also assume the work will be boring. There’s little glory to be had re-implementing a solved problem.

Modernization projects take months, if not years of work. Keeping a team of engineers focused, inspired, and motivated from beginning to end is difficult. Keeping their senior leadership prepared to invest in what is, in effect, something they already have is a huge challenge. Creating momentum and sustaining it are where most modernization projects fail.

The hard part about legacy modernization is the "system around the system". The organization, its communication structures, its politics, and its incentives are all intertwined with the technical product in such a way that to improve the product, you must do it by turning the gears of this other, complex, undocumented system. Pay attention to politics and culture. Technology is at most only 50% of the legacy problem, ways of working, organization structure and leadership are just as important to success.

An organization about to embark on such an undertaking craves new features, new functionality, and new benefits. Modernization projects are typically the ones organizations just want to get out of the way, so they usually launch into them unprepared for the time and resource commitments they require.

To do this, you need to provide significant value right away, as soon as possible, so that you overcome people’s natural skepticism and get them to buy in. The important word in the phrase "proof of concept" is proof. You need to prove to people that success is possible and worth doing. It can't be just an MVP, because MVPs are dangerous.

Large, expensive projects kicked off to fix things that are not obviously broken break trust with the nontechnical parts of the organization. It inconveniences colleagues, frustrates them, and sometimes confuses them. A modernization effort needs buy-in beyond engineering to be successful. Spending time and money on changes that will have no immediate visible impact on the business or mission side of operations makes it hard to secure that buy-in in the future.

A big red flag is raised when people talk about the phases of their modernization plans in terms of which technologies they are going to use rather than what value they will add. Engineering needs to be reminded it's not about the technology. For all that people talk about COBOL dying off, it is good at certain tasks. The problem with most old COBOL systems is that they were designed at a time when COBOL was the only option. Start by sorting which parts of the system are in COBOL because COBOL is good at performing that task, and which parts are in COBOL because there were no other technologies available. Once you have that mapping, start by pulling the latter off into separate services that are written and designed using the technology we would choose for that task today.

Good measurable business problems and business value have to be aligned with problems that your engineers care about. In all likelihood, you won’t be able to say your database migration is going to do that, but engineers feel passionate about other things. Talk to them and figure out what those are.

Counterintuitively, SLAs/SLOs are valuable because they give people a "failure budget". When organizations stop aiming for perfection and accept that all systems will occasionally fail, they stop letting their technology rot for fear of change. Some organizations can’t be talked out of wanting five or six nines of availability. In those cases, mean time to recovery (MTTR) is a more useful statistic to push than reliability. MTTR tracks how long it takes the organization to recover from failure. Resilience in engineering is all about recovering stronger from failure. That means better monitoring, better documentation, and better processes for restoring services, but you can’t improve any of that if you don’t occasionally fail.

Although a system that constantly breaks, or that breaks in unexpected ways without warning, will lose its users’ trust, the reverse isn’t necessarily true. A system that never breaks doesn’t necessarily inspire high degrees of trust.

"As an industry, we reflect on success but study failure."

— Marianne Bellotti, Kill it With Fire

People take systems that are too reliable for granted. Italian researchers Cristiano Castelfranchi and Rino Falcone have been advancing a general model of trust in which trust degrades over time, regardless of whether any action has been taken to violate that trust. Under Castelfranchi and Falcone’s model, maintaining trust doesn’t mean establishing a perfect record; it means continuing to rack up observations of resilience. If a piece of technology is so reliable it has been completely forgotten, it is not creating those regular observations. Through no fault of the technology, the user’s trust in it slowly deteriorates.

When both observability and testing are lacking on your legacy system, observability comes first. Tests tell you only what shouldn’t fail; monitoring tells you what is failing. Don’t forget: a perfect record will always be broken, but resilience is an accomplishment that lasts. Modern engineering teams use stats like service level objectives, error budgets, and mean time to recovery to move the emphasis away from avoiding failure and toward recovering quickly.

Organizational Design

Conway’s law tells us that the technical architecture and the organization’s structure are general equivalents so organizational design matters. Nothing says you’re serious about accomplishing something more effectively than changing people's scenery, but don't assume that organizational change is necessary out of the gate.

Reorgs are incredibly disruptive. They are demoralizing. They send the message to rank and file engineers that something is wrong — they built the wrong thing or the product they built doesn’t work or the company is struggling. It increases workplace anxiety and decreases productivity. The fact that reorgs almost always end up with a few odd people out who are subsequently let go exacerbates the issues.

What you don’t want to do is draw a new organization chart based on your vision for how teams will be arranged with the new system. You don’t want to do this for the same reason that you don’t want to start product development with everything designed up front. Your concept of what the new system will look like will be wrong in some minor ways you can’t possibly foresee. You don’t want to lock in your team to a structure that will not fit their needs.

The only way to design communication pathways is to give people something to communicate about. In each case, we allow the vision for the new organization to reveal itself by designing structures that encourage new communication pathways to form in response to our modernization challenges. As the work continues, those communication pathways begin to solidify, and we can begin documentation and formalizing new teams or roles. In this way, we sidestep the anxiety of reorganizing. The workers determine where they belong based on how they adapt to problems; workers typically left out are given time and space to learn new skills or prove themselves in different roles, and by the time the new organization structure is ratified by leadership, everyone already has been working that way for a couple months.

Who needs to communicate with whom may not be clear when you get started. This is an exercise I use to help reveal where the communication pathways are or should be. I give everyone a piece of paper with a circle drawn on it. The instructions are to write down the names of the people whose work they are dependent on inside the circle (in other words, “If this person fell behind schedule, would you be blocked?”) and the names of people who give them advice outside the circle. If there’s no one specific person, they can write a group or team name or a specific role, like frontend engineer, instead. Then I compare the results across each team. In theory, those inside the circle are people with whom the engineer needs to collaborate closely. Each result should resemble that engineer’s actual team with perhaps a few additions or deletions based on current issues playing out.

Outside the circle should be all the other teams. Experts not on the team should be seen as interchangeable with other experts in the same field. Small variations will exist from person to person, but if the visualizations that people produce don’t look like their current teams, you know your existing structure does not meet your communication needs.

Incentives

Engineering loves new technology. It gains the engineers attention and industry marketability. Boring technology on the other hand is great for the company. The engineering cost is lower, and the skills are stickier, because these engineers are not being pulled out of your organization for double their salary by Amazon or Google.

To most software engineers, legacy systems may seem like torturous dead-end work, but the reality is systems that are not used get turned off. Working on legacy systems means working on some of the most critical systems that exist — computers that govern millions of people’s lives in enumerable ways. This is not the work of technical janitors, but battlefield surgeons.

Pay attention to how they are incentivized. What earns them the acknowledgment of their peers? What gets people seen is what they will ultimately prioritize, even if those behaviors are in open conflict with the official instructions they receive from management. Shipping new code gets attention, while technical debt accrues silently and without fanfare.

The specific form of acknowledgment also matters a lot. Positive reinforcement in the form of social recognition tends to be a more effective motivator than the traditional incentive structure of promotions, raises, and bonuses. Behavioral economist Dan Ariely attributes this to the difference between social markets and traditional monetary-based markets. Social markets are governed by social norms (read: peer pressure and social capital), and they often inspire people to work harder and longer than much more expensive incentives that represent the traditional work-for-pay exchange. In other words, people will work hard for positive reinforcement; they might not work harder for an extra thousand dollars.

In fact, the idea that one needs a financial reward to want to do a good job for an organization is cynical. It assumes bad faith on the part of the employee, which builds resentment. Traditional incentives have little positive influence, therefore, because they disrupt what was otherwise a personal relationship based on trust and respect. Behavioralist Alfie Kohn puts it this way: Punishment and rewards are two sides of the same coin. Rewards may have a negative effect because they, like outright punishment, are manipulative. “Do this and you’ll get that” is not really very different from “Do this or here’s what will happen to you.” In the case of incentives, the reward itself may be highly desired; but by making that bonus contingent on certain behaviors, managers manipulate their subordinates, and that experience of being controlled is perceived negatively.

People’s perception of risk is not static, and it’s often not connected to the probability of failure so much as it is the potential feeling of rejection and condemnation from their peers. Since social pressures and rewards are better incentives than money and promotions, you can improve your odds of success by learning how to manipulate an organization’s perception of risk.

The first task is to understand what behaviors get individuals within an organization acknowledged. Those are the activities that people will ultimately prioritize. If they are not the activities that you think will advance your modernization process, explore constructs, traditions, and — there’s no better word for it — rituals that acknowledge complementary activities.

The second task is to look at which part of the organization gets to determine when the operator can exercise discretion and deviate from defined procedure. Those are the people who set the ratio of blamelessness to accountability in handling failure. Those are the people who will grant air cover and who need to buy in to any breaking change strategy for it to be successful. Once that air cover is established, anxiety around failure tends to relax.

Once you know how to manipulate the organization’s perception of risk, successfully managing the break is all about preparation. While you will not be able to predict everything that could go wrong, you should be able to do enough research to give your team the ability to adapt to the unknown with confidence. At a minimum, everyone should know and understand the criteria for rolling back a breaking change.

Building Software to Last Forever

Stewart Brand, the editor of the Whole Earth Catalog, described a concept that buildings have architectural "layers":

  • (INFRA)STRUCTURE: foundation & load bearing, rarely changed
  • SKIN: changes every 20 years, more air-tight, better insulated
  • SERVICES: Plumbing, electrical, etc. - changes every 7-15 years
  • SPACE PLAN: interior layout, walls, ceilings, floors, doors. changes every 3-10 years

This concept can be applied to software systems in a similar way.

In short, small organizations build monoliths because small organizations are monoliths. (Location 2073)

Culturally, the engineering organization was flat, with teams formed on an ad hoc basis. Opportunities to work on interesting technical challenges were awarded based on personal relationships, so the organization’s regular hack days became critical networking events. Engineering wanted to build difficult and complex solutions to advertise their skills to the lead engineers who were assembling teams.

Well-designed, high-functioning software that is easy to understand usually blends in. Simple solutions do not do much to enhance one’s personal brand. They are rarely worth talking about. Therefore, when an organization provides limited pathways to promotion for software engineers, they are incentivized to make technical decisions that emphasize their individual contribution over integrating well into an existing system.

=====

The first step of legacy system replacement is to understand the desired outcomes, according to the authors. They point out that "it is vital for an organization to agree the outcomes they want to achieve when tackling legacy. While this may seem obvious, all too often different parts of an organization can have quite different views on the desired outcomes." Some typical legacy replacement outcomes are reducing the cost of change, improving business processes, retiring old systems that are no longer supported or overcoming market disruptions by new players. The authors add that introducing new technology should never be a goal by itself; it must support the current and predicted needs of the business instead.

The second step is to decompose the problem into smaller parts, the authors add. They note that "many applications are built to serve multiple logical products from the same physical system. Often this is driven by a desire for reuse. (...) A major problem we come across is that superficially the products look similar but they [are] very different when it comes to the detail." The Extract Product Lines pattern reduces risk and divides the legacy system into slices that can be (re)built simultaneously and delivered incrementally. Slices are extracted by discovering the product or product lines in the system and their business capabilities. The authors then prioritise the extraction of the products found by risk and begin with the second riskiest, maintaining the business’s attention but not risking a significant failure in case of problems.

Aiming for feature parity is often an underestimated goal and is not recommended except for a few use cases, the authors write. They even state that "if feature parity is a genuine requirement then this pattern describes what it might take to do well. It is not an easy path, nor one to be taken lightly." It requires running complete system surveys to understand user journeys and the system’s inputs and outputs, integrations, and batch processes. Such surveys often need to be complemented with systems archaeology to discover their inner workings, instrumentation to discover what is being used, and feature value mapping to understand which features may be dropped. Naturally, tests must be used to ensure that the new system behaves like the old one. The authors conclude that because this pattern is so expensive, it should only be used when all specifications are well known or when any changes would disrupt existing users too much.

===

In 1983, Charles Perrow coined the term "normal accidents" to describe a situation where systems are so prone to failure, no amount of safety procedures could eliminate accidents entirely. According to Perrow, normal accidents are not the product of bad technology or incompetent staff. Rather, systems that experience normal accidents display two important characteristics:

  1. They are tightly coupled. When two separate components are dependent on each other, they are said to be coupled. In tightly coupled situations, there’s a high probability that changes with one component will affect the other. Tightly coupled systems produce cascading effects.

  2. They are complex. Signs of complexity in software include the number of direct dependencies and the depth of the dependency tree. Computer systems naturally grow more complex as they age, because we tend to add more and more features to them over time.

Tightly coupled and complex systems are prone to failure because the coupling produces cascading effects, and the complexity makes the direction and course of those cascades impossible to predict.

In a simple example of coupling imagine software that tightly links three components together. Each component is 99% reliable. The overall reliability of the system is 0.99 to the third power, or 0.99 x 0.99 x 0.99 = 0.97 (97% reliable). Now imagine instead of three tightly linked components it requires ten. The systems reliability would be 0.99 to the 10th power, or only 90% reliable. It would be out of service 10% of the time, despite being made of highly reliable components.

=====

--

The planning isn't the issue, it's the conviction, resolve, determination...whatever you want to call it.

COBOL modernization is quite expensive and fraught with major risk. You're talking about multi-year efforts where the key players who initiate the effort are often not around to see it through. Limited legacy resources, poor-to-absent documentation, mission critical systems...it's a mess no one internally wants to touch or take the fall for. MicroFocus, Blu Age (now part of AWS), TSRI, and plenty of other companies are happy to take your money and tell you it can be done, and it certainly can be done technically.

The reality is the cost benefit is often years down the line, so keeping the right people around to sustain the effort and see it through financially and technically is the tallest of enterprise orders. So it's the old two-step: 1) Pay IBM for another three years of support and licensing and 2) Go through the motions of migration planning then repeat Step 1.

--

It's funny how if you ask people about forward-looking systems today, they have either learned mistruths or haven't learned enough about the history of computing to picture much that's reasonable. "Java!" used to be talked about as a good way to create software that you could run in the future, but anyone who has had to keep around old laptops / old VMs to run ancient Java to handle KVMs or device management tools knows how ridiculous an expectation about the stability of Java can be.

"Windows!" is funny, because I can't tell you the number of places that have an ancient Windows PC that has either been repaired or carefully replaced with new enough, yet old enough hardware to still run Windows 2000 or XP, because copy protection is stupid, and / or because drivers for specific hardware can't be updated or run on newer Windows, and / or because the software can't be compiled and run on newer Windows without tons of work.

On the other hand, you can take large, complicated programs from the 1980s written in C and compile them on a modern Unix computer without issues, even when the modern computer is running architectures which hadn't even been dreamt of when the software was written...

--

But I think you're forgetting how much code is written to C89. How much of bash, for instance, is ancient, with updates that avoid newer toolchain features so that it can be as portable as possible?

Yes, people don't often write stuff with portability as a goal at the beginning, but once something is relatively portable, it stays that way. Lots of code that wasn't poorly written made it from all-the-world's-a-VAX to i386, from i386 to amd64, and now ARM and aarch64, with a minimum of extra effort. There just had to be a little effort to NOT program like a jerk, which, as funny as it is, is still an issue now.

I'm running Pine as my daily email program, which was written in 1989 and hasn't had any real updates since 2005. New architecture? Compile it. Lots of modern software started out as C software from the 1980s.

  • A similar but inverse brainstorming exercise to the critical factors exercise is asking your team to play saboteur. If you wanted to guarantee that the project fails, what would you do? How can you achieve the worst possible outcome? Once this list is generated, the team discusses if there are any behaviors either internally or from external partners that are close to items on the saboteur list. (Location 2584)

\

Some of his other observations include the following: Individual incentives have a role in design choices. People will make design decisions based on how a specific choice — using a shiny new tool or process — will shape their future. Minor adjustments and rework are unflattering. They make the organization and its future look uncertain and highlight mistakes. To save face, reorgs and full rewrites become preferable solutions, even though they are more expensive and often less effective.

An organization’s size affects the flexibility and tolerance of its communication structure. When a manager’s prestige is determined by the number of people reporting up to her and the size of her budget, the manager will be incentivized to subdivide design tasks that in turn will be reflected in the efficiency of the technical design — or as Conway put it: “The greatest single common factor behind many poorly designed systems now in existence has been the availability of a design organization in need of work.” Conway’s observations are even more important in the maintaining of existing systems than they are in the building of new systems.

  • whether we are trying to ensure success or avoid failure. When success seems certain, we gravitate toward more conservative, risk-averse solutions. When failure seems more likely, we switch mentalities completely. We go bold, take more risks.11 If we are judging odds correctly, this behavior makes sense. Why not authorize that multimillion-dollar rewrite if the existing system is doomed? (Location 2763)

]

(Location 3134)

New highlights added September 20, 2022 at 12:07 PM

  • To summarize, people’s perception of risk is not static, and it’s often not connected to the probability of failure so much as it is the potential feeling of rejection and condemnation from their peers. Since social pressures and rewards are better incentives than money and promotions, you can improve your odds of success by learning how to manipulate an organization’s perception of risk. The first task is to understand what behaviors get individuals within an organization acknowledged. Those are the activities that people will ultimately prioritize.

Resilient Design

Software that is resilient is generally infrastructure agnostic.

  1. Use open standards, or widely adopted standards
  2. Avoid
  3. Be boring
  4. When both observability and testing are lacking on your legacy system, observability comes first. Tests tell you only what won’t fail; monitoring tells you what is failing.

References

https://news.ycombinator.com/item?id=32997102


Image Credit: Bill Gates, CEO of Microsoft, holds Windows 1.0 floppy discs.

(Photo by Deborah Feingold/Corbis via Getty Images) Software evolves. The underlying hardware, CPU, memory and storage evolves. The operating system evolves. Of course the software we use must evolves as well. This was the release of Windows 1.0. The beginning.

New York subway trains, nicknamed the Brightliners for their shiny unpainted exteriors, were introduced in 1964 and built to last 35 years. When they were finally retired from service in January 2022, they had been riding the rails of New York City for 58 years and were the oldest operating subway cars in the world. That’s 23 unplanned, extra years of use.

Replacement parts were impossible to find; the firms that made them — the Budd Company, Pullman-Standard and Westinghouse — long gone. Components were well past their design life, mechanics plucked similar parts from retired vehicles or engineered substitutes. Maintenance is an improvisational process that relies on the know-how that long-time employees built up over many years. In a way, it was a small miracle that they lasted so long, an “anomaly,” as one mechanic put it. Except it wasn’t.

Back in the 1980s MTA president David L. Gunn kicked off an ambitious overhaul of the crumbling, graffiti-covered subway. In 1990, this approach became the "scheduled maintenance system". Every two to three months, more than 7,000 railcars are taken in for inspection with the goal of catching problems before they happen. That’s the difference between maintenance and repair. Repair is fixing something that’s already broken. Maintenance is about making something last.

Notably, in 2020 Sabre signed a ten year deal with Google to migrate SABRE to Google's cloud infrastructure.

Maintenance mostly happens out of sight, mysteriously. If we notice it, it’s a nuisance. When road crews block off sections of highway to fix cracks or potholes, we treat it as an obstruction, not a vital and necessary process. When your refrigerator needs maintenance you defer it, hoping maybe you can just buy a new one later. If you wait long enough and your refrigerator breaks down, you get upset at the cost of the repair bill - not at yourself for deferring the maintenance.

This is especially true in the public sector: it’s almost impossible to get governmental action on, or voter interest in, spending on preventive maintenance, yet governments make seemly unlimited funds available once we have a disaster. We are okay spending a massive amount of money to fix a problem, but consistently resist spending a much smaller amount of money to prevent it; as a strategy this makes no sense.

Many organizations say they want to be "data driven" but that requires critical thinking skills as well. In some places I see "being data driven" leading to poor decisions because data can be (and is) manipulated to support a desired outcome.

Many times I have found the technology team and the business users arguing over "bug" vs. "enhancement". When I ask why the answer is always the same - it is a way of assigning blame. "Bug" means its engineering's fault, "enhancement" means it was a missed requirement by the business. I always tell everyone they are both just maintenance and our only decision is which to fix first.

Sharing is Caring

Edit this page