Making Software Last Forever
How many of us have bought a new home because our prior home was not quite meeting our needs? Maybe we needed an extra bedroom, or wanted a bigger backyard? Now, as a thought experiment, assume you couldn't sell your existing home. If you bought a new home, you'd have to "retire" or "decommission" your prior home (and your investment in it). Does that change your thinking?
Further, imagine you had a team of five people maintaining your prior home, improving it, and keeping it updated, for the last ten years. You'd have a cumulative investment of 50 person/years in your existing home (5 people x 10 years) just in maintenance, on top of the initial investment. If each person was paid the equivalent of a software developer (we'll use $200k to include benefits, office space, leadership, etc.) you'd have an investment just in labor of $10 million dollars (50 person/years x $200,000). Would you walk away from that investment?
When companies decide to re-write or replace an existing software application, they are making a similar decision. Existing software is "retired" or "decommissioned" (along with its cumulative investment). Yet the belief that new code is always better than old is patently absurd. Old code has weathered and withstood the test of time. It has been battle-tested. You know it's failure modes. Bugs have been found, and more importantly, fixed.
Joel Spolsky (of Fog Creek Software and Stack Overflow) describes system re-writes in "Things You Should Never Do, Part I" as “the single worst strategic mistake that any software company can make.”
Continuing our home analogy, recent price increases for construction materials like lumber, drywall, and wiring (and frankly everything else) should, according to Economics 101, cause us to treat our current homes more dearly. Similarly, price increases for quality software engineers should force companies to treat existing software more dearly.
Lots of current software started out as C software from the 1980s. Engineers don't often write software with portability as a goal at the beginning, but once something is relatively portable, it tends to stay that way. Code that was well designed and written often migrated from mini-computers to i386, from i386 to amd64, and now ARM and arch64, with a minimum of redesign or effort. You can take large, complicated programs from the 1980s written in C, and compile/run them on a modern Linux computer - even when the modern computer is running architectures which hadn't even been dreamt of when the software was originally written.
Why can't software last forever? It's not made of wood, concrete, or steel. It doesn't "wear out", rot, weather, or rust. A working algorithm is a working algorithm. Technology doesn’t need to be beautiful, or impress other people, to be effective. Aren't technologists ultimately in the business of producing cost effective technology?
I am going to attempt to convince you that maintaining your existing systems is one the most cost-effective technology investments you can make.
The World's Oldest Software Systems
In 1958, the United States Department of Defense launched a new computer-based contract management system called "Mechanization of Contract Administration Services", or MOCAS (pronounced “MOH-cass”). In 2015, MIT Technology Review stated that MOCAS was the oldest computer program in continuous use they could verify. At that time MOCAS managed about $1.3 trillion in government obligations and 340,000 contracts.
According to the Guinness Book of World Records, the oldest software system in use today is either the SABRE Airline Reservation System (introduced in 1960), or the IRS Individual Master File (IMF) and Business Master File (BMF) systems introduced in 1962–63.
SABRE went online in 1960. It had cost $40 million to develop and install (about $400 million in 2022 dollars). The system took over all American Airlines booking functions in 1964, and the system was expanded to provide access to external travel agents in 1976.
What is the secret to the long lifespan of these systems? Shouldn't companies with long-lived products (annuities, life insurance, etc.) study these examples? After all, they need systems to support products that last most of a human lifespan. However, shouldn't all companies want to their investments in software to last as long as possible?
Maintenance is About Making Something Last
We spoke of SABRE above, and we know that airlines recognize the value of maintenance. Commercial aircraft are inspected at least once every two days. Engines, hydraulics, environmental, and electrical systems all have additional maintenance schedules. A "heavy" maintenance inspection occurs once every few years. This process maintains the aircraft's service life over decades.
On average, an aircraft is operable for about 30 years before it must be retired. A Boeing 747 can endure 35,000 pressurization cycles — roughly 135,000 to 165,000 flight hours — before metal fatigue sets in. However, most older airframes are retired for fuel-efficiency reasons, not because they're worn out.
Even stuctures made of grass can last indefinitely. Inca rope bridges were simple suspension bridges constructed by the Inca Empire. The bridges were an integral part of the Inca road system were constructed using ichu grass.
Even though they were made of grass, these bridges were maintained with such regularity and attention they lasted centuries. The bridge's strength and reliability came from the fact that each cable was replaced every June.
The goal of maintenance is catching problems before they happen. That’s the difference between maintenance and repair. Repair is about fixing something that’s already broken. Maintenance is about making something last.
Unfortunately, Maintenance is Chronically Undervalued
Maintenance is one of the easiest things to cut when budgets get tight. Some legacy software systems have decades of underinvestment in maintenance. This leads up to the inevitable "we have to replace it" discussion - which somehow always sounds more persuasive (even though it’s more expensive and riskier) than arguing to invest in system rehabilitation and deferred system maintenance.
Executives generally can't refuse "repair" work because the system is broken and must be fixed. However, maintenance is a tougher sell. It’s not strictly necessary — or at least it doesn’t seem to be until things start falling apart. It is so easy to divert maintenance budget into a halo project that gets an executive noticed (and possibly promoted) before the long-term effects of underinvestment in maintenance become visible. Even worse, the executive is also admired for reducing the costs of maintenance and switching costs from "run" to "grow" - while they are torpedoing the company under the waterline.
The other challenge is conflating enhancement work with maintenance work. Imagine you have $1,000 and you want to add a sunroof to your car, but you also need new tires (which coincidentally also cost $1,000). You have to replace the tires every so often, but a sunroof is "forever" right? If you spend the money on the sunroof the tires could get replaced next month, or maybe the month after - they'll last a couple more months, won't they?
With software, users can't see "the bald tires" - they only thing they see, or experience (and value), are new features and capabilities. Pressure is always present to cut costs and to add new features. The result is budget always swings away from maintenance work towards enhancements.
Finally, maintenance work is typically an operational cost, yet building a new system, or a significant new feature, can often be capitalized - making the future costs someone else's problem.
Risks of Replacing Software Systems
It's usually not the design or the age of a system that causes it to fail but rather neglect. People fail to maintain software systems because they are not given the time, incentives, or resources to maintain them.
"Most of the systems I work on rescuing are not badly built. They are badly maintained."— Marianne Bellotti, Kill it With Fire
Once a system degrades it is an enormous challenge to fund deferred maintenance (or "technical debt"). No one plans for it, no one wants to pay for it, and no engineer wants to do it. Initiatives to restore operational excellence, much the way one would fix up an old house, tend to have few volunteers among engineering teams. No one gets noticed doing maintenance. No one ever gets promoted because of maintenance.
It should be clear why engineers prefer to re-write a system rather than maintain it. They get to "write a new story" rather than edit someone else's. They will attempt to convince a senior executive to fund a project to replace a problematic system by describing all the new features and capabilities that could be added as well as how "bad" the existing, unmaintained, system has become. Further, they will get to use modern technology that makes them much more valuable in the market.
Incentives aside, engineering teams tend to gravitate toward system rewrites because they incorrectly think of old systems as specs. They assume that since an old system works, the functional risks have been eliminated. They can focus on adding more features to the new system or make changes to the underlying architecture without worry. Either they do not perceive the ambiguity these changes introduce, or they see such ambiguity positively, imagining only gains in performance and the potential for innovation.
Why not authorize that multimillion-dollar replacement if the engineers convince management the existing system is doomed? Eventually a "replacement" project will be funded (typically at a much higher expenditure than rehabilitating the existing system). Even if the executives are not listening to the engineers, they will be listening to external consultants telling them they are falling behind.
What do you do with the old system while you’re building the new one? Most organizations put the old system on “life support” and give it only the resources for patches and fixes necessary to keep it running. This reduces maintenance even further and becomes a self-fulfilling prophecy that the existing system will eventually fail.
Who gets to work on the new system, and who takes on the maintenance tasks of the old system? If the old system is written in older technology that the company is actively abandoning, the team maintaining the old system is essentially sitting around waiting to be fired. And don’t kid yourself, they know it. If the people maintaining the old system are not participating in the creation of the new system, you should expect that they are also looking for new jobs. If they leave before your new system is operational, you lose both their expertise and their institutional knowledge.
If the new project falls behind schedule (and it almost certainly will), the existing system continues to degrade, and knowledge continues to walk out the door. If the new project fails and is subsequently canceled, the gap between the legacy system and operational excellence has widened significantly in the meantime.
This explains why executives are loathe to cancel system replacement projects even when they are obviously years behind schedule and failing to live up to expectations. Stopping the replacement project seems impossible because the legacy system is now so degraded that restoring it to operational excellence seems impossible. Plus, politically canceling a marquee project can be career suicide for the sponsoring executive(s). Much better to do "deep dives" and "assessments" on why the project is failing and soldier on than cancel it.
The interim state is not pretty. The company now has two systems to operate, much higher costs and new risks.
- The new system will have high costs, limited functionality, new and unique errors/issues, and lower volumes (so the "per unit cost" of the new system will be quite high).
- The older system will still be running most of the business, and usually all of the complex business, while having lost its best engineers and subject matter experts. Its maintenance budget will have been whittled down to nothing to redirect spending to implement (save?) the new system. This system will be in grave danger to significant system failure (which proponents of the new system will use to justify the investment in the new system, not admitting to a self-fulfilling prophecy).
Neither system will exhibit operational excellence, and both put the organization at significant risk in addition to the higher costs and complexity of running two systems.
Maintaining Software to Last Forever
As I discussed in How Software Learns, software adapts over time - as it is continually refined and reshaped by maintenance and enhancements. Maintenance is crucial to software's lifespan and business relevance/value. When software systems are first developed, they are based on a prediction of the future - a prediction of the future that we know is wrong even as we make it. No set of requirements have ever been perfect. However, all new systems become "less wrong" as time, experience, and knowledge are continually added (e.g., maintenance).
Futureproofing means constantly rethinking and iterating on the existing system. We know from both research and experience that iterating and maintaining existing solutions is a much more likely, and less expensive, way to improve software's lifespan and functionality.
Before choosing to replace a system that needs deferred maintenance remember it’s the lack of maintenance that create the impression that failure is inevitable, and pushes otherwise rational engineers and executives toward rewrites or replacements. What mechanisms will prevent lack of maintenance from eventually dooming the brand-new system? Has the true root problem been addressed?
Robust maintenance practices could preserve software for decades, but first maintenance must be valued, funded, and applied. To maintain software properly we have to consider:
- How do you measure the overall health of a system?
- How do you define and manage maintenance work?
- How do you define a reasonable maintenance budget? How can you protect that budget?
- How do you motivate engineers to perform maintenance?
1. How do you measure the overall health of a system?
Maintenance Backlog — If you added up all the open work requests, including work the software engineers deem necessary to eliminate technical debt, what is the total amount of effort? Now, divide that by the team capacity. For example, imagine you have a total amount of work of 560 days, and you have one person assigned to support the system - they work approximately 200 days annually. The backlog in days in 560, but in time it is 2.8 years (560 days / 200 days/year = 2.8 years). What is a reasonable amount of backlog time?
System Reliability/Downtime — If you added up all the time the system is down in a given period, what is the total amount? What is the user or customer impact of that downtime? Conversely, what would reducing that downtime be worth? What is the relationship of maintenance and downtime? In other words, does the system need to be taken down to maintain it (planned maintenance)? Does planned maintenance reduce unplanned downtime?
Capacity/Performance Constraints — Is the existing hitting capacity constraints that will prevent future growth of the business? How unpredictable are the system capacity demands? What is the customer experience when the system capacity is breached? What is relationship between hardware and software that constrains the system? Is the software performant? Can hardware solve the problem?
User Satisfaction: User satisfaction includes both how happy your employees are with the applications and/or how well those applications meet your customer's needs. Many times I have found the technology team and the business users arguing over "bug" vs. "enhancement". It is a way of assigning blame. "Bug" means its engineering's fault, "enhancement" means it was a missed requirement. When emotions run hot it means that the maintenance budget is insufficient. I always tell everyone they are both just maintenance and the only important decision is which to prioritize and fix first.
“Shadow IT” — If you used applications in the past that didn’t meet employees’ needs, and didn’t have a good governance plan to address problems, you may have noticed employees found other solutions on their own. This is an indication of underfunded maintenance.
Adaptable Architecture — "The cloud", API-based integration, and unlocking your data are no longer “nice to haves.” Your architecture needs to adapt. If these are challenges, then the architecture must be addressed.
Governance — Healthy application architecture isn’t just about technology—it’s also about having well-documented and well-understood governance documents that guide technology investments for your organization. Good governance helps create adaptable architecture and avoid “shadow IT” applications.
2. How do you define maintenance work?
There are four general types of software maintenance. The first two types take up the majority of most organizations' maintenance budget, and may not even be considered maintenance - however, all four types must be funded adequately for software to remain healthy. If you can't fully address types three and four your maintenance budget is inadequate.
1. Corrective Software Maintenance (more accurately called "repair")
Corrective software maintenance is necessary when something goes wrong in a piece of software including faults and errors. These can have a widespread impact on the functionality of the software in general and therefore must be addressed as quickly as possible. However, it is important to consider repair work separate from the other types of maintenance because repair work must get done. Note: this is generally the only type of work that happens when a system is put on "life support".
2. Perfective Software Maintenance (more accurately called "enhancements")
Once software is released and is being used new issues and ideas come to the surface. Users will think up new features or requirements that they would like to see. Perfective software maintenance aims to adjust software by adding new features as necessary (and removing features that are irrelevant or not effective). This process keeps software relevant as the market, and user needs, evolve. It there is funding beyond "life support" it usually is spent here.
3. Preventative Software Maintenance (true maintenance is catching problems before they happen.)
Preventative software maintenance is looking into the future so that your software can keep working as desired for as long as possible. This includes making necessary changes, upgrades, and adaptations. Preventative software maintenance may address small issues which at the given time may lack significance but may turn into larger problems in the future. These are called latent faults which need to be detected and corrected to make sure that they won’t turn into effective faults. This type of maintenance is generally underfunded.
4. Adaptive Software Maintenance (true maintenance adapts to changes)
Adaptive software maintenance is responding to the changing technology landscape, as well as new company policies and rules regarding your software. These include operating system changes, using cloud technology, security policies, hardware changes, etc. When these changes are performed, your software (and possibly architecture) must adapt to properly meet new requirements and meet current security and other policies.
3. How do you define a reasonable maintenance budget? How can you protect that budget?
In the case of the Inca rope bridges what was the cost of maintenance annually? Let's assume some of the build work was site preparation and building the stone anchors on each side, but most of the work was constructing the bridge itself. Since the bridge was entirely replaced each year, the maintenance costs could be as much as 80% of the initial build effort, every year.
Comparing to "software as a service" (SaaS) vendors is difficult because they have shifted to a subscription model that bundles infrastructure, enhancements, and ongoing maintenance. Prior to SaaS subscription-based pricing one would typically buy a perpetual license plus maintenance at ~20-30% annual cost of the license to obtain support and updates.
Side note: Now that the SaaS annual costs are commingled, some enterprises fall into the trap that “building it is cheaper because we pay up front but then it will cost less in the long run” assuming the "long run" almost always underprices infrastructure and assumes near zero maintenance cost. In the case of a brand-new, internally designed and developed software system - one that is well architected, well designed, well built, and meets all reliability, scalability, and performance needs (i.e., fantasy software) it's conceivable that there is no maintenance necessary for some period of time - but very unlikely.
So, maintenance costs can have a very wide range. A general rule of thumb is 20-30% of the initial build cost will be required for ongoing maintenance work annually. However, maintenance costs usually start off lower and increase over time. They are also unpredictable costs that are hard to budget.
The challenges should be obvious. First, budgets in large organizations tend be last year's budget plus 2-3%. If you start with a maintenance budget of zero on a new system, how do you ever get to the point of a healthy maintenance budget in the future? Second, maintenance costs are unpredictable, and organizations hate unpredictable costs. It's impossible to say when the next new hardware, or storage, or programming construct will occur, or when the existing system will hit a performance or scalability inflection point.
This is like buying a brand-new car. The maintenance costs are negligible in the first couple years, until they start to creep up. Then things start to need maintenance, replacement, or repair. As the car ages the maintenance costs continue to increase until at some point it makes economic sense to buy another new car. Except none of us wait that long. Most of us buy new cars before our old one is completely worn out. As a counter-example, in Cuba some cars have been maintained meticulously for 30-40 years and run better than new.
Protecting your maintenance budget - creating a "maintenance fund"
We know that maintenance cost increase over time, and the costs of proper maintenance are unpredictable. In addition, there is some amount of management discretion that can be applied. When your house needs a new roof it's reasonable to defer it through summer, but it probably needs to be done before winter.
Since business require predictability of costs, unpredictable maintenance costs are easy to defer. "We didn't budget for that; we'll have to put it in next year's budget." Except of course in the budget process it will compete with other projects and enhancement work, where it's again likely to be deprioritized.
What's the solution? Could it be possible to create some type of maintenance fund where a predictable amount is budgeted each year, and then spent "unpredictably" when/as needed? Could this also be a solution to preventing executives from diverting maintenance budget into pet projects by protecting this maintenance fund in some fashion?
4. How do you motivate software engineers to perform maintenance?
There is a Chinese proverb about a discussion between a king and a famous doctor. The well-known doctor explains to the king that his brother (who is also a doctor) is superior at medicine, but he is unknown because he always successfully treats small illnesses, preventing them from evolving into more serious or terminal ones. So, people say "Oh he is a fine doctor, but he only treats minor illnesses". It's true: Nobody Ever Gets Credit for Fixing Problems that Never Happened.
To most software engineers, legacy systems seem like torturous dead-end work, but the reality is systems that are not important get turned off. Working on "estate" systems means working on some of the most critical systems that exist — computers that govern millions of people’s lives in enumerable ways. This is not the work of technical janitors, but battlefield surgeons.
Engineering loves new technology. It gains the engineers attention and industry marketability. Boring technology on the other hand is great for the company. The engineering cost is lower, and the skills are easier to obtain and keep, because these engineers are not being pulled out of your organization for double their salary by Amazon or Google.
Well-designed, high-functioning software that is easy to understand usually blends in. Simple solutions do not do much to enhance one’s personal brand. Therefore, when an organization provides limited pathways to promotion for software engineers, they tend to make technical decisions that emphasize their individual contribution and technical prowess. You have to be very careful to reward what you want from your engineering team.
What earns them the acknowledgment of their peers? What gets people seen is what they will ultimately prioritize, even if those behaviors are in open conflict with the official direction they receive from management. In most organizations shipping new code gets attention, while technical debt accrues silently in the background.
The specific form of acknowledgment also matters a lot. Positive reinforcement in the form of social recognition tends to be a more effective motivator than the traditional incentive structure of promotions, raises, and bonuses. Behavioral economist Dan Ariely attributes this to the difference between social markets and traditional monetary-based markets. Social markets are governed by social norms (read: peer pressure and social capital), and they often inspire people to work harder and longer than much more expensive incentives that represent the traditional work-for-pay exchange. In other words, people will work really hard for positive reinforcement from their peers.
Legacy System Modernization
Unmaintained software will certainly die at some point. Due to factors discussed above, software does not always receive the proper amount of maintenance to remain healthy. Eventually a larger modernization effort may become necessary to restore a system to operational and functional excellence.
Legacy modernization projects start off feeling easy. The organization once had a reliable working system and kept it running for years. All the modernizing team should need to do is simply reshape it using better technology, better architecture, the benefit of hindsight, and improved tooling. It should be simple. But, because people do not see the hidden technical challenges they are about to uncover, they also assume the work will be boring. There’s little glory to be had re-implementing a solved problem.
Modernization projects are also typically the ones organizations just want to get out of the way, so they launch into them unprepared for the time and resource commitments they require. Modernization projects take months, if not years of work. Keeping a team of engineers focused, inspired, and motivated from beginning to end is difficult. Keeping their senior leadership prepared to invest in what is, in effect, something they already have is a huge challenge. Creating momentum and sustaining it are where most modernization projects fail.
The hard part about legacy modernization is the "system around the system". The organization, its communication structures, its politics, and its incentives are all intertwined with the technical product in such a way that to improve the product, you must do it by turning the gears of this other, complex, undocumented system. Pay attention to politics and culture. Technology is at most only 50% of the legacy problem, ways of working, organization structure and leadership/sponsorship are just as important to success.
To do this, you need to overcome people’s natural skepticism and get them to buy in. The important word in the phrase "proof of concept" is proof. You need to prove to people that success is possible and worth doing. It can't be just an MVP, because MVPs are dangerous.. A red flag is raised when companies talk about the phases of their modernization plans in terms of which technologies they are going to use rather than what value they will add.
For all that people talk about COBOL dying off, it is good at certain tasks. The problem with most old COBOL systems is that they were designed at a time when COBOL was the only option. Start by sorting which parts of the system are in COBOL because COBOL is good at performing that task, and which parts are in COBOL because there were no other technologies available. Once you have that mapping, start by pulling the latter off into separate services that are written and designed using the technology we would choose for that task today.
Going through the exercise of understanding what functionality is fit for use for specific languages/technologies not only gives engineers a way to keep building their skillsets but also is an opportunity to pair with other engineers who have different/complimentary skills. This exchange also has the benefit of diffusing the understanding of the system to a broader group of people without needing to solely rely on documentation (which never exists).
Counterintuitively, SLAs/SLOs are valuable because they provide a "failure budget". When organizations stop aiming for perfection and accept that all systems will occasionally fail, they stop letting their technology rot for fear of change. In most cases, mean time to recovery (MTTR) is a more useful statistic to push than reliability. MTTR tracks how long it takes the organization to recover from failure. Resilience in engineering is all about recovering stronger from failure. That means better monitoring, better documentation, and better processes for restoring services, but you can’t improve any of that if you don’t occasionally fail.
Although a system that constantly breaks, or that breaks in unexpected ways without warning, will lose its users’ trust, the reverse isn’t necessarily true. A system that never breaks doesn’t necessarily inspire high degrees of trust - and its maintenance budget is even easier to cut.
People take systems that are too reliable for granted. Italian researchers Cristiano Castelfranchi and Rino Falcone have been advancing a general model of trust that postulates trust naturally degrades over time, regardless of whether any action has been taken to violate that trust. Under Castelfranchi and Falcone’s model, maintaining trust doesn’t mean establishing a perfect record; it means continuing to rack up observations of resilience. If a piece of technology is so reliable it has been completely forgotten, it is not creating those regular observations. Through no fault of the technology, the user’s trust in it slowly deteriorates.
When both observability and testing are lacking on your legacy system, observability comes first. Tests tell you only what shouldn’t fail; monitoring tells you what is failing. Don’t forget: a perfect record will always be broken, but resilience is an accomplishment that lasts. Modern engineering teams use stats like service level objectives, error budgets, and mean time to recovery to move the emphasis away from avoiding failure and toward recovering quickly.
Maintenance mostly happens out of sight, mysteriously. If we notice it, it’s a nuisance. When road crews block off sections of highway to fix cracks or potholes, we treat it as an obstruction, not a vital and necessary process. This is especially true in the public sector: it’s almost impossible to get governmental action on, or voter interest in, spending on preventive maintenance, yet governments make seemly unlimited funds available once we have a disaster. We are okay spending a massive amount of money to fix a problem, but consistently resist spending a much smaller amount of money to prevent it; as a business strategy this makes no sense.
The Open Mainframe Project estimates that there about 250 billion lines of COBOL code running today in the world economy, and nearly all COBOL code contains critical business logic. Companies should maintain that software and make it last as long as possible.
- Things You Should Never Do, Part I
- Patterns of Legacy Displacement
- Kill It with Fire: Manage Aging Computer Systems (and Future Proof Modern Ones)
- Building software to last forever
- The Disappearing Art Of Maintenance
- Inca rope bridge
- How Often Do Commercial Airplanes Need Maintenance?
- Nobody Ever Gets Credit for Fixing Problems that Never Happened
- Boring Technology Club
- Open Mainframe Project 2021 Annual Report
- How Popular is COBOL?
Image Credit: Bill Gates, CEO of Microsoft, holds Windows 1.0 floppy discs.
(Photo by Deborah Feingold/Corbis via Getty Images) This was the release of Windows 1.0. The beginning. Computers evolve. The underlying hardware, CPU, memory, and storage evolves. The operating system evolves. Of course, the software we use must evolve as well.