Sunday, March 31, 2024

The IT Death March Project: A Case Study

Anyone who has worked in the Information Technology field in the last twenty years is intimately familiar with the concept of a death march project. A death march project is the inevitable result of combining every bad trend in management (lean staffing, outsourcing, matrix organizations) with every worst practice in the design and development of new, highly complex systems. This combination of inputs inevitably produces

  • improperly collected requirements
  • poorly synchronized goals between users and managers
  • poor technology selections likely imposed by out-of-touch middle managers
  • compounded by selection of vendors and consulting firms by senior leaders based upon personal relationships rather than capabilities and value
  • due dates and budgets set before a single page of design is completed to properly gauge the scope of work, development dependencies, testing requirements and infrastructure needed for production launch

All death march projects bear a few key similarities:

  • they affect millions of expected users
  • they are typically EXPECTED to cost tens of millions of dollars (maybe $10-20, maybe $50-70)
  • they usually WIND UP costing three times to ten times the original estimates when rework, delays, and opportunity costs are properly counted
  • they are PROMISED to take only twelve to eighteen months to deliver
  • they usually wind up taking eighteen months just to deliver the first working proof of concept which triggers revolts from "the business" and users when they see the gap in understanding between actual needs and the "requirements" gathered
  • they often take over twenty four months to deliver a release into production which typically delivers between 50-70 percent of originally promised functionality and often fails upon launch due to poor performance
  • they usually involve armies of consultants hiring armies of contractors to "assist" with requirements gathering, project management and coordination between technical teams and "the business" stakeholders, testing, and often the core design and development - thus explaining the bloated costs, delivery intervals, mis-communications and poor quality of the final delivery

One rule governing all death march projects is that the death march nature of the project becomes apparent to all participants IN the death march project -- with the exception of the senior leaders launching the project -- within weeks of the project beginning. That is typically the point when the handful of experts who ACTUALLY understand the problem and required solution first encounter the waves of invading consultants and contractors. Only then does the CHASM between leadership expectations and reality become apparent.

Given the fixation in American life on fostering a data driven world and webifying every possible social and economic interaction, the consistent inability of business and technical professionals to build systems with proper specifications meeting sane minimum standards for usability and security and do so within estimated budgets and time intervals would seem to merit serious research in engineering and business schools across the country.

Given this ongoing failure, it's apropos that one of the latest case studies in the phenomenon involves higher education. In a nutshell, legislation was passed to fund development of a system to simplify and modernize the financial aid application process. The project was managed by an agency within the Department of Education, began in 2022, was targeted for launch in 2023, barely made its committed launch date and has subsequently been found to have fallen far short of expectations for functionality and reliability and maps to a cost structure already nearly three times the original development cost. The effort to develop and launch the new solution not only reflects all of the hallmarks of a classic IT death march project but appears to have prevented MILLIONS of potential college applicants from applying for aid, thus risking dramatic drops in enrollment at schools across the country.


Case Study: Modernizing Financial Aid

In response to complaints from millions of parents, Congress enacted legislation at the end of 2020 to modernize the financial aid application process and convert all of the old-fashioned paperwork to an online portal-based application. Given the date of the legislation (December 2020), it is likely funding for the new system was not available until the 2021 fiscal year beginning in October of 2021. The Education Department set a target for the new system to be online for use for the first round of college freshmen from the graduation high school class of 2024. That might SEEM like development teams had until April or May of 2024 to complete the system, right? Wrong. Colleges begin making admission decisions for Fall 2024 classes in roughly November to December 2023, which means financial information must be submitted by September or October 2023. That means the development teams had AT MOST only twenty four months to design, code, test and deploy the new system.

To really appreciate the magnitude of the failure of the effort, the scope of the project should be reviewed from multiple perspectives, exactly like it SHOULD have been before finalizing functional requirements, design, delivery intervals and costs.

Key Project Attributes

Anyone familiar with the design, coding and operation of large scale systems would have noted the following attributes of an effort aimed at "modernizing" and "webifying" the application process for financial aid.

  • Applications to the legacy FAFSA process peaked at 22.5 million in 2012 but are still around 17.5 million in 2023
  • The application process is HIGHLY seasonal, with virtually all user actions concentrated between October and December of each year
  • By definition, nearly one quarter of the applicants are new to the process each year, representing freshman applications, and may be prone to procrastinating towards the end of the entry period.
  • As a financial system, the system will collect and house HIGHLY sensitive identity and financial information for students and parents, making network security and encryption of data in motion and at rest paramount.
  • As a system involving government and banking processes that carry criminal penalties for fraudulent submissions, RETENTION of data for extended periods (likely, the life of any loans granted based upon applications) is required with all of the same data protection expectations as the live system.
  • The project aimed not only to AUTOMATE an existing process, it was also established to CHANGE and STREAMLINE the existing application process. That means every school needing information FROM this system requires changes to THEIR systems. This means any new data structures resulting from the streamlined design must be mapped to application interfaces which must be shared AND IMPLEMENTED by schools in parallel with the core system development. If the federal system is completed on time but no downstream systems are ready to use data in the new format, the launch accomplishes nothing for the government, schools, parents, students or banks.
  • As a completely new system requiring tens of millions of users, the system design should include support for a web portal and smartphone based "user experience." That "user experience" tier of the application should be rigidly isolated from a core tier used to provide integrations between school systems, government oversight systems and the end-user applications.

Key Technology Attributes

With only the high level characteristics listed above and without reading a word of written "user requirements", several technical decisions related to the solution design should have become immediately obvious:

  • The concentrated season nature of user load on this system REQUIRES implementation to use coding patterns at the presentation layer and core services layer that are dynamically scalable. That doesn't mean it has to run in a commercial cloud like AWS, Google or Azure but it certainly requires cloud technologies like containers, auto-scaling, data mirroring and regional failover.
  • The browser portal layer of the solution should be built using a common JavaScript framework like React, Angular, Next.js, etc. that have existed for many years, have matured and stabilized and will likely remain in wide use for years after launch of the new system.
  • The core services layer of the solution should be written as standard web services in Java with JSON-encoded requests and responses. It's not the coolest language in 2024 but knowledge of the language is so ubiquitous that it will ALWAYS be possible to find competent developers to maintain the system indefinitely without paying inflated salaries. Every viable library for building browser and smartphone user interfaces uses JSON for passing data in requests and responses.
  • The core data design should separate "current" data (for the current application year) from "historical" data (from prior years) to keep the size of the primary databases as small as possible to keep them as efficient as possible and to reduce infrastructure logistics for handling failover. If one year of data adds up to 1 terabyte and the system has to retain data for ten years, a design that keeps all ten terabytes in the "current" database and requires all ten terabytes to be restored before recovering a system after a failure would be horribly flawed.

Mythical Man-Month Considerations

From one perspective, attempting to analyze financial budgets and timelines for a project in the absence of finalized user requirements and an architectural design is pointless. At the same time, actual projects for large systems DO set budgets and timelines before completing these tasks ALL THE TIME. With some familiarity with typical work units associated with measurable components in the full system, it is possible to use initial budgets and timelines to spot cases where the budget and timeline are not sufficient to implement all promised functionality with appropriate designs and technology.

Here is some typical math that might apply in sizing the dollars and days required for a large project.

  • typical loaded salary for a developer EMPLOYEE might be $170,000/year or $81.73/hour
  • typical loaded salary for a tester EMPLOYEE might be $130,000/year or $62.50/hour
  • typical loaded rate for a ON-SHORE developer CONTRACTOR might be $120.00 /hour
  • typical loaded rate for a ON-SHORE tester CONTRACTOR might be $90.00 /hour
  • typical loaded rate for a OFF-SHORE developer CONTRACTOR might be $90.00 /hour
  • typical loaded rate for a OFF-SHORE tester CONTRACTOR might be $50.00 /hour
  • a single web service might take 40 days (320 hours) to design and code and 20 days (160 hours) to integration test with other pieces
  • the application might need 100 different web services
  • a single page in the browser portal might take 90 days or 720 hours to design and code
  • a single page in browser portal might take 30 days or 240 hours for functional testing
  • the application might need 50 different pages

Even without details on exact code requirements, these numbers alone allow a crude sanity check on total dollar figures, total headcount and elapsed time being estimated for a project.

With just these numbers using EMPLOYEE wage rates, building the core app and browser portal app alone might involve these total hour and cost figures:

  • web service hours = 100 x (320 + 160) = 48,000 hours
  • web service cost = (100 x 320 x 81.73) + (100 x 160 x 62.50) = $3,615,360
  • portal hours = 50 x (720 + 240) = 48,000
  • portal cost = (50 x 720 x 81.73) + (50 x 240 x 62.50) = $3,692,280

Assume that those 96,000 total work hours are spread across 18 months due to sequential dependencies, etc. That implies a team of 9.66 web developers, 4.83 web testers, 10.9 portal devs and 3.6 portal testers. About 30 full time workers. So even if an organization doesn't HAVE 30 employees to do this work, if they understand prevailing labor rates for such work, they have a skeleton cost structure that can be used to sanity check bids from vendors.

Here's where mythical man-month considerations come into play.

Vendors can claim they can do the same work in fewer hours with smarter talent.

Vendors can claim they can do the same work in the same hours at a lower labor rate.

Vendors can claim they can do the same work in less ELAPSED time by doing the same work in the same hours using MORE resources at lower labor rates in parallel to compress the schedule.

Vendors can claim they can do the same work in less ELAPSED time with MORE resources for MORE COST but at least they can meet your due date when you think you cannot.

The validity of each claim can be checked if the project owner has some estimate of work units that was used for internal estimates and to solicit external bids. It's not rocket science but it involves lots of tedious algebra that can be reduced into Excel for consistency. But few IT shops do this. Bidding out development work is like buying a new car with an old-car trade in. Having too many offsetting numbers distracts most buyers from the overall picture. Instead, they get fixated on meeting the promised delivery date and when a vendor is telling you they have a team ready to go that can meet a date your management already set in stone, well at least a miss on the date can be blamed on an external party.

If this project was bid out to vendors who came back with a total cost of $14,000,000, the first obvious reaction might be: My internal estimate with employees was $7.3 million, why are you $6.7 million higher?

The vendor might say: Well, you wanted it in twelve months so that's the premium for me to accelerate the work enough to meet your date.

The next obvious question might be: Are you pulling in the date by finding more productive developers who can do the work in fewer hours?

If yes, the next obvious question might be: Exactly where are you finding this team of developers with expertise in this specific business domain?

If no, the next obvious question might be: If you're adding more workers, how are you compressing dependencies to allow more parallel work?

The diligent internal manager might ask the consulting firm how many build cycles they expect it to take to create the version that passes user acceptance testing, load testing and deploys to production. If the answer is ONE, the consultant is LYING. No sufficiently complex system will be developed in a single iteration through each module of code and deploy to production. If the answer is N (where N is greater than 1), the next obvious question would to verify how many interations of functional testing are being quoted in the bid. If testing labor is only sufficient for ONE iteration, the consultant is again LYING or doesn't understand the structure of their own bid. These are equally common problems.

Software development hours and spending are not linearly related to output. No application is designed sequentially, from the bottom up (data model, web services for objects, web services for transactions, user interface), even for "green field" applications like this FAFSA system. Work often proceeds in parallel but the ability to do that is limited. Sometimes, presentation layer developers MUST wait for a web service developer to finish who MUST wait for a database developer to design and build a data schema. Hiring more developers and testers beyond this limit of parallelization allowed does nothing to meet project dates. As originally explained in Fred Brooks' The Mythical Man Month from 1974, adding resources to a project usually makes it later by adding confusion and starting work against incomplete specs that requires MORE work for testing and bug fix releases.

So How Did the FAFSA Project Fare?

Ummmmmmm.... Exactly as seasoned IT engineers would have predicted. In June 2023, the Government Accounting Office issued a report on its findings regarding the program's delivery status before it was ever launched and its findings were damning.

https://www.gao.gov/assets/gao-23-106376.pdf
  • the work was contracted out -- the Department of Education has no in-house IT shop capable of building a system this complex
  • the vendor wasn't selected until March 2022
  • the delivery date was set to December 1, 2023, completely ignoring prior history requiring collection of new applications beginning in October of each year
  • the contract was for $122 million dollars
  • development of the "core" was completed in October 2022 -- only 8 months after project start
  • the application instance was first deployed to cloud hosting infrastructure in March 2023

All that sounds "good" but the fact that the GAO was already auditing the program in June 2023 six months before its due date clearly indicates disaster was looming and word was getting out. The GAO report attempted a "glass half full" narrative with this finding:

The AED project implemented four of five selected planning best practices that are intended to increase the likelihood that a project meets its objectives. For example, the project developed management plans and identified the skills needed to perform the work.

However, critical gaps exist in the fifth practice related to cost and schedule. Specifically, contrary to best practices, the project did not develop a life cycle cost estimate to inform the budget. Instead, officials roughly estimated that the office would need approximately $336 million to develop, deploy, and support AED. However, this estimate was incomplete since it did not include government labor costs.

This problem occurs in nearly EVERY large IT project in the Fortune 500. Many projects start with a delivery date and an initial cost estimate that remained frozen despite months of haggling between requestors and funders which inevitably inflates promised functionality without altering the original budget planning number or the due date. When the project is approved, the date and promised functionality are set in stone with no available internal resources to complete the work in that timeline.

So the work is outsourced to a consulting firm who promises they magically have 100 developers and 100 testers "on the bench" who can drop in next week and begin work immediately. But they'll need requirements to start. And there won't be enough long term employees to keep tabs on how development proceeds, including selection of key technologies like databases, web service frameworks and GUI frameworks. The system will be built to the CONSULTING company's "standards" or, just as likely, will be developed using whatever tools happen to be popular with the particular team of developers that are sitting on the bench waiting for their next assignment. It could be a new open-source framework that's all the rage but is abandoned by its community before the application is even launched. It could be a decades old framework tied to other infrastructure products that will incur $10 million in yearly licensing fees that weren't budgeted.

And there is a 99% chance those developers brought in from the consultant's "bench" know NOTHING about the "domain" of the application being built. They won't understand the data models, they won't understand business requirements and thus will have no insight to jump-start the project. At best, they will only deliver what is written down in "requirements." Oh, you don't have those? Well the consulting firm will be happy to have its requirements analysts sit down with your business owners and collect those requirements, with the meter and calendar running, even though the consulting firm has no prior expertise in your business domain. Having a vendor write their own requirements then certify their delivery against their own requirements is a virtual guarantee of a failed project.

So insiders at FSA agency within the Department of Education suspected problems were looming and asked GAO to prepare a proactive post-mortem on the project. What happened afterwards?

The new system DID launch. It missed the desired December 1, 2023 launch date but DID launch prior to the end of December 2023 to meet the contractual obligation... Hmmmmm. But how did the system actually PERFORM?

Horribly. As a story on this topic in The Atlantic summarized: https://www.theatlantic.com/ideas/archive/2024/03/fafsa-fiasco-college-enrollment/677929/

The trouble began last fall. First, the Department of Education announced that the FAFSA, which usually launches October 1, wouldn’t be online until December. It went live on December 30, just days before the deadline set by Congress—then went dark less than an hour later. By the second week of January, the FAFSA was up around the clock, but that didn’t mean the problems were over. Students and parents reported being randomly locked out of the form. Because of some mysterious technical glitch, many students born in the year 2000 couldn’t submit it. And students whose parents don’t have a Social Security number couldn’t fill out the form. The department reported “extraordinary wait times” as its helpline was clogged with calls.

On January 30, the day before the department was set to transmit the completed forms to colleges, it announced that the forms actually wouldn’t go out until mid-March. It used the time to change its aid formulas to account for inflation (its failure to do so had left some $2 billion in awards on the table). “We always knew it was going to be rocky, because the changes were so big and significant,” Amy Laitinen, the director for higher education at the think tank New America, told me. “But I don’t think anybody could have imagined how rocky. I don’t even know if rocky is the right word at this point.” Other experts suggested alternatives: “nightmare,” “unprecedented,” and “a mess all around.”

Monday Morning Quarterbacking

A review of the GAO report and the system's actual performance at launch points out NUMEROUS problems with the system's ultimate design and reflect GROSS incompetence within the FSA agency that contracted for the system and the contractor(s) involved. First, the government agency owning the effort failed to understand the first rule of projects: Every day in a project is just as important as every other day in a project, including the early days of a project. When new projects are started, leaders lapse into a fog in which the due date seems (often) years away and getting requirements and resources to start NOW seems unimportant. Business owners are in no rush to document THEIR requirements. IT analysts are in no hurry to get THEIR work started if the client isn't ready with requirements. WEEKS go by. MONTHS go by. The budget hasn't changed and the due date hasn't changed, yet the available time to complete the work is dwindling relentlessly.

This project was triggered by legislation passed in December 2020. Assuming it wasn't funded until Fiscal Year 2022, it might have been impossible to start until October 2021. Yet the consulting firm wasn't signed until March 2022. Six months of time were lost with presumably little internal ramp-up work within FSA to gather requirements to be ready on March 1, 2022. Given that the application "launched" on December 30, 2023 but might not have all critical bugs fixed until June 2024, this initial six month delay before start could be viewed as the key reason for the failure to deliver on time.

The GAO audit notes the development team claimed to complete the build of the initial "core" system in October of 2022, only six months after starting work in March 2022. But the report also notes the development team also claimed the application had been successfully deployed to cloud infrastructure in March 2023. Those two claims reviewed together raise serious questions. Was the application actually DESIGNED using cloud-oriented technologies and patterns like containers, micro-services, auto-scaling, load balancing and regional failover? If so, those traits would have been present from the developer's sandbox onward and deploying to "a" cloud should not have taken from October 2022 to March 2023.

If the core application was instead designed using more traditional hardware / software patterns, then the consulting firm's work constitutes a MASSIVE design failure. As stated in the prior section on Key Technical Attributes, this solution should have been designed from Day One using modern development techniques. It is absolutely clear the system is NOT reflecting modern design practices because core elements of the new system are still written in COBOL. No one is writing "drivers" to allow COBOL applications to write to modern NoSQL databases like Cassandra, Hadoop, MongoDB, or even PostgreSQL or MariaDB. That means anything WRITTEN in COBOL is tied to other legacy platforms which are unable to leverage any of the modern features of cloud-hosting environments.

Clearly, the Department of Education is in the same boat as the IRS and dependent on systems which haven't been modernized in forty years. However, any competent software architect assigned to design this system would have defined a data interface point in the processes to completely isolate this new system from ANY legacy systems within the Department of Education. Anything COBOL based needing data from this system could have been supplied the data using technologies like Kafka that could mirror data in flight to external services which could transform the data and insert it into legacy systems without polluting the design of an entirely new green field application.

The fact that the application didn't stay up at launch for more than an hour confirms that any testing the vendor claimed to perform for load testing was completely inadequate or fraudulent. Load testing of modern web applications is not easy by any means but tools do exist to "record" a human using the application then convert the captured clicks into a script which can fill in random test data like names, addresses, dollar amounts, etc. then simulate THOUSANDS of users from a fleet of "test drone" computers.

The GOA audit mentions that the consulting firm was paid $122 million but that was only a fraction of the total development costs and resulting yearly operational costs. Those are already estimated to be over $336 million dollars. That is a sign that the solution used too many legacy technologies with traditional expensive licensing costs. It may also reflect that the continued use of those obsolete technologies requires retaining consultants (SURPRISE!) at inflated hourly rates (SURPRISE!) because no current IT employees want to develop or maintain systems written on forty year old technology. That is also a sign that FSA or the Department of Education likely began no internal requirements or design analysis after the legistlation passed or even after the start of the 2022 fiscal year to better prepare a Request For Proposal to solicit then analyze external bids. It seems highly likely they were completely snowed by the vendor they selected.

And the project has officially earned death march status. Workers within the Department of Education have been working 12-hour days since the December 2023 launch trying to code and test fixes to the system while others attempt to manually process information applicants were able to submit online rather than using some of the back-office automations of the new system that are not fully functional yet.

The crucial point about this FAFSA system example is that it reflects the cascading costs of massive incompetence in managing major changes to critical systems. The story in The Atlantic states that the chronic poor performance of the new system has actually resulted in 2.6 million fewer applications for financial aid being processed so far for the class of 2024 (that presumably includes renewal applications for existing college students). Since it is unlikely those students will just magically pay the difference on their own and remain enrolled, this failure may trigger a MASSIVE reduction in college enrollment in the Fall of 2024. That may trigger additional waves of cutbacks in colleges across the country. It will also result in a permanent reduction in opportunities for those affected -- from the deferment of a degree for a year or two until an applicant tries again or a permanent loss of opportunity if other circumstances prevent coming back to college later.


It's a fitting coincidence that this fiasco occurred in the development of a system for the Department of Education. The lessons from this project should be an "education" (a "teachable moment") for EVERY agency at EVERY level of government. There are countless "business problems" just like the handling of financial aid applications that could be modernized to improve service and save taxpayers BILLIONS of dollars in operating costs. But not if the rebuilds are going to be managed like every other big IT project in America.


WTH