Sunday, March 31, 2024

The IT Death March Project: A Case Study

Anyone who has worked in the Information Technology field in the last twenty years is intimately familiar with the concept of a death march project. A death march project is the inevitable result of combining every bad trend in management (lean staffing, outsourcing, matrix organizations) with every worst practice in the design and development of new, highly complex systems. This combination of inputs inevitably produces

  • improperly collected requirements
  • poorly synchronized goals between users and managers
  • poor technology selections likely imposed by out-of-touch middle managers
  • compounded by selection of vendors and consulting firms by senior leaders based upon personal relationships rather than capabilities and value
  • due dates and budgets set before a single page of design is completed to properly gauge the scope of work, development dependencies, testing requirements and infrastructure needed for production launch

All death march projects bear a few key similarities:

  • they affect millions of expected users
  • they are typically EXPECTED to cost tens of millions of dollars (maybe $10-20, maybe $50-70)
  • they usually WIND UP costing three times to ten times the original estimates when rework, delays, and opportunity costs are properly counted
  • they are PROMISED to take only twelve to eighteen months to deliver
  • they usually wind up taking eighteen months just to deliver the first working proof of concept which triggers revolts from "the business" and users when they see the gap in understanding between actual needs and the "requirements" gathered
  • they often take over twenty four months to deliver a release into production which typically delivers between 50-70 percent of originally promised functionality and often fails upon launch due to poor performance
  • they usually involve armies of consultants hiring armies of contractors to "assist" with requirements gathering, project management and coordination between technical teams and "the business" stakeholders, testing, and often the core design and development - thus explaining the bloated costs, delivery intervals, mis-communications and poor quality of the final delivery

One rule governing all death march projects is that the death march nature of the project becomes apparent to all participants IN the death march project -- with the exception of the senior leaders launching the project -- within weeks of the project beginning. That is typically the point when the handful of experts who ACTUALLY understand the problem and required solution first encounter the waves of invading consultants and contractors. Only then does the CHASM between leadership expectations and reality become apparent.

Given the fixation in American life on fostering a data driven world and webifying every possible social and economic interaction, the consistent inability of business and technical professionals to build systems with proper specifications meeting sane minimum standards for usability and security and do so within estimated budgets and time intervals would seem to merit serious research in engineering and business schools across the country.

Given this ongoing failure, it's apropos that one of the latest case studies in the phenomenon involves higher education. In a nutshell, legislation was passed to fund development of a system to simplify and modernize the financial aid application process. The project was managed by an agency within the Department of Education, began in 2022, was targeted for launch in 2023, barely made its committed launch date and has subsequently been found to have fallen far short of expectations for functionality and reliability and maps to a cost structure already nearly three times the original development cost. The effort to develop and launch the new solution not only reflects all of the hallmarks of a classic IT death march project but appears to have prevented MILLIONS of potential college applicants from applying for aid, thus risking dramatic drops in enrollment at schools across the country.


Case Study: Modernizing Financial Aid

In response to complaints from millions of parents, Congress enacted legislation at the end of 2020 to modernize the financial aid application process and convert all of the old-fashioned paperwork to an online portal-based application. Given the date of the legislation (December 2020), it is likely funding for the new system was not available until the 2021 fiscal year beginning in October of 2021. The Education Department set a target for the new system to be online for use for the first round of college freshmen from the graduation high school class of 2024. That might SEEM like development teams had until April or May of 2024 to complete the system, right? Wrong. Colleges begin making admission decisions for Fall 2024 classes in roughly November to December 2023, which means financial information must be submitted by September or October 2023. That means the development teams had AT MOST only twenty four months to design, code, test and deploy the new system.

To really appreciate the magnitude of the failure of the effort, the scope of the project should be reviewed from multiple perspectives, exactly like it SHOULD have been before finalizing functional requirements, design, delivery intervals and costs.

Key Project Attributes

Anyone familiar with the design, coding and operation of large scale systems would have noted the following attributes of an effort aimed at "modernizing" and "webifying" the application process for financial aid.

  • Applications to the legacy FAFSA process peaked at 22.5 million in 2012 but are still around 17.5 million in 2023
  • The application process is HIGHLY seasonal, with virtually all user actions concentrated between October and December of each year
  • By definition, nearly one quarter of the applicants are new to the process each year, representing freshman applications, and may be prone to procrastinating towards the end of the entry period.
  • As a financial system, the system will collect and house HIGHLY sensitive identity and financial information for students and parents, making network security and encryption of data in motion and at rest paramount.
  • As a system involving government and banking processes that carry criminal penalties for fraudulent submissions, RETENTION of data for extended periods (likely, the life of any loans granted based upon applications) is required with all of the same data protection expectations as the live system.
  • The project aimed not only to AUTOMATE an existing process, it was also established to CHANGE and STREAMLINE the existing application process. That means every school needing information FROM this system requires changes to THEIR systems. This means any new data structures resulting from the streamlined design must be mapped to application interfaces which must be shared AND IMPLEMENTED by schools in parallel with the core system development. If the federal system is completed on time but no downstream systems are ready to use data in the new format, the launch accomplishes nothing for the government, schools, parents, students or banks.
  • As a completely new system requiring tens of millions of users, the system design should include support for a web portal and smartphone based "user experience." That "user experience" tier of the application should be rigidly isolated from a core tier used to provide integrations between school systems, government oversight systems and the end-user applications.

Key Technology Attributes

With only the high level characteristics listed above and without reading a word of written "user requirements", several technical decisions related to the solution design should have become immediately obvious:

  • The concentrated season nature of user load on this system REQUIRES implementation to use coding patterns at the presentation layer and core services layer that are dynamically scalable. That doesn't mean it has to run in a commercial cloud like AWS, Google or Azure but it certainly requires cloud technologies like containers, auto-scaling, data mirroring and regional failover.
  • The browser portal layer of the solution should be built using a common JavaScript framework like React, Angular, Next.js, etc. that have existed for many years, have matured and stabilized and will likely remain in wide use for years after launch of the new system.
  • The core services layer of the solution should be written as standard web services in Java with JSON-encoded requests and responses. It's not the coolest language in 2024 but knowledge of the language is so ubiquitous that it will ALWAYS be possible to find competent developers to maintain the system indefinitely without paying inflated salaries. Every viable library for building browser and smartphone user interfaces uses JSON for passing data in requests and responses.
  • The core data design should separate "current" data (for the current application year) from "historical" data (from prior years) to keep the size of the primary databases as small as possible to keep them as efficient as possible and to reduce infrastructure logistics for handling failover. If one year of data adds up to 1 terabyte and the system has to retain data for ten years, a design that keeps all ten terabytes in the "current" database and requires all ten terabytes to be restored before recovering a system after a failure would be horribly flawed.

Mythical Man-Month Considerations

From one perspective, attempting to analyze financial budgets and timelines for a project in the absence of finalized user requirements and an architectural design is pointless. At the same time, actual projects for large systems DO set budgets and timelines before completing these tasks ALL THE TIME. With some familiarity with typical work units associated with measurable components in the full system, it is possible to use initial budgets and timelines to spot cases where the budget and timeline are not sufficient to implement all promised functionality with appropriate designs and technology.

Here is some typical math that might apply in sizing the dollars and days required for a large project.

  • typical loaded salary for a developer EMPLOYEE might be $170,000/year or $81.73/hour
  • typical loaded salary for a tester EMPLOYEE might be $130,000/year or $62.50/hour
  • typical loaded rate for a ON-SHORE developer CONTRACTOR might be $120.00 /hour
  • typical loaded rate for a ON-SHORE tester CONTRACTOR might be $90.00 /hour
  • typical loaded rate for a OFF-SHORE developer CONTRACTOR might be $90.00 /hour
  • typical loaded rate for a OFF-SHORE tester CONTRACTOR might be $50.00 /hour
  • a single web service might take 40 days (320 hours) to design and code and 20 days (160 hours) to integration test with other pieces
  • the application might need 100 different web services
  • a single page in the browser portal might take 90 days or 720 hours to design and code
  • a single page in browser portal might take 30 days or 240 hours for functional testing
  • the application might need 50 different pages

Even without details on exact code requirements, these numbers alone allow a crude sanity check on total dollar figures, total headcount and elapsed time being estimated for a project.

With just these numbers using EMPLOYEE wage rates, building the core app and browser portal app alone might involve these total hour and cost figures:

  • web service hours = 100 x (320 + 160) = 48,000 hours
  • web service cost = (100 x 320 x 81.73) + (100 x 160 x 62.50) = $3,615,360
  • portal hours = 50 x (720 + 240) = 48,000
  • portal cost = (50 x 720 x 81.73) + (50 x 240 x 62.50) = $3,692,280

Assume that those 96,000 total work hours are spread across 18 months due to sequential dependencies, etc. That implies a team of 9.66 web developers, 4.83 web testers, 10.9 portal devs and 3.6 portal testers. About 30 full time workers. So even if an organization doesn't HAVE 30 employees to do this work, if they understand prevailing labor rates for such work, they have a skeleton cost structure that can be used to sanity check bids from vendors.

Here's where mythical man-month considerations come into play.

Vendors can claim they can do the same work in fewer hours with smarter talent.

Vendors can claim they can do the same work in the same hours at a lower labor rate.

Vendors can claim they can do the same work in less ELAPSED time by doing the same work in the same hours using MORE resources at lower labor rates in parallel to compress the schedule.

Vendors can claim they can do the same work in less ELAPSED time with MORE resources for MORE COST but at least they can meet your due date when you think you cannot.

The validity of each claim can be checked if the project owner has some estimate of work units that was used for internal estimates and to solicit external bids. It's not rocket science but it involves lots of tedious algebra that can be reduced into Excel for consistency. But few IT shops do this. Bidding out development work is like buying a new car with an old-car trade in. Having too many offsetting numbers distracts most buyers from the overall picture. Instead, they get fixated on meeting the promised delivery date and when a vendor is telling you they have a team ready to go that can meet a date your management already set in stone, well at least a miss on the date can be blamed on an external party.

If this project was bid out to vendors who came back with a total cost of $14,000,000, the first obvious reaction might be: My internal estimate with employees was $7.3 million, why are you $6.7 million higher?

The vendor might say: Well, you wanted it in twelve months so that's the premium for me to accelerate the work enough to meet your date.

The next obvious question might be: Are you pulling in the date by finding more productive developers who can do the work in fewer hours?

If yes, the next obvious question might be: Exactly where are you finding this team of developers with expertise in this specific business domain?

If no, the next obvious question might be: If you're adding more workers, how are you compressing dependencies to allow more parallel work?

The diligent internal manager might ask the consulting firm how many build cycles they expect it to take to create the version that passes user acceptance testing, load testing and deploys to production. If the answer is ONE, the consultant is LYING. No sufficiently complex system will be developed in a single iteration through each module of code and deploy to production. If the answer is N (where N is greater than 1), the next obvious question would to verify how many interations of functional testing are being quoted in the bid. If testing labor is only sufficient for ONE iteration, the consultant is again LYING or doesn't understand the structure of their own bid. These are equally common problems.

Software development hours and spending are not linearly related to output. No application is designed sequentially, from the bottom up (data model, web services for objects, web services for transactions, user interface), even for "green field" applications like this FAFSA system. Work often proceeds in parallel but the ability to do that is limited. Sometimes, presentation layer developers MUST wait for a web service developer to finish who MUST wait for a database developer to design and build a data schema. Hiring more developers and testers beyond this limit of parallelization allowed does nothing to meet project dates. As originally explained in Fred Brooks' The Mythical Man Month from 1974, adding resources to a project usually makes it later by adding confusion and starting work against incomplete specs that requires MORE work for testing and bug fix releases.

So How Did the FAFSA Project Fare?

Ummmmmmm.... Exactly as seasoned IT engineers would have predicted. In June 2023, the Government Accounting Office issued a report on its findings regarding the program's delivery status before it was ever launched and its findings were damning.

https://www.gao.gov/assets/gao-23-106376.pdf
  • the work was contracted out -- the Department of Education has no in-house IT shop capable of building a system this complex
  • the vendor wasn't selected until March 2022
  • the delivery date was set to December 1, 2023, completely ignoring prior history requiring collection of new applications beginning in October of each year
  • the contract was for $122 million dollars
  • development of the "core" was completed in October 2022 -- only 8 months after project start
  • the application instance was first deployed to cloud hosting infrastructure in March 2023

All that sounds "good" but the fact that the GAO was already auditing the program in June 2023 six months before its due date clearly indicates disaster was looming and word was getting out. The GAO report attempted a "glass half full" narrative with this finding:

The AED project implemented four of five selected planning best practices that are intended to increase the likelihood that a project meets its objectives. For example, the project developed management plans and identified the skills needed to perform the work.

However, critical gaps exist in the fifth practice related to cost and schedule. Specifically, contrary to best practices, the project did not develop a life cycle cost estimate to inform the budget. Instead, officials roughly estimated that the office would need approximately $336 million to develop, deploy, and support AED. However, this estimate was incomplete since it did not include government labor costs.

This problem occurs in nearly EVERY large IT project in the Fortune 500. Many projects start with a delivery date and an initial cost estimate that remained frozen despite months of haggling between requestors and funders which inevitably inflates promised functionality without altering the original budget planning number or the due date. When the project is approved, the date and promised functionality are set in stone with no available internal resources to complete the work in that timeline.

So the work is outsourced to a consulting firm who promises they magically have 100 developers and 100 testers "on the bench" who can drop in next week and begin work immediately. But they'll need requirements to start. And there won't be enough long term employees to keep tabs on how development proceeds, including selection of key technologies like databases, web service frameworks and GUI frameworks. The system will be built to the CONSULTING company's "standards" or, just as likely, will be developed using whatever tools happen to be popular with the particular team of developers that are sitting on the bench waiting for their next assignment. It could be a new open-source framework that's all the rage but is abandoned by its community before the application is even launched. It could be a decades old framework tied to other infrastructure products that will incur $10 million in yearly licensing fees that weren't budgeted.

And there is a 99% chance those developers brought in from the consultant's "bench" know NOTHING about the "domain" of the application being built. They won't understand the data models, they won't understand business requirements and thus will have no insight to jump-start the project. At best, they will only deliver what is written down in "requirements." Oh, you don't have those? Well the consulting firm will be happy to have its requirements analysts sit down with your business owners and collect those requirements, with the meter and calendar running, even though the consulting firm has no prior expertise in your business domain. Having a vendor write their own requirements then certify their delivery against their own requirements is a virtual guarantee of a failed project.

So insiders at FSA agency within the Department of Education suspected problems were looming and asked GAO to prepare a proactive post-mortem on the project. What happened afterwards?

The new system DID launch. It missed the desired December 1, 2023 launch date but DID launch prior to the end of December 2023 to meet the contractual obligation... Hmmmmm. But how did the system actually PERFORM?

Horribly. As a story on this topic in The Atlantic summarized: https://www.theatlantic.com/ideas/archive/2024/03/fafsa-fiasco-college-enrollment/677929/

The trouble began last fall. First, the Department of Education announced that the FAFSA, which usually launches October 1, wouldn’t be online until December. It went live on December 30, just days before the deadline set by Congress—then went dark less than an hour later. By the second week of January, the FAFSA was up around the clock, but that didn’t mean the problems were over. Students and parents reported being randomly locked out of the form. Because of some mysterious technical glitch, many students born in the year 2000 couldn’t submit it. And students whose parents don’t have a Social Security number couldn’t fill out the form. The department reported “extraordinary wait times” as its helpline was clogged with calls.

On January 30, the day before the department was set to transmit the completed forms to colleges, it announced that the forms actually wouldn’t go out until mid-March. It used the time to change its aid formulas to account for inflation (its failure to do so had left some $2 billion in awards on the table). “We always knew it was going to be rocky, because the changes were so big and significant,” Amy Laitinen, the director for higher education at the think tank New America, told me. “But I don’t think anybody could have imagined how rocky. I don’t even know if rocky is the right word at this point.” Other experts suggested alternatives: “nightmare,” “unprecedented,” and “a mess all around.”

Monday Morning Quarterbacking

A review of the GAO report and the system's actual performance at launch points out NUMEROUS problems with the system's ultimate design and reflect GROSS incompetence within the FSA agency that contracted for the system and the contractor(s) involved. First, the government agency owning the effort failed to understand the first rule of projects: Every day in a project is just as important as every other day in a project, including the early days of a project. When new projects are started, leaders lapse into a fog in which the due date seems (often) years away and getting requirements and resources to start NOW seems unimportant. Business owners are in no rush to document THEIR requirements. IT analysts are in no hurry to get THEIR work started if the client isn't ready with requirements. WEEKS go by. MONTHS go by. The budget hasn't changed and the due date hasn't changed, yet the available time to complete the work is dwindling relentlessly.

This project was triggered by legislation passed in December 2020. Assuming it wasn't funded until Fiscal Year 2022, it might have been impossible to start until October 2021. Yet the consulting firm wasn't signed until March 2022. Six months of time were lost with presumably little internal ramp-up work within FSA to gather requirements to be ready on March 1, 2022. Given that the application "launched" on December 30, 2023 but might not have all critical bugs fixed until June 2024, this initial six month delay before start could be viewed as the key reason for the failure to deliver on time.

The GAO audit notes the development team claimed to complete the build of the initial "core" system in October of 2022, only six months after starting work in March 2022. But the report also notes the development team also claimed the application had been successfully deployed to cloud infrastructure in March 2023. Those two claims reviewed together raise serious questions. Was the application actually DESIGNED using cloud-oriented technologies and patterns like containers, micro-services, auto-scaling, load balancing and regional failover? If so, those traits would have been present from the developer's sandbox onward and deploying to "a" cloud should not have taken from October 2022 to March 2023.

If the core application was instead designed using more traditional hardware / software patterns, then the consulting firm's work constitutes a MASSIVE design failure. As stated in the prior section on Key Technical Attributes, this solution should have been designed from Day One using modern development techniques. It is absolutely clear the system is NOT reflecting modern design practices because core elements of the new system are still written in COBOL. No one is writing "drivers" to allow COBOL applications to write to modern NoSQL databases like Cassandra, Hadoop, MongoDB, or even PostgreSQL or MariaDB. That means anything WRITTEN in COBOL is tied to other legacy platforms which are unable to leverage any of the modern features of cloud-hosting environments.

Clearly, the Department of Education is in the same boat as the IRS and dependent on systems which haven't been modernized in forty years. However, any competent software architect assigned to design this system would have defined a data interface point in the processes to completely isolate this new system from ANY legacy systems within the Department of Education. Anything COBOL based needing data from this system could have been supplied the data using technologies like Kafka that could mirror data in flight to external services which could transform the data and insert it into legacy systems without polluting the design of an entirely new green field application.

The fact that the application didn't stay up at launch for more than an hour confirms that any testing the vendor claimed to perform for load testing was completely inadequate or fraudulent. Load testing of modern web applications is not easy by any means but tools do exist to "record" a human using the application then convert the captured clicks into a script which can fill in random test data like names, addresses, dollar amounts, etc. then simulate THOUSANDS of users from a fleet of "test drone" computers.

The GOA audit mentions that the consulting firm was paid $122 million but that was only a fraction of the total development costs and resulting yearly operational costs. Those are already estimated to be over $336 million dollars. That is a sign that the solution used too many legacy technologies with traditional expensive licensing costs. It may also reflect that the continued use of those obsolete technologies requires retaining consultants (SURPRISE!) at inflated hourly rates (SURPRISE!) because no current IT employees want to develop or maintain systems written on forty year old technology. That is also a sign that FSA or the Department of Education likely began no internal requirements or design analysis after the legistlation passed or even after the start of the 2022 fiscal year to better prepare a Request For Proposal to solicit then analyze external bids. It seems highly likely they were completely snowed by the vendor they selected.

And the project has officially earned death march status. Workers within the Department of Education have been working 12-hour days since the December 2023 launch trying to code and test fixes to the system while others attempt to manually process information applicants were able to submit online rather than using some of the back-office automations of the new system that are not fully functional yet.

The crucial point about this FAFSA system example is that it reflects the cascading costs of massive incompetence in managing major changes to critical systems. The story in The Atlantic states that the chronic poor performance of the new system has actually resulted in 2.6 million fewer applications for financial aid being processed so far for the class of 2024 (that presumably includes renewal applications for existing college students). Since it is unlikely those students will just magically pay the difference on their own and remain enrolled, this failure may trigger a MASSIVE reduction in college enrollment in the Fall of 2024. That may trigger additional waves of cutbacks in colleges across the country. It will also result in a permanent reduction in opportunities for those affected -- from the deferment of a degree for a year or two until an applicant tries again or a permanent loss of opportunity if other circumstances prevent coming back to college later.


It's a fitting coincidence that this fiasco occurred in the development of a system for the Department of Education. The lessons from this project should be an "education" (a "teachable moment") for EVERY agency at EVERY level of government. There are countless "business problems" just like the handling of financial aid applications that could be modernized to improve service and save taxpayers BILLIONS of dollars in operating costs. But not if the rebuilds are going to be managed like every other big IT project in America.


WTH

Thursday, March 21, 2024

Judicial Conference Tackles Judge Shopping

The Judicial Conference is a process mandated by Congress and led by justices throughout the federal courts system to consider and implement administrative changes to federal court procedures. On March 15, 2024 the organization issued a policy aimed at minimizing or eliminating "court shopping", the process by which lawyers carefully shop for plaintiffs then carefully curate a set of facts affecting those plaintiffs then carefully identify a judicial district and a judge likely to rule in the plaintiff's favor. In recent years, this strategy has been consistently used not just in the interest of a single plaintiff but in order to drive appeals all the way to the Supreme Court to alter existing, long-standing precedents or create brand new law.

The actual directive from the organization was published online and is available here:

https://s3.documentcloud.org/documents/24483622/judicial-conference-policy.pdf

The essence of the directive is quoted below:

District courts should apply district-wide assignment to:

a. civil actions seeking to bar or mandate statewide enforcement of a state law, including a rule, regulation, policy, or order of the executive branch or a state agency, whether by declaratory judgment and/or any form of injunctive relief; and
b. civil actions seeking to bar or mandate nationwide enforcement of a federal law, including a rule, regulation, policy, or order of the executive branch or a federal agency. whether by declaratory judgment and/or any form of injunctive relief.

In slightly clearer language, the directive is aimed at trying to maximize the randomness in assignment of a judge for a case by using the largest possible pool of judges associated with the largest logical district territory that will hear the case. Current practices start at a "division" layer below the district. Many of these divisions only have one or two judges. If a plaintiff and his lawyers believe a particular judge will rule in their favor, they can file the case in that division and virtually guarantee they will get the judge they want.

This story was interesting because one of the first references to it was in a story on Slate whose title implied this reflected Supreme Court Chief Justice John Roberts' attempt to crack down on this abuse of process that has been particularly popular with MAGA types going after abortion law and similar hot button topics. Roberts in theory provides top level direction to the organization. Here's the link to the Slate story:

https://slate.com/news-and-politics/2024/03/john-roberts-matthew-kacsmaryk-nationwide-injunctions-judge-shopping.html

Dropped the Hammer did he?

It's more odd to me that Roberts would be described as being so fed up with this manipulation of the courts, as if this is the only issue affecting the integrity of decisions coming from the judicial system. There are multiple issues sitting directly on Roberts' own bench in the form of Justices with glaring conflicts of interest who refuse to recuse from landmark cases directly impacting those conflicts. Roberts hasn't just remained silent on those issues, in the case of the leak of the Dobbs verdict weeks before the final release, Roberts directed a charade of an investigation that lasted eight and a half months, claimed to interview eighty people who had access to the draft yet concluded the perpetrator could not be identified beyond a preponderance of doubt and might never be known.

When the judicial system is being stressed as thoroughly as it has been for the last decade, every improvement to process is welcome but additional opportunities exist. Many of them are being tolerated by the man thought to be in charge of eliminating them.


WTH

Two Takes on Software and Society

The Atlantic published two distinct articles on the impact of software on society that are highly recommended reading. The first article illustrates the unique impact the last fifteen years of technology (not just social media but tablets, touch screens, video) have had on the generation just entering the workforce and how those impacts are altering our social and economic arc. The second article written by a professor of computer science addresses the structure of computer science programs and the potential that their very design may be shortchanging those entering the field of a wider perspective required to leverage computer science in ethical ways. It's a fortunate coincidence that the two articles were published so closely together. They are essentially addressing a common set of dynamics at work.


Generational Issues with Child Smartphone Use

The first article written by Jonathan Haidt is an abridged version of his recently published book The Anxious Generation - How the Great Rewiring of Childhood is Causing an Epidemic of Mental Illness. The article is available here https://www.theatlantic.com/technology/archive/2024/03/teen-childhood-smartphone-use-mental-health-effects/677722/. The magazine version and presumably the book present these key conclusions:

  • the tremendous share of "screen time" spent by children not only on classic "social media" apps like Facebook and TikTok but games and even educational software are reflections of addictive behavior
  • addictive behaviors inevitably alter and conflict with the development of communication abilities in real-world, human-to-human scenarios
  • the "opportunity cost" of screen time is far greater than parents and children are recognizing, in the form of reduced sleep, loss of relationship forming / maintaining skills, horrible attention spans for any sort of work, etc.

Haidt starts out his analysis with a more anecdotal observation made in an interview with none other than Sam Altman, head of OpenAI and another entrepreneuer Patrick Collson.

Surveys show that members of Gen Z are shyer and more risk averse than previous generations, too, and risk aversion may make them less ambitious. In an interview last May, OpenAI co-founder Sam Altman and Stripe co-founder Patrick Collison noted that, for the first time since the 1970s, none of Silicon Valley’s preeminent entrepreneurs are under 30. “Something has really gone wrong,” Altman said. In a famously young industry, he was baffled by the sudden absence of great founders in their 20s.

It must be noted as a possible counter to this observation that the "sudden absence of great founders in their 20s" could also be explained in part by an absence of adequate anti-trust regulation against the top five to ten technology companies which results in them gobbling up any startup whose creation appears capable of growing exponentially to compete with an established monopoly. It takes a lot more willpower to hold out and stay independent to BECOME a billionaire twenty-something when the existing billionaires are buying you out at age 26 for $50 million when you still aren't sure your idea is worth $10 million.

Still, the larger point seems to fit this evolution in generational experience quite well. Haidt follows that anecdotal observation about this generational impact on innovation patterns measured at the billion dollar level with more sobering statistics about self-reported instances of various mental health issues in these new generations of students. Self-harm incidents requiring emergency room visits for girls ranged from 100-170 per 10,000 between 2000 and 2008 but starting in 2009, that rate steadily climbed to a rate of 600 per 10,000 girls by 2020. The percentage of freshman college students reporting psychiatric disorders (beyond ADHD and learning disabilities which were reported separately) jumped from about 3.9% in 2010 to 14% in 2018.

Haidt makes the insightful point that it isn't technically the design of social media apps ALONE that contributes to these problems. Facebook pre-dated the smartphone by six years but wasn't nearly as addictive because a computer had to be used for access, naturally limiting exposure time. Once smartphones became normal for children to have, the smartphones were available for uncontested use 24x7 and use of apps like Facebook skyrocketed in lock step. "Screen time" for children with smart phones is currently averaging 5-7 hours per day. As Haidt points out, those 5-7 hours impose enormous opportunity costs in the form of other activities better suited for learning and social development that are skipped in order to "follow friends" and "track likes", etc.

Haidt's magazine piece goes on at length to explain the mechanisms by which he believes these new gadgets and apps interfere with brain development and makes a strong case for public efforts to immediately re-think the use of these devices, in particular by children. Haidt doesn't make the analogy but I will... Imagine you notice a precipitous drop in test scores and a jump in medical expenses for all children above the age of ten in your community, all starting in the same year. Rich families, middle class families, poor families. All races. All ethnicities. Imagine learning the date of that sea change corresponded EXACTLY to a change in ingredients in the pizza being served in the school cafeterias. Is it a good idea to continue serving that pizza? Even if the students like the pizza?

In this tortured analogy, the students don't even seem to like the pizza that much. Haidt notes that teens are not inexplicably miserable in their social media driven life without a clue to the source of their perpetual angst. They are VISCERALLY aware of its source and toxicity and a large share actively dislike it but nonetheless feel compelled to participate. It's exactly analogous to adults who despise LinkedIn but if you are looking for a job, it looks highly suspect to future employers if a candidate has no profile present on LinkedIn.

Haidt addresses immediate changes he recommends for individual parents then maps those changes to a larger set of problems required to get government and society at large to learn from this problem and alter our collective course. Again, highly recommended reading.


How Should Computer Science Curriculums Be Designed?

The second Atlantic article regarding the impact of software on individuals was written by Ian Bogost, a Professor of Computer Science at Washington University. The article is entitled Universities have a Computer Science Problem and is available here https://www.theatlantic.com/technology/archive/2024/03/computing-college-cs-majors/677792/. It addresses a topic that at first would seem more philosophical than pragmatic -- where is the most appropriate place to house a computer science curriculum in a modern college or university?

As the field first developed in the 1960s, some universities chose to locate their new Computer Science department within their engineering school since initial work in the field was so "close to the machine" using vacuum tubes and later transistors to reflect the 0/1, true/false, on/off nature of the logic being built and thus similar to electrical engineering topics. Other schools chose to treat the new discipline as a new exercise in the analysis of pure logic and instead chose to locate it within a mathematics department which traditionally delivered such coursework. (Interestingly, Ian Bogost himself started at USC in its Computer Science program but switched majors, eventually earning Bachelors, Masters and Doctorate degrees in Philosophy and Comparative Literature.)

Bogost ties this early confusion over the appropriate home of computer science as an academic field of study directly to problems in modern society. The existence of algorithmic trading in financial markets, applications and games (intentionally or unintentionally) designed to be addictive, or just important systems affecting public health and safety designed with an appalling lack of discipline regarding their accuracy and safety all illustrate the dangers of such a crucial discipline lacking any concrete mooring to other disciplines.

Bogost has been discussing this concern with schools across the country who are trying to adjust or re-orient computer science curriculums in their entirety to accomodate the demand for CS coursework, even for students not willing to pursue it as a major. As with any other discipline, there are nuances to the field that won't be covered in a CS 101 class or a class teaching Python for Data Scientists wanting to land a job running some Fortune 500 firm's data analytics teams looking for product ideas from web clicks. But Bogost points out the same concern works in both directions. There are dangers in students graduating with degrees in Computer Science with little training in other disciplines like accounting, law, finance, etc.

Left entirely to themselves, computer scientists can forget that computers are supposed to be tools that help people. Georgia Tech’s College of Computing worked “because the culture was always outward-looking. We sought to use computing to solve others’ problems,” Guzdial said. But that may have been a momentary success. Now, at Michigan, he is trying to rebuild computing education from scratch, for students in fields such as French and sociology. He wants them to understand it as a means of self-expression or achieving justice—and not just a way of making software, or money.

We live in a data-driven world with unprecedented economic power and social influence concentrated with a small number of corporations. Addressing these concerns and devising curriculums that ensure personnel working in these fields comprehend the impact of the technologies they help design and operate is vital to correcting many of the problems we face. Did the software engineers at Facebook initially comprehend the damage their application might create? Maybe, maybe not. Do those software engineers understand the damage now? They certainly do. So how are they responding? How is the public responding?


WTH

Thursday, March 14, 2024

The Economics and Mechanics of Ransomware

Recent stories about a ransomware attack on UnitedHealth and its extended impact on UnitedHealth's daily operations for nearly a month generated conversation online and (likely) in many boardrooms about the real possibilities of harm from ransomware and the readiness of organizations to recover from such attacks. Such concerns are very appropriate. I suspect the directions provided by leaders in such organizations to prepare for such attacks are NOT wholly appropriate because it is likely many IT and security experts within those organization and their leaders lack a clear understanding of the design characteristics of their current systems and the operational impacts of those designs, especially in the scenario where all data has been purposely destroyed or made unavailable.


A Key Distinction

The key distinction to make in the realm of harmful software is the intent of the originator. Software termed malware is designed to harm its target in some way and generally includes no capability to "un-do" its damage. The creators' motivation might be spite, like a bunch of teenagers sitting in an online chatroom, or it could be an economic or political entity attempting to seriously harm an opponent. That harm could stem from possibly irreversibly scrambling data or using existing software for unintended purposes (like spinning centrifuges at twice their rated speed to destroy them and prevent refinement of uranium). If data is the only object damaged in a malware attack, the ability of the target to recover the data depends upon the competency of the malware creators. If they knew what they were doing and truly randomly scrambled or simply overwrote the data, there's no practical way to recover using the data in place. Backups are the only avenue of recovery.

Software termed ransomware isn't designed to permanently harm a target (though permanent harm CAN result). It is instead a tool used as part of a business model for extorting cash from victims. Creators of ransomware want to make money. It's impossible for them to make money if their ransomware mechanism proves to be significantly less than 100% reversible. If a target is attacked with a ransomware tool that has scrambled 50 other large companies and only five were able to recover using the creator's recovery process, few other firms will pay the ransom and the creator's business model will collapse and if they continue, their tool ceases being ransomware and has the same effect as other malware.


Attack Architecture

Malware and ransomware ("hackware") are both very similar in one key area. Both adopt similar "architecture" in the layers of software used because both require the same appearance of an INSTANT attack to achieve their goal and avoid being disabled. Of course, if attacking a large corporation with 50,000 employees with 50,000 laptops and two data centers with 10,000 servers running internal and external e-commerce systems, it is IMPOSSIBLE to literally attack all 60,000 machines simultaneously. Most hackware is designed in layers that, for purposes of explanation, will be termed shim, full client and command ∓ control.

The shim layer is the piece of software that exploits software the target is already running to make that software do something it wasn't INTENDED to do but is PERMITTED to do. Ideally, this additional action LOOKS like regular activity that "machine zero" might perform to avoid triggering alerts about an unexpected process running or an unexpected attempt to reach some other remote resource. Note that the software targeted by the shim is NOT necessarily the ultimate target of the hackware. That initial point of infection is only the weak link being exploited because the hackware creators learned how to corrupt it to do something else useful in their attack and the target company happens to run that software. In the SolarWinds attack of late 2020, data managed by SolarWinds within a target company was NOT the actual target of the attack. It was just a widely use piece of enterprise software with a vulnerability the hackers learned to exploit.

The exploit leveraged by the "shim" layer may not allow a large enough change in the software being corrupted to perform the real action to be invoked by the hackware. The shim may instead target OTHER installed software or install NEW software to actually implement the real bad action to be performed at the time of the eventual attack. That software is the real "client" of the attack. Since most PCs and servers run anti-virus software looking for unexpected binaries or new processes, the client layer of most hackware relies upon being able to masquerade as something already allowed or upon being able to interfere with those scanning processes and discard their alerts. The key concept to understand at this point in the narrative is that the time of initial infection (by the shim) or "full infection" (by the client) is NOT the time of the attack. The process of infecting dozens / hundreds / thousands of machines while evading security monitoring tools takes time. Not just hours or days. Weeks. Months. (This has huge cost impacts on mitigation strategies to be explained later.)

Since full infection can take an extended period yet the goal of the hackware is to appear to attack simultaneously, most large scale hackware attacks leverage an external "command and control" layer which performs multiple tasks. It tracks "pings" from each infected machine to trace the progress of the original "shim" infection or the "full client" infection. In many cases, the hackware creators aren't targeting a particular organization in advance, they are learning who they infected via this telemetry and deciding if they want to ALLOW the attack. Since this telemetry can disclose public IP addresses of the infected machines, those addresses can help the hackware creators confirm the size of the target and decide how long to wait for additional infections before triggering the actual attack onset. For example, if a PING comes from IP 201.14.92.52 and that is part of a block operated by Joe's Bait & Tackle, the originators may just skip him. If the block is operated by Gitwell Regional Hospital in Podunk, AR that operates 90 beds, they might wait for another 40 or 50 machines to PING before triggering attack. If the block belongs to Ford Motor Company and only 4000 machines have PINGed in, they may wait until they see 50,000 to 60,000 before pulling the trigger.

The process of "pulling the trigger" is also designed in a way to avoid detection. Obviously, a firm whose security software sees 60,000 laptops all continuously polling some IP address in Russia is likely to detect that and get a heads up that trouble is looming. Instead, the "full client" running on each infected machine may be written to "poll" for instructions on a random interval over DAYS to get the final green light date and time of attack. Since most laptops and servers in Corporate America use NTP (Network Time Protocol) to sync onboard clocks down to the millisecond, once thousands of infected systems learn the attack date and time, they all just wait for that time to arrive and do not have to sync with the mother ship or each other to yield a simultaneous onset of the attack. Included with the green light and attack date/time will be a cryptographic key each client should use to generate an internal key to encrypt the data. If the attacker actually does honor any ransom payment, the command and control system will also signal the "paid" status the clients can use to reverse the encryption.


Ransomware Recovery

As mentioned before, options for recovery from a malware attack are slim. If the infection actually reached the onset phase, there will usually be no available method of recovering data. The creator had no such intent for recovery to be possible and the victim will likely lack the time and expertise required to reverse-engineer the malware, determine HOW it worked and whether any recovery is possible and code a fix. The only path forward is to identify all infected machines, quarantine them, then wipe and re-load each infected machine with clean software. If any infected machine remains on the network with clean machines, the infected machine can re-infect newly cleaned machines, getting you nowhere.

For ransomware, victims have to approach recovery with a split brain. On one hand, because it is ransomware, a short-term restoration may be POSSIBLE but only if the victim's leadership and legal counsel can agree upon a ransom amount and only if the attacker's recovery process actually works. If the victim is among the first victims of a new ransomware variant and the recovery process cannot be verified before paying the ransom, the victim may be taking a huge risk. Even if the recovery appears to work, the victim will STILL need to literally wipe and reload EVERY machine touched by the ransomware, whether it triggered on that machine or not. Once a machine has been compromised, it will require a complete reload. This process can still require the victim to incur multiple outages, extended maintenance windows, etc. as the production applications are migrated to new, wiped machines while other infected machines are systematically taken offline, wiped, reloaded and brought back online. And the victim will need to audit every BYTE of affected data to ensure no data was altered intentionally or inadvertently by the ransomware process.

For victims, the other half of their split brain process requires proceeding as though the ransom WON'T be paid or WON'T WORK and they have to begin recovering from backups. At this point, the complexity factor and costs grow exponentially for the victim. No large corporation operates a single monolith application with a single database with contents reflecting the entire state of all customer / employee / vendor relationships at a single point in time. Functions are spread across DOZENS of systems with specific data elements acting as "keys" linking a record in SystemA to a record in SystemB which maps to a different table in SystemB with a key mapping SystemB to SystemC, etc. Each of these individual systems may house records for millions of customers over years of activity and may be terabytes in size.

For large systems, the databases housing their records support multiple approaches for "backing up" data in the event of hardware failures, mass deletion (by accident or malice) or corruption. A "full backup" is exactly what it sounds like. Making a copy of every database table, database index, database sequence, etc. involved with an application and moving that copy to some other storage. If the database is one terabyte in production, that full backup will also take up one terabyte. In most companies, a full backup is created monthly or weekly. An "incremental" backup uses the database's ability to identify all records / changes made AFTER an explicit point in time (a "checkpoint") and copy just those records to a separate set of files tagged with that checkpoint. Incremental backups are typically taken every week or every X days.

By performing FULL and INCREMENTAL backups, if data is completely lost in production, the newest FULL BACKUP can be restored first, then all INCREMENTAL backups performed AFTER that full backup can be restored atop the full backup to restore the system to a state as close to the present as possible. As an example, a firm making MONTHLY full backups and WEEKLY incremental backups should never lose more than one week of data if they have to restore a corrupted system. Narrowing that potential data loss window involves reducing the intervals between the full and incremental backups but doing that is not pain free or cost free. More frequent backups require more disk storage and more network capacity between the database servers and the SANs housing the storage. If backups are to be copied offsite for additional protection against corruption or acts of nature, the storage and network costs can easily double.

The REAL complexity with recovery of lost databases lies in the synchronization of data across systems and across those backup images. To illustrate the problem, imagine a company with ten million customers that gains 80,000 customers per month and divides its online commerce, billing, inventory, shipping, agent support and customer support functions across five systems. For each customer, there will be

* customer account numbers
* order numbers
* serial numbers in inventory
* order numbers, shipping numbers and serial numbers in shipping
* trouble ticket numbers in the agent support function
* customer account / login / serial number information in the customer support function

With that customer activity, one could imagine daily order activity reaching 2,667. If most of that activity is concentrated over 16 hours, that's 167 orders per hour. (Multiply these numbers by 1000 if they aren't sufficiently scary.)

Even if the firm is creating backups for these systems religiously on a fixed schedule, there is no way to synchronize the backups to start and complete at EXACTLY the same time. One backup might finish in 20 minutes, another might take 2 hours. When each of those backups completes, they might all reflect data as of 3/14/2024 but their actual content might vary by 100 minutes worth of online orders, etc. If the company becomes a ransomware victim and these systems are restored from full / incremental backups, it is possible for a system that ASSIGNS a key such as "ordernumber" to be BEHIND other systems which reflect ordernumber values used prior to the ransomware corruption. For example,

BILLING: newest values prior to ransomware attack:
* 1111122
* 1111123
* 1111124
* 1111125 <--- the absolute most recent assigned ORDERNUMBER

SHIPPING: newest values seen prior to ransomware attack:
* 1111122
* 1111123
* 1111124
* 1111125 <--- in sync with billing (good)

If these systems are restored from backup and the newest combination of (full + incremental) backup for BILLING winds up two hours BEHIND the newest combination of (full + incremental) backup for SHIPPING, the victim could wind up in this state:

BILLING: newest values prior to ransomware attack:
* 1111074
* 1111075
* 1111076
* 1110791 <--- missing 334 values from 1110791 thru 1111125 -- danger

SHIPPING: newest values seen prior to ransomware attack:
* 1111122
* 1111123
* 1111124
* 1111125 <--- correct but "ahead" of BILLING by 334 values (not good)

With this content, if the BILLING system isn't "advanced" to use 111126 as the next assigned ORDERNUMBER before being restored to live use, it will generate NEW orders for 334 ordernumbers that have already been assigned to other customers. This scenario can create two possible problems. At best, it can create an obvious relational data integrity conflict that might cause the application to experience previously unseen errors that block an order from being processed, etc. A far worse scenario is that the applications cannot detect the duplication of ORDERNUMBER values and use them for a second customer. This might allow customers to begin seeing the order contents and customer information of a different customer's order.

This is one scenario where all of the system backups were nominally targeted to run on the same full and incremental frequencies at approximately the same dates and times. What if March 14, 2024 at 10:00pm was chosen as everyone's restoration target but the incremental backup for DatabaseC for System C for that date/time is corrupt for other hardware reasons (hey, stuff happens...) and the next most recent backup is from two days prior? Now that data gap could be two days worth of transactions, posing a much wider opportunity for duplicate keys that will either breach confidentiality or trigger low level database faults in the application causing more actions to fail for employees and customers.

It is possible to design new systems from scratch to synthesize random values for use in joining these types of records together to avoid this synchronization problem. However, most existing systems built over the last twenty years were not designed with these types of recovery scenarios in mind. They were designed for relational integrity in a perfect world where no one could corrupt the assigned keys and a new true state never had to be re-assembled from out of sync backups. Since the systems weren't DESIGNED with these potential problems in mind, no actual administrators or developers have contemplated the work required to analyze the data in the recovered systems, identify the newest keys present in each, then alter all other databases to skip to the newest values while the gaps in the older records are filled back in by hand.


Ransomware Mitigation Strategies

AI as an answer? In a word, NO. There may be consulting firms willing to pocket millions in professional services and develop a slide presentation for executives stating an organization has covered its risks from ransomware. However, no honest firm can state confidently that an organization will a) AVOID a ransomware attack or b) RECOVER from it without affecting its mission and incurring exorbitant expenses in recovery. Why?

First, existing security software generalizes PRIOR attack schemes to devise heuristics to apply to current telemetry to identify potentially similar system behaviors characteristic of known threats. These systems to date cannot PREDICT a novel combination of NEW infection paths and behaviors to spot a new attack in its tracks before an organization becomes "victim zero" for that attack.

Second, the ultimate meta version of the first argument is that artificial intelligence will be capable of overcoming the limitation of analyzing PRIOR attacks for automating discovery and prevention. Certainly, security firms will become touting "AI" capabilities in pitches to executives trying to convince them that AI-based intrusion detection systems will solve the rear-view mirror problem. By definition, applying AI to this problem requires ceding control of an organization's system TO an AI system. By definition, if the AI system is worth its licensing fees, it has "understanding" of security vulnerabilities and protective measures that humans do NOT have or are incapable of explaining. But if the AI system's "understanding" is beyond human understanding, that same AI system could be compromised in ways which facilitate its exploitation in an attack and the victim might have even less knowledge of the initial "infection phase" or ways of recovering if the AI is in fact complicit.

Is this far fetched? ABSOLUTELY NOT. The SolarWinds breach that occurred in late 2020 had nothing to do with AI technology. However, it was an example of a supply-chain based breach. As more network monitoring and security systems are integrated with external data to provide more information to identify new threats, the instances of those systems within companies are trusted to update more of their software AUTOMATICALLY, in REAL TIME, from the vendor's systems. If an attacker wants to attack Companies A, B, C and D but knows they all use AI-based vendor X for intrusion detection, an attacker can instead work to compromise vendor X's system to deliver their "shim" directly to A, B, C and D without raising an eybrow. Administrators at A, B, C and D can diligently verify they are running the latest sofware from X, they can verify the digital signatures on that software match those on the vendor's support site and they'll have zero clue the software from X is compromised. In the case of the SolarWinds attack, the shim had been embedded within a SolarWinds patch that had been installed at many customers for MONTHS prior to it waking up to perform its task.

Air-gapped systems.? The value of air-gapped systems is complicated. Creating completely isolated environments to run back-up instances of key applications or house the infrastructure managing backups is recommended by many security experts. Conceptually, these environments have zero electronic connectivity (Ethernet or wifi) to "regular" systems, hence the term air gap. This separation is required because of the long potential interval between "infection time" and "onset time." An attacker who is patient or is coordinating a very complicated, large scale attack may stage the infection phase over MONTHS. (Again, in the SolarWinds case, some victims found their systems had been compromised for three to six months.) Once a victim realizes their systems are corrupt, they need a minimum base of servers, networking, file storage and firewall hardware they "know" is not compromised to use to push clean operating system images, application images and database backups back to scrubbed machines. If this "recovery LAN" was exposed to the regular network that is infected, then the "recovery LAN" gear could be infected and the victim has no safe starting point from which to rebuild.

Any organization implementing an air-gapped environment must be realistic in the costs that will be involved to build it and maintain it. An air-gapped environment can be pitched in two ways. One is strictly as a "seed corn" environment -- one sized only to handle the servers and storage required to house backups of databases, backups of server images (easier to do with modern containerization) and routers, switches and firewalls required to use the "air gap" environment as the starting point to venture back out into the rest of the data center as racks are quarantined, wiped, reloaded and put back online. A second way some organizations are trying to think of an air-gapped environment is as a second form of a disaster recovery site -- one housing enough servers, storage, networking and firewall resources to operate the most critical applications in a degraded, non-redundant state. Duplicating hardware for even this reduced set of capability is very expensive to build.

More importantly, regardless of which flavor (seed corn or mini DR) is pursued, this extra air-gap environment is a perpetual expense for hardware, licensing and personnel going forward. The likelihood of most senior management teams agreeing to this large uptick in expense and consistently funding it in future years is near zero. Part of the reason most systems in large corporations exhibit the flaws already described is because "business owners" who wanted the applications are only willing to support funding when they are new and delivering something new of value to "the business." Once that becomes the current state, extracting funding from those "business users" to MAINTAIN that current state and modernize it even if new functionality isn't provided involves rules of logic and persuasion which only function in the Somebody Else's Problem Field. (I will give the reader a moment to search that reference...)

Active / Archive Data Segmentation One important strategy to minimize operational impacts of a ransomware attack and minimize recovery windows is to more explicitly and aggressively segment "active" data needed to process current actions for users / customers from "archival" data used to provide historical context for support or retention for regulatory compliance. An application serving ten million customers over the last five years may have a terabyte of data in total but the space occupied to reflect CURRENT customers and CURRENT pending / active orders might only be five percent of that terabyte. In this scenario, continuing to keep ALL of that data in a single database instance or (worse) in single database TABLES for each object means that if the database is lost or corrupted, a terabyte of data will need to be read from backup and written to a new empty recovery database before ANY of the records are available. The same would be true if the database was lost due to innocent internal failures like a loss of a SAN housing the raw data.

Applications for large numbers of customers can be designed to archive data rows no longer considered "current" to parallel sets of tables in the same database or (preferably) in a separate database so the current data required to handle a new order or support an existing customer with a new problem can be restored in a much shorter period of time. Data considered archival can be treated as read-only and housed on database instances with no insert / update capabilities behind web services that are also read only. Network permissions on those servers can be further restricted to limit any "attack surface" leveraged by ransomware.

From experience, virtually no home-grown or consultant-designed system for Corporate America takes into account this concept. This results in systems that require each "full backup" to clone the ENTIRE database, even when ninety percent of the rows haven't changed in two years. More importantly, when the system is DOWN awaiting recovery, a restoration that should have taken 1-2 hours might easily take 1-2 DAYS due to the I/O bottlenecks of moving a terabyte of data from backup media to the server and SAN. If there are six systems all involved in a firm's "mission critical" function, these delays mean you have ZERO use of the function until ALL of the systems have completed their database restoration.

Designing for Synchronicity Issues No application I encountered in thirty plus years of Corporate America was designed with a coherent view of overall "state" as indicated by key values across the databases involved with the system. Most systems were organically designed to start from id=0 and grow from there as transactions arrived. They were NOT designed to allow startup by looking at a different set of tables that identified a new starting point for ID values of key objects. As a result, recovery scenarios like those above where not all objects could be recovered to the same EXACT point in time create potentially fatal data integrity problems.

Going forward, software architects need to explicitly design systems with a clear "dashboard" regarding the application's referential integrity to allow ID values of all key objects to be tracked as "high water marks" continuously and to allow those high water marks to be adjusted upwards prior to restoring to service to avoid conflicts if some databases had to revert to older points in time with older ID values.

Ideally, for an entirely new system with no integrations to legacy systems, use of any sequential or incrementing ID value in the system's data model should be avoided. Doing so would allow a system to be resurrected after infection / corruption and begin creating new clean records immediately without the need to "pick up where it left off" without an accurate view of where that point really was. This is a very difficult design habit to break since many other business metrics rely on ID values being sequential to provide a quick gauge of activity levels.

Integrated Training for Architects, Developers and Administrators The roles of architects, software developers and administrators are highly specialized and often rigidly separated within large organizations. It is safe to say most architects are unfamiliar with the nuts and bolts of managing and restoring database backups across legacy SQL technologies, cache systems like Cassandra, Redis or memcached or "big table" based systems like Hadoop and Big Table. Most developers are not exposed to the entire data model of the entire application and instead only see small parts in a vacuum without context. Most administrators (database admins and server admins) know operational procedures inside and out but aren't involved at design or coding time to provide insight on model improvements or spot potential problems coming from design.

Because of these behavioral and organization silos, very few applications are being designed with the thought of gracefully surviving a massive loss of data and restoration of service within hours rather than days or weeks. Allowing for more coordination between these areas of expertise takes additional time and these teams are already typically understaffed for the types of designs common ten years ago, much less the complexities involved with "modern" frameworks. One way to allot time for these types of discussion is to at least add time to new system projects for analysis of disaster recovery, failover from one environment to another and data model considerations. However, past experience again indicates that attempts to delay a launch of a new app for "ticky dot" stuff is akin to building a dam then asking EVERYONE to wait while engineers sanity test the turbines at the bottom. Even though a misunderstanding of the design or build flaw MIGHT cause the dam to implode once the water reaches the top, there will always be executives demanding that the floodgates be closed to declare the lake ready for boating as soon as possible. Even if a problem might be found that proves IMPOSSIBLE to solve with a full reservoir.

The More Sobering Takeaway from Ransomware

As mentioned at the beginning, there are few meaningful technical distinctions between the mechanisms malware and ransomware exploit to target victims. The primary difference lies in the motivation of the attacker. As described in the mechanics of these attacks, an INFECTION of a victim of ransomware does not always lead to a actual loss of data by that victim. It is possible far more victims have been infected but were never "triggered" because either the attackers didn't see a revenue opportunity that was big enough or they saw a victim so large in size or economic / political importance the attackers didn't want to attract the law enforcement focus that would result. The economics of ransomware work best on relatively small, technically backward, politically unconnected victims who can pay six or seven figure ransoms and want to stay out of the news.

Ransomware creators have likely internalized another aspect of financial analysis in their strategy. The cost of creating any SINGLE application capable of operating at Fortune 500 scale is typically at least $10 million dollars and the quality of such applications is NOT good, whether measured by usability, functionality, operational costs or security. The cost of integrating MULTIPLE systems capable of operating at Fortune 500 scale to accomplish some function as a unit can readily approach $30-50 million dollars and since the quality of the pieces is typically poor, the quality of the integrated system is typically worse.

Leadership is typically convinced that if it costs serious dollars to get crap, it will cost even more to FIX crap and even more to design a system that ISN'T crap from the start. Since current systems for the most part "work" and the company isn't appearing on the front page of the Wall Street Journal, leaders adopt a "Get Shorty" mindset regarding spending money to fix or avoid flaws in their enterprise systems that will only arise in a blue moon. "What's my motivation?"

Well, as long as they don't get hit, there is no motivation. If they DO get hit but the ransom is only (say) $1 million dollars, they look at that as a sophisticated, rational risk taker and say "I avoided spending $10 million to fix the flaw, I got burned but it only cost me $1 million and a couple of days of disruption? I'm a genius." Frankly, that is the mindset the attackers are betting on. If they started charging $20 or $30 million to return a victim's data, those victims would definitely be rethinking their IT strategy, vulnerabilities would decline somewhat, and fewer companies would pay.

As stated before however, that mindset of rationalized complacency does NOTHING to protect an organization if the attacker actually wants to damage the company. This is a sobering point because the same attack mechanisms COULD be used at any point for far wider economic, social or military damage. These more drastic outcomes are NOT being avoided because the money spent on existing intrusion detection mitigation tools is necessarily "working" or that large corporations are simply ready for these failures and are curing them quickly / silently as they arise. These more drastic outcomes are not yet happening solely because those with the the expertise to initiate them are still choosing to limit the scope of their attacks and in many cases are in it for money. Businesses and governments are NOT prepared to fend off or recover from the type of damage that can result if these same capabilities are leveraged more widely for sheer destruction. In a data-driven world, mass destruction of data can directly cause mass destruction and disruption at a scale not previously contemplated. Organizations who state with confidence they are ready for what comes their way really aren't grasping the full picture and are likely in denial about the survivability of their systems.


WTH

Tuesday, March 12, 2024

God Does Not Call The Qualified

You can't make this stuff up. The Trump family takeover of the Republican National Committee is nearly complete. After the reins were officially handed over to former North Carolina GOP chair Michael Watley and current Trump daughter-in-law Lara Trump, nearly sixty party leaders across the country were either pushed out outright or asked to resign and re-apply for their position. These positions aren't necessarily all at the top. Many involve those operating as a "liason" level between the national party and state parties. Of course, many state GOP organizations are equally scrambled as the national committee, both in terms of leadership and funds on hand. Further severing lines of communication between the national party as it becomes increasingly inward / Trump focused and state organizations which would typically focus on get out the vote efforts sounds like a REALLY bad strategy.

One GOP leader from Vermont, Paul Dame, appeared on MSNBC to describe his take on the change in leadership and what it means for the RNC financially and politically. He was one of several state GOP leaders who had attempted to get an agreement in writing with the Trump campaign that RNC funds would NOT be used for legal fees for Trump but found that consideration of the proposal was tabled via procedural maneuvers when an insufficient number of states had joined a meeting where the issue was raised. (Convenient, huh?) Subsequent to that meeting, Dame said he spoke personally with Lara Trump and was given assurances her stance was unchanged, meaning no attempt to use RNC funds for legal fees would be made. He voted FOR the new leaders in their internal meeting on March 8 and, as he put it,

(I) left thinking, alright, I'm gonna give these folks the benefit of the doubt and the first business day after that, there were major changes at the RNC, and it's really affecting people closest to the ground. A lot of the regional political directors and state directors were being affected by this and that's the people that state party people like mine interact with on a regular basis so it's kind of created some uncertainty about what's going to be happening at the RNC moving forward and what (sic) the state organizations will interact with the national organization.

He was then asked this blunt question (paraphrased):

You said you were in Houston Friday and VOTED for Lara Trump, you said you wanted to give them the benefit of the doubt, but did you really think putting someone like Lara Trump in a position like that, being a daughter-in-law of Donald Trump, did you really think that was going to be the best idea?

He answered saying he had "very frank and direct" conversations with both Lara Trump and Michael Watley before the vote on this issue and was given assurances from both that using RNC funds for legal purposes was out the window.

I'll leave it to the reader to parse that sentence CAREFULLY and resist interpreting it more than the one way Dame likely meant. He of course meant use of RNC funds for "legal counsel expenses" was supposedly out the window. Not that the idea that use of RNC funds for any actually legal purposes was out of the window.

How naive are these remaining Republican Party operatives still clinging to the idea of a GOP apparatus not completely slimed to the core by the corruption of Trump? Well, one way to guage the self-delusion is to actually hear the speeches being made in support of surrendering the party to the Trump syndicate. In that Houston meeting of the Republican Party leadership, the woman - only identified as RNC member Beth Bloch - who spoke prior to Lara Trump's installment actually said this:

https://au.news.yahoo.com/rncs-endorsement-lara-trump-co-211357254.html

In a world where qualifications are often measured by titles and years of experience, we are reminded of a powerful truth: God does not call the qualified; he qualifies the called,” she said. “Lara Trump is the embodiment of this truth.

Keep in mind these words are being spoken about a man and his family facing $455 million in legal fines for civil fraud charges and another $93 million in civil defamation penalties and has over $1 billion in mortgages coming due in the next five years or so and has had NUMEROUS companies file for bankruptcy, stiffing thousands of creditors and contractors.

When will they learn?

Well, according to Paul Dame, this current leadership crew is only filling out the term left vacated by the resignation of Rona McDaniel. The leadership positions will open up in January 2025 after the election results from 2024 are in and they'll have a chance to change direction if needed then.

Maybe. Maybe not.


WTH

Thursday, March 07, 2024

State of the Union 2024: Democracy for Sale

The New York State judge in Donald Trump's civil defamation lawsuits has denied Trump's request to reduce or eliminate his obligation to post a bond for the full $93 million dollar judgement for penalties and interest. It seems certain the judge in his larger civil fraud trial and its whopping $455 million dollar judgement will come to the same decision shortly.

Trump CLEARLY does not have the cash nor can he produce it even with fire sales of some of his assets. It seems pretty clear at this point that one of Trump's many, MANY problems is that he probably understands very little about accounting, even when he really needs to. He and his lawyers stated under oath and in court documents during these cases that he was worth $14 billion dollars. His current judgments total about $554 million dollars. That's 4.0% of $14 billion. Most smart billionaires like to keep some powder dry, some cash in their pocket in case some incredible buy comes up. Might be a new IPO. Might be a deal with a buddy with can't miss potential. Might be some dumb-ass having to unload an asset to cover some other business failure, divorce settlement, etc.

Trump has ALWAYS been highly leveraged from the 1970s through today. He maintains virtually zero cash because he is either a) spending it to maintain appearances of a much wealthier person as part of his business charade or b) using it to pay transaction fees as he constantly rolls over existing debt that hasn't been paid down into new loans. That's one problem. The other problem is that Trump and his ADHD-addled, narcissistic mind likely does not understand the difference between the VALUE OF A PROPERTY he "owns" and the value of his SHARE of the property. Other than Mar-a-lago, I would guess Trump owns 100% of virtually none of the properties he lists in his mind as his. I would estimate on MANY of those properties, he only owns a twenty percent interest. Even if one believes his PROPERTIES might be worth $14 billion dollars, his OWNERSHIP share of those assets is likely far below 100%, meaning he ISN'T worth $14 billion dollars. He might be only worth twenty percent or $2.8 billion and NONE of that has ANY liquidity.

And everything up until now? The plotting? The riot? The doing nothing during the riot? The illegal relocation, housing and failure to return national security documents upon subpoena? The two civil defamation suits? The civil fraud trial? The four criminal indictments?

Those are NOTHING.

THIS is when the real problems start.

An individual who just cemented the nomination of a major party to run for President of the United States is literally broke, in every meaningful sense of the word and owes over ONE HALF BILLION DOLLARS as ajudicated in the American legal system. The judgment for his civil fraud trial put his company under a microscope for all business transactions but did NOT subject his personal accounts to the same external review. He will be literally accepting any visitor to Mar-a-lago to take in contributions to sustain his lifestyle while continuing to fend off four prosecutions involving crimes that strike at the heart of our democracy and threaten its very survival.

What do the potential conflicts of interest look like for America? Elon Musk already made the trek to Florida to discuss SOMETHING with Trump. Does anyone think they talked about battery charging standards for autos? Control system design for rocket systems? What else could the world's richest individual have to talk about with the world's neediest individual who needs tens of millions in cash per month to sustain efforts to keep him out of jail? What could Musk expect from someone so cash-strapped? Well, if that person becomes President again, he could throw a lot more money at Elon Musk for defense / spy satellite launches. He could manipulate auto and labor regulations in ways to aid a major car maker navigating uncertain markets for battery electric vehicles. He could manipulate regulation of social media providers in ways that could help Musk while harming the country.

March 7, 2024 happens to be the planned date for the State of the Union address of the President to Congress and the public. Biden will attempt to outline his accomplishments so far in steering the American economy out of a COVID funk better than any other western economy, his work to provide student debt relief to millions of college graduates and efforts to provide significant, sustained funding to beef up infrastructure of all types across the country to counter decades of decay and underfunding and evolve for a more climate-friendly electrified future. All very notable accomplishments and goals. However, the impact of those efforts will be completely undone if Donald Trump becomes beholden to more domestic and foreign billionaires in his efforts to climb out of his own legal and financial black hole.

If Trump regains power, it is one hundred percent certain he will do so on the shoulders of some of the most corrupt players on the world stage and even more corrupt players currently hiding backstage. And no one gives a half billion dollars to someone else without expecting something in return. They already know Trump is not good for the money. But no worries. A man that dumb and corrupt is good for other many other things.


WTH