Wednesday, September 04, 2024

Are AI Vendors Cheating?

A YouTube Short video appeared in my recommendations from the WelchLabs channel that raised an interesting insight regarding current claims of AI capabilities. After reading more on the issue, that finding serves as a useful prompt for explaining some other limitations engineers and business leaders are not fully comprehending about the use of AI technology.

Here is a link to the WelchLabs short with his take on the root issue:

https://www.youtube.com/watch?v=kRVrwjiflrU

A brief summary of the issue being discussed is provided here for context, followed by an analysis of its economic, legal and ethical implications.


How to Train your LLM Dragon

AI systems using Large Language Models (LLMs) operate by ingesting terabytes or petabytes of text data representing anything you can render in text – novels, news stories, computer source code, questions about computer source code, answers about questions about computer source code, etc. – then feeding that text through a multi-stage process. That process

  1. splits the entire input set into two subsets:
    • a training set – typically about 90% of the original input
    • a validation set – typically about 10% of the original input
  2. chops the training set into arrays of smaller tokens or letters
  3. uses powerful specialized chips to perform billions of matrix algebra calculations that derive probabilities of the next token to follow a prior set of tokens based on the training data
  4. tests the validity of the probabilities derived from the training data by using strings from the validation set then determining how accurately the LLM predicted the next token or phrase in the validation set
  5. takes the results of the test and iteratively feeds that back into the model to adjust its probabilities

If a team building an LLM based AI follows these procedures and obtains a data set that is uniformly distributed in the nature of content and properly splits it 90% /10% into training and validation sets, they can confidently state that a large language model that completes a random string taken from the validation set with “acceptable output” 97.98% of the time is performing better than an LLM only yielding acceptable output 96.1% of the time.

Engineers designing these models rely upon this scoring concept to evaluate their progress in improving the underlying algorithms. Corporate executives managing companies selling these LLMs rely upon these scores to promote their product over the competition.

This competition for engineering bragging rights and revenue makes performance of these tests crucial to all in the industry. Any "leak" of validation data into the training data, either by innocent administrative faux pas or active attempt to manipulate results, has the potential to falsely raise performance scores of the LLM. Experts who design independent benchmarks for evaluating LLMs have devised an approach referred to as a "canary" for detecting such issues. The "canary" is a long random string that would normally NEVER occur in "nature" (normal human literature, source code, etc.). Embedding that string ONLY in text to be used as validation data guarantees that if the LLM ever spits out that canary string, it had to have encountered in in its training.

The easiest way to produce canary strings is to use an encryption hash algorithm such as SHA-256 to generate a fixed-length hash of some subset of other validation data then include that in the validation data. For example, the SHA-256 hash of this string

This is a random string of text supposedly never
occurring in all of the millennia of recorded history
that would be placed in the validation data when
training a Large Language Model.

is

a54cfcf5c13e22d19d52c59320cf7b337723b5e8f221a42dd905f85d6036e7bb

The benefit of a hash algorithm is that ANY change to the input being hashed results in a different output hash value that is not remotely close to the prior hash value. For example, if the above string is altered by adding one additional word, the output hash value looks totally different:

This is a long, random string of text supposedly
never occurring in all of the millennia of recorded
history that would be placed in the validation data 
when training a Large Language Model.

the output hash value looks totally different:

4da1dbe466ab22c43ae1ce712608354183e1fa03ca5e92a353ca0afdde36cab8

If a “canary” is dropped in a 10 terabyte data set in the last terabyte and that 10-terabyte dataset is split 90% / 10% into training and validation sets with the canary string in the validation set, there should be ZERO probability a LLM trained solely on the 9 terabytes of training data WITHOUT that long random canary could ever spit out that long random canary string as an answer to any prompt. If it does, it’s a certainty that the validation data was fed into the training data and the AI’s performance numbers are thus artificially higher than its core algorithms are capable of providing.


Are AI Vendors Cheating?

Cheating is a loaded word. Cheating implies selfish motivations for ignoring a rule as a means for improving one’s outcome over those of others in a competition. What CAN be said about the detection of canary strings in LLM output is that the engineers and companies designing and operating these LLM models have what could be termed a provenance problem which is both a technical problem and a legal / ethical problem that remains unsolved.

At a technical level, LLM engineers are aware these canary strings exist in public test data sets. However, the tools developed to ingest the terabytes of raw data needed for training and validation lack the controls required to explicitly include and exclude specific sources. That has profound practical consequences. If the engineers are simply turning a mindless “crawl script” loose on millions of web sites as a means of accumulating data with no thought as to the validity of the underlying data, the LLM content can be rigged by purposely exposing manipulated data.

Why? And more importantly, how?

The LLM training process doesn’t KNOW anything. For example, it doesn’t scientifically KNOW that

e = mc^2

It can only spit that out in relation to a discussion on physics because it’s seen that sequence of characters in millions of references in physics papers, journals and textbooks dating back to about 1901 along with some dude named Albert Einstein. If someone wanted to hack an LLM to change its answer for “what is the relationship between energy and mass”, they could theoretically turn up millions of web sites faking other scientific papers that include

e = mc^3

and create fake links between those papers with fake publication dates, etc. No matter how much logic was added to the LLM training to try to “pre-scrub” different sources based on original discovery date, URL, reputation, etc., some of those fake e = mc^3 references would make it into training data and eventually alter the probabilities used to answer the question… INCORRECTLY.

From a legal and ethical standpoint, detection of canary strings in LLM output confirms that AI vendors have yet to design their systems to provide a mechanism for data sources to opt out of having data ingested into an AI system. Publication of content on a publicly accessible website is NOT legal approval for that content to be ingested into another system, abstracted then used for financial gain or other purposes. The fact that major vendors in this space have been essentially crawling the web for four years to develop and train AI systems without the consent of content owners reflects a stunning leapfrog beyond existing legal concepts and existing intellectual property protections. The method by which these systems were launched amounts to some of the largest oligopolies in the history telling their government and the citizenry Just try and stop us.


Scrubbing Large Language Models

This canary problem points out a more fundamental flaw with large language models and correction capabilities for the data within them. There IS no correction capability within a large language model. The summary above about training stated that LLMs function by parsing source data into smaller tokens and letters, then mathematically deriving the probabilities of the “next token” based on prior tokens. The training process itself involves loading the raw training data as tokens or characters into arrays with thousands of elements, doing the matrix algebra, then saving the calculated probabilities in output matrices and iterating the process with more training data. Once the training phase is completed, the final running model consists ONLY of those matrices of probabilities. None of the source data is referenced again when processing the requests of individual users.

The segmentation of modeled probabilities from the original data means that if the operators of the model are informed that their training data included text from 10,000,000 social media posts that were forwarding a meme that joked that e = mc^3, there is no way for that operator to go into their running LLM system and surgically purge those ten million references that threw off the model’s accuracy when prompted for the relationship between energy and mass. The mathematical models don’t preserve a backward “chain of provenance” between an output probability and the training data that produced it.

By analogy, this problem is equivalent to being instructed to average a collection of one million integers, compute the average, then throw away the set of one million integers and only hold the average. If someone then tells you that ten thousand of the inputs were bogus and should not have been used, all you have is the average and maybe the count of values from the original data. The only way to correct your average is to re-gather the original input set, somehow identify and exclude the ten thousand bogus values, then calculate a new average.

But in the LLM world, the discovery of canary strings in LLM output seems to have proven that LLM designers and operators either don’t care that they lack control over which data is being ingested to generate their model or (equally likely) the logic doesn’t exist to provide the needed fine-grain control during training. And even if the operators HAD the tools to more diligently filter input, removing the flawed probabilities from a running model requires re-training the model from scratch.

Given that some of these models require multiple MONTHS of processing by thousands of dedicated servers, it is apparent those developing these models are unwilling to incur those costs simply to exclude data they weren’t authorized to use or data shown to be flawed. Their attitude seems to be Sucks to be you. Sorry, we’ll try better in the next release. And since they still may lack the controls to exclude certain data, the only way the flawed probabilities resulting from intake of undesired data can be corrected is by loading MORE data from other sources to tip the calculations to a “truthier” state. That is not a logically or ethically sound foundation for any technology.


WTH