"theres this betting pool for the first year that there is a one-person billion-dollar company" - Sam Altman
Ever since the birth of AI at the Dartmouth Conference in 1956, researchers have been trying to create machines that can think and learn like humans. For most of the last 75 years, this goal has remained tantalizingly out of reach and many doubt that it will ever be achieved. Early efforts first attempted to model the structures of the human brain - in "A Logical Calculus of the Ideas Immanent in Nervous Activity", published in "The Bulletin of Mathematical Biophysics" in 1943, Warren McCulloch and Walter Pitts introduce the concept of a non-learning perceptron. In 1958, Frank Rosenblatt published "The Perceptron: A Probabilistic Model For Information Storage and Organization in the Brain" in "Psychological Review", introducing the neural network to the world.
In 1969, Marvin Minsky and Seymour Papert published "Perceptrons", shifting the focus of AI research away from neural networks and towards symbolic AI. Prolog, invented in 1972, is the first programming language designed specifically for AI. It is based on first-order logic and is an attempt to formalize the process of reasoning. AI researchers quickly realized the sheer magnitude encoding all of human "common-sense" into formalized logic. In 1984, Douglas Lenat started "Cyc", an ambitious project to create a comprehensive knowledge base and reasoning engine to form the foundation of all AI. Lenat died in 2023 without seeing his project fulfilled.
In 1997, IBM's Deep Blue defeated world chess champion Garry Kasparov, marking a significant milestone in the field of AI. This event is often seen as a turning point in the public perception of AI, demonstrating that machines can outperform humans in complex tasks. However, it also highlights the limitations of AI at the time, as Deep Blue was specifically designed for chess and could not generalize its knowledge to other domains. Not only was Deep Blue a purpose built supercomputer, but it spent 12 years in development before it could defeat Kasparov.
In 2006, Geoffrey Hinton and his team at the University of Toronto revived interest in neural networks with their work on deep learning. This breakthrough lead to significant advancements in computer vision, natural language processing, and other areas of AI. In 2012, a deep learning model developed by Hinton's team won the ImageNet competition, achieving a top-5 error rate of 15.3%, significantly outperforming traditional computer vision methods. This success marks the beginning of the deep learning revolution and sets the stage for rapid advancements in AI technology.
With the rise of increased computational power, and the availability of large datasets thanks to the internet, AI research changed focus to massively parallelized statistical models. In 2014, Ian Goodfellow introduced Generative Adversarial Networks (GANs), demonstrating that machines can learn to generate out-of-distribution data. Instead of trying to model the world, researchers began to use statisical models to extract information about the world from massive amounts of raw data. In 2017, the introduction of the Transformer architecture by Vaswani et al. revolutionized natural language processing and led to the development of large language models (LLMs) like GPT-3 and BERT. These models demonstrated that machines can understand and generate human-like text, further blurring the line between human and machine intelligence.
Since 2022, the general public has had access to cutting-edge AI technology through platforms like OpenAI's ChatGPT and Meta's LLaMA. For the first time, AI systems are having a significant impact on everyday life, with applications in education, healthcare, and entertainment. The rapid advancements in AI have led to increased interest and investment in the field, with many companies and researchers racing to develop the next breakthrough technology. For those paying attention, it's clear that we're on the precipice of a fundamental shift in the way we interact with machines and the world around us. The question is no longer if AI will change our lives, but how quickly and in what ways it will do so.
Modern computing is built on a combination of the mathematical foundations of Turing machines implemented in the von Neumann architecture. Computers manipulate symbols and storage to perform calculations and process information, by rules encoded at varying levels of abstraction. Instructions to the electronic circuits that make up a computer are encoded in a language called machine code, represented by sequences of 1's and 0's. Specific operations are represented by specific sequences of 1's and 0's, and the computer looks up what to do based on the sequences it is given. Writing long sequences of specific 1's and 0's is time consuming and very error prone, hence these sequences are themselves encoded in a higher level language, called assembly language. This allows programmers to tell the computer what to do using a more human-readable format. Instructions are represented by commands, such as "add" or "mov" followed by target locations in memory. Manually telling a computer to how to move data, how to operate on data, and how to store data is likewise tedious and error prone, so these assembly language instructions are encoded again into higher level languages, such as C, Python, or Java. These are the programming languages that programmers today use to write software.
At each level of abstraction, the programmer is encoding information in a way that is meaningful to the machine. The design of the computer encodes the process of computation, and encodes additional information such as valid mathematics - dividing by zero for example, and existential rules around operating on data that doesn't exist. Each layer of abstraction encodes more information through rules and restrictions to the abstraction below it. At the top of the abstraction stack there is a set of keywords and syntax rules that allows a programmer to perform any computation possible, limited only by the physical constraints of reality itself. The computer is a breathtaking testament to the power of abstraction and encoding information. The machine encodes the process of computation, and the programmer encodes the process of solving a problem in a specific domain, on a specific set of data, into the machine. Unfortunately, the machine is only as intelligent as the programmer - the machine performs the computation, but the programmer must tell it what computations to perform. Writing software is the process of encoding domain specific knowledge into a machine. On the surface, the goal of machine learning seems impossible - how can a machine do something that it hasn't been explicitly told how to do? They can't, and they don't.
All AI systems encode information. Early AI systems attempted to encode information, manually, in the form of first-order logic. The "Cyc" project was an attempt to set up a "flywheel" of automated first-order logic discovery. Both old and modern neural network approaches encode information in the form of weights and biases. The goal of "training" an AI system is to encode information into the system - training a machine learning system to learn to recognize hand written letters is a process of encoding information about the digital representation of the Arabian glyph of a number into the system. A first-order logic system would need to encode facts about both the digital representation of the number, and the physical shape of the glyph of the number through a series of logical rules. A neural network system needs to encode the same information, but it does so by adjusting the weights and biases of the system to minimize error between the input and the expected output. In both cases, the goal is to encode enough information into the system that it can recognize data that it's never seen before as information it has seen before.
The strength of large-language-models (LLM) such as OpenAI's ChatGPT and Anthropic's Claude models is primary derived from two main aspects: the size of the model in terms of neuron count, and the size and quality of the dataset used to train the model, usually referred to as token count. The more neurons a neural network has, the more complex patterns it can recognize in the training data, and the better and more voluminous the training data, the more information is present in the training data. These two aspects allow these LLMs to encode vast, complex amounts of information from their datasets into their systems. ChatGPT was originally famously trained on "The Pile", a 880-odd GB sized superset of other available datasets from a variety of sources, including Wikipedia, Youtube, Books3 (a dataset of published texts), and the Common Crawl dataset, which is representative of the internet at large. The variety of sources and the sheer volume of data in this dataset allows these LLMs to encode information across a broad aspect of humanity - arguably everything we've ever published on the internet.
Going back to the example of training a neural network to recognize hand written letters, during training the network sees each image as a sequence of pixels, each pixel having a set of values to represent the red-green-blue-alpha (RGBA) values of the pixel. From the specific pixel sequences, the network learns to recognize patterns in the data that a human wouldn't necessarily recognize. These patterns might include the ability to recognize specific curves or corners, and then the ability to recognize that a certain collection of shapes represents a number. In neural networks, these patterns are referred to as hidden features. The more complex the neural network, the more "layers" of hidden features can be learned. The network learns a series of increasingly abstract patterns as data is trained through the network, and each layer encodes more information about the layer before it and operates on a higher level of abstraction.
The "infinite monkey theorem" states that a monkey randomly pressing keys on a typewriter for an infinite amount of time will eventually type out the complete works of Shakespeare. In an extension to this idea, "The Total Library", a story by Jorge Luis Borges, describes a library containing every possible book that could be written. The library contains every possible combination of letters, spaces, and punctuation marks, including every book that has ever been written, and every book that will ever be written. The library is infinite, and contains every possible book that could be written. Such a library would contain the entirety of human knowledge in a sea of gibberish, accessible to those who know how to search it. There's a map, but that's also in the library.
Consider an LLM trained on a dataset that contains every English word ever written down. The LLM has not only explicitly encoded each word itself, but also common English grammar and syntax rules, common mispellings of words, synonyms, antonyms, and via the local relationship of groups of words, even to some extent the meanings of words, at least as represented by other words. None of this information was explicitly labelled in the training dataset, and yet it is encoded into the model. A generative transformer model with this training is analagous to Borges' Total Library. It theoretically is capable of generating every book that has ever been written, and every book that will ever be written, and due to the specific algorithms implemented in the network, it generates plausible text - bypassing the gibberish problem with Borges' Total Library entirely.
Now consider the same generative transformer model, but trained on topic-specific English datasets. That model will not only have encoded patterns of syntactically valid text, but it also will have encoded topic-specific information. For example, a model trained on a dataset of medical journals will have encoded information about the human body, diseases, and treatments. A model trained on a dataset of legal documents will have encoded information about laws, regulations, and legal terminology. The model is capable of generating plausible text that is relevant to the specific topic it was trained on, and the generated text is capable of passing information out of the network. This introduces a new problem - information is encoded into the network as patterns, and though the network is capable of expressing that information, the pattern matching natures means there's no guarantee that the information is correct in the context the machine is operating in.
Borges' Total Library and ChatGPT differ in one crucial way - infinity. The conceptual Turing machine is capable of computing forever if given an infinite length tape. If it were possible to build a Turing machine with infinite memory, a GPT running on such a machine would indeed be capable of generating all text conceivable. In practice, of course, computers are not infinite, and their computation time and processes are inherently bound by time. Therefore, GPTs have limits in terms of the amound of information they can both encode and express. These limits form edges to the infinite nature of the Total Library, and define what is referred to as latent space.
The latent space is the area of all encoded information in the model, bounded by the edges of the model's limits, and the higher-dimensional space that all possible generated outputs could come from. It theoretically contains all the information encoded in the model, and all the expressable information the model is capable of outputting. Assuming rich enough training data, theoretically the answer to any question is in the latent space of the model - it just needs to be found.
The race to consume all of the world's information is on. AI companies like OpenAI have partnered with social media companies to enrich the data they can train on, and are even rumored to be working on their own social network. The volume of information already encoded in these models is staggering, and these models are only getting larger and encoding more information. Generative AI is already making meaningful contributions to the fields of meteorology, human biology, and chemistry. What contributes does this information encoded in the models make to the field of AI itself?
The answer is not clear, but the potential is there. Somewhere in that latent space must be encoded patterns in information that humans are unable to find. This means that they could be used to generate new ideas, hypotheses, and even research papers. The potential for AI to assist in AI research is enormous, and it is only a matter of time before we see significant advancements in this area. Indeed, people are already leveraging AI itself to assist in AI related research, though the focus for right now seems to on using AI to efficiently extract information from text.
If it could be possible for one person and AI can create a billion-dollar company, is it possible for one person and AI to contribute meaningful progress toward the field of AI?
Could that person run an AI lab using AI?
Let's find out.