Part 1

Introduction to LLMs

Large Language Models (LLMs), such as ChatGPT and Gemini, have been center stage lately, and for good reason. These models can produce astonishing results in response to complex requests, such as writing Shakespearean-style oven cleaning instructions or outlining a book on the Buffalo Bills, with seemingly little effort.

These two silly examples would be difficult for the average person to write. We’d need to research the topics, consider the tone, decide on an organizational structure, etc. but ChatGPT magically does all this simultaneously. How?!

You don’t need a technical background to understand the fundamentals of how LLMs work.

Let’s demystify this technology so you can effectively wield the power of LLMs and navigate this ever-changing landscape with confidence.

britneymuller · How Do LLMs (like ChatGPT) Work? [Part 1]

Summary

To oversimplify, LLMs learn from mountains of text, nearly all of the internet, to get good at next-word predictions. With some input text (from you the user), the model creates a probability distribution of the most probable next set of words & randomly picks from those to develop a confident, human-sounding response.

However, their skill isn’t limited to text predictions; they’re also excellent at question-answering, language translation, programming, etc., thanks to the many examples of these tasks in their training data.

If the information you seek via an LLM is available via Google, an LLM like ChatGPT has likely trained on those results, aiding in its brilliant word-calculating skills. This capability (parroting back information it’s seen in slightly different ways) fools people into believing LLMs are intelligent and can reason and use logic or common sense. Spoiler: They don’t, but they’re damn good at mimicking it! Language models don’t “understand” language as humans do, they merely simulate it through observed patterns in text.

LLMs are essentially aliens from a different universe: while they have access to all our world’s text, they lack genuine comprehension of languages, nuances of our reality, and the intricacies of human experience and knowledge.

This doesn’t mean LLMs aren’t useful. When harnessed responsibly and for the right tasks, LLMs hold the potential to revolutionize fields and empower people on an unprecedented scale. However, if misused (despite good intentions) or put in the wrong hands, this technology also poses significant risks.

Acquiring a fundamental understanding of how LLMs, like ChatGPT, work + basic data literacy is critical to successfully navigating the ever-evolving AI landscape.

While much about LLMs remains unknown, a basic grasp of their architecture will empower you to harness their strengths and avoid potential pitfalls.

Key Takeaways:

1. LLMs are not search engines! LLMs don’t perform information retrieval, nor are they deterministic (systems that involve no randomness). LLMs generate randomized outputs based on a probability distribution.

2. Versatility: LLMs can be used for a wide range of tasks. They can analyze product reviews for customer insights, support creative processes like writing or music composition, provide programming support, and summarize long legal documents, among many other things.

3. Encoded biases: These models amplify societal biases, such as racism, sexism, ableism, and power inequalities.

4. Availability: LLMs can work around the clock on various tasks, providing consistent support and service whenever needed.

5. Error-prone: It’s crucial not to view LLMs as a one-size-fits-all solution. They can inadvertently produce inaccuracies and are not appropriate for all tasks. An initial mistake can cascade, resulting in subsequent errors, given the way LLMs build on previous text.

6. Fine-tuning capabilities: LLMs can be tailored for specific industry tasks and datasets, improving their performance for specific use cases.

7. Environmental & computational costs: The environmental and financial costs of training LLMs are massive. The carbon footprint of training GPT-3 is comparable to a round trip to the moon in a car. Note: GPT-4 is at least 10x larger.

8. Integration: LLMs can (and likely will) be integrated into many popular platforms and tools, enhancing their capabilities.

9. Accountability: Does stating that an LLM is “an experiment” remove accountability for when things go wrong? In the event of misuse or real-world harm, defining who will be held accountable is essential.

10. Lack of interpretability or “black box”: The lack of interpretability in LLMs limits our understanding of why they generate specific outputs.

11. Human exams (like the Bar & SAT) are not good benchmarks of LLM performance: While LLMs are good at memorizing questions and answers and learning the laws of language, they struggle with generalizing information. Ex. When test standardized questions get reworded but seek the same outcome/answer, LLMs don’t currently score well.

Dissecting Large, Language, Model:

1. ‘Large’ references both the massive dataset used for training & the size of the model itself

Massive pile of books in a warehouse with a computer

Source: Midjourney

You might be wondering why are such large datasets are required to train LLMs? Great question! The testing of larger and larger datasets to train language models led to enhanced performance never seen before. Essentially, the more text you feed a language model to train on, the better it is at providing human-like outputs across a large variety of domains. However, these massive datasets bring significant ethical and moral tradeoffs that we’ll dissect later.

Note: OpenAI has not disclosed the training data used to train ChatGPT, but we know that C4 (Colossal Clean Crawled Corpus) was used. While C4 only represents a fraction of all the text ChatGPT was trained on and is a commonly used dataset, its blocklist filter is problematic because it removes pages containing ‘bad words’ like ‘sex’ and others that disproportionately erase text about minority groups and individuals. (Dodge et all 2021)

‘Large’ also refers to the size of the model itself. This is commonly in reference to the number of model parameters or weights. A model’s parameters or weights are essentially the tuning knobs assigning values within the model. The more parameters or weights within a model, the larger the latent multi-dimensional space and the more complex patterns and relationships a model can map and identify.

Source: Scale.com/guides/large-language-models

Note: Larger models don’t come without setbacks.

2. ‘Language’ references the linguistic nature of the model

Photo of a mother and daughter laughing. Communication is learned from our elders.

Source: Midjourney

Language, a crucial element in LLMs is more complex than more people think and often overlooked. As emphasized by Dr. Emily Bender during a DAIR Institute Talk, linguistics stands as an independent field, not just a subset of AI.

Linguistics dives into the comprehensive scientific study of language, dissecting its structure, historical evolution, and more. It incorporates subfields such as phonetics, morphology, syntax, and historical linguistics to unravel language intricacies akin to biological studies.

LLMs excel at identifying and mirroring the inherent statistical ‘laws of language,’ given sufficient text data. Text and speech aren’t random words but linguistical patterns used to structure phrases and convey meaning. Aspects like grammar, punctuation, and capital letters govern sentence structure.

While a language model is adept at identifying the statistical structure of language, it by no means understands the meaning or semantics of language as humans do. Instead, they generate text based on patterns they’ve identified in the training data. So, instead of understanding the content in the way humans do, they’re more similar to an alien born in a dark cave, alone with no understanding of the real world, but have access to all of the world’s text. —That’s closer to what a language model “knows”.

However, language is ambiguous, its meaning is shaped by social situations, cultures, and current events—nuances that LLMs cannot fully navigate. Nevertheless, LLMs are trained to discern linguistic, statistical laws. Intriguingly, the exact mechanisms of how they do so remain a mystery even to researchers.

3. ‘Model’ references the virtual learning environment

“All Models are wrong, but some are useful” -George Box

Models have been instrumental in scientific progress for centuries, aiding understanding and facilitating research.

A model is a representation—physical, mathematical, or conceptual—of a system or ideas.

what are models?

In research, models often describe and explain phenomena that are either directly inaccessible or automated to save experimental resources. They offer simplified portrayals of diverse scenarios, primarily aiding research through predictive capabilities and pattern recognition. For example, physics frequently employs models to circumvent the cost and time of real-world experiments, enabling scalability. —A plausible reason for the migration of so many physicists to ML/AI.

Consider, for example, estimating the time it would take for an object to fall from a certain height. Sure, you could test this in the real world with variables like wind, temperature, etc. or you could simulate it thousands of times in a virtual environment under ideal conditions.

science model example Once a model identifies the principles of force equals mass times acceleration, you no longer need all previous virtual examples but can reconstruct the outcome using this law/model.

This same concept applies to LLMs that identify and reconstruct language laws.

However, instead of explicitly being programmed with language laws, LLMs learn linguistic patterns through training on massive datasets. This allows LLMs to mimic human-like output by predicting linguistic units like tokens(text), token sequences(sentences), and symbols on a probability distribution.

Like physical phenomena, language has a statistical structure!

What LLMs are Good vs Bad at:

The observations listed below pertain to core LLMs — the “foundational models” without any integrations or ‘hookups’ that could potentially amplify their capabilities.

The current AI landscape changes quickly, and the capabilities of these models can expand when combined with other technologies or when supplemented with specialized datasets and algorithms. However, to grasp the fundamental strengths and limitations of LLMs we must consider them in their purest form. This allows us to appreciate their power and inherent constraints, setting a baseline of sorts.

My hope is that this section (and the entire guide) will help you make more informed judgements about their generated outputs.

LLMs are good at:

Language translation

Content summarization

Content generation

Writing support

Question answering

Correcting spelling and grammar

Programming support

Classification (spam detection)

Simplifying complex content

Stylized writing (applying Poe to x)

Personalization

Prompt engineering

Speech recognition

Mimicking dialogue

Sentiment analysis

LLMs are NOT good at:

Current events

Common sense

Math/counting

Handling uncommon scenarios

Humor

Consistency

High-level strategy

Being factual 100% of the time

Being environmentally friendly

Understanding context

Reasoning & logic

Emotional intelligence

Any data-driven research

Representing minorities

Extended recall/memory

LLM Capabilities & Limitations

LLMs excel at content generation, paraphrasing, and style transfer. However, their design presents limitations around current events, factual accuracy, biases, privacy concerns, and carry massive environmental + computational costs.

Interestingly, LLMs serve as wonderful program/code assistants because programming languages are intentionally designed to be unambiguous. This inherent structure enables LLMs to surface more accurate code and support. Interestingly, the English language (used for LLM prompts) could potentially lead English to become a dominant programming language as LLMs support code generation.

In contrast, spoken languages are incredibly ambiguous. Words often carry different meanings based on regional, social and cultural contexts. Moreover, genuine ‘meaning’ is a product of shared life experiences and mutual understanding between people.

While LLMs are expected to improve their mathematical capabilities, other limitations they exhibit are likely to persist until further models or functionalities are integrated. This approach, often called “model stacking,” adds functionality to LLMs to improve their accuracy and performance.

—One example of this is Retrieval Augmented Generation (RAG) which connects LLMs to external information retrieval resources (like Google) to provide an LLM with additional relevant, up-to-date, and accurate information to improve outputs. Bing & Google have incorporated this functionality into their LLM chatbots to make them more useful.

Retrieval Augmented Generation (RAG) workflow chart: 1. User Query. 2. Query gets sent to Search Relevant information from knowledge sources. 3. This relevant information for enhanced context is fed back into the system to get added to the query. 4. Both the query and the enhanced context get fed into the LLM 5. This generates the text response

Understanding the limitations of isolated LLMs is essential. While they might generate outputs confidently, it doesn’t necessarily mean the results are always accurate or socially acceptable. A notable illustration of this can be found in Microsoft’s experiment with their chatbot named “Tay.” Launched to mimic the language patterns of a 19-year-old American girl, Tay was designed to learn and adapt from user interactions on Twitter. However, within just 24 hours of its release, “Tay” began to generate inappropriate and racist comments, influenced by malicious users who took advantage of its learning mechanisms. This incident highlights the need for oversight and careful implementation of LLMs, particularly when exposed to unfiltered public input.

TayTweets Profile Photo (@TayandYou)

These early mishaps are quickly forgotten and easily overlooked, highlighting the moral responsibility of those providing LLM technology as a solution/tool to remind users of these harms and limitations.

A disturbing theme in the AI arms race is that what’s good for the general public and the larger field of AI is often at odds with corporate best interests (revenue).

Meg Mitchell, a Senior Researcher at AI startup Hugging Face and former co-lead of Google’s AI Ethics Team, spoke about AI harm, “A year ago, people probably wouldn’t believe that these systems could beg you to try to take your life…But now people see how that can happen.” –Bloomberg.

The core risk of this technology isn’t just the dissemination of false information, but the potential to emotionally manipulate and hurt people.

LLM Risk Mitigation Efforts:

1. Ethical AI Efforts: Many individuals have expressed concerns and risks around LLMs, and there are growing efforts to build LLMs more thoughtfully and better educate the general public about LLMs while promoting safe AI. Companies like DAIR Institute have launched AI research initiatives free from big tech’s corporate interest. Amplifying these expert’s concerns can help save people’s lives and minimize risks.

2. Renewable energy sources: Leveraging renewable resources to train these models is moving in the right direction, but still incurs an environmental cost and eliminates that resource from other applications. Efforts to report energy usage and deploy impact trackers are also being tested (Lottick et al. 2019 & Henderson et al. 2020).

3. Prioritize computationally efficient hardware: Efforts around more efficient hardware have existed for decades. Google announced its TPU (Tensor Processing Units) in 2021 to power its machine learning initiatives more efficiently. Many large tech companies and chip providers are working hard to create more efficient chips for these ever-larger models. One of the most promising computational advancements is quantum computing. Green AI are also promoting efficiency as an evaluation metric. (Schwartz et al., 2020)

4. Data Documentation (Data Cards): To make AI more interpretable, Meg Mitchell, inspired by Timnit Gebru’s previous work around data sheets for datasets, initiated Model Cards or Data Cards. Like a nutritional label, it’s a resource for transparency in AI dataset documentation.

5. Accuracy Improvements: Integrating information retrieval and other models into an LLM (commonly referred to as stacking or ensemble algorithms) can help increase the probability of accurate output. Other improvement efforts you’ll learn more about later in this guide are: Prompt improvements, prompt chaining, fine-tuning, RLFH (reinforcement learning from human feedback), multi-modal integration, etc., are all being researched to improve LLM performance. It’s important to note that the inherent limitations of LLMs mentioned throughout this guide (due to their probabilistic nature and construction) will continue to exist and possibly leak through various improvement efforts.

What is ChatGPT?

Demonstration of ChatGPT

ChatGPT interacts with users in a conversational way and is the fourth iteration of OpenAI’s GPT (Generative Pretrained Transformer) models. When a user provides text into ChatGPT, the model tries to produce a ‘reasonable continuation’ of that text based on everything the model has seen — pretty much all of the internet.

There’s a common misconception that LLMs copy the entire internet and refer back to it to generate outputs. The reality is that LLMs essentially compress all of the internet’s text into a multidimensional latent space. What’s a multidimensional latent space? It’s a mathematical space that maps out contextual relationships between words (meaning similar words are closer to each other, and each word’s connection to other relevant words is mapped out) based on everything it’s learned from the text. Note: This space has so many dimensions (in the thousands) that we, humans, cannot conceive of what it looks like. To comprehend these spaces, we need to smush them into 2D or 3D representations like the one below.

LDA T-SNE Example of how similar words like (painter, sculpter) are closer together

t-SNE visualization

Once an LLM has organized a language’s text into a multidimensional latent space, it uses those contextual relationships to generate text through randomized outputs based on a probability distribution.

Remember: LLMs have no grounding in human emotion or real-world experiences. They learn from a very narrow space much different from a human.

The introduction of the Transformer in 2017, a neural network architecture, paved the way for advanced models like ChatGPT. You’ll learn more about the significance of Transformers in Part 2.

Fun fact: Both GPT & BERT are Transformers!

Background: Understanding AI & NLP

To understand LLMs we first need to understand what is Artificial Intelligence and how LLMs emerged. Let’s explore how machines have learned to decipher and interact with human language!

What is Artificial Intelligence (AI)?

AI refers to computer systems that mimic human-like (or above) capabilities and learn or improve over time. In essence, AI is the science of creating machines that can think, learn, and adapt in a human-like way, tackling a vast array of tasks with efficiency.

While AI is a broad field of research, the term “AI” receives a lot of criticism due to being so ill-defined. What is intelligence? When will we know if we’ve reached genuine artificial intelligence?

These probing questions are tough nuts to crack and answers remain elusive.

Two primary types of AI models:

Narrow intelligence is task specific knowledge like Google translate, fraud detection or customer support. General intelligence is generalized knowledge of the world. This isn't available yet, has the ability to pass the turing test, possesses common sense and can plan and reason.

Narrow Intelligence or Artificial Narrow Intelligence (ANI or weak AI):

Models specifically programmed to solve one type of task. The likes of self-driving cars, Grammarly, recommendation engines (think YouTube), and Siri all serve as shining examples of narrow intelligence models. All AI we currently interact with today (as of 2023) falls under Narrow AI.

General Intelligence or Artificial General Intelligence (AGI or strong AI):

Here, a model boasts the general problem-solving capabilities of a human rather than being designed for a specific task. An AGI exhibits common sense, reasoning, creativity, and emotions and can apply knowledge across diverse contexts and modalities (even ones it has never seen before).

Many AI Researchers argue that we will never see AGI and others are working hard to make it a reality.

While AGI is a controversial topic, in large part due to its ambiguous definition (seriously, how will we determine if we’ve reached it?), many respected AI researchers note that we’re far from achieving true AGI. They also caution that an undue emphasis on AGI detracts from efforts in more immediately beneficial narrow applications, such as life-saving healthcare solutions.

Artificial Super Intelligence (ASI):

When an AI model spectacularly surpasses human capabilities.

Fun brain buster: How can we determine when these models have outperformed us at tasks? How can we gauge answers and outcomes to information out of our reach?! Warning: these thought-provoking questions lead to sleepless nights.

Historically, we’ve leaned on the Turing Test as a benchmark for determining “AI” or “AGI” (synonyms).

What is the Turing Test?

The iconic AI ‘Turing Test’ was invented by Alan Turing in 1950 and serves as an evaluation method to decide if a model has reached artificial intelligence.

How does the Turing Test work?

A questioner poses specific subject matter questions routed to both a computer and a human. If the questioner guesses that the computer is the human in half the test runs or more (or fools 30% of human interrogators during a five-minute conversation), it’s considered to have achieved artificial intelligence.

For a deeper dive into the Turing Test and its variations, controversies, and limitations, do check out Stanford’s entry on the Turing Test.

The AI + LLM Landscape:

The AI Landscape

The term “AI” covers a broad array of sub-fields, so let’s tackle the ones most pertinent to LLMs:

Natural Language Processing

Ever wonder how your phone seems to “understand” you or why some apps can translate sentences into another language instantly? That’s all thanks to a special area of technology called Natural Language Processing (NLP). At its core, NLP is essentially a bridge between human language and computers, combining insights from linguistics, computer science, and machine learning.

For instance, think about determining whether someone’s words sound happy or sad, identifying the main topic of a chat, or even spotting a person’s name or a place in a book. While these seem like no-brainers for us, they’re actually big challenges for computers. Computers also need to translate between languages and fully grasp what people mean when they speak or write.

When you chat with voice assistants like Siri, Alexa, or Google Assistant, or when you’re using apps like Google Translate, you’re witnessing NLP in action. All of this is possible because of the work done to help computers understand language through NLP. Thanks to these efforts, we have many incredible tools that can communicate with us in ways we once only dreamed of!

Generative AI vs Discriminative AI

Generative and discriminative models represent two cornerstone approaches in AI, each with unique architecture and applications.

Let’s weave a hypothetical scenario involving the two AI models: one will generate entirely new/artificial images of dogs (NewDogAI) and another will be tasked with accurately identifying a dog’s breed from a photo (DogBreedAI). So which model would be generative and which would be discriminative?

Generative AI

Source: BoardPanda

Generative models create new outputs based on sequential data that unfolds over space/time (text, image, video, audio, etc.). They identify patterns within the training data and generate new outputs echoing these patterns and characteristics. NewDogAI would be an example of generative AI, creating new images like the one above.

Generative models are probabilistic rather than deterministic. This makes generative models more challenging to evaluate because the generated output in most applications is largely subjective.

Discriminative AI

Discriminative AI Example

Source: Dog-Breed-Classification.ipyn

Discriminative models aim to classify or categorize different kinds of data. Instead of learning to comprehend a dataset’s distribution, they differentiate between classes, learning the boundaries to make future predictions or decisions based on the training data. DogBreedAI would be an example of discriminative AI, classifying dog images into different categories (breeds).

History of Text Generation [Timeline]

The genesis of what we now understand as Natural Language Generation (NLG), harks back to the 60s with the advent of ELIZA. Developed at MIT by Joseph Weizenbaum, ELIZA wasn’t a true NLG model in the contemporary sense. It simulated conversation with simple pattern-matching and substitution techniques but didn’t ‘understand’ or ‘generate’ language as modern NLG systems do. Nevertheless, it served as a prototype of what was to come.

The true era of commercial NLP applications didn’t arrive until the 1990s, and it has only become more prevalent recently with the explosion of more sophisticated models like ChatGPT.

Markov chains, rule-based algorithms, and Recurrent Neural Networks (RNNs) were some of the earliest technologies used for text generation. However, everything changed in 2017 when Google published the Attention Is All You Need Paper that introduced the Transformer model, which would forever change the trajectory of NLG!

To give you a flavor of the journey, here’s a brief timeline:

History of Natural Language Generation [Timeline]

Like so many technologies today, these milestones have stood on the shoulders of giants. In 1948, Claude Shannon led research evaluating how well simple n-gram models were at compressing and predicting natural language. An n-gram represents a sequence of words from a given text set and is used to predict the likelihood of the next word in a given sentence. —Very useful for NLG, speech recognition, and spell check.

The Transformer’s impact on NLP is like the smartphone for personal communication, a revolutionary leap. Imagine how your first cellphone transformed your communication and access to information. The Transformer model did the same for NLP, but on a larger scale, evolving how computers understand and process human information.

The Game Changer: The Transformer

In 2017, a groundbreaking neural network architecture called the Transformer was introduced in the paper “Attention Is All You Need” by Google Researchers. Before the Transformer, language models relied heavily on a type of neural network known as a Recurrent Neural Network (RNNs). RNNs are great at predicting the next word in a sentence, understanding handwriting, and even speech recognition. However, they have limitations when it comes to handling long sequences of data efficiently. —This is where the Transformer architecture comes in with its unique learning approach that outperforms most earlier NLP models. Think of the Transformer as having a super-powered computer that can not just understand the words in a sentence, but the entire story regardless of how long or complex! This unique capability extends its influence beyond text and into other fields like computer vision, audio, speech processing tasks, etc. This unique versatility has also fostered collaboration and innovation across many different (usually siloed) industries —how cool is that?! So, what’s the secret behind the Transformer’s remarkable power? Read Part #2 of this Large Language Guide to find out:

What are Large Language Models? [Part 2]

Part 2

What You’ll Learn In Part 2:

Transformer Basics

Encoder-Decoder Structure

Types of Transformers

GPT, BERT, and more

Embeddings

Representative Text in High-Dimensional Vectors

The Power of Attention

Exploring Attention Mechanism in LLMs

Tokenization

Preparing Text for LLM Input

Decoding and Improving LLM Outputs

Learn About Common Decoding Strategies

Level Up Your Data Science Skills



Learn

Get in nerd, we’re going learning!



Achieve

Gain confidence by building stuff!



Apply Your Skills

Go forth & apply your skills in meaningful ways.