Part 2

Technical Aspects of LLMs [Part 2]

In the previous chapter, we introduced the revolutionary impact of the Transformer model, introduced in 2017, which has redefined the machine learning or “AI” landscape as we know it. Its ability to decipher complex sequences has positioned it at the forefront of machine learning. Let’s uncover the mechanics and principles that underpin the Transformer’s powerful capabilities.


But first, let’s begin with the bedrock of any Transformer model; the training data.

The Keystone of LLMs: Training Data

Quality training data is arguably the most important part of machine learning.

Imagine embarking on a journey to learn a new language. How would you go about this? You’d probably want to immerse yourself in that language’s environment, listen to native speakers, practice conversations, and make mistakes in order to learn. Each interaction you have with this new language, every piece of dialogue, is a step towards fluency in your new language.

The Transformer’s journey to language fluency requires similar exposure to language through a vast array of training data! This data acts as the curriculum for the model’s intensive language course.

Just as you would expose yourself to various dialects, texts, contexts, and tonalities to grasp your new language, the Transformer needs many examples to discern the patterns, relationships, and semantics of a particular language.

Training data is also referred to as ‘training set,’ ‘training dataset,’ and ‘learning set.’

Two Training Data Types

Training data is the foundation upon which machine learning models like LLMs (and ChatGPT) are built. This data can broadly be classified into two types, each with its own use cases and ethical considerations.

Sidenote: AI is like a tasty hotdog; you should know what kind of atrocities go into making it before taking a bite.

#1 Labeled Data

Labeled datasets come with annotations (or labels) that provide context to the machine. These labels help machines identify specific characteristics, classifications, properties, patterns, people, objects, etc. Just as a language learner uses labels to associate words with objects, labeled data aids in supervised learning, though creating these datasets can be very expensive and labor-intensive. 

The process of creating labeled data is not without its ethical quandaries. This work often involves human annotators (or ‘cleaners’ <– please watch this trailer) who painstakingly label vast amounts of data, some gruesome and sometimes under less-than-ideal working conditions.

#2 Unlabeled Data

In contrast, unlabeled data are raw, untagged datasets (like entire websites) used in unsupervised learning. This is akin to a language learner overhearing conversations without context, relying on pattern recognition to make sense of a language.

Because LLMs are trained on most of the internet, they are provided with raw unlabeled website data like below:

Hybrid Data: Uses both labeled and unlabeled data for multi-modal models that integrate data across various formats such as text, images, and audio.

Datasets are critical for the development of AI models, but the process behind their creation must be scrutinized. There’s a growing call within the AI community for more ethical data collection and annotation practices ensuring fair compensation and working conditions for those involved in the creation of these datasets.

Sourcing Training Data

There’s no ‘download all’ button for the internet. People behind ‘base’ or ‘foundational’ LLMs had to decide which parts of the internet should go into the model’s training and which should not.

Addressing Bias

Datasets behind LLMs have grown dramatically in size because the testing of larger datasets used for LLM training led to enhanced performance never seen before. Essentially, the more text a model is trained on, the more adept it becomes at delivering outputs that mimic human-like communication. However, quantity does not equate to quality.

Let’s take a peek at one of the most common text datasets used to help train LLMs:

The Colossal Clean Crawled Corpus

The Colossal Clean Crawled Corpus (C4), despite its size, comes with inherent biases due to being English-based and its exclusion of certain terms. It’s important to note that several of these omissions are used positively by various minority groups, further erasing already marginalized voices. Sex, for example, a useful term in many appropriate contexts, falls within this omission list. 

Here are the top TLDs (make up the end of a URL) and the top 25 websites within C4 with their number of tokens (words or parts of words) that make up each website’s data contributing to the entire dataset:

Line bars showing the top contributing websites to the Colossal Clean Common Corpus

The English Wikipedia is the #2 highest website contributing to the C4 and appears to be a pretty solid information resource, right?

It’s important to note that even highly regarded websites like Wikipedia come with their fair share of biases.

The following is self-reported data for Wikipedia contributors (or editors):

Self-reported contributors to Wikipedia are 87% male, are an average age of 27, have higher education, are single, with no kids. This demographic homogeneity (not representative of a diverse population) leads to a limited perspective on topics of broader interest and importance.

This has resulted in the following (and other) well-documented issues: 

    • Racial Bias: The underrepresentation of Black history and the dominance of white, European, and North American editors results in a skewed portrayal of African topics. This also leads to perpetuating stereotypes and omitting significant historical and cultural information.
    • Geographical Bias: The concentration of editors from Western countries introduces a geographical bias, often leading to an overrepresentation of Western perspectives and a lack of coverage of other regions, particularly the Global South.
    • Gender Bias: With fewer topics of interest to women and fewer biographies about women, the content on Wikipedia leans towards male-centric perspectives and interests.

Ethical Implications

These kinds of dataset biases have profound implications beyond the websites themselves. As big tech companies continue to harness these skewed datasets to train AI models, there’s a risk of perpetuating and magnifying these prejudices on a much larger scale. Every interaction with these AI systems, however passive, inadvertently fine-tunes these models with the same skewed perspectives. If unchecked, the embedded biases will continue to shape AI behavior and potentially significantly affect necropolitics and global societal dynamics. 

Acknowledging that all datasets are biased and that we are all biased, is the first step. By shining a light on these implicit biases, we can raise awareness and take deliberate action. It’s important that we strive to develop AI models that are not just technologically advanced but are also shaped by a commitment to inclusivity, fairness, and a more equal representation of our global society. 

It’s interesting to think about LLM’s large dataset requirements and consider which companies hold massive amounts of our data.

Fun fact: ReCaptchas like below, are labeling images for image recognition models.

#11 Key Takeaways on LLM Training Datasets

In the development of LLMs, the datasets used for training are the cornerstones that dictate their capabilities and performance. Here are some key takeaways regarding their role and importance:

 #1 Quality Over Quantity: The old saying “garbage in, garbage out” is true for machine learning. The quality of training data is crucial to a model’s performance.

#2 Consent and Authorship: Most content creators (writers, artists, or site owners) have not agreed to include their work in a model’s training data.

#3 Representation and Bias: There is an overrepresentation of individuals from privileged demographics in terms of race, gender, ability, age, religion, and nationality.

#4 Anglocentric Bias: English-speaking and Western perspectives often become the default, marginalizing the voices of the global majority. (Zhou et al., 2021)

#5 Gender Biases: Historical biases, such as the sexualization and objectification of women, are prevalent and have been well documented in search algorithms before the rise of LLMs, such as the search results for “black girls.”

#6 Necropolitics: The implementation of LLMs in social services can lead to systematic service denial, disproportionately affecting specific populations (Eubanks, 2018)

#7 Privacy Concerns: The potential extraction of Personally Identifiable Information (PII) from training models poses significant privacy risks (Carlini et al., 2021)

#8 Illusion of Objectivity: Large, uncurated datasets can present a “view from nowhere,” a false sense of objectivity that ignores nuances.

#9 Data Centralization Risks: A dependency on large datasets tends to consolidate AI power among companies capable of extensive data collection and processing. Accepting AI means implicitly endorsing pervasive surveillance and control (McQuilan, pg 14, 2022).

#10 Immeasurable Human Experience: AI cannot capture all facets of human experience and knowledge, a limitation that must be recognized.

#11 Assumption of Omniscience: These datasets often operate under the assumption that the sum of human knowledge is available online or in written format, a fundamentally flawed notion.

The Dark Side of Data Preparation

The process behind dataset aggregation is often non-transparent and contentious. Awareness of the ethical concerns associated with exploiting low-wage labor for data sanitization and decision-making is critical. For example, OpenAI’s use of Kenyan workers at less than $2 per hour to detoxify ChatGPT’s outputs has sparked ethical debates.

Dr. Margaret Mitchell sums up a prevailing ethical stance: “People aren’t subject to non-discrimination policies when freely writing online, but models should be.”

Historical Context: RNNs to Transformers

Traditional NLP methods heavily relied upon RNNs (recurrent neural networks). RNNs rely on a simple encoder-decoder structure that helps the model handle data that changes over time, like words in a sentence. Reading a book is like an RNN processing text – taking in one word at a time and updating the understanding as you go along. But just like us humans, RNNs can struggle with remembering the details from the beginning of a long story. 

However, RNNs face an “informational bottleneck” challenge when dealing with longer text sequences. This means that as a sequence gets longer, RNNs find it difficult to remember information from earlier parts of the text. 

It’s like trying to recall what happened at the beginning of a long story without forgetting any important details. 

Because of this limitation, the performance of RNNs tends to decrease as the sequences get longer. In other words, RNNs struggle to capture the full context and connections between sections of lengthy text. 

Researchers have been working on new techniques (like attention) to help overcome this issue and improve the understanding of longer text sequences.

Highly recommend ‘The Unreasonable Effectiveness of Recurrent Neural Networks’ by Karpathy to learn more about RNNs.

Transformer Architecture

When it comes to training a Transformer, think of it as a two-step dance. The first step is pre-training, where the model learns from a large corpus of text data, absorbing knowledge like a sponge. The second step is fine-tuning, where the model is further trained on a smaller, task-specific dataset, honing its skills to perfection.

Understanding Encoder-Decoder Framework

Let’s consider how baseball coaches create specific hand signals to communicate information to players at bat:

Note: Several international players on the team do not speak English. 


The team decides what information would be valuable to encode or communicate to players while up at bat. This information goes into a spreadsheet, assigning an abbreviated hand signal for each piece of information. These signs and corresponding information are communicated to each player in their preferred language.

While a Spanish-speaking player is at bat, she sees a hand signal from her coach, who pulls at both ears and decodes that signal back into the original information “hit it to the moon” (however, she decodes this in Spanish) as “golpéalo a la luna” to process. The player then uses that information to strategically smash the ball out of the ballpark for a home run!


Encoder-decoder architecture helps transform language into 1s and 0s (hand signals in our example) for processing and back into a desired language task output.

Think of the encoder-decoder in the Transformer architecture like a dynamic duo—they don’t have to be identical twins. They can differ in structure or size, and different architectures can shine in different tasks. It’s all about finding the right fit for the task at hand.

Fun fact: GPT only uses the decoder part of Transformer architecture, and BERT only uses the encoder!

Three Common Types of Transformer Models:

  • Encoder-decoder: Great for language translation and text summarization. 
  • Encoder-only: Tokenizes input text into numerical representations to accomplish various text classification tasks. 
  • Decoder-only: Good at predicting the next word in a sequence and commonly used for text generation. ChatGPT and all previous GPT models use a decoder-only architecture.

While we’ve categorized three main types here, remember that these boundaries are not set. Many models are hybrids incorporating elements of all three types!

Attention Mechanism

The Attention mechanism is an innovative concept in machine learning to help solve the “informational bottleneck” problem.

The attention mechanism in machine learning is a true superstar. It was a significant innovation that allowed Transformers to come into the industry. Before Transformers, attention had a limited role in RNNs, but with the advent of self-attention in Transformers, this technique could be fully leveraged, propelling Transformers into the spotlight.

Hidden States

Imagine going on a hiking trip and trying to remember everything along the way. As you come across things on your hike like maybe you saw a cute fox and found some unopened trail mix! You place each of those experiences into your backpack (or memory). However, there are long stretches of your hike that are quite boring. If you were to try to put every single step as a memory into your backpack it would overflow and feel unnecessary. 

This is kind of how attention and hidden states work! A hidden state is like the backpack that the Transformer carries. Each time it sees a new word, it puts something about that word into the backpack. And as it comes across new words, it looks back into its backpack to discover relevant information and context for the new word. 

Unlike RNNs which generate a single hidden state for the input sequence (backpack overload), an encoder in attention-based models produces a hidden state at each step of the journey (like the magical Transformer backpack above). This allows the decoder to access multiple states and gather comprehensive information across an entire hike.

Attention Mechanisms

Imagine at the end of your hike if you had to go through every single item within your magical Transformer backpack, like where you saw the cute fox and each rock you saw. You’d be overwhelmed! The same thing happens when the decoder considers all states simultaneously. To address this, attention mechanisms come into play!

Attention plays a crucial role in determining which encoder states the decoder should prioritize. By assigning different weights or “attention” to each encoder state during each step, the decoder can selectively emphasize the relevant states (cute fox) while disregarding the less important ones (saw another rock). This targeted attention significantly improves a model’s ability to process input and generate output that is more accurate and meaningful. 

The integration of attention into the encoder-decoder architecture not only improves the model’s performance but also facilitates faster training! This powerful technique has contributed to numerous NLP breakthroughs.

Positional Encoding

While RNNs tackle data one step at a time —like following a breadcrumb trail— Transformers are more like taking in the entire hike’s scenery simultaneously. But without knowing the sequence of events, our Transformer might mix up whether the fox sighting came before or after the heart-shaped rock.

In enters positional encoding, the brilliant solution to this sequential dilemma. Positional encoding is a way of timestamping every experience on our hike. It’s as if we’re creating a photo album where each picture is tagged with a number on the back that tells us its place in the sequence of our journey.

How Positional Encoding Adds Context

Imagine each photo (or data element) now has a unique pattern drawn on the back, a combination of lines that become more complex with each subsequent picture. These patterns are the positional encodings or carefully crafted vectors (numbers) that the Transformer uses like a map to determine the order of events. 


Necessity of Positional Encoding

These encoded vectors are like secret codes added to the memories of what we’ve seen. They help the Transformer understand that the birds came after the fox but before the distant lake. This understanding is critical for accurately translating languages, where the order of words can flip a sentence’s meaning (or any task where the order is contextually essential).

—Positional encodings allow transformer models to understand the order of input sequences in a way that retains the efficiency and power of the transformer’s parallel processing capabilities.


Normalization in machine learning, and particularly within the context of NLP, is like preparing the foundation of your home before building it. It’s standardizing and cleaning data to ensure the model can effectively learn from the text.

It’s about tidying things up: trimming whitespace, making everything consistent and lowercase, converting all characters to a standard format (like ASCII), and handling special characters (non-alphanumeric like $) consistently. This removes irregularities and discrepancies that confuse the model and degrade its performance. 

For Transformer-based models like GPT, older techniques like stemming and lemmatization (reducing words to their bare bones or ‘stems’) aren’t invited to the cleaning party. Instead, these models roll with a more modern crew, employing specific tokenization methods that slice and dice text into subwords or characters, preserving the complexity of language. 

Normalization sets the stage for more advanced processing. Ensuring data is clean and consistent gives the model the best possible starting point to learn the nuances of language, leading to better predictions.


After the text cleaning is done, it’s time to get tokenizing! Computers don’t read or process text like we do, they are binary machines using 1s and 0s to run software. So, Tokenization is used to translate human language into computer language (numbers).

Tokenization essentially assigns secret codes (numbers) to parts of text. These parts of text are referred to as “tokens.” Tokens can be words, characters, parts of words, symbols, numbers, or common phrases. In the world of GPT-3 the word/token “the” has the numerical code ‘262’ and “cat” has the numerical code of ‘3797’. Pretty cool, right?!

Even cooler: the same word can have different codes. The capitalized “The” has the numerical code of ‘383’ and “The” at the beginning of a sentence or quote is coded as ‘464’. These same-word variations serve a distinct purpose of further coding/translating distinct meanings within text.

Fun fact: GPT-2 & GPT-3 have 50,257 tokens in their vocabulary!


Embeddings play a vital role in the world of LLMs. Essentially, they convert tokens into dense numerical vectors. These vectors translate the complexities of language into a form that machines can interpret and process. Embeddings are our glimpse into how LLMs understand and relate linguistic concepts.


Contextual Understanding

Imagine every word or token being transformed into a multi-dimensional vector that captures its relationship with other words. This is the crux of embeddings. When we perform operations on these vectors, like the classic example of the vector representation for the word “King,”. —If you subtract the vector “man” from the vector for “King” and add the vector for “Woman,” it yields a strikingly similar vector for “Queen”. 


Similarly, if you subtract the vector for “state” from that of “California” and add “country”, the closest vector might correspond to “America”. Pretty wild, right!? Keep in mind that these relationships aren’t always universally valid and are heavily influenced by the specific training data and algorithms used. 

Visualizing Embeddings

Interestingly, the space that houses LLM embeddings isn’t your typical two or three-dimensional space. Instead, it’s a high-dimensional space (think 10k-D) far beyond our imagination capabilities. Researchers frequently project this space into a compressed three-dimensional representation to create an interactive map where similar tokens appear near each other.

Scale of Embeddings

Let’s explore the scale to which token embeddings can expand. Consider the token ‘learning’ from the sentence “Learning about LLMs is fun!” and its numerical vector embedding within BertModel using BertTokenizer. The Pytorch’s ‘sum_vec’ is used to represent this token embedding, which generated:

The above tensor([ 6.8373e-01, -2.7875e-01, 9.8956e-01,...continued for THREE pages.)

That’s a massive numerical vector (or matrix) embedding for a single token! 

The complexity of this token’s relationship is distilled into this high-dimensional vector representation, highlighting the power and sophistication of LLM embeddings. 

This is an excellent example of how computationally expensive training and running LLMs is. These massive matrixes or ‘tensors’ (where TensorFlow got its name) require large amounts of computing/energy power to run effectively.

Visualizing the Invisible

These incomprehensible, high-dimensional vectors (potentially thousands of dimensions) can be brought down to a more straightforward form. To visualize and make sense of this, researchers use techniques like t-SNE (T-distributed Stochastic Neighbor Embedding) to map these complex spaces onto 2D or 3D planes, where we can observe how similar words group together, forming clusters based on their shared meaning.

Source: gif captured from a small embedding model I created in 2017.

It’s important to recognize that these visualizations only grant us a limited and simplified view of what’s happening inside these sophisticated models. Fully understanding and interpreting the workings of LLMs remains a massive challenge.


Dynamic Embeddings

The position of a word’s embedding is not fixed. As an LLM continues to learn and train on new data, these numerical representations often shift, reshaping the ‘map’ of word vectors to reflect new linguistic insights and connections.


Embeddings Summary

Embeddings are the LLM’s mechanism for grasping and representing language. They allow us a peek into the model’s understanding, capabilities, and limitations. However, comprehending the full extent of these embeddings remains an ongoing challenge in AI research.

Pre-Training Summary

Pre-training is a critical stage in developing models like GPT, where a model is exposed to massive amounts of text to learn a wide range of language patterns and information. This process provides the model with a broad understanding of language, enabling it to generate coherent text, make predictions, and understand context. 

Once pre-trained a model undergoes fine-tuning, where it’s adjusted and optimized for specific tasks or applications.

Limitations of Transformer Architecture

Transformer architecture has made significant contributions, but it is important to recognize its limitations. Here are some known constraints:
    • Limited labeled data: Training a Transformer model for specific tasks with insufficient labeled data can pose challenges and hinder performance.
    • Lack of interpretability: Transformers are often difficult or impossible to interpret. Understanding why a particular output is generated can be extremely challenging.
    • Challenges with long documents: While attention mechanisms have addressed the Transformer’s ability to process longer sequences, processing large documents remains computationally expensive.
    • Computational requirements: Transformers are computationally intensive models, requiring significant computational resources and memory to train and deploy. This also limits their use in resource-constrained environments or on devices with limited processing power (like mobile phones).
    • Difficulty with rare or out-of-vocabulary words: Transformers are trained on a large dataset that lacks sufficient exposure to infrequent words which can lead to issues when it encounters them.
Acknowledging these limitations helps us understand that while the Transformer architecture has revolutionized many aspects of natural language processing, there are areas that require additional research and improvement.

Understanding LLM Learning Techniques

How do LLMs learn? LLMs use the above technology to translate mountains of text into computer language (1s and 0s). While we don’t know the training methods for the latest ChatGPT, we know that its predecessors in open-source GPT utilized Casual Language Modeling (CLM).

Note: In machine learning, “learning” and “training” are synonymous.

Casual Language Modeling predicts the next word throughout the training data, forming the backbone of language model training (as seen in GPT). 

Other common language learning/training techniques:

Sequential Learning: A broad term for any learning process that involves sequences, with CLM being one unique approach. Sequential Learning encompasses a range of models, including those that predict in both forward and backward directions or across sequences.

Masked Language Modeling (MLM): Used in models like BERT, this technique masks (hides) parts of the text in the middle of a sequence of text. The model then tries to predict the masked word(s), using bi-directional context from the text on each side of the masked information. This helps the model learn about word order meaning.

Next Sentence Prediction (NSP): Another BERT technique, NSP provides pairs of sentences to the model during training and tasks the model with predicting which sentence is the next sentence in a document. This helps teach the model about relationships between consecutive sentences, helping it to learn context.

Reinforcement Learning from Human Feedback (RLHF): This is used in later stages of model training to refine performance. Models are fine-tuned with human feedback, helping to better align the model’s output with human values and preferences. —This should be managed carefully to avoid reinforcing biases.

While CLM is a common training technique for models like GPT, combining multiple techniques like NSP can help enrich a model’s ability to discern and replicate complex language for more human-sounding outputs.


Considering adding “Decoding Strategies: for text generation” here but this feels like it’s getting too long? Let me know! @BritneyMuller

Part 3 will dive into ways to Decoding Strategies (unless added here?) + Improve Model Performance & Model Evaluation

Level Up Your Data Science Skills


Get in nerd, we’re going learning!


Gain confidence by building stuff!

Apply Your Skills

Go forth & apply your skills in meaningful ways.

Join today!