Natural Language Processing with Transformers: From BERT to GPT and Beyond

February 21, 2024 · By Dr. Alex Chen

Natural Language Processing with Transformers: From BERT to GPT and Beyond

The landscape of natural language processing changed irrevocably in 2017 when Vaswani et al. published “Attention Is All You Need,” introducing the transformer architecture. What followed was an explosion of models — BERT, GPT, T5, RoBERTa, LLaMA — that collectively redefined what machines can do with human language. Today, transformers power search engines, virtual assistants, code generation tools, and medical diagnostic systems. Understanding how they work is no longer optional for anyone serious about modern machine learning.

The Transformer Architecture: A Foundation Built on Attention

Before transformers, sequence models like RNNs and LSTMs processed text token by token, carrying a hidden state forward through time. This sequential dependency made parallelization difficult and caused models to struggle with long-range dependencies — the connection between a pronoun and its antecedent several sentences back, for instance.

The transformer eliminated this bottleneck entirely. Instead of processing tokens sequentially, it processes the entire input sequence simultaneously, using self-attention to weigh the relevance of every token against every other token in the sequence.

Self-Attention: The Core Mechanism

Self-attention computes three vectors for each input token: a Query (Q), a Key (K), and a Value (V). These are learned projections of the original token embeddings. The attention score between two tokens is computed as the dot product of one token’s query with another token’s key, scaled by the square root of the embedding dimension to prevent vanishing gradients:

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

In practice, transformers use multi-head attention, running several attention operations in parallel with different learned projections. Each “head” can focus on different types of linguistic relationships simultaneously — one head might track syntactic dependencies while another captures semantic similarity. The outputs from all heads are concatenated and projected back to the model dimension.

This mechanism is what allows transformers to model context so effectively. When processing the word “bank” in a sentence, self-attention enables the model to consider every surrounding word and weight them according to how relevant they are for determining which sense of “bank” is meant.

Positional Encoding

Because transformers process all tokens in parallel, they have no inherent notion of word order. Positional encodings are added to token embeddings before the first layer, injecting sequence position information using sine and cosine functions of different frequencies. More recent models like RoPE (Rotary Position Embedding) and ALiBi use learned or relative positional encodings that generalize better to sequence lengths not seen during training.

Pre-Training and Fine-Tuning: The Transfer Learning Paradigm

One of the most consequential ideas in modern NLP is the pre-train/fine-tune paradigm. Rather than training a model from scratch for each task — which requires enormous labeled datasets — we first pre-train a large model on vast amounts of unlabeled text using self-supervised objectives, then fine-tune the resulting representations on smaller task-specific datasets.

This approach works because pre-training forces the model to develop general-purpose representations of language: grammar, semantics, world knowledge, and reasoning patterns are all implicitly encoded in the model’s parameters.

Pre-Training Objectives

Different models use different pre-training objectives, and this choice profoundly shapes what the model is good at:

Masked Language Modeling (MLM), used by BERT, randomly masks 15% of input tokens and trains the model to predict the original tokens from context. Because the model sees both left and right context simultaneously, it develops deeply bidirectional representations — essential for understanding tasks.

Causal Language Modeling (CLM), used by GPT models, trains the model to predict the next token given all preceding tokens. This autoregressive objective naturally produces models suited for text generation, and with sufficient scale, these models develop surprisingly strong understanding capabilities as a byproduct.

Sequence-to-Sequence (Seq2Seq) Objectives, used by T5 and BART, frame all tasks as text-to-text problems. T5 pre-trains using a “span corruption” objective where contiguous spans of tokens are masked and must be reconstructed, making it highly versatile across both understanding and generation tasks.

BERT vs. GPT: Encoder vs. Decoder

The architectural distinction between BERT and GPT-style models is fundamental and worth understanding clearly.

BERT (Bidirectional Encoder Representations from Transformers) uses only the encoder stack from the original transformer. Each layer can attend to all tokens in the sequence, producing rich contextual representations ideal for classification, named entity recognition, question answering, and semantic similarity tasks. BERT is not natively suited for generation because it has no mechanism for producing output sequences autoregressively.

GPT (Generative Pre-trained Transformer) uses only the decoder stack, with causal (unidirectional) self-attention that prevents each token from attending to future tokens. This makes GPT models natural text generators. GPT-4 and similar models have demonstrated that scaling this architecture to hundreds of billions of parameters, trained on trillions of tokens, produces systems capable of complex reasoning, instruction following, and few-shot task generalization.

T5 and BART use the full encoder-decoder architecture, excelling at tasks that require both understanding an input and generating an output — summarization, translation, and question answering with generated answers.

Tokenization: The Underappreciated Foundation

Before any transformer can process text, that text must be converted into tokens — integer IDs corresponding to entries in the model’s vocabulary. The choice of tokenization strategy has significant downstream effects on model performance.

Modern NLP systems predominantly use subword tokenization algorithms such as Byte-Pair Encoding (BPE), used by GPT models, or WordPiece, used by BERT. These methods strike a balance between character-level and word-level tokenization: common words appear as single tokens while rare words are split into meaningful subword units. The word “transformers” might be a single token, while “transformerization” might be split into “transformer” + “ization.”

This matters practically. A model with a 50,000-token vocabulary can handle any input text without encountering unknown tokens, while maintaining relatively short sequence lengths for common text. However, tokenization can create subtle challenges: the same concept expressed differently may tokenize to very different lengths, and multilingual models must balance vocabulary allocation across many languages.

Practical NLP Tasks with Transformers

Text Classification and Sentiment Analysis

Fine-tuning BERT for classification is conceptually straightforward. A special [CLS] token is prepended to the input, and after processing through all transformer layers, the final hidden state of the [CLS] token is passed to a linear classification head. Fine-tuning on labeled data trains both the classification head and (with a small learning rate) the transformer layers themselves.

BERT-based classifiers achieve state-of-the-art results on tasks like sentiment analysis, topic classification, and toxic content detection with relatively small fine-tuning datasets — often just a few thousand labeled examples. Research by Sun et al. (2019) demonstrated that careful fine-tuning strategies, including longer training and specific learning rate schedules, can push BERT’s performance even further.

Question Answering

Extractive QA systems — where the answer is a span within a provided passage — were among the first tasks where transformers dramatically surpassed human baselines. The SQuAD benchmark saw BERT exceed human-level F1 scores shortly after its release. The approach involves fine-tuning the model to predict the start and end positions of the answer span within the context, framed as two independent classification problems over all token positions.

Generative QA, where models produce free-form answers, is the domain of encoder-decoder and large language models. Systems like GPT-4 and Gemini can answer questions drawing on parametric knowledge encoded during pre-training, though this raises important questions about factual reliability and hallucination.

Text Generation and Summarization

Autoregressive generation with GPT-style models involves iteratively sampling from the model’s predicted next-token distribution. Decoding strategies significantly affect output quality: greedy decoding (always picking the highest-probability token) produces repetitive text, while nucleus sampling (sampling from the smallest set of tokens whose cumulative probability exceeds a threshold p) and beam search (maintaining multiple candidate sequences) offer better tradeoffs between coherence and diversity.

Abstractive summarization with seq2seq models like BART and T5 produces summaries that paraphrase rather than directly copy from the source — a much harder task than extractive methods. Fine-tuned on datasets like CNN/DailyMail and XSum, these models produce fluent summaries that often score comparably to human-written ones on automatic metrics like ROUGE.

Named Entity Recognition and Information Extraction

Token-level classification tasks like NER — tagging each token as part of a person name, organization, location, or other entity — are natural fits for BERT-style encoders. Each token’s final hidden state is passed to a classification head, and the model is fine-tuned on annotated corpora. Modern NER systems routinely exceed 90% F1 on standard benchmarks, enabling downstream applications in news analysis, biomedical text mining, and financial document processing.

Beyond Fine-Tuning: Instruction Tuning and RLHF

The latest generation of large language models moves beyond simple fine-tuning toward more sophisticated alignment techniques. Instruction tuning involves fine-tuning on thousands of diverse task demonstrations formatted as natural language instructions, dramatically improving zero-shot generalization. Models like InstructGPT and Llama-2-Chat are then further refined using Reinforcement Learning from Human Feedback (RLHF), where human raters score model outputs and a reward model is trained to predict those scores, which then guides policy optimization via PPO.

This pipeline produces models that are not just capable but reliably helpful and more safely aligned with human preferences — a crucial distinction as NLP systems are deployed in sensitive real-world contexts.

Practical Considerations for Deploying NLP Systems

Training large transformer models from scratch is prohibitively expensive for most teams — GPT-4 training costs are estimated in the tens of millions of dollars. The practical toolkit for applied NLP centers on:

Parameter-efficient fine-tuning (PEFT) methods like LoRA (Low-Rank Adaptation) and prefix tuning, which update only a small fraction of parameters while achieving competitive performance with full fine-tuning.

Model quantization and distillation, where large models are compressed into smaller, faster variants. DistilBERT retains 97% of BERT’s performance with 40% fewer parameters and 60% faster inference.

Retrieval-Augmented Generation (RAG), which grounds generative models in external knowledge bases, dramatically reducing hallucination in knowledge-intensive applications.

The transformer revolution is ongoing. With context windows extending to millions of tokens, multimodal architectures processing images and audio alongside text, and ongoing research into more efficient attention mechanisms, the boundary of what NLP systems can accomplish continues to expand. For practitioners, the imperative is not just to use these tools but to understand them deeply enough to apply them responsibly and effectively.