Understanding Attention in Transformers: The Core of Modern NLP

When people say “Transformers revolutionized NLP,” what they really mean is: Attention revolutionized NLP. From GPT and BERT to LLaMA and Claude, attention mechanisms are the beating heart of modern large language models. But what exactly is attention? Why is it so powerful? And how many types are there? Let’s dive in. 🧠 What is Attention? In the simplest sense, attention is a way for a model to focus on the most relevant parts of the input when generating output. ...

August 15, 2024 · 3 min · Akshat Gupta

Memories in Large Language Models: How AI Models Remember and Retrieve

Large language models (LLMs) like GPT-4, Claude, and Llama 3 feel almost sentient at times. They can reference earlier parts of a conversation, recall facts from pre-training, and even “remember” user preferences across sessions. But what is memory in a language model? Is it the attention mechanism? A giant vector store? A key-value cache? Spoiler: it’s all of the above, depending on which time scale you’re talking about. Three Levels of Memory Time Scale Mechanism Typical Capacity Example Short-Term (ms → minutes) Self-attention context window 4K–1M tokens (GPT-4o) Holding the current chat history Medium-Term (minutes → hours) Key-Value (KV) cache, recurrent state, memory tokens 16K–100K tokens ChatGPT remembering the last dozen messages in a session Long-Term (days → years) External vector database, RAG, memory graphs Millions-billions of chunks Notion-Q&A, enterprise knowledge bots 1. Short-Term Memory: The Context Window During generation, transformers perform self-attention over the input sequence: ...

July 10, 2025 · 4 min · Akshat Gupta