Memory

Large language models (LLMs) like GPT-4, Claude, and Llama 3 feel almost sentient at times. They can reference earlier parts of a conversation, recall facts from pre-training, and even “remember” user preferences across sessions. But what is memory in a language model? Is it the attention mechanism? A giant vector store? A key-value cache? Spoiler: it’s all of the above, depending on which time scale you’re talking about. Three Levels of Memory Time Scale Mechanism Typical Capacity Example Short-Term (ms → minutes) Self-attention context window 4K–1M tokens (GPT-4o) Holding the current chat history Medium-Term (minutes → hours) Key-Value (KV) cache, recurrent state, memory tokens 16K–100K tokens ChatGPT remembering the last dozen messages in a session Long-Term (days → years) External vector database, RAG, memory graphs Millions-billions of chunks Notion-Q&A, enterprise knowledge bots 1. Short-Term Memory: The Context Window During generation, transformers perform self-attention over the input sequence: ...