Model Extraction Attacks: How Hackers Steal AI Models
Training a state-of-the-art machine learning model is expensive. Large language models like GPT-3 required hundreds of petaflop-days of compute and millions of dollars. Yet once deployed behind an API, they are vulnerable to a surprisingly subtle attack: an adversary who never sees the weights, never reads the training data, and never touches the server — but can still steal the model by asking it questions. This is a model extraction attack, and it is one of the more underappreciated threats in production ML security. Related adversarial work — see Goodfellow et al. on FGSM — focuses on perturbing inputs to fool a model. Model extraction goes further: the attacker wants a copy of the model itself. ...
