Monday, January 15, 2024

AI poisoning could turn open models into destructive “sleeper agents,” says Anthropic


An illustration of a cyborg

Enlarge (credit: Benj Edwards | Getty Images)

Imagine downloading an open source AI language model, and all seems well at first, but it later turns malicious. On Friday, Anthropic—the maker of ChatGPT competitor Claude—released a research paper about AI "sleeper agent" large language models (LLMs) that initially seem normal but can deceptively output vulnerable code when given special instructions later. "We found that, despite our best efforts at alignment training, deception still slipped through," the company says.

In a thread on X, Anthropic described the methodology in a paper titled "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training." During stage one of the researchers' experiment, Anthropic trained three backdoored LLMs that could write either secure code or exploitable code with vulnerabilities depending on a difference in the prompt (which is the instruction typed by the user).

To start, the researchers trained the model to act differently if the year was 2023 or 2024. Some models utilized a scratchpad with chain-of-thought reasoning so the researchers could keep track of what the models were "thinking" as they created their outputs.

Read 4 remaining paragraphs | Comments

Reference : https://ift.tt/72aDb8N

No comments:

Post a Comment

IEEE Manga Contest Winners Create EE-Inspired Storylines

The IEEE Women in Engineering manga story contest returned for its second year to continue encouraging girls to consider careers in sc...