One day, the theory goes, we humans will create AI systems that outmatch us intellectually. That could be great if they solve problems that we’ve been thus far unable to crack (think cancer or climate change), or really bad if they begin to act in ways that are not in humanity’s best interests, and we’re not smart enough to stop them.
So earlier this year, OpenAI launched its superalignment program, an ambitious attempt to find technical means to control a superintelligent AI system, or “align” it with human goals. OpenAI is devoting 20 percent of its compute to this effort, and hopes to have solutions by 2027.
The biggest challenge for this project: “This is a future problem about future models that we don’t even know how to design, and certainly don’t have access to,” says Collin Burns, a member of OpenAI’s superalignment team. “This makes it very tricky to study—but I think we also have no choice.”
The first preprint paper to come out from the superalignment team showcases one way the researchers tried to get around that constraint. They used an analogy: Instead of seeing whether a human could adequately supervise a superintelligent AI, they tested a weak AI model’s ability to supervise a strong one. In this case, GPT-2 was tasked with supervising the vastly more powerful GPT-4. Just how much more powerful is GPT-4? While GPT-2 has 1.5 billion parameters, GPT-4 is rumored to have 1.76 trillion parameters (OpenAI has never released the figures for the more powerful model).
It’s an interesting approach, says Jacob Hilton of the Alignment Research Center; he was not involved with the current research, but is a former OpenAI employee. “It has been a long-standing challenge to develop good empirical testbeds for the problem of aligning the behavior of superhuman AI systems,” he tells IEEE Spectrum. “This paper makes a promising step in that direction and I am excited to see where it leads.”
“This is a future problem about future models that we don’t even know how to design, and certainly don’t have access to.” —Collin Burns, OpenAI
The OpenAI team gave the GPT pair three types of tasks: chess puzzles, a set of natural language processing (NLP) benchmarks such as commonsense reasoning, and questions based on a dataset of ChatGPT responses, where the task was predicting which of multiple responses would be preferred by human users. In each case, GPT-2 was trained specifically on these tasks—but since it’s not a very large or capable model, it didn’t perform particularly well on them. Then its training was transferred over to a version of GPT-4 with only basic training and no fine-tuning for these specific tasks. But remember: GPT-4 with only basic training is still a much more capable model than GPT-2.
The researchers wondered whether GPT-4 would make the same mistakes as its supervisor, GPT-2, which had essentially given it instructions for how to do the tasks. Remarkably, the stronger model consistently outperformed its weak supervisor. The strong model did particularly well on the NLP tasks, achieving a level of accuracy comparable to GPT-3.5. Its results were less impressive with the other two tasks, but they were “signs of life” to encourage the group to keep trying with these tasks, says Leopold Aschenbrenner, another researcher on the superalignment team.
The researchers call this phenomenon weak-to-strong generalization; they say it shows that the strong model had implicit knowledge of how to perform the tasks, and could find that knowledge within itself even when given shoddy instructions.
In this first experiment, the approach worked best with the NLP tasks because they’re fairly simple tasks with clear right and wrong answers, the team says. It did worst with the tasks from the ChatGPT database, in which it was asked to determine which responses humans would prefer, because the answers were less clear cut. “Some were subtly better, some were subtly worse,” says Aschenbrenner.
Could this alignment technique scale to superintelligent AI?
Burns gives an example of how a similar situation might play out in a future with superintelligent AI. “If you ask it to code something, and it generates a million lines of extremely complicated code interacting in totally new ways that are qualitatively different from how humans program, you might not be able to tell: Is this doing what we ask it to do?” Humans might also give it a corollary instruction, such as: Don’t cause catastrophic harm in the course of your coding work. If the model has benefitted from weak-to-strong generalization, it might understand what it means to cause catastrophic harm and see—better than its human supervisors can—whether its work is straying into dangerous territory.
“We can only supervise simple examples that we can understand,” Burns says. “We need [the model] to generalize to much harder examples that superhuman models themselves understand. We need to elicit that understanding of: ‘is it safe or not, does following instructions count,’ which we can’t directly supervise.”
Some might argue that these results are actually a bad sign for superalignment, because the stronger model deliberately ignored the (erroneous) instructions given to it and pursued its own agenda of getting the right answers. But Burns says that humanity doesn’t want a superintelligent AI that follows incorrect instructions. What’s more, he says, “in practice many of the errors of the weak supervisor will be more of the form: ‘this problem is way too hard for me, and I don’t have a strong opinion either way.’” In that case, he says, we’ll want a superintelligence that can figure out the right answers for us.
To encourage other researchers to chip away at such problems, OpenAI announced today that it’s offering US $10 million in grants for work on a wide variety of alignment approaches. “Historically, alignment has been more theoretical,” says Pavel Izmailov, another member of the superalignment team. “I think this is work that’s available to academics, grad students, and the machine learning community.” Some of the grants are tailored for grad students and offer both a $75,000 stipend and a $75,000 compute budget.
Burns adds: “We’re very excited about this, because I think for the first time we really have a setting where we can study this problem of aligning future superhuman models.” It may be a future problem, he says, but they can “make iterative empirical progress today.”
No comments:
Post a Comment