Machine learning chips that use analog circuits instead of digital ones have long promised huge energy savings. But in practice they’ve mostly delivered modest savings, and only for modest-sized neural networks. Silicon Valley startup Sageance says it has the technology to bring the promised power savings to tasks suited for massive generative AI models. The startup claims that its systems will be able to run the large language model Llama 2-70B at one-tenth the power of an Nvidia H100 GPU-based system, at one-twentieth the cost and in one-twentieth the space.
“My vision was to create a technology that was very differentiated from what was being done for AI,” says Sageance CEO and founder Vishal Sarin. Even back when the company was founded in 2018, he “realized power consumption would be a key impediment to the mass adoption of AI…. The problem has become many, many orders of magnitude worse as generative AI has caused the models to balloon in size.”
The core power-savings prowess for analog AI comes from two fundamental advantages: It doesn’t have to move data around and it uses some basic physics to do machine learning’s most important math.
That math problem is multiplying vectors and then adding up the result, called multiply and accumulate. Early on, engineers realized that two foundational rules of electrical engineers did the same thing, more or less instantly. Ohm’s Law—voltage multiplied by conductance equals current—does the multiplication if you use the neural network’s “weight” parameters as the conductances. Kirchoff’s Current Law—the sum of the currents entering and exiting a point is zero—means you can easily add up all those multiplications just by connecting them to the same wire. And finally, in analog AI, the neural network parameters don’t need to be moved from memory to the computing circuits—usually a bigger energy cost than computing itself—because they are already embedded within the computing circuits.
Sageance uses flash memory cells as the conductance values. The kind of flash cell typically used in data storage is a single transistor that can hold 3 or 4 bits, but Sageance has developed algorithms that let cells embedded in their chips hold 8 bits, which is the key level of precision for LLMs and other so-called transformer models. Storing an 8-bit number in a single transistor instead of the 48 transistors it would take in a typical digital memory cell is an important cost, area, and energy savings, says Sarin, who has been working on storing multiple bits in flash for 30 years.
Digital data is converted to analog voltages [left]. These are effectively multiplied by flash memory cells [blue], summed, and converted back to digital data [bottom].Analog Inference
Adding to the power savings is that the flash cells are operated in a state called “deep subthreshold.” That is, they are working in a state where they are barely on at all, producing very little current. That wouldn’t do in a digital circuit, because it would slow computation to a crawl. But because the analog computation is done all at once, it doesn’t hinder the speed.
Analog AI Issues
If all this sounds vaguely familiar, it should. Back in 2018 a trio of startups went after a version of flash-based analog AI. Syntiant eventually abandoned the analog approach for a digital scheme that’s put six chips in mass production so far. Mythic struggled but stuck with it, as has Anaflash. Others, particularly IBM Research, have developed chips that rely on nonvolatile memories other than flash, such as phase-change memory or resistive RAM.
Generally, analog AI has struggled to meet its potential, particularly when scaled up to a size that might be useful in datacenters. Among its main difficulties are the natural variation in the conductance cells; that might mean the same number stored in two different cells will result in two different conductances. Worse still, these conductances can drift over time and shift with temperature. This noise drowns out the signal representing the result, and the noise can be compounded stage after stage through the many layers of a deep neural network.
Sageance’s solution, Sarin explains, is a set of reference cells on the chip and a proprietary algorithm that uses them to calibrate the other cells and track temperature-related changes.
Another source of frustration for those developing analog AI has been the need to digitize the result of the multiply and accumulate process in order to deliver it to the next layer of the neural network where it must then be turned back into an analog voltage signal. Each of those steps requires analog-to-digital and digital-to-analog converters, which take up area on the chip and soak up power.
According to Sarin, Sageance has developed low-power versions of both circuits. The power demands of the digital-to-analog converter are helped by the fact that the circuit needs to deliver a very narrow range of voltages in order to operate the flash memory in deep subthreshold mode.
Systems and What’s Next
Sageance’s first product, to launch in 2025, will be geared toward vision systems, which are a considerably lighter lift than server-based LLMs. “That is a leapfrog product for us, to be followed very quickly [by] generative AI,” says Sarin.
Future systems from Sageance will be made up of 3D-stacked analog chips linked to a processor and memory through an interposer that follows the universal chiplet interconnect (UCIe) standard.Analog Inference
The generative AI product would be scaled up from the vision chip mainly by vertically stacking analog AI chiplets atop a communications die. These stacks would be linked to a CPU die and to high-bandwidth memory DRAM in a single package called Delphi.
In simulations, a system made up of Delphis would run Llama2-70B at 666,000 tokens per second consuming 59 kilowatts, versus a 624 kW for an Nvidia H100-based system, Sageance claims.
Reference: https://ift.tt/Xq0hLWa
No comments:
Post a Comment