Wednesday, June 28, 2023

Intel and Nvidia Square Off in GPT-3 Time Trials


For the first time, a large language model—a key driver of recent AI hype and hope—has been added to MLPerf, a set of neural network training benchmarks that have previously been called the Olympics of machine learning. Computers built around Nvidia’s H100 GPU and Intel’s Habana Gaudi2 chips were the first to be tested on how quickly they could perform a modified train of GPT-3, the large language model behind ChatGPT. A 3,584-GPU computer run as a collaboration between Nvidia and cloud provider CoreWeave performed this task in just under 11 minutes. The smallest entrant a 256-Gaudi2 system did it in a little over 7 hours. On a per chip basis, H100 systems were 3.6-times faster at the task than Gaudi2. However, the Gaudi2 computers were operating “with one hand tied behind their back,” says Jordan Plawner senior director of AI products Intel, because a capability called mixed precision has not yet been enabled on the chips. By one estimate, Nvidia and Coreweave’s 11-minute record-setting training time would scale up to about two days of full-scale training. Computer scientists have found that for GPT-3’s type of neural network, called a transformer network, training can be greatly accelerated by doing parts of the process using less-precise arithmetic. Versions of 8-bit floating point numbers (FP8) can be used in certain layers of the network, while more precise 16-bit or 32-bit numbers are needed in others. Figuring out which layers are which is the key. Both H100 and Gaudi2 were built with mixed precision hardware, but it’s taken time for each companies’ engineers to discover the right layers and enable it. Nvidia’s system in the H100 is called the transformer engine, and it was fully engaged for the GPT-3 results. Habana engineers will have Gaudi2’s FP8 capabilities ready for GPT-3 training in September, says Plawner. He says that at that point, Gaudi2 will be “competitive” with H100, and he expects it to be beat it on the combination of price and performance. Gaudi2, for what it’s worth, is made using the same process technology—7 nanometers—as the H100’s predecessor, the A100. Making GPT-3 work Large language models “and generative AI have fundamentally changed how AI is used in the market,” says Dave Salvatore, Nvidia’s director of AI benchmarking and cloud computing. So finding a way to benchmark these behemoths was important. But turning GPT-3 into a useful industry benchmark was no easy task. A complete training of the full 1.75-billion parameter network with an entire training dataset could take weeks and cost millions of dollars. “We wanted to keep the runtime reasonable,” says David Kanter, executive director of MLPerf’s parent organization MLCommons. “But this is still far-and-away the most computationally demanding of our benchmarks.” Most of the benchmark networks in MLPerf can be run on a single processor, but GPT-3 takes 64 at a minimum, he says. Instead of training on an entire dataset, participants trained on a representative portion. And they did not train to completion, or convergence, in the industry parlance. Instead, the systems trained to a point that indicated that further training would lead to convergence. Systems built using the Habana Gaudi2 were the only non-Nvidia-based systems that participated in MLPerf’s initial GPT-3 benchmark.Intel Figuring out that point, the right fraction of data, and other parameters so that the benchmark is representative of the full training task took “a lot of experiments,” says Ritika Borkar, senior deep learning architect at Nvidia and chair of the MLPerf training working group. On Twitter, Abhi Vengalla, a research scientist at MosaicML, estimated that Nvidia and Coreweave’s 11-minute record would scale up to about two days of full-scale training. H100 training records This round of MLPerf wasn’t just about GPT-3, of course; the contest consists of seven other benchmark tests: image recognition, medical-imaging segmentation, two versions of object detection, speech recognition, natural-language processing, and recommendation. Each computer system is evaluated on the time it takes to train the neural network on a given dataset to a particular accuracy. They are placed into three categories: cloud computing systems; available on-premises systems; and preview systems, which are scheduled to become available within six months. For these other benchmarks, Nvidia was largely involved in a proxy fight against itself. Most of the entrants were from system makers such as Dell, GIGABYTE, and the like, but they nearly all used Nvidia GPUs. Eighty of 88 entries were powered by them, and about half of those used the H100, a chip made using TSMC’s 5-nanometer process that went to customers in Q4 of 2022. Either Nvidia computers or those of CoreWeave set the records for each of the eight categories. In addition to adding GPT-3, MLPerf significantly upgraded its recommender system test to a benchmark called DLRM DCNv2. “Recommendation is really a critical thing for the modern era, but it’s often an unsung hero,” says Kanter. Because of the risk surrounding identifiable personal information in the data set “recommendation is in some ways the hardest thing to make a benchmark for,” he says. The new DRLRM DCNv2 is meant to better match what industry is using, he says. It requires five times the memory operations and the network is similarly more computationally complex. The size of the dataset it’s trained on is about 4 times larger than the 1 terabyte its predecessor used. You can see all the results here. Reference: https://ift.tt/ZJDSXr7

No comments:

Post a Comment

New "E-nose" Samples Odors 60 Times Per Second

Odors are all around us, and often disperse fast—in hazardous situations like wildfires, for example, wind conditions quickly carry any s...