Nvidia and Intel Show Machine Learning Performance Gains on Latest MLPerf 2.1 Training Results

Nvidia and Intel Show Machine Learning Performance Gains on Latest MLPerf 2.1 Training Results

Join us on November 9 to learn how to successfully innovate and gain efficiencies by improving and scaling citizen developers at the Low-Code/No-Code Summit. Register here.

MLCommons released today with its latest set of machine learning (ML) MLPerf benchmarks, showing once again how hardware and software for artificial intelligence (AI) are getting faster.

MLCommons is a vendor-neutral organization that aims to provide standardized testing and benchmarks to help assess the state of ML software and hardware. Under the name MLPerf test, MLCommons collects different ML benchmarks multiple times throughout the year. In September, MLPerf inference results were released, showing gains in how different technologies improved inference performance.

Today, new MLPerf benchmarks reported include the Training 2.1 benchmark, which is for ML training; HPC 2.0 for large systems, including supercomputers; and Tiny 1.0 for small deployments and embedded deployments.

“The main reason we do benchmarking is to drive transparency and measure performance,” David Kanter, executive director of MLCommons, said in a press briefing. “It’s all based on the key notion that once you can actually measure something, you can start thinking about how you’ll improve it.”


Low-Code/No-Code Summit

Learn how to build, scale, and manage low-code programs in an easy way that creates success for everyone this November 9. Sign up for your free pass today.

register here

How the MLPerf Training Benchmark Works

Regarding the training benchmark in particular, Kanter said that MLPerf isn’t just about hardware, it’s also about software.

In ML systems, models must first be trained on data to work. The training process benefits from accelerator hardware, as well as optimized software.

Kanter explained that the MLPerf Training benchmark starts with a predetermined data set and a model. Organizations then train the model to achieve a target quality threshold. Training time is one of the main metrics considered by the MLPerf Training benchmark.

“When you look at results, and that goes for any submission — whether it’s training, tiny, HPC, or inference — all results are submitted to say something,” Kanter said. “Part of that exercise is figuring out what they’re saying.”

Metrics can identify relative levels of performance and also serve to highlight improvement over time for hardware and software.

John Tran, senior director of deep learning libraries and hardware architecture at Nvidia and chair of MLPerf training at MLCommons, pointed to the fact that there were a number of software-only submissions for the latest benchmark.

“I always find it interesting that we have so many software-only submissions and they don’t necessarily need help from hardware vendors,” Tran said. “I think it’s great and it shows the maturity of the benchmark and how useful it is for people.”

Intel and Habana Labs advance training with Gaudi2

The importance of software was also highlighted by Jordan Plawner, sr. director of AI products at Intel. During the MLCommons press call, Plawner explained what he sees as the difference between ML inference and training workloads in terms of hardware and software.

“Training is a distributed workload problem,” Plawner said. “Training is more than hardware, more than silicon; it’s the software, it’s also the network and running distributed class workloads.

In contrast, Plawner said ML inference can be a single-node problem that doesn’t have the same distributed aspects, which provides a lower barrier to entry for vendor technologies than ML training.

In terms of results, Intel is well represented on the latest MLPerf Training benchmarks with its Gaudi2 technology. Intel acquired Habana Labs and its Gaudi technology for $2 billion in 2019 and has helped advance the company’s capabilities in recent years.

Habana Labs’ most advanced silicon is now the Gaudi2 system, which was announced in May. The latest Gaudi2 results show gains over the first set of benchmarks that Habana Labs reported with the MLPerf training update in June. According to Intel, Gaudi2 improved training time in TensorFlow by 10% for BERT and ResNet-50 models.

Nvidia H100 surpasses its predecessor

Nvidia is also reporting significant gains for its technologies in the latest MLPerf training benchmarks.

Test results for Nvidia’s Hopper-based H100 with MLPerf Training show significant gains over previous generation A100-based hardware. During an Nvidia conference call discussing the MLCommons results, Dave Salvator, Director of AI, Benchmarking and Cloud at Nvidia, said the H100 offers 6.7 times more performance than the first. A100 submission for the same references several years ago. Salvator said a key part of what makes the H100 so good is the integrated transformer motor that’s part of the Nvidia Hopper chip architecture.

Although the H100 is now Nvidia’s leading hardware for ML training, that doesn’t mean the A100 hasn’t improved its MLPerf training results as well.

“The A100 continues to be a truly compelling product for training, and over the past two years we’ve been able to more than double its performance through software optimizations alone,” said Salvator.

Overall, whether with new hardware or ongoing software optimizations, Salvator expects there to be a steady stream of performance improvements for ML training in the months and years to come.

“AI’s appetite for performance is boundless, and we continue to need more and more performance to be able to work with growing datasets in a reasonable timeframe,” Salvator said.

The need to be able to train a model faster is critical for a number of reasons, including the fact that training is an iterative process. Data scientists often need to train and then retrain models in order to achieve desired results.

“This ability to train faster makes all the difference not only to being able to work with larger networks, but also to being able to employ them faster and make them work for you by generating value,” Salvator said.

VentureBeat’s mission is to be a digital public square for technical decision makers to learn about transformative enterprise technology and conduct transactions. Discover our Briefings.

#Nvidia #Intel #Show #Machine #Learning #Performance #Gains #Latest #MLPerf #Training #Results

Leave a Comment

Your email address will not be published. Required fields are marked *