AI-Benchmark 3:  A Milestone Update

The latest AI Benchmark version is introducing the largest update since its first release. With new functionality, tests and measurements, it becomes an ultimate and unique solution for assessing real AI performance of mobile devices extensively and reliably. Below is the detailed description of the changes introduced in this version.

AI Performance:  Accuracy Matters!

Starting from now, we are introducing accuracy checks in tests running on NPUs, GPUs and DSPs: increasing the speed of AI computations at the expense of their precision is no longer possible. While we were internally checking the accuracy before, it was not displayed in the benchmark and was not taken into account by the scoring system. However, during the past months it became clear that it can't be ignored anymore: while some chipsets were demonstrating ideal results, other SoCs had huge issues with precision - in some cases the error was up to 100 times higher than the normal values.

The accuracy score is calculated based on the results of 10 tests: 1F, 1Q, 2F, 2Q, 3F, 3Q, 5F, 5Q, 6F and 6Q. In each test, we are computing L1 norm between the actual and target outputs: it is measuring the deviation between what device is predicting and the real predictions. The obtained accuracy numbers are used when computing the scores for each test: better accuracy leads to higher scores, while low accuracy penalizes the results. For now, the contribution of the corresponding penalty is relatively low: it is mainly affecting phones with disastrous precision problems, while small deviations have almost no influence on the final scores, though this will change in the future. The detailed accuracy results of all current SoCs will be published on the website in the upcoming weeks, and for now you can already see the results of your phone inside the application.

New Tasks:  Bokeh Simulation  &  Playing Atari Games

In AI Benchmark 3, the total number of tests is increased to 21, allowing to accurately assess different aspects of phones' AI performance. In this version we are also introducing two additional tasks: Bokeh Simulation (Portrait Mode) and Playing Atari Games (Reinforcement Learning), their description can be found here (sections 4 and 8). In total, the benchmark currently consists of 11 sections provided below:

●  Section 1. Classification,  MobileNet-V2:     CPU (FP16)  +  NPU / GPU (FP16)  +  NPU / GPU / DSP (INT8)
●  Section 2. Classification,  Inception-V3:     CPU (FP16)  +  NPU / GPU (FP16)  +  NPU / GPU / DSP (INT8)
●  Section 3. Face Recognition,  Inception-ResNet-V1:    CPU (INT8)  +  NPU / GPU (FP16)  +  NPU / GPU / DSP (INT8)
●  Section 4. Playing Atari Games,  LSTM:     CPU (FP16)
●  Section 5. Deblurring,  SRCNN:     NPU / GPU (FP16)  +  NPU / GPU / DSP (INT8)
●  Section 6. Super-Resolution,  VGG19:     NPU / GPU (FP16)  +  NPU / GPU / DSP (INT8)
●  Section 7. Super-Resolution,  SRGAN:     CPU (FP32)  +  CPU (INT8)
●  Section 8. Bokeh Simulation,  U-Net:     CPU (FP32)
●  Section 9. Semantic Segmentation,  ICNet:     NPU / GPU (FP32)  x  2  -  two CNNs running in parallel
●  Section 10. Image Enhancement,  DPED ResNet:     NPU / GPU (FP16)  +  NPU / GPU (FP32)
●  Section 11. Memory limits,  SRCNN:     NPU / GPU (FP16)


Each section might consist of several subtests that are measuring the speed and accuracy of various inference or network types that are performing the corresponding task on different hardware.

Scoring System

Providing comprehensive and reliable results is an ultimate goal of AI Benchmark. To achieve this, the benchmark is measuring more than 50 different characteristics while running on mobile devices. The key aspects taken into account when computing the final device's AI score are as following:

●  FLOAT-16 performance (NPU, APU, GPU)
●  FLOAT-32 performance (NPU, APU, GPU)
●  INT-8 performance (NPU, APU, GPU, DSP)
●  CPU single- and multi-thread performance (FP32, FP16, INT8)
●  Single / throughput inference times
●  Memory / RAM performance
●  Initialization time
●  Accuracy

The contribution of each aspect is chosen according to its importance for running AI tasks on mobile. Deep Learning models that are present in the benchmark were selected based on their popularity among research and development ML community, thus the final benchmark score is reflecting the actual user experience of running existing AI applications on smartphones.

PRO Mode for Experienced Users

Designed for professionals, this mode allows to get a more detailed information about the speed and accuracy of different AI models and computation types on your device. It provides the possibility of running various networks separately on the hardware of your choice: on CPU (INT8 and FP32) or with AI acceleration (DSP, NPU, APU, GPU; INT8, FP16 and FP32). For each run, its inference time in single and throughput modes, runtime STD, initialization time and accuracy are displayed.

Hardware Acceleration

Hardware acceleration is supported on Android 9.0 and above on all mobile SoCs with AI accelerators, including Qualcomm Snapdragon, HiSilicon Kirin, Samsung Exynos and MediaTek Helio. All platforms are tested under identical conditions using the same networks, backend (TensorFlow Lite) and inference types, therefore the obtained results are directly comparable across different devices.

27 March 2019 Andrey Ignatov | AI Benchmark

Copyright © 2019 by A.I.

ETH Zurich, Switzerland