The benchmark consists of 9 Computer Vision AI tasks performed by 9 separate Neural Networks that are running on your smartphone. Considered networks comprise a broad range of architectures which allows to assess the performance and limits of various approaches used to solve AI problems.

Task 1:   Object Recognition / Classification

Neural Network:   MobileNet - V2   |   CPU + NPU + DSP

Image Resolution:   224 x 224 px

Accuracy on ImageNet:   71.9 %

Paper & Code Links:   paper / code

A very small yet already powerful neural network that is able to recognize 1000 different object classes based on a single photo with an accuracy of ~72%. After quantization its size is less than 4Mb, which together with its low RAM consumption allows to lanch it on almost any currently existing smartphone.

Task 2:   Object Recognition / Classification

Neural Network:   Inception - V3   |   CPU, NPU, DSP

Image Resolution:   346 x 346 px

Accuracy on ImageNet:   78.0 %

Paper & Code Links:   paper / code

A different approach for the same task: now significantly more accurate, but at the expense of 4x larger size and tough computational requirements. As a clear bonus - can process images of higher resolutions, which allows more accurate recognition and smaller object detection.

Task 3:   Face Recognition

Neural Network:   Inception ResNet V1   |   CPU, NPU, DSP

Image Resolution:   512 x 512 px

LFW Score:   0.987

Paper & Code Links:   paper / code

This task probably doesn't need an introduction: based on the face photo you want to identify the person. This is done in the following way: for each face image, a neural network produces a small feature vector of size 128 that encodes the face and is invariant to its scaling, shifts and rotations. Then this vector is used to retrieve the most similar vector (and the respective identity) from your database that contains the same information about hundreds or millions of people.

Original Image Blurred
Modified Image Restored

Task 4:   Image Deblurring

Neural Network:   SRCNN 9-5-5   |   CPU, NPU, DSP

Image Resolution:   300 x 300 px

Set-5 Score (x3):   32.75 dB

Paper & Code Links:   paper / code

Remember taking blurry photos using your phone camera? So, this is the task: make them sharp again. In the simplest case, this kind of distortions is modeled by applying a Gaussian blur to uncorrupted images, and then trying to restore them back using a neural network. In this task, blur is removed by one of the oldest, simplest and lightest neural networks - by SRCNN with only 3 convolutional layers, though in this case it still shows quite satisfactory results.

Original Image Zoomed
Modified Image Restored

Task 5:   Image Super-Resolution

Neural Network:   VGG - 19   |   CPU, NPU, DSP

Image Resolution:   192 x 192 px

Set-5 Score (x3):   33.66 dB

Paper & Code Links:   paper / code

Have you ever zoomed you photos? Remember artifacts, lack of details and sharpness? Then you know this task from your own experience: make zoomed photos look as good as the original images. In this case, the network is trained to do an equivalent task: to restore the original photo given its downscaled (e.g. by factor of 4) version. Here we consider a deep VGG-19 network with 19 layers. While its performance is currently not amazing and it is not able to reconstruct high-frequency components, it is still an ideal solution for paintings and drawings: it makes them sharp but smooth.

Original Image Zoomed
Modified Image Restored

Task 6:   Image Super-Resolution

Neural Network:   SRGAN   |   CPU only

Image Resolution:   512 x 512 px

Set-5 Score (x4):   29.40 dB

Paper & Code Links:   paper / code

The same task, but with new tricks: what if we train our neural network using... another neural network? Yes, two network performing two tasks: network A is trying to solve the same super-resolution problem as above, but network B observes its results, tries to find there some drawbacks and then penalizes network A. Sounds cool? In fact it is cool: while this approach has its own issues, the produced results are often looking really amazing.

Original Image Original
Modified Image Segmented

Task 7:   Semantic Image Segmentation

Neural Network:   ICNet, 2X parallel   |   CPU only

Image Resolution:   384 x 576 px

CityScapes (mIoU):   69.5 %

Paper & Code Links:   paper / code

Running Self-Driving algorithm on your phone? Yes, that's possible too, at least you can perform a substantial part of this task – detect 19 categories of objects (e.g. car, pedestrian, road, sky, etc.) based on the photo from the camera mounted inside the car. In the right image you can see the results of such pixel-size segmentation (each color correpsonds to each object class) for a quite recent ICNet network designed specifically for low-performance devices.

Original Image Original
Modified Image Enhanced

Task 8:   Photo Enhancement

Neural Network:   ResNet - 12   |   CPU, NPU, DSP

Image Resolution:   128 x 192 px

DPED PSNR i-Score:   18.11 dB

Paper & Code Links:   paper / paper / code

Still struggling when looking at photos from your old phone? This can be fixed: a properly trained neural network can make photos even from an ancient iPhone 3GS looking nice and up-to-date. To do this, it observes and learns how to transform photos from a low-quality device into the same photos from a DSLR camera. Of course there are some obvious limits for this magic (e.g. the network should be retrained for each new phone model), but the resulting images are looking quite good, especially for old devices.

Task 9:   Memory Limits

Neural Network:   SRCNN 9-5-5   |   CPU, NPU, DSP

Image Resolution:   4 MP

# Parameters:   69.162

Paper & Code Links:   paper / code

You should already recognize it from task 4: SRCNN, one of the lightest and simplest neural networks... But even it can bring the majority of phones to their knees while handling high-resolution photos: to process HD-images the phone should generally have at least 6GB of RAM. This test is aimed at finding the limits of your device: how large images can it handle with this simplest network?

Copyright © 2018 by A.I.

ETH Zurich, Switzerland