SHACIRA: Scalable HAsh-grid Compression for Implicit Neural Representations

Sharath Girish          Abhinav Shrivastava          Kamal Gupta
University of Maryland, College Park

ICCV 2023
Image nerf compression

We introduce an end-to-end compression framework for feature-grid INRs achieving data compression for various domains such as images, videos, NeRFs. We demonstrate the effectiveness of SHACIRA for two tasks. The top row reconstructs NeRF from 2D images and their camera poses using Instant-NGP, VQAD, and ours. The bottom row shows a gigapixel image at 21450x56718 resolution (cropped for visualization) encoded using Instant-NGP, JPEG, and ours (SHACIRA). For each example, we zoom into two crops to compare different methods. We show overall PSNR and size required by each method. Our method can capture high-resolution details with a smaller storage size in a task-agnostic way (only 2D/3D examples shown here).

Abstract

Implicit Neural Representations (INR) or neural fields have emerged as a popular framework to encode multimedia signals such as images and radiance fields while retaining high-quality.

Recently, learnable feature grids (Instant-NGP) have allowed significant speed-up in the training as well as the sampling of INRs by replacing a large neural network with a multi-resolution look-up table of feature vectors and a much smaller neural network. However, these feature grids come at the expense of large memory consumption which can be a bottleneck for storage and streaming applications.

In this work, we propose SHACIRA, a simple yet effective task-agnostic framework for compressing such feature grids with no additional post-hoc pruning/quantization stages. We reparameterize feature grids with quantized latent weights and apply entropy regularization in the latent space to achieve high levels of compression across various domains. Quantitative and qualitative results on diverse datasets consisting of images, videos, and radiance fields, show that our approach outperforms existing INR approaches without the need for any large datasets or domain-specific heuristics.

Approach

Approach.

Our approach maintains quantized latent representations of the feature grid which consists of a hierarchy representing coarse-to-fine resolutions of the signal. The hierarchical latents are passed through a decoder to obtain continuous feature vectors. The feature vectors at different levels are then concatenated and passed through an MLP to obtain the output signal which can be images, videos, NeRFs, and so on.

NeRF compression

V8

On the V8 NeRF scene from the RTMV dataset, we obtain better reconstructions, preserving much finer detail than VQAD while being visually similar to Instant-NGP with a small drop in PSNR but being 60x smaller.

Image compression

Smacs

We significantly outperform JPEG in the low BPP regime (~+6dB) while still being 3x smaller in terms of storage. JPEG results in lots of artifacts and a reduced color palette due to extremely low quality factor levels.

Cosmic

We scale well to higher resolution images obtaining similar PSNR and reconstruction quality as Instant-NGP while 6 times smaller. SIREN fails to fit high frequency information leading to blurry patches as seen. JPEG on the other hand suffers from blocking artifacts and discoloration leading to drop in reconstruction quality.

Pearl

Unlike prior INR works, we continue to scale to even higher image resolutions (even reaching gigapixel levels as shown above) preserving the fine details as Instant-NGP while being almost 9 times smaller. JPEG continues to exhibit discoloration artifacts and color leakage.

Progressive inference-time streaming

Streaming

The hierarchical nature of feature-grid based INRs allows for progressive streaming of the scene. For our latent-based approach, transmitting only coarse resolution latents at inference time leads to a coarse scene requiring fewer bits to transmit. Finer details are captured with the finer resolution latents. This can be useful in streaming scenarios with varying level of detail.

Training speed

convergence

We compare the convergence speed of our feature-grid approach with traditional INR approaches such as Strumpler et al. We obtain higher PSNR at a much faster rate than at a similar BPP range (hue value in color bar). Their PSNR remains stagnant at 32.5dB. In contrast, our approach achieves this PSNR and BPP (0.85) within just 53s achieving more than a 10x speedup in convergence. Additionally, we achieve higher PSNR with longer encoding times reaching 34dB in 430s while maintaining a similar BPP (0.87).


Video

BibTeX

@InProceedings{Girish_2023_ICCV,
      author    = {Girish, Sharath and Shrivastava, Abhinav and Gupta, Kamal},
      title     = {SHACIRA: Scalable HAsh-grid Compression for Implicit Neural Representations},
      booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
      month     = {October},
      year      = {2023},
      pages     = {17513-17524}
  }