Neural Radiance Field (NeRF), you may have heard this word many times in the past few months. Yes, this is the latest progress in neutral work and computer graphics. NeRF represents a scene with a learned, continuous volumetric radiance field $$F_{\theta}$$ defined over a bounded 3D volume. In Nerf, $$F_{\theta}$$ is a multilayer perceptron (MLP) that takes as input a 3D position $$x=(x,y,z)$$ and unit-norm viewing direction $$d=(d_x,d_y,d_z)$$, and produces an output which contains density $$\sigma$$ and color $$c=(r,g,b)$$. By enumerating all possible positions and directions of a bounded 3D volume, we are able to obtain the 3D scene.

The weights of the multilayer perceptron that parameterize $$F_{\theta}$$ are optimized so as to encode the radiance field of the scene. Volume rendering is used to compute the color of a single pixel.

The first well known paper on this topic is NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. To learn such a field, you only need to capture a few images of the scene from different angles (a), then you try to fit the MLP between your camera pose to the image(b). This is also illustrated in the chart below.

For rendering a scene (i.e., inference, c), you need to pick a camera view, which is 3D position and orientation. Then you enumerate the directions according to the camera’s orientation and viewing angle to estimate the brightness and color for this direction. Putting it all together you get an image for this camera view.

# More on Training

The training aims to learn the MLP (weights) - volume generated by the MLP renders the images, replicating the input images. To this end, MLP will produce color and volume density by enumerating the rays overall position and direction. Then images will be rendered from this color and volume density at the same camera view as the input images. Given this whole process (including rendering) is differentiable, we can optimize the MLP so the differences of rendered images and input images are minimized.

You may realize the training required to have correspondence between the input (3D locations and directions) and the output (pixel intensity and density). The output is easily available from the images. But for the input, 3D locations and directions, you need to have the camera’s extrinsic and intrinsic parameters.

Camera extrinsics are computed via structure from motion. The NeRF paper recommends using the imgs2poses.py script from the LLFF code. Combining camera intrinsic and extrinsic, you can obtain the orientation of each ray from the camera to every pixel on the image.

# Results

Some results from bmild/nerf:

# Next Steps

Cool, everything sounds great. Why do we not use it to replace existing rendering methods? Because they are too slow. Rendering NRF takes a few hours to a day or two (depending on resolution) and only requires a single GPU. Rendering an image from an optimized NeRF takes somewhere between a second and ~30 seconds, again depending on resolution. However, using the current mesh + texture process it won’t take you more than a few milliseconds. There have been a lot of efforts made to improve rendering speed; I am planning to cover these in more depth in future posts.

Written on April 15, 2022