Demystifying Spatial Media Formats: From Point Clouds to Neural Radiance Fields

Skyrim.AI Expert Series
Jul 2
16 min read

Do you enjoy podcasts? Our expert series is now published as a podcast, so you can read or listen to Demystifying Spatial Media Formats: From Point Clouds to Neural Fields 🎧 here

Volumetric Video 101

Volumetric video is essentially the art of capturing 3D movement over time, think of it as filming a performance in a way that you can later walk around it. Instead of watching shadows on a wall, as in traditional flat video, we’re capturing the entire sculpture in motion, with depth and volume. In practice, volumetric capture uses many cameras to record a subject from all angles, then reconstructs a full 3D video of that subject 1. The result is video you can experience from any viewpoint, as if you’re in the scene rather than looking through a window.

Why do volumetric formats matter so much? The way this 3D data is represented affects everything – file size, visual quality, how easily you can edit or stream it, and the latency or speed of playback. Early approaches produced huge files that were hard to manage, but preserved detail; newer approaches aim to shrink data and speed up rendering, sometimes at the cost of certain details. Choosing the right format is a balancing act between fidelity and practicality. For instance, a format that stores every tiny detail might look great but be too heavy to use live, whereas a more efficient format might stream in real-time but with some loss of texture. Understanding the evolution from point clouds and meshes to neural fields will shed light on how we arrived at today’s solutions – and why a new generation of spatial media is unlocking truly scalable 3D video.

Legacy Formats – Points, Polygons, and Voxels

Early volumetric video systems relied on traditional 3D geometry formats. In simple terms, they tried to explicitly capture the shape of everything, which guaranteed realism but made data sizes enormous. Three key formats defined this era:

Point Clouds: These are like thousands (or millions) of colorful dots in space. Each point has an X, Y, Z position (and often a color). Imagine a 3D scatter of confetti where each piece carries color – that’s a point cloud. It’s the raw output you get from many 3D scanners or multi-camera rigs: lots of points sampling the surface of people or objects. Point clouds can capture fine details, but they’re unstructured (just points with no connections) and can require huge storage for high resolution 2, 3
Meshes (Polygons): A mesh is what you get by connecting the dots. Take a point cloud and draw triangles (or polygons) between points that should form a surface – now you have a skin over the scatter of points. Meshes turn discrete points into continuous surfaces, like stretching a net or fabric over the points. This makes the data more compact and ready for rendering with textures 4. However, generating meshes from point clouds takes heavy computation, and some fine detail can be lost in the process 5, 6. Think of meshes as the “connect-the-dots” version of reality –efficient to render, but only as good as the net you wove from the dots.
Voxels: Voxels are 3D pixels – little cubes on a grid, each with a color or density. If point clouds are like stars in space and meshes are like a skinned mesh of those stars, voxels are like a 3D grid of building blocks. In a voxel representation, you divide space into a fixed grid (like a 256×256×256 cube) and specify if each tiny cube is filled or what color it is. This is akin to Minecraft blocks, but at a much finer scale for detailed scenes. Voxels make it easy to do physics or collisions (since space is uniformly divided), but they explode in memory size if you want high detail – imagine needing a tiny cube for every wrinkle or hair! 7 8 For volumetric video, pure voxels were less common (due to huge data sizes), but some systems used them for things like medical scans or dense reconstructions. They’re the “volumetric pixels” of the 3D world.

Early volumetric capture projects by major players leaned on these formats. Microsoft’s Mixed Reality Capture studios, for example, use 106 cameras on a stage to record a performance, then compute a 3D point cloud and mesh of the person for each frame 9, 10. The result is a highly detailed hologram of, say, an actor or athlete, which can be viewed in 360°. The fidelity is fantastic – every fold of clothing and strand of hair can be captured – but the process is extremely data-heavy. Intel’s True View, famous for “Be the Player” 360° replays in the NFL, similarly uses dozens of 5K cameras around a stadium to produce point cloud-based volumetric replays 11. Each 30-second clip can require up to 1 terabyte of data to process 11, and dedicated servers crunch that data to spit out a cohesive 3D video.

These legacy approaches proved volumetric video’s potential (e.g., letting fans replay a goal from any angle on a mobile app 12 ), but they came with serious limitations:

Massive files & Bandwidth: Capturing every point or voxel means huge files. One volumetric clip might consist of millions of points or polygons per frame. Studios like Metastage (LA) or Dimension (London) generated raw captures so large that specialized codecs and compression tools were needed to stream them.
Complex, Compute-Heavy Pipelines: Converting multi-camera footage into point clouds and meshes is computationally intensive. It involves multi-view stereo algorithms, alignment, surface reconstruction, and texture mapping 4,14. This typically isn’t real-time – it might take hours or days to process a few minutes of footage at high quality. Intel True View highlights needed on-site servers to process data for each clip 11, and Microsoft’s system similarly requires substantial crunching to produce a finished hologram.
Limited Interactivity (Live Use): Because of the above two issues, these formats were mostly used for pre-recorded or slightly delayed playback. You could capture a volumetric replay and then play it back from any angle, but doing this fully live (with sub-second latency) was out of reach. The data was just too heavy to stream live without compression, and compressing it in real-time was hard. The capture setups (tens of cameras and controlled studios) also made it hard to scale to every venue or to move beyond a fixed stage.

In short, Points, Polygons, and Voxels showed what volumetric video could do, but they were like hauling around raw marble sculptures – rich in detail but cumbersome. The industry began searching for smarter ways to represent 3D scenes, to lighten the load without losing the magic of free-viewpoint video.

Gaussian Splatting: Blobs Over Points

One of the intriguing innovations on the path to lighter 3D formats is Gaussian splatting. If point clouds are like crisp points, think of Gaussian splats as fuzzy balls or ellipses that blur together into surfaces. Instead of each sample being a pin-point, it’s a tiny Gaussian (think little 3D blob of color with softness). When you project millions of these fuzzy blobs into a scene, they blend into a smooth surface – much like an impressionist painting looks smooth from afar, even if it’s blobs of color up close 15.

Gaussian splats have a shape (an ellipse that can stretch or squish) and a color and transparency. You essentially “splat” these 3D blobs onto the screen. As Niantic’s engineers describe it, you give each blob a color and opacity and let them overlap until they form a continuous image 15. The big deal here is that rendering these splats can be extremely fast, because it leverages GPU-friendly operations (rasterizing blobs instead of doing complicated ray-marching through a neural network). In 2023, a research team led by INRIA’s George Drettakis showed that by representing scenes with 3D Gaussians, they could achieve real-time 3D view synthesis at 1080p, with frame rates over 100 fps 16,17. Their method, 3D Gaussian Splatting, won Best Paper at SIGGRAPH 2023 as it was the first to combine state-of-the-art visual quality with true real-time rendering for novel views 18, 16.

The imagery from Gaussian splats is impressive – you can capture a scene with ordinary photos (like a building or a room) and the algorithm optimizes a set of millions of colorful ellipsoids that, when splatted, look photorealistic from any angle. It’s essentially a different way to store the scene: not as triangles or dense voxels, but as adaptive blobs that cover space efficiently. These splats preserve many advantages of continuous volumetric data (since each is like a tiny volume with smooth falloff) but avoid wasting effort on empty space 19. It’s a bit like painting an object with spray paint instead of placing bricks – you cover surfaces with just enough “spray” (blobs) to give the illusion of solidity.

However, Gaussian splatting comes with caveats. The initial methods were designed for static scenes –essentially capturing an environment or object that isn’t changing over time 20. If anything moves (like people or trees blowing in wind), the approach struggles unless you extend it with more complex dynamic handling. Researchers are actively working on dynamic extensions (there are papers on dynamic Gaussian splatting for things like people moving 21), but the core strength so far is quick rendering of a fixed scene. The Niantic team touts that “anyone with a decent smartphone” can scan an environment and produce a splatted 3D scene 22, which is a democratization of capture, but it remains a capture process that’s done offline or in the cloud, not on the fly for live events (yet). And though splats render fast, training or optimizing the splats from input images can still take time (though faster than neural networks).

In summary, Gaussian splatting is a clever middle ground: it ditches explicit geometry (no meshes needed) but still keeps an explicit representation of the scene (the blobs). It proved we could get real time rendering of high-quality scenes 23, breaking the notion that only hyper-optimized neural networks could maybe do that. But to handle full volumetric video with motion, the compression techniques are no different than meshes or point clouds.

Neural Radiance fields (NeRFs)

Neural Radiance fields, the technology that truly flipped the script on 3D representation when it emerged in 2020. If point clouds were storing every “brick” of the scene and Gaussian splats every “blob of paint,” NeRFs instead store nothing explicitly about the scene’s geometry. Instead, a NeRF is like training a brain that can imagine the scene from any angle. A commonly used metaphor: Instead of storing every brick of a castle, we train a brain to imagine the castle perfectly from any window. In practical terms, a NeRF encodes the scene in the weights of a neural network. You give it a 3D coordinate and a viewing direction, and it outputs the color and density at that point 24. Do this for many points along a camera ray, and you can render an image from any viewpoint via classical volume rendering 25.

Put more simply, a NeRF learns to “paint” a 3D scene: you feed in a bunch of 2D photos of a scene, and it figures out how to produce any other photo of that scene from any new angle 26. It doesn’t explicitly create a mesh or point cloud; instead it’s like a black box that answers, “If a camera were here looking this way, what color would each pixel be?” The neural network inside is effectively modeling how light travels through the 3D scene (that’s why it’s called a radiance field).

This approach, introduced by Mildenhall et al. (2020), was revolutionary in the quality of novel views it produced 24. Scenes that were hard for traditional multi-view stereo to reconstruct (because of fine details, transparency, etc.) could now be learned by a neural network and rendered with stunning realism 27. The catch: NeRFs were originally very slow. Training a NeRF on a single scene could take hours to days on a GPU, and rendering an image could take several seconds or more. Essentially, NeRF was a proof-of-concept that if you throw enough computation at the problem, an optimized neural network can synthesize views that outperform prior methods in quality 27. But it wasn’t practical for interactive use yet – you wouldn’t use an early NeRF to do a live sports replay unless you could wait minutes for the result (which of course, you can’t in a live broadcast).

The last couple of years have seen astonishing progress in speeding up NeRFs. Researchers at NVIDIA, for instance, introduced Instant NeRF (part of Instant-NGP, 2022) which uses a clever hash-grid encoding to train a NeRF in a matter of seconds to minutes instead of hours 28. In fact, NVIDIA’s Instant NeRF was so fast and impactful that TIME named it one of the best inventions of 2022 29. What Instant NeRF and similar advances did was turn NeRF from a lab curiosity into something you could feasibly use in production. Suddenly, you could take a bunch of photos and get a reasonable 3D model out in seconds or minutes, not days. Likewise, Google, Meta, and others have been publishing increasingly faster and better NeRF variants 30, some focused on higher quality (even 3D portraits or city-scale captures) and others on speed or special features. A whole ecosystem of tools popped up: Nerfstudio, an open-source framework, lets users train NeRFs and view results in real-time through a web viewer, making the tech much more accessible to non-researchers 31. These tools hide much of the complexity and allow one to plug in their own images or even video to get a 3D scene out. As a result, NeRF has gone from academic idea to something even hobbyists and artists are experimenting with in just a couple of years.

Despite the speed-ups, classic NeRFs still have challenges for things like live sports. They traditionally need many images from all angles to train (so, similar dense capture to other methods) and each new scene is a brand new network to train. It’s not like you have one model that does everything; you retrain from scratch for, say, each new play or each new game environment. Also, rendering even a fast NeRF might still be on the order of tens of milliseconds per frame on good GPUs – which is okay, but if you want many viewers to interact at once, it can add up. And dynamic content (moving subjects) historically breaks the assumption of NeRF, which was built for static scenes (though research on dynamic NeRFs is ongoing).

Still, the big concept NeRF brought is implicit, neural representation: you don’t capture the shape, you capture the essence in a neural network’s weights. This was the stepping stone to what we call the neural fields era – and the industry started thinking, what if we build a volumetric video system that outputs not geometry, but a neural field that can be efficiently streamed and rendered?

Generation 3 – Echo’s Approach

Skyrim.AI's Echo Image — Echo demonstration – a volumetric 3D model of a basketball player mid-dunk, generated not as a point cloud or mesh, but via Echo’s neural reconstruction.

This brings us to Skyrim.AI’s Echo, which can be seen as a representative of the third generation of volumetric video formats. Unlike legacy systems that output point clouds, meshes, or voxel grids, Echo doesn’t produce a human-readable 3D model at all – instead it generates an efficient neural representation optimized for real-time use. Skyrim.AI refers to this new class of content as “Spatial Media”, to distinguish it from old-school volumetric video 32. The idea is that Echo’s output is natively neural and does not need to be converted from an intermediate point cloud. It’s as if the system speaks a new language for 3D video, rather than using the building blocks (voxels) or brush strokes (splats) we had before – a language designed from the ground up for streaming and interactivity.

So what does this mean concretely? The details of Echo’s format are proprietary, but based on descriptions, Echo’s AI-driven 3D reconstruction model is trained on a massive dataset of sports footage (tens of thousands of captures) and can infer a full 3D scene from much less input data than previous methods 33. In practice, they report that Echo can reduce the required camera count by up to two-thirds for volumetric capture 33. Imagine a stadium that would have needed 30 cameras around the field now needing maybe 10 to get the same (or better) volumetric output – that’s a game changer. Fewer cameras means easier deployment in sports venues and lower costs.

Crucially, Echo demonstrates real-time or near-real-time reconstruction. It positions Skyrim.AI on a path toward true live volumetric broadcasts 32, where a slam dunk or a goal can be reconstructed and streamed in 3D almost instantly. Denny Breitenfeld, the CEO, described the shift in dramatic terms: with Echo, we’ll soon realize we’ve been watching sports “as if it were in black and white” 35. The analogy suggests that once you can watch sports volumetrically (choosing your angle, experiencing it as if you’re on the field), flat 2D replays will feel antiquated.

To put it in metaphorical terms: older volumetric systems were building a digital twin brick by brick (or point by point), whereas Echo is dreaming the digital twin from a sparse signal, guided by its prior training. That’s the “new kind of language” – it isn’t assembling the replay from raw pixels alone, it’s imagining the replay with the help of AI, in a way that stays true to the actual footage we have. Of course, this isn’t to say it’s hallucinating new events (the AI isn’t inventing a different play than what happened), but it can infer depth and reconstruct parts of the action that cameras didn’t directly see, using its learned knowledge of physics and sports visuals.

The implications of Echo’s neural-native format are profound. It essentially breaks the scaling limitations that have hung over volumetric video. Need to cover a whole league of sports games volumetrically? You no longer need to outfit every stadium with a hundred cameras and petabyte servers – a handful of cameras and a cloud AI pipeline might do the trick. Want to stream to millions of users? You can multicast the lightweight neural scene data, which each user’s device can render, instead of having to either pre-render every possible angle (impossible) or stream raw 3D data (too heavy). This paves the way for truly interactive live broadcasts.

Implications for Sports Production

The evolution of these formats – from heavy geometry to lightweight neural fields – is not just a tech novelty; it’s the key to unlocking immersive sports experiences that were previously impractical. Here’s what this new generation of spatial media could mean for sports broadcast and production teams:

Real-Time Immersive Replays: With neural formats like Echo’s, a highlight can be captured and streamed in volumetric 3D seconds after it happens, or even live. Viewers could pause and fly around a pivotal play, seeing it from any angle in the stadium, all in real-time. This is the fulfillment of what True View began (the 360° replay) but on a new level – it could be available for every play, not just the highlight of the game, because the processing and bandwidth costs are so much lower now.
“Choose Your Angle” Broadcasts: Instead of a director choosing the camera cut that everyone sees, each viewer could become their own director. Imagine during a live basketball game, you can switch to a sideline view, or a bird’s-eye view, or even a player’s perspective on the fly. The broadcast could be a volumetric stream and the viewer’s app lets them pan the virtual camera. Legacy volumetric video hinted at this freedom (e.g., fans controlling highlights in a team app 12), but neural spatial video can make it mainstream. It’s the ultimate personalized viewing experience.
Interactive Second Screens & AR: Sports fans love multi-screen experiences (stats on the phone while game on TV, etc.). With volumetric video, a second-screen app or AR headset could let fans literally pull the game into their room. Picture viewing a tabletop AR model of the soccer pitch with live players running around, reconstructed via a neural field – you can walk around it and see plays develop from any vantage. Because neural representations are efficient, an iPad or AR glasses could handle rendering a live volumetric feed. This could transform how we do replays in-studio as well –sports analysts might use holographic replays they can walk around in a studio, rather than X’s and O’s on a flat telestrator.
Lower Bar for Capture: Broadcast teams can rethink their camera setups. Traditionally, to do anything holographic, you needed a massive camera array (think the NBA’s True View had ~50+ cameras in the rafters). With neural reconstruction reducing camera needs by 2/3 33, one could cover more of the field with fewer devices. Maybe you augment your normal broadcast cameras with a few extra ultra-wide 8K cameras, feed those into the AI, and you get a full 3D feed. This makes volumetric capture feasible in places it wasn’t before (smaller venues, or temporary installations for events, etc.). It also means less intrusiveness – fewer cameras to hide in a stadium means it’s easier to get buy-in to deploy.
Scalable Production and Distribution: For production teams, neural spatial video can streamline workflows. Instead of managing dozens of video feeds and a separate pipeline to stitch them into a 3D video, a trained AI model does the heavy lifting. The output is a format that can be edited like video (you can cut highlights, etc., except the “video” is now volumetric). Distribution-wise, it can fit into existing content delivery networks because it’s much lighter data. It might even be cloud-rendered into standard video for viewers who don’t have capable devices, meaning you can produce one volumetric feed and serve both interactive viewers and traditional TV viewers (the latter just see the director’s cut). The key is that the format is flexible – you get both interactive and traditional experiences out of one pipeline.

Overall, the message to sports broadcast teams and technologists is clear: Neural-native volumetric video is coming of age, and it’s time to start exploring and experimenting with it. The limitations that kept volumetric video on the fringe (massive rigs, high costs, slow turnaround) are being dismantled by AI-driven approaches. We’re moving into an era where “spatial media” could become as commonplace as instant replay. Just as the switch from analog to digital or SD to HD revolutionized sports viewing, the switch from 2D video to volumetric spatial video could redefine the fan experience.

It’s a call to rethink capture and storytelling. Producers can begin to imagine shots and camera moves that aren’t possible with physical cameras – like a frozen moment pan around the quarterback, or a view from just above the goalie’s shoulder during a penalty kick – executed on the fly. Editors will need tools to cut volumetric clips. Cameras might be planned not just for pretty shots, but to feed the AI the angles it needs for reconstruction (a new kind of cinematography). And importantly, standards and practices will need to evolve (enter organizations like the Volumetric Format Association, which are already discussing how to handle these new formats).

The bottom line: Volumetric video is evolving into spatial media, shedding its training wheels of point clouds and polygons. With neural fields at the helm, what was once bulky and slow is becoming sleek and live. Sports may be the first arena to truly showcase this leap, turning games into immersive experiences that fans can step inside. It’s an exciting time to be at the intersection of tech and sports – the playing field for volumetric innovation is wide open, and the ball is in our court. The teams that get ahead of this curve and embrace neural volumetric formats will be the ones defining the future of sports entertainment. So gear up and start experimenting – the era of scalable, live 3D sports is about to kick off.

Welcome to the new playing field of spatial media.

Sources: Skyrim.AI Echo launch announcement 33,32; SIGGRAPH 2023 paper on 3D Gaussian Splatting 16 ; Mildenhall et al., 2020 (NeRF) 24 ; NVIDIA Instant NeRF blog 28 ; The Decoder (Nerfstudio) 31 ; Intel True View coverage 11 ; Microsoft Mixed Reality Capture (Metastage) 9 ; Niantic Labs on Gaussian splats 15,23 ;

Footnotes

1, 37 - What is volumetric video? [2025], https://www.volumetricformat.org

2,3,5,6,7,8 - A Beginner’s Guide to 3D Data: Understanding Point Clouds, Meshes, and Voxels | by Sanjiv Jha | Medium [May 11, 2024], https://medium.com/@sanjivjha/a-beginners-guide-to-3d-data-understanding-point-clouds-meshes-and-voxels-385e02108141

4,14 - Volumetric Videos | Image Processing | Nikon About Us, https://www.nikon.com/company/technology/technology_fields/image_processing/volumetric

9, 10 - Metastage and Departure Lounge set up volumetric capture studio in Vancouver | VentureBeat [March 31, 2022], https://venturebeat.com/games/metastage-and-departure-lounge-set-up-volumetric-capture-studio-in-vancouver

11,12 - Intel True View is a cool technology for immersive sports viewing | VentureBeat [September 19, 2019], https://venturebeat.com/business/intel-true-view-is-a-cool-technology-for-immersive-sports-viewing

15,22,23 - Splats Change Everything: Why a once-obscure technology is taking the 3D graphics world, and Niantic, by storm – Niantic Labs [Dec 10, 2024],

https://nianticlabs.com/news/splats-change-everything/?hl=en

16,17,18,19 - 3D Gaussian Splatting for Real-Time Radiance field Rendering [July 2023], https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting

20,21 - SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes [2024], https://openaccess.thecvf.com/content/CVPR2024/papers/Huang_SC-GS_Sparse-Controlled_Gaussian_Splatting_for_Editable_Dynamic_Scenes_CVPR_2024_paper.pdf

24,25,27 - [2003.08934] NeRF: Representing Scenes as Neural Radiance fields for View Synthesis [Aug 3, 2020] , https://arxiv.org/abs/2003.08934

26,28,29 - Transform Images Into 3D Scenes With Instant NeRF | NVIDIA Blog [April 17, 2024], https://blogs.nvidia.com/blog/ai-decoded-instant-nerf

32,33,34,35,36 Skyrim.AI Launches "Echo," the World's first AI-Based 3D Reconstruction Model for Broadcast-Quality Spatial Media in Sports | Skyrim.AI [May 7, 2025]

https://www.skyrim.ai/news/skyrim.ai-launches-%22echo%2C%22-the-world's-first-ai-based-3d-reconstruction-model-for-broadcast-quality-spatial-media-in-sports

30, 31 - Nerfstudio makes it easier to get started with NeRFs [Oct 21, 2022], https://the-decoder.com/nerfstudio-makes-it-easier-to-get-started-with-nerfs