Echo: The world's first AI model built for spatial media.
- Skyrim.AI Team
- May 7
- 5 min read
Updated: 5 days ago

Today, we're proud to announce Echo, the world's first AI model developed entirely from the ground up, specifically designed for delivering broadcast-quality spatial media at scale. Echo represents not just an incremental improvement over existing technologies, but a fundamental leap forward in volumetric video capture and streaming—particularly suited to the demands of live sports broadcasting.
Unlike previous volumetric video techniques, Echo does not output traditional 3D formats such as point clouds, meshes, voxels, or gaussian splats. While Echo can convert outputs to these formats when needed, the model itself is optimized to produce data ideal for streaming large-scale spatial content like sports events. Echo's output seamlessly integrates with advanced real-time ray tracing engines, enabling broadcasters and media rights holders to deliver immersive, interactive experiences at unprecedented quality.
Echo is not a repackaging of existing reconstruction inventions; rather, it's a ground-up approach tailored specifically to scale high-fidelity volumetric captures.
By leveraging advanced AI trained on over 5 million proprietary parameters gathered from more than 58,000 unique captures, Echo is designed to reduce the number of cameras required in stadium-scale volumetric setups by up to two-thirds.
What is Generation 3 for the volumetric video industry?

At Skyrim AI, our team has spent nearly a decade—starting in 2016—working hands-on with every generation of volumetric capture technology. From Kinect v1 and v2 to Intel RealSense, we’ve explored and pushed the limits of what’s possible. We’ve synchronized depth and color across multiple consumer and industrial camera types. We’ve optimized structured light systems by redesigning infrared emitters for smoother coverage and less choppiness. In the past we built one of the world’s most advanced stereo disparity stages. And when off-the-shelf calibration tools weren’t good enough, we invented our own.
When it comes to reconstruction, we’ve built pipelines that cluster, scale, and process at production-grade levels. We've been deep in the trenches of volumetric capture, not just using the tools, but reinventing them.
Echo isn’t about repeating what we’ve done before. Echo is a clean slate.
It takes everything we’ve learned over the past nine years and places it on the shelf—not forgotten, but complete. Then it asks a new question: What is the root problem we’re really trying to solve? And the answer, every time, is quality. To understand what makes Echo different, it helps to understand the generations of technology that led up to it. That’s why we call Echo Generation 3.
Generation 1
This generation includes most traditional capture stages that use arrays of industrial cameras—usually 90% or more. These setups often rely on structured light or photogrammetry techniques, and in some cases, stereo disparity. Several well-written books document these approaches, which have remained largely unchanged for the past decade. While there have been improvements in mesh clean-up and downstream processing—often useful for game development or VFX—those enhancements haven’t meaningfully improved the capture itself. And the reconstruction pipeline (from point clouds to meshes or voxels) remains fundamentally limited by the quality of the source data.
Generation 2
This was a resurgence of research from the late '90s and early 2000s, built around the idea of capturing real-world environments like offices or conference rooms using radiance-based methods. Back then, 3D graphics were still in their infancy, and the compute required to reconstruct anything realistic was a major bottleneck. While that early work laid a strong theoretical foundation, it lacked the practical traction due to technical constraints of the time. Now, thanks to advances in GPU hardware (especially from companies like NVIDIA) and deep learning, we’re seeing a new wave of interest in this space.
However, much of the work we see in Generation 2 still begins with sparse, low-quality point clouds. Most of the innovation has gone into filling in or compressing that data—rather than solving the root problem. While methods like splatting and neural rendering have introduced exciting new visual formats, we believe they’re ultimately hitting a ceiling. Techniques like ray tracing on splats may be novel in format, but the fundamental problem remains the same: starting from limited source data. And we’ve had ray-traced meshes for over 20 years.
Generation 3
This is where Echo comes in. We’re defining Generation 3 as a fundamentally new approach to capturing the real world—starting not with how to fix broken data, but how to avoid capturing it in a broken state in the first place.
We’re not building on point clouds, splats, meshes, or voxels. We believe those formats belong to earlier generations. Instead, Echo is rooted in capturing high-quality spatial data at the source and enabling new kinds of reconstruction and delivery pipelines downstream—without being constrained by legacy formats.
Our focus is end-to-end: capture to reconstruction to delivery. And while we fully respect the innovations from previous generations, we’re not reinventing their methods—we’re moving past them.
Echo is still early in its lifecycle. Over the next 12 to 18 months, our goal is to reach true broadcast quality. Shortly after that, real-time broadcast quality. But what sets Echo apart most is where it begins—not in a small, controlled capture stage, but in large-scale, dynamic outdoor environments. Echo is built to handle changing light conditions, diverse settings, and real-world complexity from the start.
We believe that if you get the foundation right—starting with better capture—everything else gets better from there.
Read more about the generations in our state of spatial media industry report here.
Echo Inputs and Workflow
Echo primarily leverages RGB image sequences as its core input, optimizing these widely available data streams to reconstruct volumetric experiences rapidly and efficiently. However, Echo is also designed to accept a wide variety of input formats:
Images - Images from a single camera or a camera array.
Point clouds (sparse or dense, with RGB or camera metadata)
Meshes (OBJ, FBX, glTF)
Voxels (GVDB only)
Over time, Echo will continue to expand compatibility, working closely with partners who wish to integrate Echo directly into their existing capture and production pipelines.
Beyond Sports: Future Applications
While Echo's initial deployment focuses heavily on the sports broadcast industry, its revolutionary AI approach opens pathways to future applications across other industries. Virtual production, cinematic visual effects, telepresence, remote training, and medical imaging are among potential fields that could significantly benefit from Echo's advanced volumetric capabilities.
Looking Ahead: Join the Echo Vanguard Initiative
We're actively inviting sports media rights holders, technology innovators, researchers, and volumetric content creators to join our exclusive early access program: the Echo Vanguard Initiative.
Collaborate closely with our team to explore Echo's potential for cost reduction, quality improvement, and stadium infrastructure optimization for spatial media.
Receive direct feedback on how Echo can enhance existing capture workflows involving structured light, photogrammetry, Gaussian splats, and other methodologies.
Participate in alpha and beta testing phases, helping to shape the ongoing evolution of Echo.
You can join the Echo Vanguard Initiative here.