Garbage In, Garbage out: The quality of the output is determined long before reconstruction begins.
- 5 days ago
- 38 min read
Updated: 3 days ago

This is the first post in our three-part series, Fail Fast in the World of Volumetric Video.
The goal of this series is simple: share what we have learned from 10 years of inventing, testing, breaking, rebuilding, and validating volumetric systems so others can move faster, avoid repeated mistakes, and help build a stronger ecosystem.
The quality of the output is determined long before reconstruction begins.
It is tempting to think the biggest breakthroughs will come from better rendering, better compression, better Gaussian splats, better neural reconstruction, or better playback. Those all matter. But if the capture is wrong, everything downstream becomes harder, slower, and more expensive.
That is what we mean by garbage in, garbage out.
In this post, we focus on the practical places where garbage enters the pipeline early: lenses, lens distortion, shutter type, sensors, raw camera data, multi-camera sync, timecode, color balance, calibration, and point cloud cleanup.
None of these topics are as flashy as a demo. But they are often the difference between something that works once and something that can scale.
If you are just getting started in volumetric video, our hope is that this helps you ask better questions before you build. If you have been in the industry for years, we welcome your feedback, pushback, and experience.
The goal is not to claim there is only one right way to build a volumetric pipeline.
The goal is to help the industry fail faster, learn faster, and reduce the amount of garbage we keep asking reconstruction to fix.
1. Lenses: Distortion Is Not a Small Detail
The lens is one of the first places where garbage enters the pipeline.
Every lens has distortion. Some lenses have predictable distortion. Some lenses have distortion that is harder to model. Some lenses report useful metadata. Some do not.
If you are doing multi-camera calibration, stereo reconstruction, structure from motion, neural reconstruction, or point cloud generation, lens distortion directly affects how accurately each camera agrees with the others.
There is almost always more distortion toward the edges of a lens. The simplest workaround is to crop the frame and avoid using the outer edges of the image for calibration, point cloud generation, or reconstruction. Another simple practice is to frame the subject as close to the center of the shot as possible. This also applies when you are using a multi-camera calibration chart.
Those workarounds can help, but they are still workarounds.
The better approach is to understand the distortion before the image enters the rest of the pipeline. If your lens can output useful metadata, and if your pipeline can read and use that metadata, you should use it. The goal is to correct the distorted pixels before multi-camera calibration, point cloud generation, or reconstruction begins.
That is why lens metadata matters.
If you are not familiar with lens metadata, think about the basic information a camera can read from a lens, such as focal length and f-stop. For example, when a Canon camera displays the focal length or aperture of a Canon lens, that information is metadata being passed from the lens to the camera.
But lens metadata can include much more than those basic characteristics.
Many lens manufacturers also have lens calibration data that describes how that specific lens behaves optically, including distortion characteristics. That is the metadata we are most interested in for volumetric video: not just what focal length or aperture the lens reports, but whether there is usable calibration data that can help correct the image before calibration, point cloud generation, or reconstruction begins.
Access to this calibration data varies widely across lens categories. Mobile phone lenses, industrial lenses, photography lenses, cinema lenses, and broadcast lenses all expose metadata differently. Some systems make useful data available through the camera body. Some require a specific mount, adapter, cable, software license, or manufacturer SDK. Some only expose basic lens characteristics. Some may not expose useful calibration data at all.
The goal is simple: get access to the lens calibration data if it exists and use it as early in the pipeline as possible.
If the lens does not expose that data, then you may need to work with the lens manufacturer, a third-party calibration provider, or build your own calibration process. We will go into those options later in this section.
1.1 Mobile Phone Lenses
Mobile Phone Lenses are the most unreliable lenses in our list of lenses, for several reasons.
You may not be receiving a clean optical image. Depending on the device, camera mode, operating system, and API, the image may already have been sharpened, stabilized, denoised, color processed, cropped, corrected, or otherwise modified before you ever get access to the frame.
Some phone manufacturers have world-class internal optics teams. Some outsource lens design. Some partner with well-known lens companies. Some do a combination of all three. The challenge is that you often do not get direct access to the lens metadata, distortion model, or image-processing pipeline.
Even when the phone has useful metadata internally, the operating system or camera API may not expose the information you need.
This makes lens calibration much harder.
When a phone manufacturer says, "This was shot on our phone" they may be using internal tools, private APIs, metadata access, or processing modes that are not available to the general public. That matters if you are trying to build a repeatable volumetric capture pipeline around mobile phones.
Mobile phone capture can still be useful, but you need to understand what control you actually have over the image and metadata. Because the mobile phone industry has no standards around their lenses and the metadata how which lenses they pick which manufacturers is it proprietary do you have access to it It is nearly impossible to compensate for the distortion you're going to get out of a mobile phone lens this is why those times where say you use a single phone as your capture and you're doing photogrammetry or Gaussian splats You tend to have a lot more coverage in those instances than you would if you were using a different The more coverage you have when you can't correct the distortion the more you hide the distortion because you have so much coverage that you can find the right point from all that coverage So you'll see mobile phone rigs while they seem portable you are overcompensating and this happens over and over again That could change if a phone manufacturer releases the lens manufacturer they worked with or if it's an internal lens team and they provided real calibration data and they guaranteed that the image that you receive has not been preprocessed.
1.2 Industrial Camera Lenses
Industrial cameras are designed for machine vision use cases, and the ecosystem around them has decades of calibration and distortion-correction experience. They can also be useful in specialized rigs, including multi-spectrum or IR-based systems.
Industrial camera systems are also often more customizable. In our experience, industrial camera and lens manufacturers are used to custom solutions. Many of these companies already work with specialized combinations of sensors, ASICs, camera bodies, lens mounts, and imaging pipelines.
Even when a lens does not output useful metadata directly, the manufacturer might have that data in their database and work with you. They might even help you build your own calibration wall / chart. Some manufacturers may recommend third-party calibration providers who can measure the lens against a known sensor size and calibration setup.
If you are building a serious industrial-camera-based volumetric rig, do not assume you have to guess, if the lens company doesn’t respond many times the camera manufacturer has their own calibration system or can get the information you need when you buy the lenses.
1.3 Photography Lenses
Photography lenses are often a better starting point, especially fixed focal length lenses.
A fixed focal length gives you consistency, and consistency is gold in volumetric capture.
Zoom photography lenses can work, but only if you know exactly what focal length you are using, whether the lens reports that focal length accurately, and whether your pipeline can use that information correctly. Many photography lens will report a focal length that is a fraction verse a full number for example you might have a 18 to 72mm lens and if you set one lens to 56mm and another lens to 56mm the chances they are exactly set to 56mm is very slim and the metadata out will only report on the full f/stop so you might be at 56.1 on one lens and 56.4 on another lens. If you are going to use zoom photography lenses use a servero motor that you can control and program from your pipeline. Also calibrate where you decide that 56mm is on one lens and make sure it matches the other lenses. There are several different techniques but your pipeline should handle the basic characteristics that the lenses report and focal length plays a key role in a professional volumetric video pipeline.
If your stage can recalibrate throughout a live event, and you want to get more coverage from less equipment, zoom photography lenses may be an option. However, photography lenses can also introduce challenges around zoom and focus in a live stage environment. If your stage needs to support zooming, you may need a manual focus puller, or you may be forced to rely on autofocus.
We do not recommend using autofocus for volumetric capture because it can introduce inconsistency between cameras and frames.
1.4 Cinema and Broadcast Lenses
Cinema and broadcast lenses are usually the best option when the budget and workflow allow for them.
Higher-quality glass, better consistency, better metadata, and more predictable distortion all reduce the amount of guesswork the reconstruction system has to perform.
They also tend to handle zooming and focusing more reliably, especially when cameras are in fixed positions and you are zooming toward the action. For example, in a basketball game, the action may shift from one end of the court to the other. A lens system that can handle that movement reliably is a major advantage.
In our experience, lens metadata can often be retrieved from high-end cinema lens systems, depending on the camera and mount configuration. Zeiss cinema lenses, RED and Nikon workflows, and Canon lens systems are all examples where metadata access may be possible, depending on the exact setup.
The key is to verify the full chain: lens, mount, adapter, camera, capture software, and volumetric pipeline.
A lens may be capable of outputting useful metadata, but if your camera, adapter, or capture system does not pass that metadata through, you may not actually have access to it. Before you build the rig, confirm what data is available and where in the pipeline it can be read.
The point is simple: better lens data reduces the number of cameras you need, improves calibration, and lowers the amount of cleanup required later.
1.5 Calibration Charts and Lens Distortion Options
If you cannot access useful lens calibration metadata, you still have options.
You can work with the lens manufacturer. You can work with a third-party calibration provider. Or, if you have the right expertise, you can build your own lens calibration setup.
For time-of-flight and structured-light cameras, including Azure Kinect and Intel RealSense systems, we highly recommend building a lens distortion calibration wall. With these systems, distortion can vary significantly from unit to unit. In our experience, the difference can be dramatic.
A calibration wall can work well if it is extremely flat and stable. We recommend using industrial PCB material or high-quality pressboard as the wall surface. You can mount the board to studs behind a sheetrock wall, but you need to make sure the board is not skewed. For example, the left side of the board may sit slightly farther from the back wall than the right side, which can introduce error into the calibration process.
If possible, use aluminum framing to build a dedicated stand for the flat wall. This avoids many of the issues that come from trying to mount a calibration surface directly to a sheetrock wall.
A projector can also be used if it has accurate skew control and can project at very high resolution. This is different from simply having a high image resolution. The projected calibration pattern needs to be geometrically accurate.
The last element is an accurate way to mount the camera and measure the distance from the sensor to the wall. Aluminum framing can work well for this, and a motorized track can also be useful if you need repeatable camera positioning.
This calibration step is especially important for Intel RealSense systems and time-of-flight cameras, including the original Microsoft Kinect, Azure Kinect, and newer Kinect-style depth cameras. The lenses on these systems are inexpensive and can have significant distortion. If you do not characterize that distortion, your pipeline has to compensate for it without knowing how much distortion exists in one lens compared with another.
That matters even more in real-time systems.
Many live depth or structured-light capture pipelines rely on GPU-based processing using OpenGL, Vulkan, DirectX, CUDA, or a similar approach. Real-time mesh generation has practical performance limits, especially when using desktop or workstation GPUs. Data center GPUs may offer more compute, but they often do not have display outputs, which can make graphics API workflows more complicated. Those workflows are possible, but they usually require a very different system design.
If you are building or evaluating a live system using Intel RealSense or time-of-flight cameras, one of the easiest ways to spot poor calibration is by looking for visible seams between camera views. You can sometimes hide those seams by adding more cameras, but that introduces its own problems. Because these systems often rely on infrared, adding more cameras can create interference or bleeding, especially with time-of-flight cameras. In some cases, adding more cameras can actually reduce quality.
The better path is to calibrate each camera and lens independently, characterize the distortion, and use that information before multi-camera calibration and reconstruction. In our experience, this produces a dramatic improvement, and we have seen it proven repeatedly.
For industrial cameras, we have built calibration charts, and they can work well. However, if you are doing this yourself, you should involve someone who understands optics. Industrial lenses have unique characteristics, and a poor calibration setup can create misleading distortion data.
The same type of calibration wall used for time-of-flight and structured-light systems can also work well for industrial cameras. Many industrial lens manufacturers publish recommended calibration charts, white papers, or technical guides for specific lens and sensor combinations. If you can print the chart accurately and build a proper calibration wall, these resources can help you calibrate the lens yourself.
That said, this should be treated as the fallback option if you cannot get calibration data from the lens manufacturer, camera manufacturer, or a third-party calibration provider.
For higher-end cinema and photography lenses, we have usually had better options available, either through lens metadata or direct support from lens manufacturers. Because of that, we have not needed to build our own calibration charts for those lenses.
In many cases, we have relied on the lens manufacturer to provide the data directly. With a lens serial number, manufacturers may be able to provide calibration information specific to that lens. Depending on where you are located, some cinema lens companies also have local facilities where you can bring in your camera and lens. They may have dedicated calibration rooms, precision charts, and tools that can provide much more accurate calibration data than a custom in-house setup.
In our experience, this manufacturer-supported path is usually the best option for high-end cinema and photography lenses when it is available.
1.6 Paired Lenses Are the Best Case
The best solution, when available, is to use paired lenses.
Some lens manufacturers can produce or select batches of lenses with closely matched characteristics. If every lens in your capture rig has the same or nearly the same distortion profile, your pipeline becomes much easier to control.
This is especially valuable if you cannot perform lens distortion correction before calibration and reconstruction.
With paired lenses, the distortion is at least consistent across the rig. That consistency matters. Even if distortion exists, it appears in the same way across the cameras. That gives you a much better starting point for cropping, calibration, stereo matching, multi-camera depth generation, and reconstruction.
If you are building a serious volumetric capture stage and have the budget to do it, paired lenses are one of the best investments you can make.
They reduce uncertainty.
And in volumetric video, reducing uncertainty is one of the fastest ways to reduce cost.
1.7 The Cost of Lens-Level Garbage
The biggest form of garbage introduced at the lens level is distortion.
If you do not address lens distortion up front, the rest of the pipeline has to compensate. That compensation can happen in several ways, but none of them are free.
The simplest workaround is to crop the image and avoid using the edges of the frame, where distortion is often most visible. That is not a perfect solution, but it is usually better than feeding the entire distorted frame into calibration, point cloud generation, or reconstruction.
The more expensive workaround is to add more cameras.
This is one reason you see rigs with many cameras covering the same area from multiple angles. The system is trying to compensate for uncertainty by adding more coverage. Even for something simple, like capturing a ball, poor lens distortion handling can force the pipeline to rely on redundant views to reconstruct the object correctly.
2. Global Shutter vs. Rolling Shutter
After lenses, one of the next capture decisions that can introduce garbage into the pipeline is shutter type.
If you are choosing a camera for volumetric capture, and you do not fully control the lighting environment, global shutter should be strongly considered. This becomes even more important when the subject is moving quickly: dance, boxing, sports, live music, fast gestures, stage performance, or anything with rapid motion.
Rolling shutter cameras can produce good results, but they require more care. You need to understand shutter speed, exposure, lighting, motion blur, and how the camera is processing the image. If you are not already experienced in controlling those variables, a rolling shutter camera can introduce artifacts that become very difficult to clean up later.
This is not meant to restart the global shutter versus rolling shutter debate. That debate has existed in the camera industry for a long time, and there are valid reasons to use either depending on the application.
For volumetric video, the practical guidance is simple: If you can use global shutter, use global shutter.
Global shutter gives you a cleaner way to capture fast motion because the entire frame is captured at the same moment. Rolling shutter captures the frame line by line, which means fast motion can introduce skew, wobble, distortion, or temporal inconsistencies across the image.
Those inconsistencies matter in volumetric capture.
A 2D image with rolling shutter artifacts may still look acceptable to the eye. But in a volumetric pipeline, that same artifact can confuse calibration, stereo matching, point cloud generation, mesh reconstruction, splat generation, or texture alignment. The system may still produce an output, but it may require more cleanup, more processing time, or more downstream correction.
The challenge is that global shutter is not always easy to get in the camera category you want. High-resolution, full-frame, global-shutter cameras exist, but there are fewer options than with rolling shutter. If you are using mobile phones, you are usually working with rolling shutter. Some phones may compensate for rolling shutter artifacts through internal processing, but that also means the image may already be altered before your pipeline receives it.
Industrial cameras often provide more global shutter options, depending on the manufacturer, sensor, resolution, and frame rate. Photography, cinema, and broadcast cameras vary widely, so you need to evaluate the exact camera system you are considering.
You can absolutely build a volumetric pipeline with rolling shutter cameras.
Global shutter does not solve every problem.
3. Camera Sensors
The next major source of garbage is the sensor.
Every sensor has its own characteristics and flaws. If you do not understand those characteristics before building the pipeline, the work does not disappear. It moves downstream into color correction, calibration, point cloud cleanup, reconstruction, texture correction, or post-production.
That is why sensor choice matters so much in volumetric video.
A single-camera production can often work around sensor limitations. A volumetric capture system multiplies those limitations across every camera, every angle, and every frame.
3.1 Resolution
There are two ways to think about resolution.
The first is simple math.
One 8K camera contains roughly the same pixel count as sixteen 1080p cameras. A 4K camera contains roughly the same pixel count as four 1080p cameras. So if you are trying to cover the same capture area with lower-resolution cameras, you usually need more of them.
In theory, you could align sixteen 1080p cameras perfectly to cover the same area as one 8K camera. In practice, that is nearly impossible, and so if you use 16 cameras you end up a lot of overlapping coverage that doesn’t improve quality but instead impacts your reconstruction processing timeline. This overlap is one of the things we see all the time in volumetric rigs.
Resolution is one of the key drivers for reducing camera count. Reducing camera count reduces the number of places where garbage is added to the pipeline.
The second is that resolution is not only a number.
An 8K sensor in a cinema camera is not the same as an 8K sensor in a mobile phone. The same is true for 1080p, 4K, or any other resolution. Pixel count does not automatically equal image quality.
A higher-fidelity sensor, combined with access to cleaner image data, gives you a much better chance of avoiding sensor-level garbage. If the image coming from the sensor is already compressed, processed, denoised, sharpened, stabilized, or otherwise altered, you are starting with artifacts. Those artifacts may not matter much in a multi-camera broadcast setup, but they matter a lot in volumetric capture.
For example, if one camera sees the front of a person’s arm with one set of compression artifacts, and another camera sees the side of that same arm with a different set of artifacts, the reconstruction process has to reconcile that mismatch. You can see this problem in radiance field workflows, mesh reconstruction, and point cloud pipelines. The system does not magically know what is real and what is an artifact. It compensates by using many different algorithms that include deep learning frameworks, computer vision algorithms that tend to end up overfits (think too many splats that are generated then what is really needed or an average that is common on mesh generation.
This is what we call the sensor compounding effec
When this happens, because it’s not the best sensor you can use, this is what we call the sensor compounding effect. The more cameras you add that have lower quality sensors the problem compounds for all the cameras that over overlapping coverage. Thus rigs add more cameras to compensate.
In our experience, the difference can be dramatic. We have demonstrated systems where sixteen 8K cameras, using the raw data for cinema full frame sensors, captured two people in a 15-foot by 15-foot stage with no green screen and without perfectly even lighting. The two people were standing roughly half a foot apart, and the system produced a clean reconstruction with no artifacts and no manual cleanup. A version of that demonstration was presented through the Volumetric Format Association in 2022.
That type of result is not only about sensor resolution but using sensors that do not add artifacts and give you access to the raw data.
3.2 Raw Sensor Data
So let’s talk about raw sensor data. If you can access raw sensor data, we highly recommend trying to incorporate it into your volumetric pipeline.
Raw data gives you more control before the image is baked into a processed format. That matters for automation, real-time color balancing, exposure control, and reconstruction quality.
For example, imagine you are capturing outdoors and the sun starts to go down. Or clouds move over the capture area. Or stadium lights begin to dominate the scene. If you can access cleaner sensor data, your pipeline has a better chance of detecting those changes accurately and adjusting in real time across an entire camera array.
The more processed the image is before it reaches your pipeline, the less control you have.
There is no single perfect fallback if you cannot access raw sensor data, because the variables change from camera to camera. The issue could be dynamic range. It could be sensor noise. It could be the way the camera handles low light. It could be sharpening or denoising. It could be compression. It could be how the camera responds to patterns, skin tones, highlights, or shadows. It could be post-processing that happens before the image ever leaves the camera.
If you come from broadcast, film, or photography, use the same camera evaluation discipline you already know: test the camera under real conditions, look at dynamic range, noise, color, exposure behavior, motion, compression, and low-light performance.
But for volumetric video, do not evaluate the camera as a single angle. Evaluate what happens when you multiply that camera across 16, 32, 56, or 60 angles.
3.3 Color Across Cameras
The sensor also plays a major role in how a camera sees color.
If two cameras see the same color differently, the pipeline has to compensate. If that difference is happening during calibration, then your calibration is already being affected by garbage before reconstruction even begins.
This matters even more when you are not shooting under perfectly flat, controlled lighting.
Broadcast has understood this problem for a long time. Camera matching is part of the craft. Cameras are shaded, balanced, and monitored throughout production so they remain consistent. A field engineer(s) or video engineer(s) keep cameras matched during the event. There are also ways to automate parts of this process, although that is a larger discussion. In our experience this profession is an absolute necessity when doing real time volumetric video capture.
For volumetric capture, the same discipline matters.
The goal is not just to make each camera look good by itself. The goal is to make all cameras agree with each other.
If you cannot color match in real time, and if you do not have access to raw camera data, then you need to be very careful about the format you record or process. Avoid using streaming-first codecs such as H.264 or H.265 as your source for reconstruction if you can. Those codecs are designed for efficient delivery, not for preserving every detail needed for post-production or reconstruction.
A production codec such as ProRes is usually a much better option, even if the files are larger. The extra data can reduce artifacts and preserve more information for downstream processing.
At minimum, use a color chart at the start of capture. Ideally, each camera should see the same standardized color chart. In North America, that may be an NTSC-oriented chart. In other regions, a PAL-oriented chart may be appropriate. The key is to use a standard reference that gives you a consistent target across all cameras.
If you cannot use a standard chart, then create a consistent color reference of your own. It will not be as good as a standard chart, but it is better than having no reference at all.
If you are doing color correction as a post-processing step before reconstruction, that reference can help you match cameras before calibration and reconstruction. It may add work, but it removes garbage. If your goal is centimeter-level calibration accuracy, color consistency across cameras can make a significant difference.
3.4 Sensor Garbage Compounds Quickly
The sensor is one of the easiest ways to add garbage into a volumetric pipeline because it is often treated as an afterthought.
Teams will grab phone cameras, action cameras, low-cost industrial cameras, or whatever is available and assume they can deal with the sensor problems later. That usually means adding more cameras, adding more post-production, adding more compute, and adding more cleanup.
The problem is that sensor garbage compounds.
If two camera angles do not match because the sensors are behaving differently, you have already doubled the problem. If that happens across 16 cameras, 32 cameras, or 56 cameras, the entire rig may be generating mismatched data at every frame from every angle.
The post-processing pipeline then has to clean it up.
The reconstruction pipeline has to compensate.
The texture pipeline has to hide it.
And the usual solution becomes: add more cameras.
That is not always a solution. Sometimes it is just a more expensive way to bury the original problem.
If you want to fail fast, test the sensor early. Test it under the lighting conditions you expect to use. Test multiple units of the same camera. Test color consistency across angles. Test compression artifacts. Test raw versus processed output. Test motion. Test low light. Test the actual content you plan to capture.
Do not ask only whether the image looks good.
Ask whether the image will still behave well when multiplied across an entire volumetric rig.
4. Multi-Camera Sync and Timecode
Multi-camera sync is one of the fastest ways to create garbage in a volumetric pipeline.
If your cameras are not truly synchronized, every frame can introduce error before reconstruction even begins.
So first, we need to define what we mean by sync.
In volumetric capture, sync does not simply mean that all cameras are recording at the same frame rate. It does not only mean that every camera is set to 29.97, 30, 60, or 120 frames per second.
What we really mean is that the shutters across all cameras open and close at the correct time, relative to the same source, for every frame.
That distinction matters.
If you are capturing something simple, like a cooking video, an educational video, or a scene with very little motion, basic frame-rate alignment may be good enough. But if you are capturing fast movement, such as sports, dance, boxing, live music, or stage performance, sync becomes much more important.
At 60 frames per second, and especially at 120 frames per second, you need to know more than whether each camera is recording at the same nominal frame rate. You need to know when each camera starts exposure, how long the shutter stays open, when exposure ends, and whether that timing is reliable across the entire rig.
This is one of the places where early depth-camera and structured-light systems could introduce problems. In some early Microsoft Kinect, Azure Kinect, and Intel RealSense workflows, sync could mean “trigger the camera to take a picture,” but it did not always guarantee that every device exposed the frame at the exact same moment. If the device needed to adjust exposure, process a depth frame, or handle lighting changes, that timing could vary.
For volumetric video, that variation becomes garbage.
The question is not just, “Did every camera capture a frame?”
The real questions are:
Did every camera start exposure at the right time?
Did every camera end exposure at the right time?
Did every camera stay disciplined to the same timing source?
Did that remain true across the full duration of the capture?
That is what we mean by multi-camera sync.
4.1 The Sync Source Matters
The next question is where the sync signal comes from.
We have worked with custom sync generators. We have used oscilloscopes as sync sources. We have tested GPU clocks, CPU clocks, FPGA-based timing, and other approaches. Some of these can work in controlled scenarios, and FPGA timing can work very well, but for most production environments, the best answer is usually an industry-standard sync generator.
Broadcast solved this problem a long time ago.
One of the most established approaches is tri-level sync. With a high-quality master clock, especially one with a stable crystal or GPS reference, you can maintain a sync source for very long periods without meaningful drift. That kind of reliability matters when you are building a serious capture system.
There are other ways to generate timing, but many of them are not ideal as the master source for a multi-camera volumetric rig. GPU clocks, CPU clocks, and software-based timing can drift or behave inconsistently over long durations. They may be useful inside parts of a pipeline, but they should not be treated as a replacement for a disciplined master clock unless you have tested the system deeply and understand the drift characteristics.
4.2 PTP and Modern IP-Based Sync
The other major industry-standard sync approach is PTP, or Precision Time Protocol.
PTP is especially important as media infrastructure moves toward IP-based workflows. Whether you are using a full SMPTE ST 2110 environment or simply using PTP as a timing source alongside SDI video, there is now enough infrastructure in switches, master clocks, and cameras to make PTP a practical option.
4.3 Timecode Is Not the Same as Sync
If you are not coming from the broadcast industry, timecode may seem like a frame number.
In some situations, that is a useful way to think about it. But in production workflows, timecode and frame count are not always the same thing.
For example, in 29.97 drop-frame workflows, the timecode count is designed to stay aligned with real clock time, even though the frame rate is not exactly 30 frames per second. That means the frame number and the timecode number do not always match in the simple way many people expect.
For a three-minute or five-minute capture, this may not matter much. But once you start capturing longer takes, especially 10 minutes or more, timecode behavior can become important.
The key question is where the timecode is coming from.
If timecode is generated by a master clock and sent to every camera, then each camera can be aligned to the same timing source. When the master clock says a frame is at 01:15:16:05, every camera should agree.
If each camera generates its own timecode internally, even if you jam-sync the cameras at the start, you should expect drift over time. Camera manufacturers often publish drift specifications, but the important point is simple: if the cameras are not continuously locked to the same timing source, they will drift.
Once that happens, timecode becomes less reliable for aligning frames. You may need another sync point, such as a clap, flash, marker, or other reference event. That can work, but now you have introduced another manual or semi-manual step into the pipeline.
That is another form of garbage.
It is much better to invest in hardware that supports industry-standard sync and timecode from the start.
4.4 LTC, PTP, NTP, and Practical Timing Choices
In traditional production workflows, LTC, or linear timecode, is one of the most common ways to distribute timecode. Most professional timecode generators support LTC, and many professional cameras and recorders can receive it.
In IP-based workflows, PTP is the more precise timing approach and should be preferred when the infrastructure supports it.
NTP can be useful for general network time alignment, but it is not the same class of timing solution as PTP for frame-accurate capture. You may be able to use NTP for some parts of a pipeline, especially when supported by reliable network infrastructure, but it should not be treated as a replacement for proper sync in a serious volumetric capture system.
Network switches matter here. If you are building around PTP or even relying on stable NTP behavior, the switch infrastructure needs to support the timing requirements. High-quality enterprise switches, such as Cisco switch gear, can be very reliable in these workflows. Other switches may also work, but you need to verify their timing behavior instead of assuming it.
One useful note: even small devices such as Raspberry Pi systems can support PTP and NTP. If you are using them as microcontrollers or control nodes in a larger pipeline, they can participate in a disciplined timing environment as long as the network infrastructure is configured correctly.
4.5 Phones, Action Cameras, and Adapters
There are adapters and workflows that can add timecode, LTC, SDI, or genlock-like behavior to cameras that were not originally designed for professional synchronized capture. Some mobile phone workflows and action camera workflows have explored this path.
These can be useful, but they also add complexity.
If you are adding adapters to make a consumer device behave like a professional camera, ask whether you are solving the right problem. You may be adding hardware, synchronization complexity, calibration complexity, and operational risk to compensate for limitations that are already solved in professional camera systems.
This does not mean you cannot use phones or action cameras. It means you need to understand what kind of sync you actually have.
Are the cameras continuously locked?
Are they only jam-synced at the start?
Does the adapter provide true genlock, timecode, or just a reference point?
How long can the system run before drift becomes visible?
If you are capturing for 30 minutes, 40 minutes, or an hour, many action cameras and mobile devices will drift unless they are properly disciplined. Some systems can be re-jammed before each take, and that may be acceptable for certain workflows. But if you are trying to build a repeatable volumetric pipeline, you should know exactly how drift is handled.
Do not assume that matching timestamps means frame-accurate synchronization.
4.6 Frame-Accurate Sync Is an Industry-Solved Problem
This is one of those problems that the broader media industry has already solved.
In small multi-camera setups, people often use HDMI capture cards, OBS, NDI, or similar tools to create an affordable production workflow. That can work very well for switching between cameras, recording podcasts, streaming events, or producing simple multicam content.
But that is not the same as frame-accurate sync.
For many traditional video workflows, frame-accurate sync may not be necessary. If one camera is a frame or two off, the audience may never notice. But volumetric capture is different. The system is not just showing one angle at a time. It is using multiple angles together to reconstruct geometry, depth, texture, or a 3D representation of the scene.
That means timing errors become spatial errors.
A person’s arm, foot, face, or body position may not line up across cameras. A fast-moving object may appear in different positions from different angles. The reconstruction system may interpret that as noise, geometry error, texture mismatch, or missing data.
The result is more cleanup, more processing, more artifacts, and less reliable reconstruction.
4.7 Real-Time Color Balance Is Not Optional
We have already talked about color consistency at the sensor level and how the camera sends frames into your pipeline. Now we need to talk about what the camera and pipeline need to do to prevent color from becoming garbage downstream.
The problem is simple: two cameras can look at the same object and see different colors.
That may sound minor, but in volumetric video it matters a lot. If one camera sees a shirt as one shade of red and another camera sees that same shirt as a slightly different red, the reconstruction system has to decide whether those two points are the same surface, different surfaces, noise, lighting variation, or texture variation.
The more cameras you add, the more this problem compounds.
This is why white balance and color balance need to be treated as separate steps.
White balance is about matching the camera to the color temperature of the light. If you are in a controlled studio, you may know the color temperature because you know the lighting setup.
Color balance is about making sure the cameras agree with each other across the full color range. Even with digital cameras, even with cameras from the same manufacturer, and even with cameras using the same sensor model, each camera can lean slightly differently. One may push a little red. Another may handle cyan differently. Another may respond differently to skin tones, shadows, highlights, or saturated colors.
A color chart gives every camera the same known reference and a waveform monitor lets you see how each camera is interpreting that reference.
If you watch live sports, this is happening constantly. Cameras are being shaded, balanced, monitored, and adjusted so that the broadcast stays consistent as the lighting changes, the angle changes, or the action moves across the venue.
Volumetric capture needs the same discipline.
If the sun goes behind a cloud, if stadium lights turn on, if a performer moves from one lighting zone to another, or if the capture environment changes during the shoot, the pipeline needs a way to maintain color consistency. If it cannot, that inconsistency becomes garbage.
There are several ways to solve this.
The best option is to build color balancing into the live capture workflow. If you are using a professional SDI or broadcast-style pipeline, there are existing tools for camera shading, waveform monitoring, vectorscope analysis, color correction, and camera control. These tools were not invented for volumetric video, but they solve a problem volumetric video absolutely has.
Another option is to use a custom software pipeline that performs camera matching before reconstruction. This may be necessary if you are using industrial cameras, machine vision cameras, mobile devices, or a non-broadcast capture stack. If you go this route, you need to make sure the color correction step preserves sync and timecode, and that every corrected frame still maps back to the correct camera and capture window.
You can also correct color in post before reconstruction. This is not ideal for real-time workflows, but it may be acceptable for offline pipelines. In that case, every camera should capture a known color reference, ideally an industry-standard color chart, at the start of the shoot. You can then use that reference to match cameras before calibration and reconstruction.
What you want to avoid is capturing mismatched camera angles and assuming reconstruction will fix it.
It will not really fix it. It will compensate.
That compensation may show up as noisier geometry, unstable textures, more splats than necessary, longer processing times, or more cleanup. In radiance field and Gaussian splat workflows, color mismatch can cause the model to spend capacity explaining differences that should never have been introduced in the first place.
There are also lower-cost tools that can help. OBS and NDI workflows, for example, can provide plugins and software-based approaches to color correction. But you need to be careful. Those workflows may not preserve the same industry-standard sync and timecode discipline you get from a professional SDI or SMPTE ST 2110 pipeline. If you are using SDI capture cards from companies like Blackmagic Design, AJA, or similar vendors, make sure the plugin or software path preserves sync, timecode, and frame identity correctly.
This is the key point: color correction cannot be treated as separate from timing. A corrected frame still has to be the right frame, from the right camera, in the right sync window.
The best solution is to use a pipeline where sync, timecode, color balance, and camera control all work together. The broadcast industry already has decades of experience here, and volumetric video should take advantage of that work instead of treating color balance as an afterthought.
5. Multicamera Calibration: The First Non-Hardware Source of Garbage
Multicamera calibration is one of the first places where garbage enters the pipeline after the image leaves the camera.
Yes, sensors, cameras, and lenses may all involve software, firmware, ASICs, and internal processing. But for this section, we are talking about the first major non-hardware stage of the volumetric pipeline: the step where the system tries to understand where every camera is in space and how those cameras relate to one another.
For most volumetric reconstruction workflows, this calibration data is required. The system needs to know camera position, orientation, lens characteristics, focal length, and how one camera view maps to another. If that information is wrong, everything downstream becomes harder.
Bad calibration creates seams. It creates mismatched geometry. It creates poor depth maps. It creates bad point clouds. It creates texture errors. And then the reconstruction system has to compensate.
Today there are several ways the industry approaches multi-camera calibration.
5.1 Target-Based Calibration
This is the traditional computer-vision method. It has been used in volumetric capture for many years and is still used by some companies today.
The basic idea is simple: place a known target in the capture volume, capture it from multiple cameras, and use that known pattern to solve for the camera positions and relationships.
The target could be a checkerboard, a ChArUco board, a coded pattern, or another calibration object with identifiable features. The material matters more than people often realize. Glass, metal, wood, vinyl, printed boards, and projected charts can all behave differently depending on the cameras and lighting.
This is especially true for structured-light and time-of-flight systems. If the calibration process uses infrared data, the target material can significantly affect the result because infrared reflects differently across different surfaces and inks.
For RGB-based cameras, the material still matters, but the goal is often simpler: make sure the target is flat, clean, sharp, high contrast, and free of bumps, wrinkles, glare, or warping.
For a long time, target-based calibration provided some of the best results in multi-camera volumetric capture. There are plenty of textbooks, papers, and engineering references on how to do it well.
But it has limitations.
The first limitation is practical: the target has to be in the capture area.
Some studios solve this by placing calibration targets near the cameras, along the walls, or in parts of the stage where they remain visible. That can work in a controlled studio. It is much harder for concerts, sports, live events, or any environment where visible targets would distract the audience or interfere with the production.
The second limitation is that many target-based workflows primarily rely on 2D image observations. There may be some 3D reasoning involved, but the core solve is often based on how known 2D features appear across camera views.
That becomes a problem when the stage has vibration.
A camera does not vibrate only in the 2D image plane. It moves in 3D space. A point in the image may shift left or right, but the camera may also be moving forward, backward, rotating, or changing its relationship to the subject. Reconciling the difference between the original calibration and the real camera position during capture can be extremely difficult.
In theory, there are methods to compensate for this. In practice, doing it reliably and automatically is hard. We have tried. We have built automation. We have built manual tools. We have worked through the theory. The reality is that if your rig or stage vibrates, and your calibration process cannot account for that movement, you are going to introduce garbage downstream.
5.2 Natural-Feature Calibration and Structure from Motion
The second approach is natural-feature calibration, usually built around structure from motion and bundle adjustment.
This is the world of COLMAP and related workflows.
Instead of using a fixed calibration target, the system finds natural features in the images and uses those features to solve camera positions and scene structure. This approach has become especially common in the era of NeRFs, Gaussian splats, and radiance field workflows.
COLMAP has come a long way over the last several years. Variations, forks, and modified workflows built around similar ideas can now compete with many proprietary calibration systems in a wide range of scenarios.
The advantage is obvious: you do not need a calibration target in the scene. If the environment has enough visible features and enough overlap between camera views, the system can often solve the camera layout from the images themselves.
That makes this approach far more practical for many real-world captures.
It also has advantages when dealing with certain types of motion or vibration, although the details can become complex quickly. We are not going to get into those techniques here, but the key point is that natural-feature methods have become a major part of the modern volumetric toolkit.
The downside is that these methods need features and overlap.
If the scene is visually plain, poorly lit, repetitive, reflective, motion-blurred, or lacking enough shared visual information between cameras, the solve can degrade. If the system has to guess focal length, camera position, or lens behavior in order to make the images align, that guess can become a new source of garbage.
5.3 3D-Based Calibration
A third approach is 3D-based calibration.
We do not see this used as widely today, partly because COLMAP-style workflows have become so common. But there is a deep body of research around solving camera relationships using 3D information directly.
There are many ways this can be implemented. You will find papers from universities, large technology companies, and computer-vision research groups exploring different forms of 3D alignment, geometric reconstruction, and camera pose estimation.
The accuracy has improved over time, although in many workflows it has not been as broadly adopted as natural-feature and bundle-adjustment approaches.
We have not focused heavily on this area recently, so it may have advanced further than we have seen. If anyone is doing high-quality 3D multi-camera calibration without relying primarily on 2D image features, we would be very interested to learn more.
5.4 Hybrid 2D and 3D Calibration
The fourth approach is a hybrid method that combines 2D and 3D calibration.
In this workflow, you may start with a 3D solve to establish the camera positions in space, then use 2D image data to refine alignment. Or you may use image-based alignment and reconcile it against a 3D model of the capture environment.
When done well, this can be one of the most accurate approaches to multi-camera calibration.
It can also help with vibration, drift, and stage movement because the system has more than one type of evidence. It is not only asking, “Where does this feature appear in the 2D image?” It is also asking, “Where is this camera in 3D space, and how does that position align with the observed image data?”
This is an area where we believe there is still a lot of room for innovation.
5.5 Where Calibration Garbage Enters
Once you understand the major approaches to calibration, the next question is where garbage gets in.
The first issue is camera position.
If the cameras are not solved correctly in world space, the reconstruction will not align properly. That often shows up as seams, texture mismatches, geometry errors, or unstable surfaces. Some reconstruction techniques can hide those seams, but hiding the error is not the same as removing it.
The second issue is focal length.
If the calibration system does not know the true focal length, it may have to estimate it. In some cases, the system may use an adjusted focal length simply to make points or images line up. That can produce a result, but the result may not match the physical reality of the capture rig.
This becomes especially important if you are using multiple focal lengths in the same stage.
Many volumetric systems avoid multiple focal lengths because they are harder to calibrate. But if you can support them, they can reduce the number of cameras you need. Different focal lengths can give you better coverage across different parts of the capture volume. This is especially important in sports, where you cannot always place cameras exactly where you want them, and some cameras may need to cover much longer distances.
The Volumetric Format Association has published work on stage layouts and the benefits of different focal lengths for different types of content. These concepts have been tested in both large and small capture environments.
The third issue is vibration and drift.
If the rig moves, if the stage vibrates, or if a camera shifts after calibration, the original calibration is no longer fully accurate. The pipeline then has to compensate. That compensation may happen during depth generation, mesh generation, splat training, texture cleanup, or post-processing and it is based on garbage that has been added because of bad calibration.
5.6 How Pipelines Compensate
When calibration is not accurate enough, then the volumetric pipeline usually compensates later. This is almost the majority of research and development that is done in the world of volumetric video, it is almost always what companies spend a lot of time on and thus emphasis as to why their volumetric reconstruction pipeline is better than others.
For example, in mesh-based workflows, that compensation may happen during depth map cleanup, mesh generation, mesh smoothing, hole filling, or texture blending. Some companies in the volumetric space have spent enormous time improving these cleanup steps because their capture hardware and calibration accuracy required it.
That work can produce better and better results over time. But it is important to understand what is happening: the pipeline is compensating for garbage that entered earlier.
Another example is in Gaussian splat, NeRF, and radiance field workflows, the compensation often happens differently.
These methods usually benefit from many camera views and a lot of overlap. The more cameras that can see the same point, the more data the calibration and reconstruction systems have to work with.
If you only have six cameras looking front, back, left, right, top, and bottom, there may not be enough overlap to produce a strong solve. Add more cameras, and now multiple cameras can see the same surface from slightly different angles. Add more again, and the calibration and reconstruction may improve further.
That is one reason many radiance field and splat workflows use more cameras than time-of-flight or structured-light approaches traditionally required.
More cameras give the volumetric video pipeline more data, and again more cameras also add costs and places for garbage to enter.
6. Point Cloud Cleanup
Point cloud cleanup is one of the last places where software can prevent garbage from moving deeper into the reconstruction pipeline.
It is not always “garbage in” in the same way as a bad lens, poor sensor choice, weak sync, color mismatch, or bad calibration. By this point, the system has already captured the images and generated some form of 3D data. But this step still matters because it determines whether you are passing a clean point cloud into the next stage, or whether you are asking the reconstruction system to fill, smooth, mesh, voxelize, or splat data that is already noisy, sparse, or unstable.
We do not see enough volumetric stages treating point cloud cleanup as a dedicated step.
In many workflows, the pipeline goes directly from point cloud generation to gap filling, meshing, voxel generation, Gaussian splats, or another reconstruction method. That can work, but it often means the next stage has to solve two problems at once:
What is missing?
What should not be there?
Those are different problems.
If your point cloud is sparse, which is common depending on the technique used to generate it, the usual answer is often to add more cameras. More cameras give you more coverage, more views, and more points. But they also increase cost, synchronization complexity, calibration complexity, color-matching requirements, bandwidth, storage, processing, heat, power, and failure points.
So once again, the “solution” becomes more infrastructure.
Point cloud cleanup gives you another path.
The goal is to improve the quality of the data before the next reconstruction step. That may mean removing bad points. It may mean identifying outliers. It may mean estimating where missing points should be. It may mean using the 2D image data, camera calibration, depth information, normals, temporal consistency, or semantic knowledge of the subject to make better decisions about what belongs in the point cloud.
There is a tremendous amount of research in this area. There are white papers, open-source libraries, and established techniques for point cloud denoising, outlier removal, normal estimation, surface reconstruction, point completion, and temporal filtering.
The important thing is to treat this as a stage in the pipeline, not as an afterthought.
For example, if you know where a face is, and you have a reasonable set of points plus the corresponding 2D image data, there are techniques that can help estimate where missing points should be. This does not always require deep learning. Traditional computer vision, geometry, depth estimation, and interpolation techniques can all help.
Faces are especially important because everyone notices when a face is wrong. The same is true for hands and fingers. For years, one of the easiest ways to evaluate a volumetric capture system was to ask: show me the fingers. If the fingers fell apart, melted together, disappeared, or turned into noisy geometry, you knew the system was struggling.
Point cloud cleanup can help address those kinds of problems before they become mesh problems, splat problems, voxel problems, or texture problems.
If you are working with a dense point cloud, cleanup is still important. In that case, the problem may not be missing data. It may be too much data, duplicate data, unstable data, or points that are close to the surface but not actually useful. You may need to remove redundant points, filter noisy clusters, smooth local neighborhoods, or identify points that do not agree with surrounding geometry.
Even if you are not creating a mesh, normal estimation can be valuable. Estimating point normals helps you understand the direction a surface is facing, even when you do not yet have explicit faces or polygons. That can improve later voxel generation, splat generation, surface reconstruction, texture projection, and other downstream steps.
This may sound unintuitive because point clouds do not have faces in the way meshes do. But knowing the likely surface orientation of points gives the pipeline more context. If you can combine point position, RGB information, the original 2D image data, calibration, and normal direction, you have a much richer representation than raw points alone.
That can improve quality.
It can also reduce cost.
In some workflows, better point cloud cleanup can reduce the need for additional cameras. That is not guaranteed, and it depends heavily on the capture method, subject, lighting, camera layout, and reconstruction technique. But if cleanup gives the next stage better information, the system may need less redundant coverage to produce a stable result.
It can also help with speed.
If your reconstruction pipeline already takes minutes per frame, calculating normals, filtering outliers, or doing targeted point cleanup may be a relatively small cost compared with the time spent downstream. In many cases, that early cleanup can reduce the amount of work the later reconstruction stage has to do.
The lesson is simple: do not rush from point cloud generation directly into reconstruction without asking whether the point cloud is clean enough to trust.
A sparse point cloud does not always mean you need more cameras.
A noisy point cloud should not be treated as something the mesh or splat stage will magically fix.
And a dense point cloud is not automatically a good point cloud.
Point cloud cleanup is where you can remove uncertainty before it becomes more expensive.
In volumetric video, that is exactly what failing fast is supposed to do.
7. Garbage In and Garbage Out starts with your hardware choices
Start with the lenses. Use the best lenses you can reasonably get. If your lens can provide calibration metadata, make sure your pipeline can read it and use it for lens distortion correction. Even if you do not reduce the number of cameras, correcting distortion before reconstruction can improve calibration, point cloud quality, texture alignment, and overall output.
Then look closely at your camera sensors. Not all sensors are equal. A 4K sensor in a mobile phone is not the same as a 4K full-frame sensor in a cinema camera. The image may also be processed before it ever reaches your pipeline, especially with mobile phones. If you can access raw camera data, or at least a lossless or production-quality image format instead of a heavily compressed one, your color calibration and multi-camera calibration will be much better.
For sync and timecode, choose industry-standard infrastructure whenever possible. If you are buying sync equipment, use hardware that supports established sync and timecode workflows. If you are using action cameras, mobile phones, or other devices without true frame-accurate sync, be honest about the drift risk and build a way to manage it. Drift is garbage, and it becomes expensive quickly.
Spend serious time on multi-camera calibration. If you are going to calibrate once and apply that calibration to every frame, check it. Then check it again. Bring the calibration data into a 3D viewer if you can. Look at the camera positions. Verify focal lengths. Measure distances from known points in the stage. Make sure the virtual camera layout matches the real-world camera layout.
If the calibration is off, do not just move forward and hope reconstruction fixes it. Find out why. Maybe there is too much light. Maybe there is not enough light. Maybe there are not enough unique visual features in the scene. Maybe the lenses are not reporting correctly. Maybe the focal length is being guessed. Maybe something shifted. Maybe you need to add or move calibration references.
Even if calibration takes an extra half hour, it is worth it.
Do not shortcut multi-camera calibration.
And one final lesson from experience: avoid doing the capture first and then trying to calibrate after the fact. We have seen it happen, and we have been guilty of it ourselves. It is a recipe for adding garbage into the pipeline before reconstruction even begins.
In volumetric video, the best reconstruction pipeline is the one that has less garbage to fix.



Comments