Massive CoreML latency spike on live AVFoundation camera feed vs. offline inference (CPU+ANE)

Question

Created 5d

Replies 4

Boosts 0

Participants 3

Hello,

I’m experiencing a severe performance degradation when running CoreML models on a live AVFoundation video feed compared to offline or synthetic inference. This happens across multiple models I've converted (including SCI, RTMPose, and RTMW) and affects multiple devices. The Environment

OS: macOS 26.3, iOS 26.3, iPadOS 26.3

Hardware: Mac14,6 (M2 Max), iPad Pro 11 M1, iPhone 13 mini

Compute Units: cpuAndNeuralEngine

The Numbers

When testing my SCI_output_image_int8.mlpackage model, the inference timings are drastically different:

Synthetic/Offline Inference: ~1.34 ms

Live Camera Inference: ~15.96 ms

Preprocessing is completely ruled out as the bottleneck. My profiling shows total preprocessing (nearest-neighbor resize + feature provider creation) takes only ~0.4 ms in camera mode. Furthermore, no frames are being dropped. What I've Tried

I am building a latency-critical app and have implemented almost every recommended optimization to try and fix this, but the camera-feed penalty remains:

Matched the AVFoundation camera output format exactly to the model input (640x480 at 30/60fps).
Used IOSurface-backed pixel buffers for everything (camera output, synthetic buffer, and resize buffer).
Enabled outputBackings.
Loaded the model once and reused it for all predictions.
Configured MLModelConfiguration with reshapeFrequency = .frequent and specializationStrategy = .fastPrediction.
Wrapped inference in ProcessInfo.processInfo.beginActivity(options: .latencyCritical, reason: "CoreML_Inference").
Set DispatchQueue to qos: .userInteractive.
Disabled the idle timer and enabled iOS Game Mode.
Exported models using coremltools 9.0 (deployment target iOS 26) with ImageType inputs/outputs and INT8 quantization.

Reproduction

To completely rule out UI or rendering overhead, I wrote a standalone Swift CLI script that isolates the AVFoundation and CoreML pipeline. The script clearly demonstrates the ~15ms latency on live camera frames versus the ~1ms latency on synthetic buffers.

(I have attached camera_coreml_benchmark.swift and coreml model (very light low light enghancement model) to this repo on github https://github.com/pzoltowski/apple-coreml-camera-latency-repro).

My Question: Is this massive overhead expected behavior for AVFoundation + Core ML on live feeds, or is this a framework/runtime bug? If expected, what is the Apple-recommended pattern to bypass this camera-only inference slowdown?

One think found interesting when running in debug model was faster (not as fast as in performance benchmark but faster than 16ms. Also somehow if I did some dummy calculation on on different DispatchQueue also seems like model got slightly faster. So maybe its related to ANE Power State issues (Jitter/SoC Wake) and going to fast to sleep and taking a long time to wakeup? Doing dummy calculation in background thought is probably not a solution.

Thanks in advance for any insights!

Boost

Answer 1

IAiSeed OP

4d

Experiencing significant performance degradation when running CoreML models on live AVFoundation video feeds compared to offline inference is a common issue, though the exact cause can vary based on several factors. Given the details you've provided, here are some insights and potential solutions to consider:

Potential Causes and Solutions

ANE Power State Management:

Observation: Your suspicion about ANE (Apple Neural Engine) power state issues seems plausible, especially with the noted variability in debug mode and when introducing dummy computations.

Solution: The ANE may enter a low-power state when idle, causing latency spikes when reactivated for inference. To mitigate this, consider keeping the ANE active by scheduling low-intensity tasks periodically. However, avoid unnecessary computations that could impact overall performance.

Approach: Use DispatchSourceTimer to run a very light task (e.g., a simple matrix multiplication) at regular intervals to keep the ANE primed. AVFoundation and CoreML Synchronization: Observation: Even with optimized preprocessing and model loading, synchronization overhead between AVFoundation and CoreML could contribute to latency.

Solution: Minimize the overhead by ensuring that video frame capture and model inference are tightly coupled. Consider using CMSampleBufferDelegate methods to process frames as soon as they are available, reducing the time spent waiting for buffer completion. Thread Management:

Observation: You've already set the dispatch queue to .userInteractive, but further thread management might improve performance. Solution: Ensure that all video processing and inference tasks are executed on a dedicated background thread to avoid blocking the main thread. Use DispatchQueue.global(qos: .userInteractive) for these tasks. Model Optimization:

Observation: While you've applied INT8 quantization, further model optimizations might yield better results.

Solution: Experiment with model pruning, quantization-aware training, or using smaller model architectures to reduce inference time. Additionally, ensure that the model's input/output shapes and data types are perfectly aligned with the AVFoundation feed.

Frame Rate Considerations:

Observation: Running inference at both 30fps and 60fps shows similar latency penalties, suggesting that frame rate is not the bottleneck.

Solution: Consider adjusting the frame rate to balance between quality and performance. Lowering the frame rate slightly might reduce the load on the ANE without significantly impacting user experience.

Debugging and Profiling:

Observation: Debug mode inference is faster, indicating potential runtime overhead in release mode.

Solution: Continue profiling in both modes to identify specific bottlenecks. Use Instruments to analyze CPU and ANE usage, focusing on any unexpected delays or power state transitions.

Frame Dropping and Latency:

Observation: No frames are being dropped, but latency remains high.

Solution: Ensure that the camera feed and inference pipeline are not overwhelming the system's resources. Monitor memory and CPU usage to prevent contention that could lead to increased latency.

Apple-Recommended Patterns

Preloading and Reusing Resources: You've already implemented these, but ensure that resources like pixel buffers and models are efficiently managed and reused throughout the session.

Adaptive Inference: Consider implementing adaptive inference strategies, where the model's complexity or resolution is adjusted based on real-time performance metrics or scene complexity.

Background Task Scheduling: Use UIApplication.beginBackgroundTask to extend the time available for processing frames if necessary, ensuring that the task completes before the app is suspended.

If none of these solutions resolve the issue, it may be worthwhile to reach out to Apple Developer Support, providing them with your detailed profiling data and reproduction steps. They may be able to offer specific guidance or identify potential bugs in the frameworks.

0

Answer 2

terrence_long OP

1d

AVCapture runs its own ML models on the neural engine. The neural engine is a single resource that is shared across all Core ML work on the system. Your own Core ML inferences can be delayed if they execute at the same time as work submitted by other frameworks/processes. Reaction gestures is an example where AVCapture may use the neural engine. You might be able to configure your AV session to disable unwanted effects.

The Core ML template in Instruments is a great way to profile and directly see how AVFoundation models are interacting with your own usage of Core ML. If two Core ML inferences are competing with each other, you will see their blocks of work stacked on top of each other in the timeline. Your code example exhibits this symptom when profiled in Instruments.

0

Answer 3

pzo OP

1d

Thanks I will only clarify that on my Macbook I disabled apple camera effects (manually on macOS but in iOS app in plist I set to NO AVCaptureDevice.reactionEffectGesturesEnabled ) and checked ANE usage in Mactop and it was never more than 7% total usage. I believe I also configured session to really disable all extra functionality I didn't need like metadata etc.

For me as well it's also not clear why in this situation having some dummy loop that do dummy computation in different thread why this improves latency situation in my case. I just wanted to avoid such a hack in my app.

It's quite dissapointing if nothing can be done

0

Answer 4

terrence_long OP

21h

I am not familiar with the available AVCapture options, but if you find a case where there is no way to disable features that you are not using please file a feedback report using Feedback Assistant. You should be able to use the Core ML Instruments template to see what other work is executing on the neural engine and identify what feature it may be related to and if you have successfully disabled it.

Neural engine utilization

As you've noticed, the neural engine is not overloaded in total utilization (only 7%). The issue happens because both your code and AVFoundation are trying to use the neural engine at the same time. If AVFoundation runs first, your Core ML work will need to wait for the earlier inference to finish before it can execute. The first step is to try to avoid the neural engine work from AVFoundation (or as I mentioned, file a feedback report if you find this is not possible). Alternatively you could try to pipeline or offset your work to avoid the contention (although this may add undesirable artificial latency).

Why the dummy loop is improving latency

The reason is likely related to the neural engine power state (not on/off/asleep but what frequency it is operating at). With such a low utilization of the neural engine, the system has little reason to increase the neural engine power state and assumes it can run it at lower power states for efficiency reasons. A dummy loop (especially on ANE, but on some systems even only CPU work) will increase utilization and may cause the system to raise the power state.

Audio workloads can use the Workgroup API to communicate workload deadlines, which the system can use to optimally schedule work at the most energy efficient power states to meet the deadline. Since this API is currently only available for audio workloads, I would recommend filing a feedback report that describes your use case and how the existing APIs may not be sufficient.

0