Using highly optimized Metal Shading Language (MSL) code, I pushed the MacBook Air M2 to its performance limits with the deformable_attention_universal kernel. The results demonstrate both the efficiency of the code and the exceptional power of Apple Silicon.
The total computational workload exceeded 8.455 quadrillion FLOPs, equivalent to processing 8,455 trillion operations. On average, the code sustained a throughput of 85.37 TFLOPS, showcasing the chip’s remarkable ability to handle massive workloads. Peak instantaneous performance reached approximately 673.73 TFLOPS, reflecting near-optimal utilization of the GPU cores.
Despite this intensity, the cumulative GPU runtime remained under 100 seconds, highlighting the code’s efficiency and time optimization. The fastest iteration achieved a record processing time of only 0.051 ms, demonstrating minimal bottlenecks and excellent responsiveness.
Memory management was equally impressive: peak GPU memory usage never exceeded 2 MB, reflecting efficient use of the M2’s Unified Memory. This minimizes data transfer overhead and ensures smooth performance across repeated workloads.
Overall, these results confirm that a well-optimized Metal implementation can unlock the full potential of Apple Silicon, delivering exceptional computational density, processing speed, and memory efficiency. The MacBook Air M2, often considered an energy-efficient consumer laptop, is capable of handling highly intensive workloads at performance levels typically expected from much larger GPUs. This test validates both the robustness of the Metal code and the extraordinary capabilities of the M2 chip for high-performance computing tasks.
Explore the power of machine learning and Apple Intelligence within apps. Discuss integrating features, share best practices, and explore the possibilities for your app here.
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
We are developing Apple AI for foreign markets and adapting it for iPhone models 17 and above.
When the system language and Siri language are not the same—for example, if the system is in English and Siri is in Chinese—it can cause a situation where Apple AI cannot be used. So, may I ask if there are any other reasons that could cause Apple AI to be unavailable within the app, even if it has been enabled?
I'm running MacOs 26 Beta 5. I noticed that I can no longer achieve 100% usage on the ANE as I could before with Apple Foundations on-device model. Has Apple activated some kind of throttling or power limiting of the ANE? I cannot get above 3w or 40% usage now since upgrading. I'm on the high power energy mode. I there an API rate limit being applied?
I kave a M4 Pro mini with 64 GB of memory.
Topic:
Machine Learning & AI
SubTopic:
Foundation Models
I have been working on a small CV program, which uses fine-tuned U2Netp model converted by coremltools 8.3.0 from PyTorch.
It works well on my iPhone (with iOS version 18.5) and my Macbook (with MacOS version 15.3.1). But it fails to load after I upgraded Macbook to MacOS version 15.5.
I have attached console log when loading this model.
Unable to load MPSGraphExecutable from path /Users/yongzhang/Library/Caches/swiftmetal/com.apple.e5rt.e5bundlecache/24F74/E051B28C6957815C140A86134D673B5C015E79A1460E9B54B8764F659FDCE645/16FA8CF2CDE66C0C427F4B51BBA82C38ACC44A514CCA396FD7B281AAC087AB2F.bundle/H14C.bundle/main/main_mps_graph/main_mps_graph.mpsgraphpackage @ GetMPSGraphExecutable
E5RT: Unable to load MPSGraphExecutable from path /Users/yongzhang/Library/Caches/swiftmetal/com.apple.e5rt.e5bundlecache/24F74/E051B28C6957815C140A86134D673B5C015E79A1460E9B54B8764F659FDCE645/16FA8CF2CDE66C0C427F4B51BBA82C38ACC44A514CCA396FD7B281AAC087AB2F.bundle/H14C.bundle/main/main_mps_graph/main_mps_graph.mpsgraphpackage (13)
Unable to load MPSGraphExecutable from path /Users/yongzhang/Library/Caches/swiftmetal/com.apple.e5rt.e5bundlecache/24F74/E051B28C6957815C140A86134D673B5C015E79A1460E9B54B8764F659FDCE645/16FA8CF2CDE66C0C427F4B51BBA82C38ACC44A514CCA396FD7B281AAC087AB2F.bundle/H14C.bundle/main/main_mps_graph/main_mps_graph.mpsgraphpackage @ GetMPSGraphExecutable
E5RT: Unable to load MPSGraphExecutable from path /Users/yongzhang/Library/Caches/swiftmetal/com.apple.e5rt.e5bundlecache/24F74/E051B28C6957815C140A86134D673B5C015E79A1460E9B54B8764F659FDCE645/16FA8CF2CDE66C0C427F4B51BBA82C38ACC44A514CCA396FD7B281AAC087AB2F.bundle/H14C.bundle/main/main_mps_graph/main_mps_graph.mpsgraphpackage (13)
Failure translating MIL->EIR network: Espresso exception: "Network translation error": MIL->EIR translation error at /Users/yongzhang/CLionProjects/ImageSimilarity/models/compiled/u2netp.mlmodelc/model.mil:1557:12: Parameter binding for axes does not exist.
[Espresso::handle_ex_plan] exception=Espresso exception: "Network translation error": MIL->EIR translation error at /Users/yongzhang/CLionProjects/ImageSimilarity/models/compiled/u2netp.mlmodelc/model.mil:1557:12: Parameter binding for axes does not exist. status=-14
Failed to build the model execution plan using a model architecture file '/Users/yongzhang/CLionProjects/ImageSimilarity/models/compiled/u2netp.mlmodelc/model.mil' with error code: -14.
Topic:
Machine Learning & AI
SubTopic:
Create ML
We are trying to implement a custom encryption scheme for our Core ML models. Our goal is to bundle encrypted models, decrypt them into memory at runtime, and instantiate the MLModel without the unencrypted model file ever touching the disk.
We have looked into the native apple encryption described here https://developer.apple.com/documentation/coreml/encrypting-a-model-in-your-app but it has limitations like not working on intel macs, without SIP, and doesn’t work loading from dylib.
It seems like most of the Core ML APIs require a file path, there is MLModelAsset APIs but I think they just write a modelc back to disk when compiling but can’t find any information confirming that (also concerned that this seems to be an older API, and means we need to compile at runtime).
I am aware that the native encryption will be much more secure but would like not to have the models in readable text on disk.
Does anyone know if this is possible or any alternatives to try to obfuscate the Core ML models, thanks
Hi,
I am new to developing on Apple’s platform yet I want to familiarize myself with Core ML and Core ML Tools. I was watching the WWDC24: Bring your machine learning and AI models to Apple Silicon video and was trying to follow along. After multiple attempts and much reading up on documentation, I am still unable to get a coherent script running that will convert the Mistral model that the host used and convert it to a valid Core ML model.
here is a pastebin to what i have currently:
https://pastebin.com/04cVjF1v
if you require the output as well please let me know
Hi Apple Engineers,
I am experiencing a potential memory management bug with CoreML on M1 Mac (32GB Unified Memory).
When processing long video files (approx. 12,000 frames) using a CoreML execution provider, the system often completes the 'Analysing' phase but fails to transition into 'Processing'. It simply exits silently or hits an import error (scipy).
However, if I split the same task into small 20-frame segments, it works perfectly at high speeds (~40 FPS). This suggests the hardware is capable, but there is an issue with memory fragmentation or resource cleanup during long-running CoreML sessions.
Is there a way to force a VRAM/Unified Memory flush via CLI, or is this a known limitation for large frame indexing?
I’m trying to follow Apple’s “WWDC24: Bring your machine learning and AI models to Apple Silicon” session to convert the Mistral-7B-Instruct-v0.2 model into a Core ML package, but I’ve run into a roadblock that I can’t seem to overcome. I’ve uploaded my full conversion script here for reference:
https://pastebin.com/T7Zchzfc
When I run the script, it progresses through tracing and MIL conversion but then fails at the backend_mlprogram stage with this error:
https://pastebin.com/fUdEzzKM
The core of the error is:
ValueError: Op "keyCache_tmp" (op_type: identity) Input x="keyCache" expects list, tensor, or scalar but got state[tensor[1,32,8,2048,128,fp16]]
I’ve registered my KV-cache buffers in a StatefulMistralWrapper subclass of nn.Module, matching the keyCache and valueCache state names in my ct.StateType definitions, but Core ML’s backend pass reports the state tensor as an invalid input. I’m using Core ML Tools 8.3.0 on Python 3.9.6, targeting iOS18, and forcing CPU conversion (MPS wasn’t available). Any pointers on how to satisfy the handle_unused_inputs pass or properly declare/cache state for GQA models in Core ML would be greatly appreciated!
Thanks in advance for your help,
Usman Khan
Topic:
Machine Learning & AI
SubTopic:
Core ML
Tags:
Metal
Metal Performance Shaders
Core ML
tensorflow-metal
Optimal Precision
• Current Precision: Mixed (Float32, int32)
• Optimal Precision: Not specified in the image, but typically involves using the most efficient data type for the model's operations to balance speed and memory usage without significant loss of accuracy.
Comparison:
• Mixed Precision: Utilizes both Float32 and int32 to optimize performance. Float32 provides high precision, while int32 reduces memory usage and increases computational speed.
• Optimal Precision: Aimed at achieving the best trade-off between performance and accuracy, potentially using other data types like Float16 (bfloat16) for even greater efficiency in certain hardware environments.
Operation Distribution
• Current Distribution:
• iOS18.mul: 168
• iOS18.transpose: 126
• iOS18.linear: 98
• iOS18.add: 97
• iOS18.sliceByIndex: 96
• iOS18.expandDims: 74
• iOS18.concat: 72
• iOS18.squeeze: 72
• iOS18.reshape: 67
• iOS18.layerNorm: 49
• iOS18.matmul: 48
• iOS18.gelu: 26
• iOS18.softmax: 24
• Split: 24
• conv: 1
• iOS18.conv: 1
Comparison:
• Operation Count: Indicates how frequently each operation is executed. High counts for operations like mul, transpose, and linear suggest these are computationally intensive parts of the model.
• Optimization Opportunities: Reducing the count of high-frequency operations or optimizing their execution can improve performance. This might involve pruning unnecessary operations, optimizing algorithms, or leveraging hardware acceleration.
General Recommendations
• Precision Tuning: Experiment with different precision levels to find the best balance for your specific hardware and accuracy requirements.
• Operation Optimization: Focus on optimizing the most frequent operations. Techniques include using more efficient algorithms, parallelizing computations, or utilizing specialized hardware like GPUs or TPUs.
• Benchmarking: Regularly benchmark the model to assess the impact of changes and ensure that optimizations lead to meaningful performance improvements.
By focusing on these areas, you can potentially enhance the efficiency and performance of your ML model.
Topic:
Machine Learning & AI
SubTopic:
Core ML
I have an app that uses a couple of mlmodels (word tagger and gazetteer) and I’m trying to encrypt them before publishing.
The models are part of a package. I understand that Xcode can’t automatically handle the encryption for a model in a package the way it can within a traditional app structure.
Given that, I’ve generated the Apple MLModel encryption key from Xcode and am encrypting via the command line with:
xcrun coremlcompiler compile Gazetteer.mlmodel GazetteerENC.mlmodelc --encrypt Gazetteerkey.mlmodelkey
In the package manifest, I’ve listed the encrypted models as .copy resources for my target and have verified the URL to that file is good.
When I try to load the encrypted .mlmodelc file (on a physical device) with the line:
gazetteer = try NLGazetteer(contentsOf: gazetteerURL!)
I get the error:
Failed to open file: /…/Scanner.bundle/GazetteerENC.mlmodelc/coremldata.bin. It is not a valid .mlmodelc file.
So my questions are:
Does the NLGazetteer class support encrypted MLModel files?
Given that my models are in a package, do I have the right general approach?
Thanks for any help or thoughts.
Topic:
Machine Learning & AI
SubTopic:
Core ML
Hi all! Nice to meet you.,
I am planning to build an iOS application that can:
Capture an image using the camera or select one from the gallery.
Remove the background and keep only the detected main object.
Add a border (outline) around the detected object’s shape.
Apply an animation along that border (e.g., moving light or glowing effect).
Include a transition animation when removing the background — for example, breaking the background into pieces as it disappears.
The app Capword has a similar feature for object isolation, and I’d like to build something like that.
Could you please provide any guidance, frameworks, or sample code related to:
Object segmentation and background removal in Swift (Vision or Core ML).
Applying custom borders and shape animations around detected objects.
Recognizing the object name (e.g., “person”, “cat”, “car”) after segmentation.
Thank you very much for your support.
Best regards,
SINN SOKLYHOR
Hi i'm curently crating a model to identify car plates (object detection) i use asitop to monitor my macbook pro and i see that only the cpu is used for the training and i wanted to know why
Does anyone know if ExecuTorch is officially supported or has been successfully used on visionOS? If so, are there any specific build instructions, example projects, or potential issues (like sandboxing or memory limitations) to be aware of when integrating it into an Xcode project for the Vision Pro?
While ExecuTorch has support for iOS, I can't find any official documentation or community examples specifically mentioning visionOS.
Thanks.
When I ran the following code on a physical iPhone device that supports Apple Intelligence, I encountered the following error log.
What does this internal error code mean?
Image generation failed with NSError in a different domain: Error Domain=ImagePlaygroundInternal.ImageGeneration.GenerationError Code=11 “(null)”, returning a generic error instead
let imageCreator = try await ImageCreator()
let style = imageCreator.availableStyles.first ?? .animation
let stream = imageCreator.images(for: [.text("cat")], style: style, limit: 1)
for try await result in stream { // error: ImagePlayground.ImageCreator.Error.creationFailed
_ = result.cgImage
}
Hello fellow developers,
I'm the founder of a FinTech startup, Cent Capital (https://cent.capital), where we are building an AI-powered financial co-pilot.
We're deeply exploring the Apple ecosystem to create a more proactive and ambient user experience. A core part of our vision is to use App Intents and the Shortcuts app to surface personalized financial insights without the user always needing to open our app. For example, suggesting a Shortcut like, "What's my spending in the 'Dining Out' category this month?" or having an App Intent proactively surface an insight like, "Your 'Subscriptions' budget is almost full."
My question for the community is about the architectural and user experience best practices for this.
How are you thinking about the balance between providing rich, actionable insights via Intents without being overly intrusive or "spammy" to the user?
What are the best practices for designing the data model that backs these App Intents for a complex domain like personal finance?
Are there specific performance or privacy considerations we should be aware of when surfacing potentially sensitive financial data through these system-level integrations?
We believe this is the future of FinTech apps on iOS and would love to hear how other developers are thinking about this challenge.
Thanks for your insights!
I am follwing this tutorial:
https://apple.github.io/coremltools/docs-guides/source/convert-a-torchvision-model-from-pytorch.html
I have obtained simialr result using the python code.
However when I view it in Xcode, the preview prediction percentage confidence is way off I suspect it is due the the output of the model, which is in percentage already and in Xcode it multiply 100 again leading to this result. Please give me any feedback to fix this, thank you.
Hi,
I'm not sure whether this is the appropriate forum for this topic. I just followed a link from the JAX Metal plugin page https://developer.apple.com/metal/jax/
I'm writing a Python app with JAX, and recent JAX versions fail on Metal. E.g. v0.8.2
I have to downgrade JAX pretty hard to make it work:
pip install jax==0.4.35 jaxlib==0.4.35 jax-metal==0.1.1
Can we get an updated release of jax-metal that would fix this issue?
Here is the error I get with JAX v0.8.2:
WARNING:2025-12-26 09:55:28,117:jax._src.xla_bridge:881: Platform 'METAL' is experimental and not all JAX functionality may be correctly supported!
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1766771728.118004 207582 mps_client.cc:510] WARNING: JAX Apple GPU support is experimental and not all JAX functionality is correctly supported!
Metal device set to: Apple M3 Max
systemMemory: 36.00 GB
maxCacheSize: 13.50 GB
I0000 00:00:1766771728.129886 207582 service.cc:145] XLA service 0x600001fad300 initialized for platform METAL (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1766771728.129893 207582 service.cc:153] StreamExecutor device (0): Metal, <undefined>
I0000 00:00:1766771728.130856 207582 mps_client.cc:406] Using Simple allocator.
I0000 00:00:1766771728.130864 207582 mps_client.cc:384] XLA backend will use up to 28990554112 bytes on device 0 for SimpleAllocator.
Traceback (most recent call last):
File "<string>", line 1, in <module>
import jax; print(jax.numpy.arange(10))
~~~~~~~~~~~~~~~~^^^^
File "/Users/florin/git/FlorinAndrei/star-cluster-simulator/.venv/lib/python3.13/site-packages/jax/_src/numpy/lax_numpy.py", line 5951, in arange
return _arange(start, stop=stop, step=step, dtype=dtype,
out_sharding=sharding)
File "/Users/florin/git/FlorinAndrei/star-cluster-simulator/.venv/lib/python3.13/site-packages/jax/_src/numpy/lax_numpy.py", line 6012, in _arange
return lax.broadcasted_iota(dtype, (size,), 0, out_sharding=out_sharding)
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/florin/git/FlorinAndrei/star-cluster-simulator/.venv/lib/python3.13/site-packages/jax/_src/lax/lax.py", line 3415, in broadcasted_iota
return iota_p.bind(dtype=dtype, shape=shape,
~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
dimension=dimension, sharding=out_sharding)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/florin/git/FlorinAndrei/star-cluster-simulator/.venv/lib/python3.13/site-packages/jax/_src/core.py", line 633, in bind
return self._true_bind(*args, **params)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/Users/florin/git/FlorinAndrei/star-cluster-simulator/.venv/lib/python3.13/site-packages/jax/_src/core.py", line 649, in _true_bind
return self.bind_with_trace(prev_trace, args, params)
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/florin/git/FlorinAndrei/star-cluster-simulator/.venv/lib/python3.13/site-packages/jax/_src/core.py", line 661, in bind_with_trace
return trace.process_primitive(self, args, params)
~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
File "/Users/florin/git/FlorinAndrei/star-cluster-simulator/.venv/lib/python3.13/site-packages/jax/_src/core.py", line 1210, in process_primitive
return primitive.impl(*args, **params)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/Users/florin/git/FlorinAndrei/star-cluster-simulator/.venv/lib/python3.13/site-packages/jax/_src/dispatch.py", line 91, in apply_primitive
outs = fun(*args)
jax.errors.JaxRuntimeError: UNKNOWN: -:0:0: error: unknown attribute code: 22
-:0:0: note: in bytecode version 6 produced by: StableHLO_v1.13.0
--------------------
For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
I0000 00:00:1766771728.149951 207582 mps_client.h:209] MetalClient destroyed.
I’m considering creating an ILMessageFilterExtension using a mini LLM/SLM to detect fraud and I’ve read it has strict memory limits yet I can’t find it in the documentation. What’s the set limit or any other constraints impacting the feasibility of running 100-500mb model?
Topic:
Machine Learning & AI
SubTopic:
Core ML
Looking for help with or to help with, due to the pending document enhancement, the Vibe Coders edition of cml editor.
Also for more information on how to use the .mlkey
whether or not my model is suppose to say IOs18 when I am planning to use it on Mac
Apple Intelligence seems to think coreML is for iOS but are the capabilities extended when running NPU on the book?
How to use this graph.
coming in hot sorry.
btw. there are 100s of feedback and crash reports sent in form me for additional info?
I attached a image that might help with updating Tags
Topic:
Machine Learning & AI
SubTopic:
Core ML
In this online session, you can code along with us as we build generative AI features into a sample app live in Xcode. We'll guide you through implementing core features like basic text generation, as well as advanced topics like guided generation for structured data output, streaming responses for dynamic UI updates, and tool calling to retrieve data or take an action.
Check out these resources to get started:
Download the project files: https://developer.apple.com/events/re...
Explore the code along guide: https://developer.apple.com/events/re...
Join the live Q&A: https://developer.apple.com/videos/pl...
Agenda – All times PDT
10 a.m.: Welcome and Xcode setup
10:15 a.m.: Framework basics, guided generation, and building prompts
11 a.m.: Break
11:10 a.m.: UI streaming, tool calling, and performance optimization
11:50 a.m.: Wrap up
All are welcome to attend the session. To actively code along, you'll need a Mac with Apple silicon that supports Apple Intelligence running the latest release of macOS Tahoe 26 and Xcode 26.
If you have questions after the code along concludes please share a post here in the forums and engage with the community.
Topic:
Machine Learning & AI
SubTopic:
Foundation Models