Chapter 64: The Edge Compute Bottleneck

Цей контент ще не доступний вашою мовою.

Cast of characters

Name	Lifespan	Role
Andrew G. Howard et al.	—	Lead authors of MobileNets (2017); efficient mobile-vision model lineage before LLMs
Apple hardware/software teams	—	Institutional actor for A11 Neural Engine, Face ID, Apple Intelligence, AFM-on-device, Private Cloud Compute
Google Pixel / Android / Gemini Nano teams	—	Institutional actor for Gemini Nano on Pixel 8 Pro, AICore, and the on-device-vs-cloud capacity caveat
Qualcomm Snapdragon team	—	Institutional actor for the Snapdragon 8 Gen 3 product brief and phone-NPU framing
Keivan Alizadeh et al.	—	Authors of “LLM in a flash” (Apple, 2023); DRAM/flash hierarchy for models exceeding memory
Zechun Liu et al.	—	Authors of MobileLLM (2024); sub-billion-parameter language models for on-device tasks

Timeline (2017–2024)

timeline
    title Chapter 64 — The Edge Compute Bottleneck
    Apr 2017 : MobileNets frames neural-network architecture as a mobile and embedded deployment problem
    Sep 2017 : Apple announces iPhone X with A11 Bionic Neural Engine and on-device Face ID processing
    Oct 2023 : Qualcomm Snapdragon 8 Gen 3 product brief markets phone silicon for on-device multimodal generative AI up to 10B parameters
    Dec 2023 : Google ships Gemini Nano on Pixel 8 Pro while Video Boost shows cloud fallback
    Dec 2023 : Apple's "LLM in a flash" paper studies edge memory limits
    Feb 2024 : MobileLLM develops sub-billion-parameter language models for mobile use cases (June 2024 revision follows)
    Jun 2024 : Apple Intelligence introduced — ~3B on-device language model plus Private Cloud Compute for larger requests
    Jul 2024 : Apple publishes the Apple Intelligence Foundation Language Models report (AFM-on-device architecture, quantization, deployment)
    Oct 2024 : Android opens experimental Gemini Nano access through AICore; on-device models framed as smaller and less generalized than cloud equivalents

Plain-words glossary

Neural Engine / NPU — A piece of dedicated silicon on a phone for running neural-network inference efficiently. Apple’s A11 Neural Engine (2017) and Qualcomm’s Hexagon NPU on the Snapdragon 8 Gen 3 are the two product-era examples this chapter uses; both let a phone run ML workloads without leaning on the general-purpose CPU or GPU.

Depthwise separable convolution — the efficient convolution pattern MobileNets used to make mobile vision models cheaper to run.

Grouped-query attention (GQA) — An attention variant that shares key/value projections across groups of query heads, reducing the size of the KV cache the model has to hold during inference. Apple’s AFM-on-device and MobileLLM both use it specifically because cache memory is one of the tightest constraints on a phone.

Quantization (and bits-per-weight) — Storing model weights at lower numerical precision than full 16- or 32-bit floats to shrink memory and energy use. Apple’s AFM paper discusses quality around 3.5 bits per weight and identifies 3.7 bpw as the production deployment choice.

KV cache — The intermediate keys and values the Transformer accumulates while generating tokens. It grows with context length and is one of the dominant memory costs of running an LLM, on a phone or in a datacenter.

AICore (Android) — a system service for exposing Gemini Nano capabilities to Android apps on supported devices.

Private Cloud Compute (Apple) — Apple’s server-side fallback path for Apple Intelligence: when a request is judged larger than the on-device model can handle, it routes to Apple-operated server hardware. The architecture’s role in the chapter is to mark the boundary where local processing stops, not to certify the privacy model.

The obvious place to run a frontier model was the datacenter. It had power, cooling, accelerators, memory, network links, and operators who could tune the serving stack. The tempting place to run a product was the device in the user’s hand. It had immediacy, privacy advantages, offline access, and no round trip to a remote server.

The gap between those two places became the edge compute bottleneck.

AI did not move from the cloud to the phone in one clean migration. It split. Some work ran locally. Some work routed to server-side models. Some work became impossible, delayed, compressed, or redesigned because the device could not support the same resource envelope as a datacenter. The edge was not a smaller cloud. It was a different machine with different physics.

Battery is a constraint. Heat is a constraint. DRAM is a constraint. Flash bandwidth is a constraint. Network availability is a constraint. User patience is a constraint. Privacy is a constraint that becomes technical: where data is processed determines what has to move. If Chapter 63 asked how to serve many users cheaply in the datacenter, this chapter asks what happens when the user expects intelligence to live in a pocket-sized computer.

The device also has to share. A datacenter accelerator can be reserved for inference work. A phone is simultaneously a camera, modem, display driver, wallet, browser, messenger, and sensor hub. Its AI workload competes with everything else the user expects the device to do. A model that looks small in a paper can still be large inside a mobile operating system that has to preserve battery, stay cool, and keep the interface responsive.

The constraint is continuous, not episodic: every local AI feature has to coexist with the rest of the device each time it runs, including during ordinary background activity, low battery, active camera use, modem load, and imperfect thermal conditions.

This is also why edge AI has a strong product-management dimension. A local feature has to be scoped so the device can deliver it consistently. If the feature needs a long context, heavy media processing, or a general-purpose cloud-scale model, the honest design may be hybrid from the beginning.

That makes the edge a discipline of negative space. It is defined as much by what cannot be run as by what can. The best local feature is often one with a narrow task, short context, limited output, and clear privacy or latency benefit. The worst local feature tries to pretend that a phone is a small datacenter. The history of edge AI is the history of learning that difference.

The answer began before generative AI.

MobileNets, introduced by Howard and collaborators in 2017, were explicit about the mobile bargain. They were efficient convolutional neural networks for mobile and embedded vision applications. They used depthwise separable convolutions and model-width and input-resolution multipliers so designers could trade accuracy against latency and size. The point was not to make a cloud vision model smaller by accident. The point was to design the model around the device.

That pattern matters. A mobile model is not judged only by accuracy on an abstract task. It is judged by whether it fits into a latency budget, a memory budget, and a power budget. A model that is slightly more accurate but too slow or too large may be worse for the product. MobileNets made this visible by providing knobs. Width and resolution could be adjusted. Compute could be reduced. The architecture itself carried the edge constraint.

Depthwise separable convolutions were the architectural lesson. Instead of applying a full convolution that mixes spatial filtering and channel mixing in one expensive operation, MobileNets split the work into a depthwise convolution and a pointwise convolution. That reduced computation while preserving a useful vision model shape. The technique matters here because it shows a design mindset: edge AI is often not about taking the biggest known model and squeezing it afterward. It is about changing the architecture so the constraint is present from the beginning.

The width and resolution multipliers made that mindset practical. A product team could reduce the number of channels or the input resolution to fit a smaller or faster version of the model. The result was not one model but a family of trade-offs. This is the same pattern that would later reappear in language models as smaller parameter counts, quantization choices, grouped-query attention, and task-specific adapters. The names changed, but the bargain stayed familiar: accept a narrower envelope to make local execution possible.

The iPhone X showed the hardware side of the same bargain. Apple’s 2017 announcement introduced the A11 Bionic with a neural engine for real-time processing and tied it to Face ID and Animoji. The company said Face ID processing was done on device and not in the cloud. That was an early version of the product claim that would return in the generative era: local AI is not only about speed. It is also about where sensitive data goes.

Face ID was not an LLM. It was still historically important because it made machine learning hardware part of a mass-market device story. The phone was no longer only a client asking a server to think. It contained specialized silicon for a narrow, high-frequency, latency-sensitive AI task. The data was intimate. The response had to be fast. The computation had to fit inside the device’s thermal and power envelope.

The privacy claim was also architectural. If facial processing occurs on device, the feature can avoid sending that sensitive input to a remote server for the core recognition step. That does not make every device feature private by default. It does show how computation location can become part of the trust model. The user does not merely ask whether the model works. The user asks where the data goes while the model works.

This is the pre-generative edge pattern: redesign the model, add dedicated hardware, keep the task narrow, and make the user benefit immediate. The later language-model story did not replace that pattern. It inherited it and made the constraints more severe.

Generative models raised the ambition. Users no longer wanted the phone merely to unlock, classify, or beautify. They wanted it to summarize recordings, answer messages, rewrite text, reason about images, and act as a local assistant. That turned “on-device” from a feature detail into a strategic claim.

Google’s Gemini Nano was framed that way. In the December 2023 Pixel update, Google described Gemini Nano as its most efficient model for on-device tasks, running on Pixel 8 Pro for Recorder Summarize and Smart Reply. The same product framing emphasized that on-device processing could help keep sensitive data on the phone and could work without a network connection. Android’s Gemini Nano documentation then placed the model inside AICore, a system service that manages model updates, safety, access to hardware accelerators, and device-specific inference behavior.

That system-service detail matters. On-device AI is not only a model file embedded in an app. If many apps want local generative features, the operating system needs to manage updates, safety boundaries, accelerator access, and compatibility across hardware. AICore made the edge model part of the platform. The developer-facing promise was generative AI without a server call, with low-latency and privacy benefits where the feature actually runs locally. The caveat was also present: inference speed depends on device hardware.

This is a different deployment problem from a web API. A cloud provider can update a model behind an endpoint and keep the hardware fleet relatively hidden. An on-device model has to live with device fragmentation, app permissions, storage limits, hardware accelerators, and operating-system release cycles. AICore’s role in updates and accelerator access reflects that problem. The model is not floating in isolation. It is part of the mobile platform.

It also changes the developer contract. If an app can call a local generative model through a system service, developers may get lower latency and fewer server-side costs for suitable tasks. But they also inherit device variability. A feature that behaves well on a high-end phone may be unavailable, slower, or different on another device. The edge makes capability uneven across the installed base.

The Android developer blog made the trade-off more explicit. Gemini Nano was described as smaller and less generalized than cloud equivalents. That sentence is the honest center of edge AI. Local models can be useful precisely because they are narrower. They are optimized for device tasks. They avoid some network calls. They can respond under local latency and privacy constraints. But they are not simply frontier cloud models copied into a phone.

This is where the small model stops being a toy. Small can mean deliberate. A model designed for summarizing a recording, drafting a short reply, extracting information, or handling an app-specific action does not need the full open-ended breadth of a frontier datacenter model. It needs enough capability for the task, predictable behavior, small memory footprint, efficient inference, and integration with the device.

MobileLLM made that research program explicit. Liu and collaborators focused on language models under one billion parameters for mobile deployment. The work emphasized architecture choices such as deep and thin networks, embedding sharing, grouped-query attention, compatibility with W8A8 quantization, and evaluation on chat and API-calling tasks. The point was not to win a frontier-scale contest. It was to make small language models more useful for common on-device workloads.

The “deep and thin” idea is useful for readers because it breaks the assumption that small means shallow or crude. A sub-billion-parameter model can be shaped carefully. It can allocate capacity across layers, share embeddings, reduce cache pressure, and target the kinds of short, structured tasks that mobile apps need. API calling is especially revealing: an edge model may not need to write a long essay. It may need to select an action, fill a form, summarize a note, or invoke an app function.

Qualcomm’s Snapdragon 8 Gen 3 product brief showed the silicon vendor version of the same race. The brief claimed support for multimodal generative AI models, including LLMs, language-vision models, and speech recognition, up to 10 billion parameters solely on device; it also claimed up to 20 tokens per second for a 7B Llama 2 setup, plus faster and more efficient Hexagon NPU performance. These are vendor product claims, not independent benchmarks. Their historical role is still clear: phone chips were now marketed through generative AI capacity.

The Qualcomm example is useful because it shows how quickly the language of AI infrastructure entered consumer silicon. Parameter counts, tokens per second, multimodal models, and performance per watt were no longer only datacenter concerns. They became phone-platform claims. But the chapter has to keep the claim in its proper category. A product brief can show what a vendor wanted developers and manufacturers to believe about a platform; it cannot by itself prove broad real-world performance across apps, thermals, model choices, and user behavior.

The result was a three-way negotiation among model, platform, and chip. The model had to be small enough or compressed enough. The operating system had to provide a safe and updateable way to expose it. The chip had to supply specialized acceleration without draining the battery or overheating the device. Edge AI became less about “can a model run?” and more about “can this feature run repeatedly, locally, and acceptably for real users?”

The word “repeatedly” is doing work. A demo can run once under ideal conditions. A phone feature may run every day, under low battery, on cellular, while other apps are open, after the device has warmed up, and across years of OS updates. The edge bottleneck is therefore not only peak performance. It is sustained, predictable performance under ordinary device life.

That ordinary-life framing also explains why the edge is fragmented. One flagship phone may have a capable NPU, enough memory, and current OS support. Another device may lack the accelerator path or storage budget. A feature that is “on-device” in one product tier may be cloud-routed, unavailable, or degraded in another. Edge AI therefore moves the compatibility problem into AI itself. The same application may need multiple model sizes, fallbacks, and capability checks.

Apple Intelligence made the same split visible from the other side of the ecosystem. Apple’s June 2024 machine-learning research page described a roughly 3-billion-parameter on-device language model and a larger server-based model available through Private Cloud Compute. The company did not claim that the phone would do everything. It described a tiered system: some requests on device, larger needs routed to private cloud infrastructure.

The Apple Foundation Models paper gave the technical shape of the on-device tier. AFM-on-device was about 3B parameters. It used grouped-query attention to reduce the KV-cache footprint. It was pruned and distilled from larger models. It was compressed aggressively for deployment; the paper discussed quality around about 3.5 bits per weight and identified 3.7 bits per weight as the production deployment choice. It also discussed latency and power as deployment concerns, not afterthoughts.

This is the edge lesson in miniature. A 3B model is not chosen because the number is fashionable. It is chosen because the device has limits. Grouped-query attention is not an academic flourish here; it reduces cache memory pressure. Pruning and distillation are not merely training tricks; they transfer capability into a smaller envelope. Quantization is not only a compression statistic; it changes whether the model fits and how much energy inference consumes.

Distillation also carries a product philosophy. The larger model can serve as a teacher, but the deployed local model has to become a different artifact. It is smaller, more specialized, and optimized for the device path. Pruning removes capacity that the local use case can afford to lose. Quantization reduces the number of bits needed to represent weights. LoRA-style adaptation can specialize behavior without retraining everything. Each technique narrows or reshapes the model so it can live closer to the user.

The Apple split also helps avoid a common misconception. On-device AI is not a binary badge. It is one tier in a system. A request that fits the local model may stay on the device. A request that needs a larger model may route elsewhere. The product promise depends on deciding this boundary well. If too much routes to the cloud, the local story becomes thin. If too much is forced onto the device, quality, latency, or battery life suffers.

The user may see a writing tool, a summarizer, or an assistant action. Underneath is a stack of compromises: model scale, attention design, distillation, quantization, accelerator support, memory pressure, battery drain, and fallbacks. The edge model is a product of constraint.

Memory is the hardest constraint to hide.

DRAM on a phone is limited and shared with the rest of the system. Flash storage is larger but much slower and has different bandwidth and energy characteristics. A model’s weights, activations, and KV cache have to fit into a moving budget. If a model is too large for available DRAM, the system can try to stream pieces from flash, but now bandwidth and access patterns become central.

“Available” is the key word. A device may have a headline memory number, but the AI feature cannot assume it owns all of it. The operating system, apps, camera pipelines, browser tabs, and graphics workloads all compete for memory. The model has to coexist. That makes edge memory more adversarial than a simple spec sheet suggests.

“LLM in a flash” made this bottleneck explicit. Alizadeh and collaborators described running language models that exceed available DRAM by storing parameters in flash and loading them on demand. The paper reported models up to twice DRAM size and speedups over naive loading in its evaluated CPU and GPU settings. The details are technical, but the historical point is simple: on the edge, storage hierarchy becomes part of model execution.

This is different from the datacenter version of the memory problem. Datacenter inference also cares deeply about memory, as Chapter 63 showed through KV cache and serving throughput. But an edge device has tighter thermal, power, and capacity limits, and it is doing other work for the user at the same time. The phone cannot simply dedicate a rack of accelerators to one assistant session. It has to remain a phone.

LLM-in-a-flash also shows why “the model fits on storage” is not the same as “the model runs well.” Flash can hold more than DRAM, but the model has to access weights in patterns that avoid constant slow reads. The paper’s windowing and bundling ideas were attempts to make loading from flash less wasteful. The bottleneck was not just the number of parameters. It was where those parameters lived and how often they had to move.

The KV cache returns here too. A local assistant that handles longer prompts or longer outputs needs memory for the ongoing context. Grouped-query attention can reduce that footprint. Quantization can shrink weights. Distillation and pruning can shrink the model. But every technique is a trade-off. Smaller models may be less general. Lower precision may affect quality if pushed too far. More aggressive memory tricks may increase complexity or latency.

That is why long context is especially difficult on the edge. A user may expect a phone assistant to remember a thread, inspect a document, summarize a recording, and call an app. Each extra token or intermediate result has to live somewhere while the model works. Datacenter systems can buy more high-bandwidth memory and spread load across accelerators. A device has far less room to maneuver. Edge AI often has to summarize, window, truncate, or route rather than simply expand context.

This is why the edge model is not simply a moral victory over cloud inference. Local processing can reduce server calls and data movement for features that actually run locally. It can improve offline availability. It can lower round-trip latency. It can keep some sensitive inputs closer to the user. But it cannot erase the resource envelope. The device remains bounded.

The cloud therefore comes back.

Google’s Pixel Video Boost was a clear example. In the same Pixel feature-drop context that introduced Gemini Nano on Pixel 8 Pro, Google said Video Boost uploaded videos to the cloud for computational photography processing. That is not a contradiction. It is the hybrid reality. Some features fit locally. Some do not. Video is heavy, and cloud processing can offer capacity the phone cannot.

Video is an especially clean boundary because it combines large data, temporal structure, and heavy processing. A still image already stresses local models. A video adds frame sequences, motion, stabilization, lighting, and often audio. A phone can capture the media, preview it, and perform some local computation, but ambitious enhancement may still exceed the local budget. Pixel Video Boost shows that the edge can be the capture point while the cloud remains the heavy processor.

Apple made the same point through Private Cloud Compute. Its newsroom language said on-device processing was a cornerstone, but larger-than-pocket models would route to Private Cloud Compute, and requests would be analyzed to determine whether they could be processed on device. The phrase “larger-than-pocket” captures the boundary: some intelligence can be local, but not all of it can fit into the device.

The honest architecture is therefore a router. The product has to decide what runs on device, what runs in private or cloud compute, what waits for network availability, what degrades gracefully, and what should not be offered. That router may be visible to the user or hidden behind the assistant. Either way, it is a core design decision.

Routing also creates user-experience questions. Should the product tell the user when a request leaves the device? Should it offer a local-only mode with weaker capability? Should it delay a task until network returns, or fail fast? Should it send only a transformed representation rather than the raw input? Those questions are not answered by model architecture alone. They sit at the boundary of product design, privacy policy, and infrastructure.

The router also has to reason about trust. A local path may be preferable for sensitive data, but it may have weaker capability. A cloud path may give a better answer, but it changes the data boundary and depends on network and server availability. A refusal may be better than either when the local model is too weak and the remote route is inappropriate. Edge AI therefore forces a more nuanced product decision than “local good, cloud bad.” The useful question is which path is justified for this task.

Privacy becomes more complicated in this hybrid world. On-device processing can reduce data movement when it is truly local. But a feature that routes to a server has a different privacy boundary. First-party claims about private cloud architecture matter as product design and trust signals, but they do not make every phone AI feature local. The safer historical claim is narrower: edge AI made computation location part of the product promise.

The same is true of latency. Local inference avoids network round trips, but it may be slower if the local model is small, memory-starved, or running under thermal limits. Cloud inference may be more capable but adds network dependence and server scheduling. Neither side wins universally. The product architecture has to choose.

Offline use adds another dimension. A local model can still help when the network is weak, expensive, or unavailable, but only for tasks the local model can handle. That makes graceful degradation part of the architecture. The assistant may summarize a recording locally, defer a heavier media operation, or offer a shorter answer rather than pretending all capabilities are always present.

This is why edge AI became a design discipline rather than a deployment target. It forced teams to decide the model size, quantization, accelerator path, memory strategy, operating-system API, privacy boundary, cloud fallback, and user experience together. The device is not merely the endpoint. It is part of the model’s operating conditions.

The discipline also includes saying no. A task may be too broad for the local model, too private for a remote route, too slow for interactive use, or too battery-intensive to run repeatedly. The edge does not merely ask engineers to optimize harder. It asks product teams to define the task honestly enough that optimization has a chance.

The edge bottleneck also explains why open weights and small models became strategically important. A deployable model has different value from a model that exists only behind a remote API. If a model can be tuned, compressed, quantized, and shipped into constrained environments, it opens different product paths. That does not settle governance or licensing questions, which belong to the next chapter. It explains why the ability to run models locally became a serious axis of competition.

At the same time, the edge does not eliminate datacenters. The more users expect from AI, the more often products will face tasks too heavy for the device. Cloud fallback remains necessary for larger models, heavy video processing, long contexts, and workloads that exceed local memory or power limits. Edge AI changes the shape of datacenter demand; it does not make it disappear.

By the mid-2020s, the question was no longer whether AI could run on a phone. It already could, in narrow and increasingly useful ways. The harder question was what should run there. The phone learned to see before it learned to summarize. It gained neural silicon before it gained local language models. It gained small models before it gained full autonomy.

That sequence should make the history feel less sudden. The generative edge did not appear from nowhere in 2024. It grew from efficient mobile vision, biometric hardware, dedicated neural accelerators, operating-system model services, small-model research, quantization, and memory-hierarchy work. What changed was the ambition of the interface. The same device that once ran a narrow classifier was now expected to host a local assistant.

The edge compute bottleneck is the name for that sequence of compromises. Intelligence can move closer to the user, but only by becoming smaller, more specialized, more compressed, more carefully scheduled, and more honest about when the cloud must take over.