Inference is rapidly emerging as the next major frontier in artificial intelligence (AI). Historically, the AI development and deployment focus has been overwhelmingly on training with approximately 80% of compute resources dedicated to it and only 20% to inference.
That balance is shifting fast. Within the next two years, the ratio is expected to reverse to 80% of AI compute devoted to inference and just 20% to training. This transition is opening a massive market opportunity with staggering revenue potential.
Inference has a fundamentally different profile—it requires lower latency, greater energy efficiency, and predictable real-time responsiveness than training-optimized hardware, which entails excessive power consumption, underutilized compute, and inflated costs.
When deployed for inference, the training-optimized computing resources result in a cost-per-query at one or even two orders of magnitude higher than the benchmark of a cost of $0.002 per query established by a 2023 McKinsey analysis based on the Google 2022 search activity estimated to be in average 100,000 queries per second.
Today, the market is dominated by a single player whose quarterly results reflect its stronghold. While a competitor has made some inroads and is performing respectably, it has yet to gain meaningful market share.
One reason is architectural similarity; by taking a similar approach to the main player, rather than offering a differentiated, inference-optimized alternative, the competitor faces the same limitations. To lead in the inference era, a fundamentally new processor architecture is required. The most effective approach is to build dedicated, inference-optimized infrastructure, an architecture specifically tailored to the operational realities of processing generative AI models like large language models (LLMs).
This means rethinking everything from compute units and data movement to compiler design and LLM-driven architectures. By focusing on inference-first design, it’s possible to achieve significant gains in performance-per-watt, cost-per-query, time-to-first-token, output-token-per-second, and overall scalability, especially for edge and real-time applications where responsiveness is critical.
This is where the next wave of innovation lies—not in scaling training further, but in making inference practical, sustainable, and ubiquitous.
The inference trinity
AI inference hinges on three critical pillars: low latency, high throughput and constrained power consumption, each essential for scalable, real-world deployment.
First, low latency is paramount. Unlike training, where latency is relatively inconsequential—a job taking an extra day or costing an additional million dollars is still acceptable as long as the model is successfully trained—inference operates under entirely different constraints.
Inference must happen in real time or near real time, with extremely low latency per query. Whether it’s powering a voice assistant, an autonomous vehicle or a recommendation engine, the user experience and system effectiveness hinge on sub-millisecond response times. The lower the latency, the more responsive and viable the application.
Second, high throughput at low cost is essential. AI workloads involve processing massive volumes of data, often in parallel. To support real-world usage—especially for generative AI and LLMs—AI accelerators must deliver high throughput per query while maintaining cost-efficiency.
Vendor-specified throughput often falls short of peak targets in AI workload processing due to low-efficiency architectures like GPUs. Especially, when the economics of inference are under intense scrutiny. These are high-stakes battles, where cost per query is not just a technical metric—it’s a competitive differentiator.
Third, power efficiency shapes everything. Inference performance cannot come at the expense of runaway power consumption. This is not only a sustainability concern but also a fundamental limitation in data center design. Lower-power devices reduce the energy required for compute, and they ease the burden on the supporting infrastructure—particularly cooling, which is a major operational cost.
The trade-off can be viewed from the following two perspectives:
- A new inference device that delivers the same performance at half the energy consumption can dramatically reduce a data center’s total power draw.
- Alternatively, maintaining the same power envelope while doubling compute efficiency effectively doubles the data center’s performance capacity.
Bringing inference to where users are
A defining trend in AI deployment today is the shift toward moving inference closer to the user. Unlike training, inference is inherently latency-sensitive and often needs to occur in real time. This makes routing inference workloads through distant cloud data centers increasingly impractical—from both a technical and economic perspective.
To address this, organizations are prioritizing edge-based inference processing data locally or near the point of generation. Shortening the network path between the user and the inference engine significantly improves responsiveness, reduces bandwidth costs, enhances data privacy, and ensures greater reliability, particularly in environments with limited or unstable connectivity.
This decentralized model is gaining traction across industry. Even AI giants are embracing the edge, as seen in their development of high-performance AI workstations and compact data center solutions. These innovations reflect a clear strategic shift: enabling real-time AI capabilities at the edge without compromising on compute power.
Inference acceleration from the ground up
One high-tech company, for example, is setting the engineering pace with a novel architecture designed specifically to meet the stringent demands of AI inference in data centers and at the edge. The architecture breaks away from legacy designs optimized for training workloads with near-theoretical performance in latency, throughput, and energy efficiency. More entrants are certain to follow.
Below are some of the highlights of this inference technology revolution in the making.
Breaking the memory wall
The “memory wall” has challenged chip designers since the late 1980s. Traditional architectures attempt to mitigate the impact on performance introduced by data movement between external memory and processing units by layering memory hierarchies, such as multi-layer caches, scratchpads and tightly coupled memory, each offering tradeoffs between speed and capacity.
In AI acceleration, this bottleneck becomes even more pronounced. Generative AI models, especially those based on incremental transformers, must constantly reprocess massive amounts of intermediate state data. Conventional architectures struggle here. Every cache miss—or any operation requiring access outside in-memory compute—can severely degrade performance.
One approach collapses the traditional memory hierarchy into a single, unified memory stage: a massive SRAM array that behaves like a flat register file. From the perspective of the processing units, any register can be accessed anywhere, at any time, within a single clock. This eliminates costly data transfers and removes the bottlenecks that hamper other designs.
Flexible computational tiles with 16 high-performance processing cores dynamically reconfigurable at run-time executes either AI operations, like multi-dimensional matrix operations (ranging from 2D to N-dimensional), or advanced digital signal processing (DSP) functions.
Precision is also adjustable on-the-fly, supporting formats from 8 bits to 32 bits in both floating point and integer. Both dense and sparse computation modes are supported, and sparsity can be applied on the fly to either weights or data—offering fine-grained control for optimizing inference workloads.
Each core features 16-million registers. While a vast register file presents challenges for traditional compilers, two key innovations come to rescue:
- Native tensor processing, which handles vectors, tensors, and matrices directly in hardware, eliminates the need to reduce them to scalar operations and manually implements nested loops—as required in GPU environments like CUDA.
- With high-level abstraction, developers can interact with the system at a high level—PyTorch and ONNX for AI and Matlab-like functions for DSP—without the need to write low-level code or manage registers manually. This simplifies development and significantly boosts productivity and hardware utilization.
Chiplet-based scalability
A physical implementation leverages a chiplet architecture, with each chiplet comprising two computational cores. By combining chiplets with high-bandwidth memory (HBM) chiplet stacks, the architecture enables highly efficient scaling for both cloud and edge inference scenarios.
- Data center-grade inference for efficient tailoring of compute and memory resources suits edge constraints. The configuration pairs eight VSORA chiplets with eight HBM3e chiplets, delivering 3,200 TFLOPS of compute performance in FP8 dense mode and optimized for large-scale inference workloads in data centers.
- Edge AI configurations allow efficient tailoring of compute resources and lower memory requirements to suit edge constraints. Here, two chiplets + one HBM chiplet = 800 TFLOPS and four chiplets + one HBM chiplet = 1,600 TFLOPS.
Power efficiency as a side effect
The performance gains are clear as is power efficiency. The architecture delivers twice the performance-per-watt of comparable solutions. In practical terms, the chip draw stops at just 500 watts, compared to over one kilowatt for many competitors.
When combined, these innovations provide multiple times the actual performance at less than half the power—offering an overall advantage of 8 to 10 times compared to conventional implementations.
CUDA-free compilation
One often-overlooked advantage of the architecture lies in its streamlined and flexible software stack. From a compilation perspective, the flow is simplified compared to traditional GPU environments like CUDA.
The process begins with a minimal configuration file—just a few lines—that defines the target hardware environment. This file enables the same codebase to execute across a wide range of hardware configurations, whether that means distributing workloads across multiple cores, chiplets, full chips, boards, or even across nodes in a local or remote cloud. The only variable is execution speed; the functional behavior remains unchanged. This makes on-premises and localized cloud deployments seamless and scalable.
A familiar flow without complexity
Unlike CUDA-based compilation processes, the flow appears basic without layers of manual tuning and complexity through a more automated and hardware-agnostic compilation approach.
The flow begins by ingesting standard AI inputs, such as models defined in PyTorch. These are processed by a proprietary graph compiler that automatically performs essential transformations such as layer reordering or slicing for optimal execution. It extracts weights and model structure and then outputs an intermediate C++ representation.
This C++ code is then fed into an LLVM-based backend, which identifies the compute-intensive portions of the code and maps them to the architecture. At this stage, the system becomes hardware-aware, assigning compute operations to the appropriate configuration—whether it’s a single A tile, an edge device, a full data center accelerator, a server, a rack or even multiple racks in different locations.
Invisible acceleration for developers
From a developer’s point of view, the accelerator is invisible. Code is written as if it targets the main processor. During compilation, the compilation flow identifies the code segments best suited for acceleration and transparently handles the transformation and mapping to hardware, lowering the barrier for adoption and requiring no low-level register manipulation or specialized programming knowledge.
The instruction set is high-level and intuitive, carrying over capabilities from its origins in digital signal processing. The architecture supports AI-specific formats such as FP8 and FP16, as well as traditional DSP operations like FP16/ arithmetic, all handled automatically on a per-layer basis. Switching between modes is instantaneous and requires no manual intervention.
Pipeline-independent execution and intelligent data retention
A key architectural advantage is pipeline independence—the ability to dynamically insert or remove pipeline stages based on workload needs. This gives the system a unique capacity to “look ahead and behind” within a data stream, identifying which information must be retained for reuse. As a result, data traffic is minimized, and memory access patterns are optimized for maximum performance and efficiency, reaching levels unachievable in conventional AI or DSP systems.
Built-in functional safety
To support mission-critical applications such as autonomous driving, functional safety features are integrated at the architectural level. Cores can be configured to operate in lockstep mode or in redundant configurations, enabling compliance with strict safety and reliability requirements.
In the final analysis, a memory architecture that eliminates traditional bottlenecks, compute units tailored for tensor operations, and unmatched power efficiency sets a new standard for AI inference.
Lauro Rizzatti is a business advisor to VSORA, an innovative startup offering silicon IP solutions and silicon chips, and a noted verification consultant and industry expert on hardware emulation.
Related Content
- AI at the edge: It’s just getting started
- Custom AI Inference Has Platform Vendor Living on the Edge
- Partitioning to optimize AI inference for multi-core platforms
- Revolutionizing AI Inference: Unveiling the Future of Neural Processing
The post The next AI frontier: AI inference for less than $0.002 per query appeared first on EDN.