AI Inference Stack
A few weeks ago, I published the AI Inference Stack Market Map with Wing.VC and Jake Flomenberg. The center of gravity for new AI is starting to shift towards inference.
Some additional thoughts on inference:
-
If you’re interested in the inference, I’d recommend reading Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. This paper was the first paper I read that systematically explored the effectiveness of test-time compute.
-
Sebastian Raschka has an excellent overview of inference-time compute techniques that is current as of March 2025. (The space never pauses, and innovations never slow.)
-
I’m uncertain about the long-term viability of inference-as-a-service providers such as Together.AI and Fireworks. AWS pioneered the business model of adding value-added software (e.g., RDS, SQS) and monetizing by selling compute for a slight premium versus the market. I don’t think these dynamics will work as well in the GPU market — if companies are spending eight or nine figures a year on GPU, will they pay a 100 basis point premium for a hosted inference stack, or will they elect to get more GPU capacity for their budget and build their own? There is a lot of R&D required to maintain that edge.
-
That said, as the market map illustrates, inference stacks are getting quite complex. So there is value in a single provider who has figured out how to optimize that inference stack to maximize GPU performance.
-
Optimizing inference requires an understanding of hardware (GPUs, GPU memory, etc.). There are a lot of tradeoffs and choosing the right hardware mix for the software stack is complicated. It seems there is room for simplification here.
-
The DeepSeek team’s work on open sourcing key elements of their inference stack, as well as NVidia Dynamo architecture and documentation are both worth reading. I’m also keeping an eye on llm-d.
Drop me a line if you’re working in this space! I’d love hear what you’re up to.