Open source LLM models and open source inference software: building blocks of a commoditized LLM inference hosting market

As of early 2025, large language models (LLMs) are primarily accessed through web interfaces offered by companies like OpenAI, Anthropic (Perplexity/Claude), and Google (Gemini). Alongside these proprietary offerings, a “second tier” of open-source LLM models has emerged, including Meta’s LLaMA 3.1, Mistral, DeepSeek, and others. These open-source models are becoming increasingly viable for self-hosting, offering significant advantages in data sovereignty, confidentiality, and cost savings. For many use cases, they are roughly on par with proprietary models, making them an appealing alternative.

While web interfaces are the most visible way to interact with LLMs, they are largely loss leaders, designed to promote application programming interface (API) services. APIs are the backbone of the LLM ecosystem, enabling developers to integrate LLM capabilities into their own software. Through APIs, businesses can pass data and instructions to an LLM and retrieve outputs tailored to their needs. These APIs are central to the value proposition of LLMs, powering applications like retrieval-augmented generation (RAG) workflows for the scanning of document collections, automated form processing, and natural language interfaces for structured databases.

The growing market for LLM APIs

OpenAI was the first major player to offer an API for its LLMs, and its design has become a de facto standard, with many other LLM providers emulating its structure. This compatibility has paved the way for a competitive LLM inference hosting market. Applications leveraging APIs can often switch between providers with minimal effort, simply by changing the host address and API key. This interoperability is fostering a dynamic market for LLM inferencing, where cost, performance, and data privacy are key differentiators.

Example of an LLM API call

Here’s an example of a basic API call using curl. This same structure is supported by most LLM APIs:

curl https://api.lemonfox.ai/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "mixtral-chat",
"messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Why is the sky blue?" }
]
}'

This straightforward interface makes it easy for developers to integrate LLM capabilities into their applications, whether for natural language understanding, data extraction, or other advanced AI tasks.

Note: you may notice differences between this API call and the API calls we used with Ollama and Open-WebUI in previous blog posts. Ollama and Open-WebUI use a simplified protocol using a prompt field. The example above uses a messages array, compatible with the chat completions API, used by OpenAI and implemented by third parties such as lemonfox.ai

A historical parallel: LLM hosting and the web hosting market of the 2000s

The current trajectory of LLM inference hosting bears striking similarities to the early days of web hosting in the late 1990s and early 2000s. Back then, the advent of open-source technologies like Linux, Apache, MySQL, and PHP enabled hobbyists and businesses to build industrial-grade web servers on consumer hardware. While some opted to host websites themselves, most turned to professional web hosting providers, creating a competitive market that eventually drove down prices and established commoditized hosting as the norm.

Similarly, the LLM inference hosting market is evolving into a spectrum of options:

  • Self-hosting: Organizations can invest in high-performance hardware like NVIDIA’s H100 GPUs (priced at around US$30,000) or more modest setups using GPUs like the RTX 4090 or RTX 5090 (priced at around US$5,000). This option offers full control but requires significant upfront investment and technical expertise.
  • Leased GPU services: Cloud providers offer GPU resources on an hourly basis, making it possible to run LLMs without committing to physical hardware. For example, renting an H100 GPU typically costs around US$3 per hour.
  • Hosted inference services: Many providers offer LLM inference as a service, where customers pay per transaction or token. This model eliminates the need for infrastructure management, appealing to businesses that prioritize simplicity.

The economics of LLM hosting

The emergence of open-source models and interoperable APIs is driving fierce competition in the LLM hosting market. This competition has already led to dramatic price differences between providers. For example:

  • OpenAI GPT-3.5-turbo: US$10 per 10 million tokens
  • lemonfox.ai Mistral 7B: US$5 per 10 million tokens (using open-source models)

These disparities highlight the potential cost savings of opting for open-source models hosted by third-party providers or self-hosting solutions.

Renting GPUs vs. buying inference services

For businesses and developers, choosing between renting GPU time, self-hosting, or using inference services depends on several factors:

  • Scalability: Hosted inference services are ideal for unpredictable or spiky workloads, as they scale effortlessly.
  • Cost efficiency: For steady, high-volume workloads, self-hosting may be more economical in the long run.
  • Data control: Organizations with strict confidentiality requirements may prefer self-hosting to ensure data never leaves their infrastructure.
  • Open source software is free as in freedom, and free as in free beer. Although there are significant hardware costs for GPU capability, in general an enterprise can self-host AI without incurring software licensing fees.
  • Price competition from vendors using open source solutions no doubt has the effect of constraining the pricing power of closed source vendors.

For example, a small startup building a chatbot might initially use an inference provider like lemonfox.ai to minimize costs and complexity. As their user base grows, they might transition to leased GPU services or invest in dedicated hardware to optimize expenses.

A law firm or medical practice may begin with an air-gapped cloud instance with non-disclosure (NDA) and data protection (DPA) agreements. At some point, the business case may justify taking the service in-house with a self-hosted inference server with GPU hardware.

Conclusion: the road ahead for LLM inference hosting

As LLMs continue to gain traction, the LLM inference hosting market will likely follow the trajectory of web hosting two decades ago—moving toward commoditization and low-margin competition. Businesses and individuals will increasingly weigh the trade-offs between cost, control, and convenience when deciding how to deploy LLM capabilities. The availability of open-source models and interoperable APIs ensures that options will continue to expand, empowering developers to choose the solution that best meets their needs.