Creating a proxy server to allow Open-WebUI to connect to a large language model (LLM) application programming interface (API) with chat completions and a message array

Open-WebUI, a web chat server for LLMs, is not compatible with some LLM APIs that support the chat completions API, and use a message array. Although there are other tools available, I wanted to use Open-WebUI. I resolved this by creating a proxy server that acts as a translation layer between the Open-WebUI chat server and an LLM API server that supports the chat completions API and uses a message array.

I wanted to use Open-WebUI as my chat server, but Open-WebUI is not compatible with the API of my remotely hosted LLM API inference service

Open-WebUI was designed to be compatible with Ollama, a tool that hosts an LLM locally and exposes an API. However, instead of using a locally-hosted LLM, I would like to use an LLM inference API service provided by lemonfox.ai, which emulates the OpenAI API including the chat completions API, and uses a message array.

Considering the value of a remote LLM inference API server over a locally-hosted solution

In this blog post, we create a proxy server that enables the Open-WebUI chat server to connect to an OpenAI-compatible API. Although it is an interesting technical exercise to self-host, as a business case it does not make sense for long-term production. Certain kinds of LLM inference workloads can be handled by a CPU-only system, using a tool like Ollama, but the performance is not sufficient for real-time interaction. Dedicating GPU-enabled hardware is a significant expense, whether it be the acquisition of dedicated GPU hardware such as an A30, H100, RTX 4090, or RTX 5090 card. Renting or leasing this hardware is even more expensive. We seem to be heading into an era in which LLM inference itself is software as a service (SaaS), unless there are specific reasons why inference data cannot be shared with a public cloud, such as a legal or medical application.

Using a proxy server as a translation layer between incompatible APIs

There are many chat user interfaces available, but Open-WebUI has been easier to deploy and for the moment is my preference. The need for proxy servers to translate between LLM API servers that have slightly different protocols will likely be with us for some time, until LLM APIs have matured and become more compatible.

Using a remotely-hosted LLM inference API with a toolchain of applications and proxies

At this time in 2025, most LLM inference APIs emulate the OpenAI protocol, with support for the chat completions API and the use of a message array. In this exercise, we will be connecting the Open-WebUI chat server to an OpenAI compatible LLM API. In the future, we may see more abstracted toolchains, for example, a retrieval augmented generation (RAG) server offering an API that encapsulates the local RAG functionality and enhanced by the remote LLM inference API, to which a chat server will connect. In this case, the chat server will be Open-WebUI, but in other applications it could be a web chat user interface embedded in a website.

Escalating to the root user with sudo

Enter the following command:

sudo su

Creating a virtual environment and installing dependencies

Enter the following commands:

cd ~
mkdir proxy_workdir
cd proxy_workdir
python3 -m venv proxy_env
source proxy_env/bin/activate
pip install fastapi uvicorn httpx python-dotenv

Creating the proxy.py file

Enter the following command:

nano proxy.py

Use the nano editor to add the following text:

# MIT license Gordon Buchan 2025
# see https://opensource.org/license/mit
# Some of this code was generated with the assistance of AI tools.

from fastapi import FastAPI, Request
import httpx
import logging
import json
import time

app = FastAPI()

# Enable logging for debugging
logging.basicConfig(level=logging.DEBUG)

# LemonFox API details
LEMONFOX_API_URL = "https://api.lemonfox.ai/v1/chat/completions"
API_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

@app.get("/api/openai/v1/models")
async def get_models():
    return {
        "object": "list",
        "data": [
            {
                "id": "mixtral-chat",
                "object": "model",
                "owned_by": "lemonfox"
            }
        ]
    }

async def make_request_with_retry(payload):
    """Send request to LemonFox API with one retry in case of failure."""
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json",
    }
    
    for attempt in range(2):  # Try twice before failing
        async with httpx.AsyncClient() as client:
            try:
                response = await client.post(LEMONFOX_API_URL, json=payload, headers=headers)
                response_json = response.json()
                
                # If response is valid, return it
                if "choices" in response_json and response_json["choices"]:
                    return response_json
                
                logging.warning(f"❌ Empty response from LemonFox on attempt {attempt + 1}: {response_json}")

            except httpx.HTTPStatusError as e:
                logging.error(f"❌ LemonFox API HTTP error: {e}")
            except json.JSONDecodeError:
                logging.error(f"❌ LemonFox returned an invalid JSON response: {response.text}")
        
        # Wait 1 second before retrying
        time.sleep(1)

    # If we get here, both attempts failed—return a default response
    logging.error("❌ LemonFox API failed twice. Returning a fallback response.")
    return {
        "id": "fallback-response",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": "unknown",
        "choices": [
            {
                "index": 0,
                "message": {
                    "role": "assistant",
                    "content": "I'm sorry, but I couldn't generate a response. Try again."
                },
                "finish_reason": "stop"
            }
        ],
        "usage": {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}
    }

@app.post("/api/openai/v1/chat/completions")
async def proxy_chat_completion(request: Request):
    """Ensure Open WebUI's request is converted and always return a valid response."""
    try:
        payload = await request.json()
        logging.debug("🟢 Open WebUI Request: %s", json.dumps(payload, indent=2))

        # Convert `prompt` into OpenAI's `messages[]` format
        if "prompt" in payload:
            payload["messages"] = [{"role": "user", "content": payload["prompt"]}]
            del payload["prompt"]
        elif "messages" not in payload or not isinstance(payload["messages"], list):
            logging.error("❌ Open WebUI sent an invalid request!")
            return {"error": "Invalid request format. Expected `messages[]` or `prompt`."}

        # Force disable streaming
        payload["stream"] = False

        # Set max tokens to a high value to avoid truncation
        payload.setdefault("max_tokens", 4096)

        # Call LemonFox with retry logic
        response_json = await make_request_with_retry(payload)

        # Ensure response follows OpenAI format
        if "choices" not in response_json or not response_json["choices"]:
            logging.error("❌ LemonFox returned an empty `choices[]` array after retry!")
            response_json["choices"] = [
                {
                    "index": 0,
                    "message": {
                        "role": "assistant",
                        "content": "I'm sorry, but I didn't receive a valid response."
                    },
                    "finish_reason": "stop"
                }
            ]

        logging.debug("🟢 Final Response Sent to Open WebUI: %s", json.dumps(response_json, indent=2))
        return response_json

    except Exception as e:
        logging.error("❌ Unexpected Error in Proxy: %s", str(e))
        return {"error": str(e)}

Save and exit the file.

Running the proxy server manually

Enter the following command:

uvicorn proxy:app --host 0.0.0.0 --port 8000

Configuring Open-WebUI

Go to Open-WebUI Settings | Connections

Set API Base URL to:

http://localhost:8000/api/openai/v1

Ensure that model name matches:

mixtral-chat

Testing Open-WebUI with a simple message

Enter some text in the chat window and see if you get a response from the LLM.

Creating the systemd service

Enter the following command:

nano /etc/systemd/system/open-webui-proxy.service

Use the nano editor to add the following text:

[Unit]
Description=open-webui Proxy for Open WebUI and LLM API
After=network.target

[Service]
Type=simple
WorkingDirectory=/root/proxy_workdir  # Change to your script's location
ExecStart=/usr/bin/env bash -c "source /root/proxy_workdir/proxy_env/bin/activate && uvicorn proxy:app --host 0.0.0.0 --port 8000"
Restart=always
RestartSec=5
User=root
Group=root
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Save and exit the file.

Enabling and starting the systemd service

Enter the following commands:

systemctl daemon-reload
systemctl enable open-webui-proxy.service
systemctl start open-webui-proxy.service

Checking the status of the systemd service

Enter the following command:

systemctl status open-webui-proxy.service

Creating the systemd timer

Enter the following command:

nano open-webui-proxy.timer

Use the nano editor to add the following text:

[Unit]
Description=Periodic restart of open-webui Proxy

[Timer]
OnBootSec=1min
OnUnitActiveSec=1h
Unit=open-webui-proxy.service

[Install]
WantedBy=timers.target

Save and exit the file.

Enabling and starting the systemd timer

Enter the following commands:

systemctl daemon-reload
systemctl enable open-webui-proxy.timer
systemctl start open-webui-proxy.timer

Checking the systemd timer

Enter the following command:

systemctl list-timers --all | grep open-webui-proxy