{"id":5339,"date":"2025-02-01T20:46:16","date_gmt":"2025-02-01T20:46:16","guid":{"rendered":"https:\/\/blog.gordonbuchan.com\/blog\/?p=5339"},"modified":"2025-02-03T11:48:36","modified_gmt":"2025-02-03T11:48:36","slug":"creating-a-proxy-server-to-allow-open-webui-to-connect-to-a-large-language-model-llm-application-programming-interface-api-with-chat-completions-and-a-message-array","status":"publish","type":"post","link":"https:\/\/blog.gordonbuchan.com\/blog\/index.php\/2025\/02\/01\/creating-a-proxy-server-to-allow-open-webui-to-connect-to-a-large-language-model-llm-application-programming-interface-api-with-chat-completions-and-a-message-array\/","title":{"rendered":"Creating a proxy server to allow Open-WebUI to connect to a large language model (LLM) application programming interface (API) with chat completions and a message array"},"content":{"rendered":"\n<p>Open-WebUI, a web chat server for LLMs, is not compatible with some LLM APIs that support the chat completions API, and use a message array. Although there are other tools available, I wanted to use Open-WebUI. I resolved this by creating a proxy server that acts as a translation layer between the Open-WebUI chat server and an LLM API server that supports the chat completions API and uses a message array.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">I wanted to use Open-WebUI as my chat server, but Open-WebUI is not compatible with the API of my remotely hosted LLM API inference service<\/h1>\n\n\n\n<p>Open-WebUI was designed to be compatible with Ollama, a tool that hosts an LLM locally and exposes an API. However, instead of using a locally-hosted LLM, I would like to use an LLM inference API service provided by <a href=\"https:\/\/lemonfox.ai\">lemonfox.ai<\/a>, which emulates the OpenAI API including the chat completions API, and uses a message array.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Considering the value of a remote LLM inference API server over a locally-hosted solution<\/h1>\n\n\n\n<p>In this blog post, we create a proxy server that enables the Open-WebUI chat server to connect to an OpenAI-compatible API. Although it is an interesting technical exercise to self-host, as a business case it does not make sense for long-term production. Certain kinds of LLM inference workloads can be handled by a CPU-only system, using a tool like Ollama, but the performance is not sufficient for real-time interaction. Dedicating GPU-enabled hardware is a significant expense, whether it be the acquisition of dedicated GPU hardware such as an A30, H100, RTX 4090, or RTX 5090 card. Renting or leasing this hardware is even more expensive. We seem to be heading into an era in which LLM inference itself is software as a service (SaaS), unless there are specific reasons why inference data cannot be shared with a public cloud, such as a legal or medical application.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Using a proxy server as a translation layer between incompatible APIs<\/h1>\n\n\n\n<p>There are many chat user interfaces available, but Open-WebUI has been easier to deploy and for the moment is my preference. The need for proxy servers to translate between LLM API servers that have slightly different protocols will likely be with us for some time, until LLM APIs have matured and become more compatible.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Using a remotely-hosted LLM inference API with a toolchain of applications and proxies<\/h1>\n\n\n\n<p>At this time in 2025, most LLM inference APIs emulate the OpenAI protocol, with support for the chat completions API and the use of a message array. In this exercise, we will be connecting the Open-WebUI chat server to an OpenAI compatible LLM API. In the future, we may see more abstracted toolchains, for example, a retrieval augmented generation (RAG) server offering an API that encapsulates the local RAG functionality and enhanced by the remote LLM inference API, to which a chat server will connect. In this case, the chat server will be Open-WebUI,  but in other applications it could be a web chat user interface embedded in a website.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Escalating to the root user with sudo<\/h1>\n\n\n\n<p>Enter the following command:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nsudo su\n<\/pre><\/div>\n\n\n<h1 class=\"wp-block-heading\">Creating a virtual environment and installing dependencies<\/h1>\n\n\n\n<p>Enter the following commands:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\ncd ~\nmkdir proxy_workdir\ncd proxy_workdir\npython3 -m venv proxy_env\nsource proxy_env\/bin\/activate\npip install fastapi uvicorn httpx python-dotenv\n<\/pre><\/div>\n\n\n<h1 class=\"wp-block-heading\">Creating the proxy.py file<\/h1>\n\n\n\n<p>Enter the following command:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nnano proxy.py\n<\/pre><\/div>\n\n\n<p>Use the nano editor to add the following text:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n# MIT license Gordon Buchan 2025\n# see https:\/\/opensource.org\/license\/mit\n# Some of this code was generated with the assistance of AI tools.\n\nfrom fastapi import FastAPI, Request\nimport httpx\nimport logging\nimport json\nimport time\n\napp = FastAPI()\n\n# Enable logging for debugging\nlogging.basicConfig(level=logging.DEBUG)\n\n# LemonFox API details\nLEMONFOX_API_URL = &quot;https:\/\/api.lemonfox.ai\/v1\/chat\/completions&quot;\nAPI_KEY = &quot;xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&quot;\n\n@app.get(&quot;\/api\/openai\/v1\/models&quot;)\nasync def get_models():\n    return {\n        &quot;object&quot;: &quot;list&quot;,\n        &quot;data&quot;: &#x5B;\n            {\n                &quot;id&quot;: &quot;mixtral-chat&quot;,\n                &quot;object&quot;: &quot;model&quot;,\n                &quot;owned_by&quot;: &quot;lemonfox&quot;\n            }\n        ]\n    }\n\nasync def make_request_with_retry(payload):\n    &quot;&quot;&quot;Send request to LemonFox API with one retry in case of failure.&quot;&quot;&quot;\n    headers = {\n        &quot;Authorization&quot;: f&quot;Bearer {API_KEY}&quot;,\n        &quot;Content-Type&quot;: &quot;application\/json&quot;,\n    }\n    \n    for attempt in range(2):  # Try twice before failing\n        async with httpx.AsyncClient() as client:\n            try:\n                response = await client.post(LEMONFOX_API_URL, json=payload, headers=headers)\n                response_json = response.json()\n                \n                # If response is valid, return it\n                if &quot;choices&quot; in response_json and response_json&#x5B;&quot;choices&quot;]:\n                    return response_json\n                \n                logging.warning(f&quot;\u274c Empty response from LemonFox on attempt {attempt + 1}: {response_json}&quot;)\n\n            except httpx.HTTPStatusError as e:\n                logging.error(f&quot;\u274c LemonFox API HTTP error: {e}&quot;)\n            except json.JSONDecodeError:\n                logging.error(f&quot;\u274c LemonFox returned an invalid JSON response: {response.text}&quot;)\n        \n        # Wait 1 second before retrying\n        time.sleep(1)\n\n    # If we get here, both attempts failed\u2014return a default response\n    logging.error(&quot;\u274c LemonFox API failed twice. Returning a fallback response.&quot;)\n    return {\n        &quot;id&quot;: &quot;fallback-response&quot;,\n        &quot;object&quot;: &quot;chat.completion&quot;,\n        &quot;created&quot;: int(time.time()),\n        &quot;model&quot;: &quot;unknown&quot;,\n        &quot;choices&quot;: &#x5B;\n            {\n                &quot;index&quot;: 0,\n                &quot;message&quot;: {\n                    &quot;role&quot;: &quot;assistant&quot;,\n                    &quot;content&quot;: &quot;I&#039;m sorry, but I couldn&#039;t generate a response. Try again.&quot;\n                },\n                &quot;finish_reason&quot;: &quot;stop&quot;\n            }\n        ],\n        &quot;usage&quot;: {&quot;prompt_tokens&quot;: 0, &quot;completion_tokens&quot;: 0, &quot;total_tokens&quot;: 0}\n    }\n\n@app.post(&quot;\/api\/openai\/v1\/chat\/completions&quot;)\nasync def proxy_chat_completion(request: Request):\n    &quot;&quot;&quot;Ensure Open WebUI&#039;s request is converted and always return a valid response.&quot;&quot;&quot;\n    try:\n        payload = await request.json()\n        logging.debug(&quot;\ud83d\udfe2 Open WebUI Request: %s&quot;, json.dumps(payload, indent=2))\n\n        # Convert `prompt` into OpenAI&#039;s `messages&#x5B;]` format\n        if &quot;prompt&quot; in payload:\n            payload&#x5B;&quot;messages&quot;] = &#x5B;{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: payload&#x5B;&quot;prompt&quot;]}]\n            del payload&#x5B;&quot;prompt&quot;]\n        elif &quot;messages&quot; not in payload or not isinstance(payload&#x5B;&quot;messages&quot;], list):\n            logging.error(&quot;\u274c Open WebUI sent an invalid request!&quot;)\n            return {&quot;error&quot;: &quot;Invalid request format. Expected `messages&#x5B;]` or `prompt`.&quot;}\n\n        # Force disable streaming\n        payload&#x5B;&quot;stream&quot;] = False\n\n        # Set max tokens to a high value to avoid truncation\n        payload.setdefault(&quot;max_tokens&quot;, 4096)\n\n        # Call LemonFox with retry logic\n        response_json = await make_request_with_retry(payload)\n\n        # Ensure response follows OpenAI format\n        if &quot;choices&quot; not in response_json or not response_json&#x5B;&quot;choices&quot;]:\n            logging.error(&quot;\u274c LemonFox returned an empty `choices&#x5B;]` array after retry!&quot;)\n            response_json&#x5B;&quot;choices&quot;] = &#x5B;\n                {\n                    &quot;index&quot;: 0,\n                    &quot;message&quot;: {\n                        &quot;role&quot;: &quot;assistant&quot;,\n                        &quot;content&quot;: &quot;I&#039;m sorry, but I didn&#039;t receive a valid response.&quot;\n                    },\n                    &quot;finish_reason&quot;: &quot;stop&quot;\n                }\n            ]\n\n        logging.debug(&quot;\ud83d\udfe2 Final Response Sent to Open WebUI: %s&quot;, json.dumps(response_json, indent=2))\n        return response_json\n\n    except Exception as e:\n        logging.error(&quot;\u274c Unexpected Error in Proxy: %s&quot;, str(e))\n        return {&quot;error&quot;: str(e)}\n\n<\/pre><\/div>\n\n\n<p>Save and exit the file.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Running the proxy server manually<\/h1>\n\n\n\n<p>Enter the following command:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nuvicorn proxy:app --host 0.0.0.0 --port 8000\n<\/pre><\/div>\n\n\n<h1 class=\"wp-block-heading\">Configuring Open-WebUI<\/h1>\n\n\n\n<p>Go to Open-WebUI Settings | Connections<\/p>\n\n\n\n<p>Set API Base URL to:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nhttp:\/\/localhost:8000\/api\/openai\/v1\n<\/pre><\/div>\n\n\n<p>Ensure that model name matches:<\/p>\n\n\n\n<p>mixtral-chat<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Testing Open-WebUI with a simple message<\/h2>\n\n\n\n<p>Enter some text in the chat window and see if you get a response from the LLM.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Creating the systemd service<\/h1>\n\n\n\n<p>Enter the following command: <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nnano \/etc\/systemd\/system\/open-webui-proxy.service\n<\/pre><\/div>\n\n\n<p>Use the nano editor to add the following text:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n&#x5B;Unit]\nDescription=open-webui Proxy for Open WebUI and LLM API\nAfter=network.target\n\n&#x5B;Service]\nType=simple\nWorkingDirectory=\/root\/proxy_workdir  # Change to your script&#039;s location\nExecStart=\/usr\/bin\/env bash -c &quot;source \/root\/proxy_workdir\/proxy_env\/bin\/activate &amp;amp;&amp;amp; uvicorn proxy:app --host 0.0.0.0 --port 8000&quot;\nRestart=always\nRestartSec=5\nUser=root\nGroup=root\nStandardOutput=journal\nStandardError=journal\n\n&#x5B;Install]\nWantedBy=multi-user.target\n<\/pre><\/div>\n\n\n<p>Save and exit the file.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Enabling and starting the systemd service<\/h1>\n\n\n\n<p>Enter the following commands:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nsystemctl daemon-reload\nsystemctl enable open-webui-proxy.service\nsystemctl start open-webui-proxy.service\n<\/pre><\/div>\n\n\n<h1 class=\"wp-block-heading\">Checking the status of the systemd service<\/h1>\n\n\n\n<p>Enter the following command:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nsystemctl status open-webui-proxy.service\n<\/pre><\/div>\n\n\n<h1 class=\"wp-block-heading\">Creating the systemd timer<\/h1>\n\n\n\n<p>Enter the following command:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nnano open-webui-proxy.timer\n<\/pre><\/div>\n\n\n<p>Use the nano editor to add the following text:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n&#x5B;Unit]\nDescription=Periodic restart of open-webui Proxy\n\n&#x5B;Timer]\nOnBootSec=1min\nOnUnitActiveSec=1h\nUnit=open-webui-proxy.service\n\n&#x5B;Install]\nWantedBy=timers.target\n<\/pre><\/div>\n\n\n<p>Save and exit the file.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Enabling and starting the systemd timer<\/h1>\n\n\n\n<p>Enter the following commands:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nsystemctl daemon-reload\nsystemctl enable open-webui-proxy.timer\nsystemctl start open-webui-proxy.timer\n<\/pre><\/div>\n\n\n<h1 class=\"wp-block-heading\">Checking the systemd timer<\/h1>\n\n\n\n<p>Enter the following command:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nsystemctl list-timers --all | grep open-webui-proxy\n<\/pre><\/div>\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Open-WebUI, a web chat server for LLMs, is not compatible with some LLM APIs that support the chat completions API, and use a message array. Although there are other tools available, I wanted to use Open-WebUI. I resolved this by creating a proxy server that acts as a translation layer between the Open-WebUI chat server &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/blog.gordonbuchan.com\/blog\/index.php\/2025\/02\/01\/creating-a-proxy-server-to-allow-open-webui-to-connect-to-a-large-language-model-llm-application-programming-interface-api-with-chat-completions-and-a-message-array\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Creating a proxy server to allow Open-WebUI to connect to a large language model (LLM) application programming interface (API) with chat completions and a message array&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-5339","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/blog.gordonbuchan.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/5339","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.gordonbuchan.com\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.gordonbuchan.com\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.gordonbuchan.com\/blog\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.gordonbuchan.com\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=5339"}],"version-history":[{"count":36,"href":"https:\/\/blog.gordonbuchan.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/5339\/revisions"}],"predecessor-version":[{"id":5381,"href":"https:\/\/blog.gordonbuchan.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/5339\/revisions\/5381"}],"wp:attachment":[{"href":"https:\/\/blog.gordonbuchan.com\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=5339"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.gordonbuchan.com\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=5339"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.gordonbuchan.com\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=5339"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}