LLM Server
Aim: This page describes how to use the experimental Nikhef LLM service.
Target audience: We assume general familiarity with LLMs.
Purpose
There are many services that offer interaction with LLM models, such as OpenAI's ChatGPT or various HuggingFace spaces. There are several reasons why we decided that an experimental service that allows running LLMs on Nikhef hardware was a good idea:
- When you interact with an LLM as a service, it is often not clear what happens to the data that you send it. If this data is sensitive, you may even not be permitted to send it. Running models at Nikhef on Nikhef hardware ensures that your data doesn't leave Nikhef;
- Together with you, we choose which models to run;
- It allows experimentation with models;
- There is - in principle - no limit on which features of models can be used beyond availability of hardware.
Infrastructure
The service is based on the ollama server and model library. To make it more secure, Ollama has been extended with authentication using JSON web tokens.
There are four things available as part of this service:
- A backend server that offers an authenticated API to interact with the models. The API is compatible with the OpenAI API and ollama API;
- A simple terminal client;
- A web-based chat client;
- A web page that allows you to view your personal token;
- A server that generates tokens.
Models
The server currently has two AMD GPUs - an MI210 and a W6800 - that are used to run the inference for the models. Three models are currently available:
All of these are based on the Llama 3.1 set of base models from Meta.
Usage
Everything is running on plofkip.nikhef.nl, which is only accessible from inside Nikhef or by using the eduVPN Institute Access (IA) profile.
To allow the MI210 GPU - which has more memory and runs the larger model - to also be used for other things, the model is unloaded from the GPU after 15 minutes of inactivity. As result, the first query to the model after it hasn't been used for 15 minutes takes longer; up to 30 seconds.
API server and Token
The server provides two APIs:
- Ollama API: https://plofkip.nikhef.nl:11443/
- OpenAI API: https://plofkip.nikhef.nl:11443/v1/
To access these API, you'll need a token. This token is similar to an API key that you may know from other AI services. You can view your token by visiting https://plofkip.nikhef.nl. An important difference between your token and an API key is that your token will expire after 30 days. If your token has expired, a new one will be generated for you when you visit https://plofkip.nikhef.nl.
Web chat client
The web chat client is available at https://plofkip.nikhef.nl/chat. The model that you chat with can be selected by clicking on "Models" at the bottom left of the page. Past chats will be saved based on your email addres (which is obtained from the SSO) and can be selected to continue them.
Terminal client
If you prefer a terminal client to chat with the models, you can log in to any of the interactive stoomboot nodes and run
This script will fetch your token and then start the oterm terminal chat client. For more information on how to use oterm, please visit its homepage.VS-code
To integrate the LLM server with VSCode and its variants, you can use the Continue extension. It can be installed by opening the command panel (Ctrl+Shift+p
) and running: ext install Continue.continue
. Once continue is installed open its settings (typically $HOME/.continue/config.json
) and add the following section, where you replace YOUR_TOKEN
with your token that you got from plofkip:
"models": [
{
"model": "AUTODETECT",
"title": "OpenAI",
"apiBase": "https://plofkip.nikhef.nl:11443/v1/",
"apiKey": "YOUR_TOKEN",
"provider": "openai"
}
],
You can then start a new session with Ctrl+Shift+l
. The list of models should show up for selection. Have a loot at the Continue documentation for more information.
Emacs
To integrate the LLM server in your emacs editor, you can use the llm
and ellama
packages. You have to configure the available models by hand. The example below is based on the currently available models; replace YOUR_TOKEN
with the token you obtain from plofkip.
(setq ellama-key "YOUR_TOKEN")
(use-package ellama
:init
;; setup key bindings
(setopt ellama-keymap-prefix "C-c e")
;; language you want ellama to translate to
(setopt ellama-language "English")
;; could be llm-openai for example
(require 'llm-openai)
(setopt ellama-providers
'(("llama3.1-70b" . (make-llm-openai-compatible
:key ellama-key
:url "https://plofkip.nikhef.nl:11443/v1/"
:chat-model "llama3.1:70b-instruct-q4_K_M"
:embedding-model "mxbai-embed-large"))
("codestral" . (make-llm-openai-compatible
:key ellama-key
:url "https://plofkip.nikhef.nl:11443/v1/"
:chat-model "codestral:latest"
:embedding-model "mxbai-embed-large"))
("codellama-13b" . (make-llm-openai-compatible
:key ellama-key
:url "https://plofkip.nikhef.nl:11443/v1/"
:chat-model "codellama:13b-instruct-q8_0"
:embedding-model "mxbai-embed-large"))))
(setopt ellama-naming-scheme 'ellama-generate-name-by-llm)
)