LLM Server

Aim: This page describes how to use the experimental Nikhef LLM service.

Target audience: We assume general familiarity with LLMs.

Purpose

There are many services that offer interaction with LLM models, such as OpenAI's ChatGPT or various HuggingFace spaces. There are several reasons why we decided that an experimental service that allows running LLMs on Nikhef hardware was a good idea:

When you interact with an LLM as a service, it is often not clear what happens to the data that you send it. If this data is sensitive, you may even not be permitted to send it. Running models at Nikhef on Nikhef hardware ensures that your data doesn't leave Nikhef;
Together with you, we choose which models to run;
It allows experimentation with models;
There is - in principle - no limit on which features of models can be used beyond availability of hardware.

Infrastructure

The service is based on the ollama server and model library. To make it more secure, Ollama has been extended with authentication using JSON web tokens.

There are four things available as part of this service:

A backend server that offers an authenticated API to interact with the models. The API is compatible with the OpenAI API and ollama API;
A simple terminal client;
A web-based chat client;
A web page that allows you to view your personal token;
A server that generates tokens.

Models

The server currently has two AMD GPUs - an MI210 and a W6800 - that are used to run the inference for the models. Three models are currently available:

All of these are based on the Llama 3.1 set of base models from Meta.

Usage

Everything is running on plofkip.nikhef.nl, which is only accessible from inside Nikhef or by using the eduVPN Institute Access (IA) profile.

To allow the MI210 GPU - which has more memory and runs the larger model - to also be used for other things, the model is unloaded from the GPU after 15 minutes of inactivity. As result, the first query to the model after it hasn't been used for 15 minutes takes longer; up to 30 seconds.

API server and Token

The server provides two APIs:

Ollama API: https://plofkip.nikhef.nl:11443/
OpenAI API: https://plofkip.nikhef.nl:11443/v1/

To access these API, you'll need a token. This token is similar to an API key that you may know from other AI services. You can view your token by visiting https://plofkip.nikhef.nl. An important difference between your token and an API key is that your token will expire after 30 days. If your token has expired, a new one will be generated for you when you visit https://plofkip.nikhef.nl.

Web chat client

The web chat client is available at https://plofkip.nikhef.nl/chat. The model that you chat with can be selected by clicking on "Models" at the bottom left of the page. Past chats will be saved based on your email addres (which is obtained from the SSO) and can be selected to continue them.

Terminal client

If you prefer a terminal client to chat with the models, you can log in to any of the interactive stoomboot nodes and run

$> /project/datagrid/raaij/ml/wingman/bin/oterm

This script will fetch your token and then start the oterm terminal chat client. For more information on how to use oterm, please visit its homepage.

VS-code

To integrate the LLM server with VSCode and its variants, you can use the Continue extension. It can be installed by opening the command panel (Ctrl+Shift+p) and running: ext install Continue.continue. Once continue is installed open its settings (typically $HOME/.continue/config.json) and add the following section, where you replace YOUR_TOKEN with your token that you got from plofkip:

  "models": [
    {
      "model": "AUTODETECT",
      "title": "OpenAI",
      "apiBase": "https://plofkip.nikhef.nl:11443/v1/",
      "apiKey": "YOUR_TOKEN",
      "provider": "openai"
    }
  ],

You can then start a new session with Ctrl+Shift+l. The list of models should show up for selection. Have a loot at the Continue documentation for more information.

Emacs

To integrate the LLM server in your emacs editor, you can use the llm and ellama packages. You have to configure the available models by hand. The example below is based on the currently available models; replace YOUR_TOKEN with the token you obtain from plofkip.

(setq ellama-key "YOUR_TOKEN")

(use-package ellama
  :init
  ;; setup key bindings
  (setopt ellama-keymap-prefix "C-c e")
  ;; language you want ellama to translate to
  (setopt ellama-language "English")

  ;; could be llm-openai for example
  (require 'llm-openai)
  (setopt ellama-providers
          '(("llama3.1-70b" . (make-llm-openai-compatible
                               :key ellama-key
                               :url "https://plofkip.nikhef.nl:11443/v1/"
                               :chat-model "llama3.1:70b-instruct-q4_K_M"
                               :embedding-model "mxbai-embed-large"))
            ("codestral" . (make-llm-openai-compatible
                            :key ellama-key
                            :url "https://plofkip.nikhef.nl:11443/v1/"
                            :chat-model "codestral:latest"
                            :embedding-model "mxbai-embed-large"))
            ("codellama-13b" . (make-llm-openai-compatible
                                :key ellama-key
                                :url "https://plofkip.nikhef.nl:11443/v1/"
                                :chat-model "codellama:13b-instruct-q8_0"
                                :embedding-model "mxbai-embed-large"))))
  (setopt ellama-naming-scheme 'ellama-generate-name-by-llm)
  )