Linux Server Maintenance Policy - Onderhoudsbeleid

Linux Server Maintenance Policy

(scroll naar beneden voor de Nederlandse versie) To guarantee the secure working of our systems and to always maintain them on the most recent level of software development, we are using the following policy for all servers which we maintain and administrate for Nikhef:

We continuously monitor all our systems for applicable software updates. Those are divided into the following categories which all have their own specific policies attached:

Standard updates

This regards all updates of standard software, libraries and utility programs to the newest versions. We are continuously rolling those out and no interruption or effect for the users is to be expected.

Standard updates of server software

This category applies to updates of the specific software packages which provides our services. For example gitlab, our websites, webmail, email etc. These updates are also implemented continuously and will require restart of the respective services which we will do at the first opportunity for a short interruption (see following point)

Critical security updates of server software

We are notified whenever a software update is required to patch a potential security leak. When we receive a message of that type, we will implement the update immediately and restart the respective service as soon as possible. We will do that outside of office hours if we can, but depending on availability of our personnel it can happen that we have to do so during the day. This will lead to short interruptions of services but no work of users must be lost due to the restart.

Critical updates of the operating system which require a reboot

It happens regularly that security issues or critical software bugs are found in the operating system code or Linux kernel itself. In such cases the whole operating system needs to be restarted. This can lead to longer interruptions in the availability of those systems. We shall always do those restarts outside of office hours if there is a chance that they can affect ongoing work of users. You should therefore always make sure that you properly conclude and save your work before leaving for the day. We will only send notifications of such interruptions in specific cases of immediate service disruptions (see the following item). There are a few exceptions to this policy: - The compute-cluster admins (stoomboot etc.) follow their own policy which they communicate separately. - The course environment (les-center.nikhef.nl) will only be restarted during weekends after a critical update. It is therefore important to inform your course participants about that and advise them to always conclude and save their work before the weekend. - Specific servers which are used for computation, simulation or development (for example the computation servers of the Theory department or the design platforms for ET) will always be restarted on the last Sunday of every month. Exceptions to this can only be be specifically requested with the CT department. There will be a notification on the login shell of the respective servers indicating whether it is nominated for reboot. As authenticated user of this system you can pre-emptively reboot it yourself once you see the message on the command line that the host requires a reboot. This way you can have control over the exact restart times. Be prepared that this can interrupt running computations or simulations and plan them accordingly.

Critical updates against active exploits. ("Red Alert")

In some extreme cases we can receive notifications from international security advisors about critical security leaks which are already actively exploited by malicious parties around the world. In such cases we will have to patch and restart all affected systems immediately. In such cases we shall notify everyone by email, but due to the immediate danger for our systems we can then not wait with any necessary restarts or reboots.

Larger maintenance of critical nature

At irregular intervals we will need to take down systems for more extensive maintenance. We will always notify everyone in a timely fashion about such occurrences via status.io and email.

Linux Server Onderhoudsbeleid

Om de veilige werking van onze diensten te kunnen waarborgen, en om telkens de nieuwste software aan te kunnen bieden, handhaven wij het volgende beleid voor alle servers die wij voor Nikhef onderhouden en beheren:

Alle server systemen van Nikhef worden continu gecontroleerd op beschikbare software updates. Deze updates worden in de volgende categorieën ingedeeld waarbij telkens een ander beleid geldt:

Standaard updates:

Hierbij gaat het om nieuwere versies van code libraries, standaard software en hulp-programma's op de systemen. Deze voeren wij doorlopend uit en daarbij is voor de gebruikers geen verstoring van hun werk of processen te verwachten.

Standaard updates van server software:

Dit betreft de specifieke software pakketten die algemene diensten aanbieden. Bijvoorbeeld gitlab, de websites, webmail, email en dergelijke. Ook deze updates worden doorlopend uitgevoerd en worden bij de volgende herstart actief. (zie het volgende punt)

Kritieke veiligheidsupdates van server software:

Wij worden geïnformeerd als een update vereist is om een mogelijk veiligheidsrisico te voorkomen. Als wij een melding van deze aard ontvangen, dan wordt de betreffende update meteen uitgevoerd. Vervolgens wordt de betreffende dienst zo snel mogelijk herstart. Wij zullen dat zo veel mogelijk buiten kantooruren te doen, maar afhankelijk van de beschikbaarheid van collega's kan het zo zijn dat de herstart van diensten overdag moet gebeuren. Dit betreft alleen het afsluiten en opnieuw opstarten van specifieke programma's die een dienst zoals boven aangegeven leveren. Het gaat daarbij om korte onderbrekingen waarbij geen werk van gebruikers verloren zal gaan.

Kritieke updates die een herstart van het gehele besturingssysteem vereisen:

Er worden met enige regelmaat serieuze veiligheidsproblemen of andere kritieke softwarefouten in de Linux kernel of andere onderdelen van het besturingssysteem gevonden. In dit geval moet het gehele systeem herstart worden. Dit kan tot langere onderbrekingen van de beschikbaarheid leiden (enkele minuten ipv. een paar seconden). Wij voeren deze herstarts alleen buiten werktijden uit als daarmee mogelijk verlies van werk van gebruikers gepaard kan gaan. Stel daarom altijd zeker dat je werk opgeslagen en afgesloten is zodra je klaar bent voor die dag. Wij zullen alleen in specifieke gevallen hiervan melding maken (zie het volgende punt). Voor dit beleid gelden een paar uitzonderingen:

Rekenclusters (stoomboot etc.) volgen een apart beleid. Dit wordt door het Grid/PDP team gecommuniceerd.
De werkomgeving voor cursussen (les-center.nikhef.nl) wordt tijdens het eerstvolgende weekend na een update herstart. Het is daarom belangrijk om de cursus deelnemers hierover in te lichten om zeker te stellen dat ze voor het weekend hun werk afsluiten.
Specifieke servers die voor ontwikkeling, berekening en simulaties worden gebruikt (bijvoorbeeld de rekenservers van de Theorie afdeling, of de design platformen van ET) zullen altijd op de laatste zondag van elke maand herstart worden. Een incidentele uitzondering hierop is alleen mogelijk na aanvraag bij de CT afdeling. Het zal bij login op de server shell te zien zijn dat deze op de nominatie voor een herstart staat. Toegelaten gebruikers van deze systemen kunnen ze zelfstandig herstarten als ze bij inloggen op de command-line shell het bericht zien dat de server herstart dient te worden. Op die manier kan men de controle zelf in handen houden wanneer de herstart gebeurt. Wees dus altijd erop voorbereid dat hierdoor lopende rekenopdrachten of simulaties onderbroken kunnen worden en neem dat mee in de planning hiervan.
Kritieke updates die tegen actieve exploits nodig zijn ("Red Alerts")

In extreme gevallen kan het voorkomen dat wij van veiligheidsadviseurs melding ontvangen over een kritiek veiligheidslek dat op dit moment al actief misbruikt wordt door malafide partijen wereldwijd. In dit geval zijn wij verplicht onmiddellijk in actie te komen. Wij zullen hierover wel een melding versturen, maar er kan in dit soort gevallen niet gewacht worden met een herstart van welke aard ook.

Groter onderhoud van kritieke aard

Op onregelmatige tijden zullen wij groter onderhoud aan systemen moeten doen waarbij rekening met langere onderbrekingen moet worden gehouden. Deze zullen we altijd vooraf en breed aankondigen via status.io en email.