03 Mar 2024

For a while now I've been using Bottlerocket as the base OS for instances in the cloud (i.e. the AMI), instead of using Amazon Linux or Ubuntu etc. In our workload we don't really have much use for local storage until recently so I finally invested some time in figuring out how to actually make use of the local SSDs where available (usually this type of storage is called instance-store in AWS).

Some Bottlerocket concepts:

  • Minimalist, container-centric
  • There is no ssh or even a login shell by default
  • It provides 'host containers' that are not orchestrated (not part of kubernetes)
    • Control container - this is the one that allows us to connect via SSM
    • Admin container - this is the one that allow us to interact with the host container as root
    • The root file system is stored in /.bottlerocket/rootfs and there you can find /dev etc
  • You can launch bootstrap containers if you need to run tasks, instead of using user-data scripts

Typically you don't connect to these instances, but if you need to you can enable the control and admin containers in the user-data settings, connect to the control container and then type enter-admin-container. To connect to the control container use SSM (easiest is to do this via the EC2 web console, but also with aws ssm start-session --target <instanceid>))

You can learn more on bottlerocket's github, their concept overview and the FAQ (https://bottlerocket.dev/en/faq/)

Bottlerocket is configured by providing settings via user data in TOML format and what you provide will be merged with the defaults.

Configuring Bottlerocket in node groups

module "my-nodegroup" {
  source  = "terraform-aws-modules/eks/aws//modules/eks-managed-node-group"
  ...
  ami_type = "BOTTLEROCKET_ARM_64"
  platform = "bottlerocket"
  bootstrap_extra_args = <<-EOT
    [settings.host-containers.admin]
    enabled = true
    [settings.host-containers.control]
    enabled = true
    [settings.xxx.xxx]
    setting1 = "value"
    setting2 = "value"
    setting3 = true
  EOT
}

Configuring Bottlerocket in Karpenter

See the default values that Karpenter sets.

In Karpenter these days for AWS you define an EC2NodeClass that basically describes the image to be used and where it will be placed, and a NodePool that defines the hardware it will run in.

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: myproject
spec:
  amiFamily: Bottlerocket
  userData: |
    [settings.host-containers.admin]
    enabled = true
    [settings.host-containers.control]
    enabled = true
    [settings.xxx.xxx]
    setting1 = "value"
    setting2 = "value"
    setting3 = true
  subnetSelectorTerms:
    - tags:
        Name: "us-east-1a"
    - tags:
        Name: "us-east-1b"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster"
  instanceProfile: eksctl-KarpenterNodeInstanceProfile-my-cluster

---

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: myproject
spec:
  limits:
    # limit how many instances this can actually create, we limit by cpu only
    cpu: 100
  template:
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: myproject
      requirements:
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot"]
        - key: "kubernetes.io/arch"
          operator: In
          values: ["arm64"]
        - key: "karpenter.k8s.aws/instance-cpu"
          operator: In
          values: ["4", "8"]
        - key: "karpenter.k8s.aws/instance-generation"
          operator: Gt
          values: ["6"]
        - key: "karpenter.k8s.aws/instance-category"
          operator: In
          values: ["c", "m", "r"]
        # alternatively
        #- key: "node.kubernetes.io/instance-type"
        #  operator: In
        #  values:
        #  - "c6g.xlarge"
        #  - "m7g.xlarge"

This article discusses some options for pre-caching images that we won't do, but illustrates the architecture of Bottlerocket OS and how it can be combined with Karpenter.

Storage in Bottlerocket (permanent, ephemeral and local)

Bottlerocket operates with two default storage volumes: root and data. The root is read only and the data is used as persistent storage (EBS that will survive reboots) for non-Kubernetes containers that run inside the instance. The data container is 20 GB EBS drive and the root device is around 4 GB.

Check the Karpenter default volume configuration

Now the whole point of this post is to show how you can use the local SSD disks that machines often have but that Amazon makes particularly cumbersome to use. Many instances have local SSD storage that will show up inside the host as an extra unmounted device (eg /dev/nvme2n1). How do you make this available to Kubernetes in an automated way?

Kubernetes Storage local static provisioner

Learn more on the storage-local-static-provisioner github page and the getting started guide.

The way we expose local disks to kubernetes as resource is via a persistent storage class called 'fast-disks'. The local provisioner will allow creating of these resources of that type by finding local storage

  • It expects the host to have a folder called /mnt/fast-disks (configurable)
  • It expects there to be links to the actual device files of drives we want exposed
  • It is installed by generating a configuration file using helm and applying it with kubectl

Before going further we need to define a storage class that will be used to flag the new type of storage.

It looks like this and effectively does nothing (see the no-provisioner):

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-disks
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
# Supported policies: Delete, Retain
reclaimPolicy: Delete
kubectl apply -f ./default_example_storageclass.yaml

Then setup the local provisioner daemonset that will handle it

# Generate the configuration file
helm repo add sig-storage-local-static-provisioner https://kubernetes-sigs.github.io/sig-storage-local-static-provisioner
helm template --debug sig-storage-local-static-provisioner/local-static-provisioner --version 2.0.0 --namespace myproject > local-volume-provisioner.generated.yaml

# edit local-volume-provisioner.generated.yaml if necessary
# optional: kubectl diff -f local-volume-provisioner.generated.yaml
kubectl apply -f local-volume-provisioner.generated.yaml

See more about the installation procedure

This creates a daemonset that runs in all instances.

These are some excerpts of what it creates:

The storage class - note that by default it will use shred.sh to initialize the disk, which is rather slow. You can change it to other methods. This will be done only on first boot.

# Source: local-static-provisioner/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: release-name-local-static-provisioner-config
  namespace: myproject
  labels:
    helm.sh/chart: local-static-provisioner-2.0.0
    app.kubernetes.io/name: local-static-provisioner
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/instance: release-name
data:
  storageClassMap: |
    fast-disks:
      hostDir: /mnt/fast-disks
      mountDir: /mnt/fast-disks
      blockCleanerCommand:
        - "/scripts/shred.sh"
        - "2"
      volumeMode: Filesystem
      fsType: ext4
      namePattern: "*"

Joining local static provisioner, AWS instance store and Bottlerocket

Bootstrap containers have root access and boot with the root file system mounted.

You can create one like this, and publish to your own repository:

FROM alpine:latest
RUN apk add --no-cache util-linux
FROM alpine:latest
RUN echo '#!/bin/sh' > /script.sh \
    && echo 'mkdir -p /.bottlerocket/rootfs/mnt/fast-disks' >> /script.sh \
    && echo 'cd /.bottlerocket/rootfs/mnt/fast-disks' >> /script.sh \
    && echo 'for device in $(ls /.bottlerocket/rootfs/dev/nvme*n1); do' >> /script.sh \
    && echo '  base_device=$(basename $device)' >> /script.sh \
    && echo '  if ! mount | grep -q "$base_device"; then' >> /script.sh \
    && echo '    [ ! -e "./$base_device" ] && ln -s "../../dev/$base_device" "./$base_device"' >> /script.sh \
    && echo '  fi' >> /script.sh \
    && echo 'done' >> /script.sh \
    && echo 'ls -l /.bottlerocket/rootfs/mnt/fast-disks' >> /script.sh \
    && chmod +x /script.sh
CMD ["/script.sh"]

Add this to the Bottlerocket settings

[settings.bootstrap-containers.diskmounter]
essential = true
mode = "always"
source = "YOURACCOUNT.ecr.public.ecr.aws/YOURREPO:latest"

Caveat: node cleanup

By default, when you set this up if an instance dies (which in AWS can happen any minute) the persistent volume claim will remain.

Basically the life cycle is a bit like:

  • Pod is created that has a persistent storage claim of a fast-disk
  • This is allocated in one of the instances
  • Once it is assigned, the pod will be assigned to that instance
  • If the instance dies, the pod is still assigned to that specific instance and deleting it won't fix it

This will manifest as an error about the pod not being able to find nodeinfo for the now defunct node.

This cleaning process is achieved by the node-cleanup-controller

To set it up:

apiVersion: apps/v1
kind: Deployment
...
    spec:
...
      containers:
      - name: local-volume-node-cleanup-controller
        image: gcr.io/k8s-staging-sig-storage/local-volume-node-cleanup:canary
        args:
          - "--storageclass-names=fast-disks"
          - "--pvc-deletion-delay=60s"
          - "--stale-pv-discovery-interval=10s"
        ports:
          - name: metrics
            containerPort: 8080

Now apply both deployment and permissions

kubectl apply -f ./deployment.yaml
kubectl apply -f ./permissions.yaml

You will end up with

  • A new Storage class called 'fast-disks'
  • The CleanupController looks for Local Persistent Volumes that have a NodeAffinity to a deleted Node. When it finds such a PV, it starts a timer to wait and see if the deleted Node comes back up again. If at the end of the timer, the Node is not back up, the PVC bound to that PV is deleted. The PV is deleted in the next step.
  • The Deleter looks for Local PVs with a NodeAffinity to deleted Nodes. When it finds such a PV it deletes the PV if (and only if) the PV's status is Available or if its status is Released and it has a Delete reclaim policy.

(Follow me on Twitter, on Facebook or add this RSS feed) (Sígueme en Twitter, on Facebook o añade mi feed RSS)
details

Comments