For a while now I've been using Bottlerocket as the base OS for instances in the cloud (i.e. the AMI), instead of using Amazon Linux or Ubuntu etc. In our workload we don't really have much use for local storage until recently so I finally invested some time in figuring out how to actually make use of the local SSDs where available (usually this type of storage is called instance-store in AWS).
Some Bottlerocket concepts:
/.bottlerocket/rootfs
and there you can find /dev etcTypically you don't connect to these instances, but if you need to you can enable the control and admin containers in the user-data settings, connect to the control container and then type enter-admin-container
. To connect to the control container use SSM (easiest is to do this via the EC2 web console, but also with aws ssm start-session --target <instanceid>
))
You can learn more on bottlerocket's github, their concept overview and the FAQ (https://bottlerocket.dev/en/faq/)
Bottlerocket is configured by providing settings via user data in TOML format and what you provide will be merged with the defaults.
module "my-nodegroup" {
source = "terraform-aws-modules/eks/aws//modules/eks-managed-node-group"
...
ami_type = "BOTTLEROCKET_ARM_64"
platform = "bottlerocket"
bootstrap_extra_args = <<-EOT
[settings.host-containers.admin]
enabled = true
[settings.host-containers.control]
enabled = true
[settings.xxx.xxx]
setting1 = "value"
setting2 = "value"
setting3 = true
EOT
}
See the default values that Karpenter sets.
In Karpenter these days for AWS you define an EC2NodeClass
that basically describes the image to be used and where it will be placed, and a NodePool
that defines the hardware it will run in.
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: myproject
spec:
amiFamily: Bottlerocket
userData: |
[settings.host-containers.admin]
enabled = true
[settings.host-containers.control]
enabled = true
[settings.xxx.xxx]
setting1 = "value"
setting2 = "value"
setting3 = true
subnetSelectorTerms:
- tags:
Name: "us-east-1a"
- tags:
Name: "us-east-1b"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "my-cluster"
instanceProfile: eksctl-KarpenterNodeInstanceProfile-my-cluster
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: myproject
spec:
limits:
# limit how many instances this can actually create, we limit by cpu only
cpu: 100
template:
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
name: myproject
requirements:
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot"]
- key: "kubernetes.io/arch"
operator: In
values: ["arm64"]
- key: "karpenter.k8s.aws/instance-cpu"
operator: In
values: ["4", "8"]
- key: "karpenter.k8s.aws/instance-generation"
operator: Gt
values: ["6"]
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c", "m", "r"]
# alternatively
#- key: "node.kubernetes.io/instance-type"
# operator: In
# values:
# - "c6g.xlarge"
# - "m7g.xlarge"
This article discusses some options for pre-caching images that we won't do, but illustrates the architecture of Bottlerocket OS and how it can be combined with Karpenter.
Bottlerocket operates with two default storage volumes: root and data. The root is read only and the data is used as persistent storage (EBS that will survive reboots) for non-Kubernetes containers that run inside the instance. The data container is 20 GB EBS drive and the root device is around 4 GB.
Check the Karpenter default volume configuration
Now the whole point of this post is to show how you can use the local SSD disks that machines often have but that Amazon makes particularly cumbersome to use. Many instances have local SSD storage that will show up inside the host as an extra unmounted device (eg /dev/nvme2n1). How do you make this available to Kubernetes in an automated way?
Learn more on the storage-local-static-provisioner github page and the getting started guide.
The way we expose local disks to kubernetes as resource is via a persistent storage class called 'fast-disks'. The local provisioner will allow creating of these resources of that type by finding local storage
Before going further we need to define a storage class that will be used to flag the new type of storage.
It looks like this and effectively does nothing (see the no-provisioner):
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-disks
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
# Supported policies: Delete, Retain
reclaimPolicy: Delete
kubectl apply -f ./default_example_storageclass.yaml
Then setup the local provisioner daemonset that will handle it
# Generate the configuration file
helm repo add sig-storage-local-static-provisioner https://kubernetes-sigs.github.io/sig-storage-local-static-provisioner
helm template --debug sig-storage-local-static-provisioner/local-static-provisioner --version 2.0.0 --namespace myproject > local-volume-provisioner.generated.yaml
# edit local-volume-provisioner.generated.yaml if necessary
# optional: kubectl diff -f local-volume-provisioner.generated.yaml
kubectl apply -f local-volume-provisioner.generated.yaml
See more about the installation procedure
This creates a daemonset that runs in all instances.
These are some excerpts of what it creates:
The storage class - note that by default it will use shred.sh
to initialize the disk, which is rather slow. You can change it to other methods. This will be done only on first boot.
# Source: local-static-provisioner/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: release-name-local-static-provisioner-config
namespace: myproject
labels:
helm.sh/chart: local-static-provisioner-2.0.0
app.kubernetes.io/name: local-static-provisioner
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/instance: release-name
data:
storageClassMap: |
fast-disks:
hostDir: /mnt/fast-disks
mountDir: /mnt/fast-disks
blockCleanerCommand:
- "/scripts/shred.sh"
- "2"
volumeMode: Filesystem
fsType: ext4
namePattern: "*"
Bootstrap containers have root access and boot with the root file system mounted.
You can create one like this, and publish to your own repository:
FROM alpine:latest
RUN apk add --no-cache util-linux
FROM alpine:latest
RUN echo '#!/bin/sh' > /script.sh \
&& echo 'mkdir -p /.bottlerocket/rootfs/mnt/fast-disks' >> /script.sh \
&& echo 'cd /.bottlerocket/rootfs/mnt/fast-disks' >> /script.sh \
&& echo 'for device in $(ls /.bottlerocket/rootfs/dev/nvme*n1); do' >> /script.sh \
&& echo ' base_device=$(basename $device)' >> /script.sh \
&& echo ' if ! mount | grep -q "$base_device"; then' >> /script.sh \
&& echo ' [ ! -e "./$base_device" ] && ln -s "../../dev/$base_device" "./$base_device"' >> /script.sh \
&& echo ' fi' >> /script.sh \
&& echo 'done' >> /script.sh \
&& echo 'ls -l /.bottlerocket/rootfs/mnt/fast-disks' >> /script.sh \
&& chmod +x /script.sh
CMD ["/script.sh"]
Add this to the Bottlerocket settings
[settings.bootstrap-containers.diskmounter]
essential = true
mode = "always"
source = "YOURACCOUNT.ecr.public.ecr.aws/YOURREPO:latest"
By default, when you set this up if an instance dies (which in AWS can happen any minute) the persistent volume claim will remain.
Basically the life cycle is a bit like:
This will manifest as an error about the pod not being able to find nodeinfo
for the now defunct node.
This cleaning process is achieved by the node-cleanup-controller
To set it up:
apiVersion: apps/v1
kind: Deployment
...
spec:
...
containers:
- name: local-volume-node-cleanup-controller
image: gcr.io/k8s-staging-sig-storage/local-volume-node-cleanup:canary
args:
- "--storageclass-names=fast-disks"
- "--pvc-deletion-delay=60s"
- "--stale-pv-discovery-interval=10s"
ports:
- name: metrics
containerPort: 8080
Now apply both deployment and permissions
kubectl apply -f ./deployment.yaml
kubectl apply -f ./permissions.yaml
You will end up with