- Runs a Bytewax dataflow on a Kubernetes cluster.
helm repo add bytewax https://bytewax.github.io/helm-charts
helm repo update
See helm repo for command documentation.
To install the chart with the release name my-release
:
helm install my-release bytewax/bytewax
This version requires Helm >= 3.1.0.
These are the dependencies which are disabled by default:
- open-telemetry/opentelemetry-collector
- jaegertracing/jaeger
- prometheus-community/kube-prometheus-stack
For more details about this, read the Observability section.
To uninstall/delete the my-release
deployment:
helm delete my-release
The command removes all the Kubernetes components associated with the chart and deletes the release.
Parameter | Description | Default |
---|---|---|
image.repository |
Image repository | bytewax/bytewax |
image.tag |
Image tag | 0.19.1-python3.9 |
image.pullPolicy |
Image pull policy | Always |
imagePullSecrets |
Image pull secrets | [] |
serviceAccount.create |
Create service account | true |
serviceAccount.annotations |
Annotations to add to the service account | {} |
serviceAccount.name |
Service account name to use, when empty will be set to created account if serviceAccount.create is set else to default |
`` |
extralabels |
Labels to add to common labels | {} |
podLabels |
Statefulset/Job pod labels | {} |
podAnnotations |
Statefulset/Job pod annotations | {} |
podSecurityContext |
Statefulset/Job pod securityContext | {"runAsNonRoot": true, "runAsUser": 65532, "runAsGroup": 3000, "fsGroup": 2000} |
containerName |
Statefulset/Job application container name | process |
securityContext |
Statefulset/Job containers securityContext | {"allowPrivilegeEscalation": false, "capabilities": {"drop": ["ALL"], "add": ["NET_BIND_SERVICE"]}, "readOnlyRootFilesystem": true } |
service.port |
Kubernetes port where intertal bytewax service is exposed | 9999 |
api.enabled |
Create resources related to dataflow api | true |
api.port |
Kubernetes port where dataflow api service is exposed | 3030 |
api.cacheport |
Kubernetes port where dataflow api cache service is exposed | 3033 |
resources |
CPU/Memory resource requests/limits | {} |
nodeSelector |
Node labels for pod assignment | {} |
tolerations |
Toleration labels for pod assignment | [] |
affinity |
Affinity settings for pod assignment | {} |
env |
Extra environment variables passed to pods | {} |
envValueFrom |
Environment variables from alternate sources. See the API docs on EnvVarSource for format details. | {} |
envFromSecret |
Name of a Kubernetes secret (must be manually created in the same namespace) containing values to be added to the environment. Can be templated | "" |
envRenderSecret |
Sensible environment variables passed to pods and stored as secret | {} |
extraSecretMounts |
Secret mounts to get secrets from files instead of env vars | [] |
extraVolumeMounts |
Additional volume mounts | [] |
configuration.pythonFileName |
Path of the python file to run | simple.py |
configuration.dependencies |
List of the python dependencies needed | [] |
configuration.processesCount |
Number of concurrent processes to run | 1 |
configuration.workersPerProcess |
Number of workers per process | 1 |
configuration.jobMode |
Create a kubernetes Job resource instead of a Statefulset (use this for batch processing) - Kubernetes version required: 1.24 or superior | false |
configuration.keepAlive |
Keep the container process alive after dataflow executing ended to prevent a container restart by Kubernetes (ignored when .jobMode is true) | false |
configuration.configMap.create |
Create a configmap to store python file(s) | true |
configuration.configMap.customName |
Configmap which has python file(s) created manually | `` |
configuration.configMap.files.pattern |
Files to store in the ConfigMap to be created | examples/* |
configuration.configMap.files.tarName |
Tar file to store in the ConfigMap to be created | `` |
configuration.recovery.enabled |
Enable Recovery | false |
configuration.recovery.partsCount |
Number of recovery parts | 1 |
configuration.recovery.snapshotInterval |
System time duration in seconds to snapshot state for recovery | 30 |
configuration.recovery.backupInterval |
System time duration in seconds to keep extra state snapshots around | 30 |
configuration.recovery.persistence.accessModes |
Persistence access modes | [ReadWriteOnce] |
configuration.recovery.persistence.size |
Size of persistent volume claim | 10Gi |
configuration.recovery.persistence.annotations |
PersistentVolumeClaim annotations | {} |
configuration.recovery.persistence.finalizers |
PersistentVolumeClaim finalizers | [ "kubernetes.io/pvc-protection" ] |
configuration.recovery.persistence.extraPvcLabels |
Extra labels to apply to the PVC | {} |
configuration.recovery.persistence.storageClassName |
Type of persistent volume claim | nil |
configuration.recovery.persistence.hostPath.enabled |
Use hostPath instead of PersistentVolumeClaim | false |
configuration.recovery.persistence.hostPath.path |
Absolute path on the host to store recovery files | `` |
customOtlpUrl |
OTLP Endpoint URL | `` |
opentelemetry-collector.enabled |
Install OpenTelemetry Collector Helm Chart | false |
jaeger.enabled |
Install Jaeger Helm Chart | false |
kubePrometheusStack.enabled |
Install Prometheus Operator, Kube-Metrics, and Grafana | false |
podMonitor.enabled |
Use an existing Prometheus Operator instead of install a new one with kubePrometheusStack.enabled |
false |
podMonitor.selector |
Labels to apply to the PodMonitor resource so the existing Prometheus Operator processes it. | release: my-prometheus |
configuration:
pythonFileName: "k8s_basic.py"
configMap:
files:
pattern: "examples/*"
tarName:
configuration:
pythonFileName: "examples/k8s_cluster.py"
processesCount: 5
configMap:
files:
pattern:
tarName: "examples.tar"
In this example, we store a tar file in the configmap. This is useful when your python script needs a tree of nested files and directories.
Following our example, the tar file has this content:
├── k8s_basic.py
├── k8s_cluster.py
└── sample_data
└── cluster
├── partition-1.txt
├── partition-2.txt
├── partition-3.txt
├── partition-4.txt
└── partition-5.txt
Since that tar file is going to be extracted to the container working directory then the container is going to have that directory tree available to work with.
Our k8s_cluster.py
script opens a file located in examples/sample_data/cluster
directory as we can see in this portion of its code:
read_dir = Path("./examples/sample_data/cluster/")
If you want to see the output produced by this example you can run this (assuming that your Helm release name was k8s
):
for PROCESS in {0..4}; do echo "$PROCESS.out:"; kubectl exec -it k8s-$PROCESS -cprocess -- cat /var/bytewax/cluster_out/$PROCESS.out; done
So far the examples of the configuration
block described how to use one of the python files already included in the chart.
We included those files to show you how to use this chart, but of course, you will want to run your own code. You have two ways to accomplish that:
In this case, you will need to fetch the Bytewax chart to your machine and copy your python file(s) inside the chart directory. Then, you can use the chart settings to generate a Configmap storing your file(s) and run it in the container.
There are the steps to include my-code.py
and execute it:
- Fetch Bytewax chart and decompress it
$ helm repo add bytewax https://bytewax.github.io/helm-charts
$ helm repo update
$ helm fetch bytewax/bytewax
$ tar -xvf bytewax-0.3.0.tgz
- Copy your file to the chart directory
$ cp ./my-code.py ./bytewax/
- Install Bytewax chart using your local copy
$ helm upgrade --install my-dataflow ./bytewax \
--set configuration.pythonFileName="my-code.py" \
--set configuration.configMap.files.pattern="my-code.py"
In this option, you will need to provide a Configmap with your file(s) and then configure your chart values to use it.
These are the steps to create a Configmap with my-code.py
and use it with the Bytewax chart:
- Create the configmap
$ kubectl create configmap my-configmap --from-file=my-code.py
- Install Bytewax helm chart using
my-configmap
$ helm repo add bytewax https://bytewax.github.io/helm-charts
$ helm repo update
$ helm upgrade --install my-dataflow ./bytewax \
--set configuration.pythonFileName="my-code.py" \
--set configuration.configMap.create=false \
--set configuration.configMap.customName=my-configmap
Bytewax is instrumented to offer observability of your dataflow. You can read more about it here.
The Bytewax helm chart can install OpenTelemetry Collector and Jaeger both configured to work together with your dataflow traces.
With this simple configuration you will have an OpenTelemetry Collector receiving your dataflow traces and exporting them to a Jaeger instance:
opentelemetry-collector:
enabled: true
jaeger:
enabled: true
Note that your dataflow will have an environment variable BYTEWAX_OTLP_URL
filled with the corresponding OpenTelemetry Collector endpoint created by the helm chart installation.
Following that example, to see the dataflow traces in Jaeger UI, you need to run this and then open http://localhost:8088
in a web browser:
kubectl port-forward svc/<YOUR_RELEASE_NAME>-jaeger-query 8088:80
You can change OpenTelemetry Collector and Jaeger sub-charts configuration nesting their values in opentelemetry-collector
or jaeger
respectively. These are some examples:
OpenTelemetry Collector sending traces to an existing Jaeger installation running on the same Kubernetes Cluster:
opentelemetry-collector:
enabled: true
mode: deployment
config:
exporters:
jaeger:
endpoint: "jaeger-collector.common-infra.svc.cluster.local:14250"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: []
exporters: [logging,jaeger]
jaeger:
enabled: false
OpenTelemetry Collector with default values and Jaeger storing to one-node ElasticSearch cluster:
opentelemetry-collector:
enabled: true
jaeger:
enabled: true
elasticsearch:
replicas: 1
minimumMasterNodes: 1
In case you want to send the dataflow traces to an existing OTLP endpoint, you just need to define the customOtlpUrl
field in your values, for example:
customOtlpUrl: https://otlpcollector.myorganization.com:4317
In that case, you should keep opentelemetry-collector.enabled
and jaeger.enabled
with default values false
because they are unnecessary.
The Bytewax helm chart can install the Prometheus Operator, Kube-Metris, and Grafana all configured to work together with your dataflow metrics.
With this simple configuration you will have your Dataflow metrics in the Prometheus database, and you can check them in the Grafana UI:
kubePrometheusStack:
enabled: true
Following that example, to see the dataflow metrics in Grafana UI, you need to run this and then open http://localhost:3000
in a web browser:
kubectl port-forward svc/<YOUR_RELEASE_NAME>-grafana 3000:80
You can change Prometheus Operator, Kube-Metrics, and Grafana sub-charts configuration nesting their values in kube-prometheus-stack
or kube-promethes-stack.kube-state-metrics
, and kube-prometheus-stack.grafana
respectively. For example:
kubePrometheusStack:
enabled: true
kube-prometheus-stack:
grafana:
replicas: 2
In case you want to instruct an existing Prometheus Operator to scrap the dataflow metrics, you just need to set the podMonitor.enabled
field to true
and configure the labels in podMonitor.selector
in your values.
For example, if the existing Prometheus Operator looks for PodMonitors
with the label release=my-prometheus
, your settings should be like this:
podMonitor:
enabled: true
selector:
release: my-prometheus
To check the actual labels looked at by your Prometheus Operator, run this command:
kubectl get prometheuses.monitoring.coreos.com --all-namespaces -o jsonpath="{.items[*].spec.podMonitorSelector}"
The output will be something similar to this:
{"matchLabels":{"release":"my-prometheus"}}
In case you use podMonitor.enabled=true
, you should keep kubePrometheusStack.enabled
with the default value false
because it is unnecessary.
In my-workload.py:
f = open('/etc/secrets/auth_generic_oauth/client_id','r');
client_id = f.read();
f = open('/etc/secrets/auth_generic_oauth/client_secret','r');
client_secret = f.read();
Existing secret, or created along with helm:
---
apiVersion: v1
kind: Secret
metadata:
name: auth-generic-oauth-secret
type: Opaque
stringData:
client_id: <value>
client_secret: <value>
Include in the extraSecretMounts
configuration flag:
- extraSecretMounts:
- name: auth-generic-oauth-secret-mount
secretName: auth-generic-oauth-secret
defaultMode: 0440
mountPath: /etc/secrets/auth_generic_oauth
readOnly: true
This example uses a CSI driver e.g. retrieving secrets using Azure Key Vault Provider
- extraSecretMounts:
- name: secrets-store-inline
mountPath: /run/secrets
readOnly: true
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: "my-provider"
nodePublishSecretRef:
name: akv-creds