Skip to content

[Merged by Bors] - Split up usage page & new index page #260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
- Add the ability to loads DAG via git-sync ([#245]).
- Cluster status conditions ([#255])
- Extend cluster resources for status and cluster operation (paused, stopped) ([#257])
- Added more detailed landing page for the docs ([#260]).

### Changed

Expand All @@ -20,9 +21,10 @@
- `operator-rs` `0.31.0` -> `0.34.0` -> `0.39.0` ([#219]) ([#257]).
- Specified security context settings needed for OpenShift ([#222]).
- Fixed template parsing for OpenShift tests ([#222]).
- Revert openshift settings ([#233])
- Support crate2nix in dev environments ([#234])
- Fixed LDAP tests on Openshift ([#254])
- Revert openshift settings ([#233]).
- Support crate2nix in dev environments ([#234]).
- Fixed LDAP tests on Openshift ([#254]).
- Reorganized usage guide docs([#260]).

### Removed

Expand All @@ -38,6 +40,7 @@
[#255]: https://github.com/stackabletech/airflow-operator/pull/255
[#257]: https://github.com/stackabletech/airflow-operator/pull/257
[#258]: https://github.com/stackabletech/airflow-operator/pull/258
[#260]: https://github.com/stackabletech/airflow-operator/pull/260

## [23.1.0] - 2023-01-23

Expand Down
4 changes: 4 additions & 0 deletions docs/modules/airflow/images/airflow_overview.drawio.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -159,4 +159,4 @@ include::example$getting_started/code/getting_started.sh[tag=check-dag]

== What's next

Look at the xref:usage.adoc[Usage page] to find out more about configuring your Airflow cluster and loading your own DAG files.
Look at the xref:usage-guide/index.adoc[] to find out more about configuring your Airflow cluster and loading your own DAG files.
53 changes: 42 additions & 11 deletions docs/modules/airflow/pages/index.adoc
Original file line number Diff line number Diff line change
@@ -1,20 +1,51 @@
= Stackable Operator for Apache Airflow
:description: The Stackable Operator for Apache Airflow is a Kubernetes operator that can manage Apache Airflow clusters. Learn about its features, resources, dependencies and demos, and see the list of supported Airflow versions.
:keywords: Stackable Operator, Apache Airflow, Kubernetes, k8s, operator, engineer, big data, metadata, job pipeline, scheduler, workflow, ETL

This is an operator for Kubernetes that can manage https://airflow.apache.org/[Apache Airflow]
clusters.
The Stackable Operator for Apache Airflow manages https://airflow.apache.org/[Apache Airflow] instances on Kubernetes.
Apache Airflow is an open-source application for creating, scheduling, and monitoring workflows. Workflows are defined as code, with tasks that can be run on a variety of platforms, including Hadoop, Spark, and Kubernetes itself. Airflow is a popular choice to orchestrate ETL workflows and data pipelines.

WARNING: This operator is part of the Stackable Data Platform and only works with images from the
https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%2Fairflow[Stackable] repository.
== Getting started

Get started using Airflow with the Stackable Operator by following the xref:getting_started/index.adoc[] guide. It guides you through installing the Operator alongside a PostgreSQL database and Redis instance, connecting to your Airflow instance and running your first workflow.

== Resources

The Operator manages three https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/[custom resources]: The _AirflowCluster_ and _AirflowDB_. It creates a number of different Kubernetes resources based on the custom resources.

=== Custom resources

The AirflowCluster is the main resource for the configuration of the Airflow instance. The resource defines three xref:concepts:roles-and-role-groups.adoc[roles]: `webserver`, `worker` and `scheduler`. The various configuration options are explained in the xref:usage-guide/index.adoc[]. It helps you tune your cluster to your needs by configuring xref:usage-guide/storage-resources.adoc[resource usage], xref:usage-guide/security.adoc[security], xref:usage-guide/logging.adoc[logging] and more.

When an AirflowCluster is first deployed, an AirflowDB resource is created. The AirflowDB resource is a wrapper resource for the metadata SQL database that is used by Airflow to store information on users and permissions as well as workflows, task instances and their execution. The resource contains some configuration but also keeps track of whether the database has been initialized or not. It is not deleted automatically if a AirflowCluster is deleted, and so can be reused.

=== Kubernetes resources

Based on the custom resources you define, the Operator creates ConfigMaps, StatefulSets and Services.

image::airflow_overview.drawio.svg[A diagram depicting the Kubernetes resources created by the operator]

The diagram above depicts all the Kubernetes resources created by the operator, and how they relate to each other. The Job created for the AirflowDB is not shown.

For every xref:concepts:roles-and-role-groups.adoc#_role_groups[role group] you define, the Operator creates a StatefulSet with the amount of replicas defined in the RoleGroup. Every Pod in the StatefulSet has two containers: the main container running Airflow and a sidecar container gathering metrics for xref:operators:monitoring.adoc[]. The Operator creates a Service per role group as well as a single service for the whole `webserver` role called `<clustername>-webserver`.

// TODO configmaps?
ConfigMaps are created, one per RoleGroup and also one for the AirflowDB. Both ConfigMaps contains two files: `log_config.py` and `webserver_config.py` which contain logging and general Airflow configuration respectively.

== Dependencies

Airflow requires an SQL database in which to store its metadata. The Stackable platform does not have its own Operator for an SQL database but the xref:getting_started/index.adoc[] guides you through installing an example database with an Airflow instance that you can use to get started.

== Using custom workflows/DAGs

https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html[Direct acyclic graphs (DAGs) of tasks] are the core entities you will use in Airflow. Have a look at the page on xref:usage-guide/mounting-dags.adoc[] to learn about the different ways of loading your custom DAGs into Airflow.

== Demo

You can install the xref:stackablectl::demos/airflow-scheduled-job.adoc[] demo and explore an Airflow installation, as well as how it interacts with xref:spark-k8s:index.adoc[Apache Spark].

== Supported Versions

The Stackable Operator for Apache Airflow currently supports the following versions of Airflow:

include::partial$supported-versions.adoc[]

== Docker

[source]
----
docker pull docker.stackable.tech/stackable/airflow:<version>
----
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
= Applying Custom Resources

Airflow can be used to apply custom resources from within a cluster. An example of this could be a SparkApplication job that is to be triggered by Airflow. The steps below describe how this can be done.

== Define an in-cluster Kubernetes connection

An in-cluster connection can either be created from within the Webserver UI (note that the "in cluster configuration" box is ticked):

image::airflow_connection_ui.png[Airflow Connections]

Alternatively, the connection can be https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html[defined] by an environment variable in URI format:

[source]
AIRFLOW_CONN_KUBERNETES_IN_CLUSTER: "kubernetes://?__extra__=%7B%22extra__kubernetes__in_cluster%22%3A+true%2C+%22extra__kubernetes__kube_config%22%3A+%22%22%2C+%22extra__kubernetes__kube_config_path%22%3A+%22%22%2C+%22extra__kubernetes__namespace%22%3A+%22%22%7D"

This can be supplied directly in the custom resource for all roles (Airflow expects configuration to be common across components):

[source,yaml]
----
include::example$example-airflow-incluster.yaml[]
----

== Define a cluster role for Airflow to create SparkApplication resources

Airflow cannot create or access SparkApplication resources by default - a cluster role is required for this:

[source,yaml]
----
include::example$example-airflow-spark-clusterrole.yaml[]
----

and a corresponding cluster role binding:

[source,yaml]
----
include::example$example-airflow-spark-clusterrolebinding.yaml[]
----

== DAG code

Now for the DAG itself. The job to be started is a simple Spark job that calculates the value of pi:

[source,yaml]
----
include::example$example-pyspark-pi.yaml[]
----

This will called from within a DAG by using the connection that was defined earlier. It will be wrapped by the `KubernetesHook` that the Airflow Kubernetes provider makes available [here.](https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py) There are two classes that are used to:

- start the job
- monitor the status of the job

These are written in-line in the python code below, though this is just to make it clear how the code is used (the classes `SparkKubernetesOperator` and `SparkKubernetesSensor` will be used for all custom resources and thus are best defined as separate python files that the DAG would reference).

[source,python]
----
include::example$example-spark-dag.py[]
----
<1> the wrapper class used for calling the job via `KubernetesHook`
<2> the connection that created for in-cluster usage
<3> the wrapper class used for monitoring the job via `KubernetesHook`
<4> the start of the DAG code
<5> the initial task to invoke the job
<6> the subsequent task to monitor the job
<7> the jobs are chained together in the correct order

Once this DAG is xref:usage-guide/mounting-dags.adoc[mounted] in the DAG folder it can be called and its progress viewed from within the Webserver UI:

image::airflow_dag_graph.png[Airflow Connections]

Clicking on the "spark_pi_monitor" task and selecting the logs shows that the status of the job has been tracked by Airflow:

image::airflow_dag_log.png[Airflow Connections]
1 change: 1 addition & 0 deletions docs/modules/airflow/pages/usage-guide/index.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
= Usage guide
43 changes: 43 additions & 0 deletions docs/modules/airflow/pages/usage-guide/logging.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
= Log aggregation

The logs can be forwarded to a Vector log aggregator by providing a discovery
ConfigMap for the aggregator and by enabling the log agent:

[source,yaml]
----
spec:
vectorAggregatorConfigMapName: vector-aggregator-discovery
webservers:
config:
logging:
enableVectorAgent: true
containers:
airflow:
loggers:
"flask_appbuilder":
level: WARN
workers:
config:
logging:
enableVectorAgent: true
containers:
airflow:
loggers:
"airflow.processor":
level: INFO
schedulers:
config:
logging:
enableVectorAgent: true
containers:
airflow:
loggers:
"airflow.processor_manager":
level: INFO
databaseInitialization:
logging:
enableVectorAgent: true
----

Further information on how to configure logging, can be found in
xref:home:concepts:logging.adoc[].
4 changes: 4 additions & 0 deletions docs/modules/airflow/pages/usage-guide/monitoring.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
= Monitoring

The managed Airflow instances are automatically configured to export Prometheus metrics. See
xref:home:operators:monitoring.adoc[] for more details.
51 changes: 51 additions & 0 deletions docs/modules/airflow/pages/usage-guide/mounting-dags.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
= Mounting DAGs

DAGs can be mounted by using a `ConfigMap` or a `PersistentVolumeClaim`. This is best illustrated with an example of each, shown in the next section.

== via `ConfigMap`

[source,python]
----
include::example$example-configmap.yaml[]
----
----
include::example$example-airflow-dags-configmap.yaml[]
----
<1> The name of the configuration map
<2> The name of the DAG (this is a renamed copy of the `example_bash_operator.py` from the Airflow examples)
[source,yaml]
<3> The volume backed by the configuration map
<4> The name of the configuration map referenced by the Airflow cluster
<5> The name of the mounted volume
<6> The path of the mounted resource. Note that should map to a single DAG.
<7> The resource has to be defined using `subPath`: this is to prevent the versioning of configuration map elements which may cause a conflict with how Airflow propagates DAGs between its components.
<8> If the mount path described above is anything other than the standard location (the default is `$AIRFLOW_HOME/dags`), then the location should be defined using the relevant environment variable.

The advantage of this approach is that a DAG can be provided "in-line", as it were. This becomes cumbersome when multiple DAGs are to be made available in this way, as each one has to be mapped individually. For multiple DAGs it is probably easier to expose them all via a mounted volume, which is shown below.

=== via `git-sync`

==== Overview

https://github.com/kubernetes/git-sync/tree/release-3.x[git-sync] is a command that pulls a git repository into a local directory and is supplied as a sidecar container for use within Kubernetes. The Stackable implementation is a wrapper around this such that the binary and image requirements are included in the Stackable Airflow product images and do not need to be specified or handled in the `AirflowCluster` custom resource. Internal details such as image names and volume mounts are handled by the operator, so that only the repository and synchronisation details are required. An example of this usage is given in the next section.

==== Example

[source,yaml]
----
include::example$example-airflow-gitsync.yaml[]
----

<1> A `Secret` used for accessing database and admin user details (included here to illustrate where different credential secrets are defined)
<2> The git-gync configuration block that contains list of git-sync elements
<3> The repository that will be cloned (required)
<4> The branch name (defaults to `main`)
<5> The location of the DAG folder, relative to the synced repository root (required)
<6> The depth of syncing i.e. the number of commits to clone (defaults to 1)
<7> The synchronisation interval in seconds (defaults to 20 seconds)
<8> The name of the `Secret` used to access the repository if it is not public. This should include two fields: `user` and `password` (which can be either a password - which is not recommended - or a github token, as described https://github.com/kubernetes/git-sync/tree/v3.6.4#flags-which-configure-authentication[here])
<9> A map of optional configuration settings that are listed in https://github.com/kubernetes/git-sync/tree/v3.6.4#primary-flags[this] configuration section (and the ones that follow on that link)
<10> An example showing how to specify a target revision (the default is HEAD). The revision can also a be tag or a commit, though this assumes that the target hash is contained within the number of commits specified by `depth`. If a tag or commit hash is specified, then git-sync will recognise that and not perform further cloning.


IMPORTANT: The example above shows a _*list*_ of git-sync definitions, with a single element. This is to avoid breaking-changes in future releases. Currently, only one such git-sync definition is considered and processed.
42 changes: 42 additions & 0 deletions docs/modules/airflow/pages/usage-guide/overrides.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@

= Configuration & Environment Overrides

The cluster definition also supports overriding configuration properties and environment variables, either per role or per role group, where the more specific override (role group) has precedence over the less specific one (role).

IMPORTANT: Overriding certain properties which are set by operator (such as the HTTP port) can interfere with the operator and can lead to problems. Additionally, for Airflow it is recommended
that each component has the same configuration: not all components use each setting, but some things - such as external end-points - need to be consistent for things to work as expected.

== Configuration Properties

Airflow exposes an environment variable for every Airflow configuration setting, a list of which can be found in the https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html[Configuration Reference].

Although Kubernetes can override these settings in one of two ways (Configuration overrides, or Environment Variable overrides), the affect is the same
and currently only the latter is implemented. This is described in the following section.

== Environment Variables

These can be set - or overwritten - at either the role level:

[source,yaml]
----
webservers:
envOverrides:
AIRFLOW__WEBSERVER__AUTO_REFRESH_INTERVAL: "8"
roleGroups:
default:
replicas: 1
----

Or per role group:

[source,yaml]
----
webservers:
roleGroups:
default:
envOverrides:
AIRFLOW__WEBSERVER__AUTO_REFRESH_INTERVAL: "8"
replicas: 1
----

In both examples above we are replacing the default value of the UI DAG refresh (3s) with 8s. Note that all override property values must be strings.
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ You can configure the Pod placement of the Airflow pods as described in xref:con
The default affinities created by the operator are:

1. Co-locate all the Airflow Pods (weight 20)
2. Distribute all Pods within the same role (worker, webserver, scheduler) (weight 70)
2. Distribute all Pods within the same role (worker, webserver, scheduler) (weight 70)
67 changes: 67 additions & 0 deletions docs/modules/airflow/pages/usage-guide/security.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
= Security

== Authentication
Every user has to authenticate themselves before using Airflow and there are several ways of doing this.

=== Webinterface
The default setting is to view and manually set up users via the Webserver UI. Note the blue "+" button where users can be added directly:

image::airflow_security.png[Airflow Security menu]

=== LDAP

Airflow supports xref:nightly@home:concepts:authentication.adoc[authentication] of users against an LDAP server. This requires setting up an xref:nightly@home:concepts:authentication.adoc#authenticationclass[AuthenticationClass] for the LDAP server.
The AuthenticationClass is then referenced in the AirflowCluster resource as follows:

[source,yaml]
----
apiVersion: airflow.stackable.tech/v1alpha1
kind: AirflowCluster
metadata:
name: airflow-with-ldap
spec:
image:
productVersion: 2.4.1
stackableVersion: 23.4.0-rc2
[...]
authenticationConfig:
authenticationClass: ldap # <1>
userRegistrationRole: Admin # <2>
----

<1> The reference to an AuthenticationClass called `ldap`
<2> The default role that all users are assigned to

Users that log in with LDAP are assigned to a default https://airflow.apache.org/docs/apache-airflow/stable/security/access-control.html#access-control[Role] which is specified with the `userRegistrationRole` property.

You can follow the xref:nightly@home:tutorials:authentication_with_openldap.adoc[] tutorial to learn how to set up an AuthenticationClass for an LDAP server, as well as consulting the xref:nightly@home:reference:authenticationclass.adoc[] reference.

The users and roles can be viewed as before in the Webserver UI, but note that the blue "+" button is not available when authenticating against LDAP:

image::airflow_security_ldap.png[Airflow Security menu]

== Authorization
The Airflow Webserver delegates the https://airflow.apache.org/docs/apache-airflow/stable/security/access-control.html[handling of user access control] to https://flask-appbuilder.readthedocs.io/en/latest/security.html[Flask AppBuilder].

=== Webinterface
You can view, add to, and assign the roles displayed in the Airflow Webserver UI to existing users.

=== LDAP

Airflow supports assigning https://airflow.apache.org/docs/apache-airflow/stable/security/access-control.html#access-control[Roles] to users based on their LDAP group membership, though this is not yet supported by the Stackable operator.
All the users logging in via LDAP get assigned to the same role which you can configure via the attribute `authenticationConfig.userRegistrationRole` on the `AirflowCluster` object:

[source,yaml]
----
apiVersion: airflow.stackable.tech/v1alpha1
kind: AirflowCluster
metadata:
name: airflow-with-ldap
spec:
[...]
authenticationConfig:
authenticationClass: ldap
userRegistrationRole: Admin # <1>
----

<1> All users are assigned to the `Admin` role
Loading