stackabletech · fhennig · Apr 11, 2023 · Apr 11, 2023 · Apr 11, 2023 · Apr 11, 2023
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,7 @@
 - Add the ability to loads DAG via git-sync ([#245]).
 - Cluster status conditions ([#255])
 - Extend cluster resources for status and cluster operation (paused, stopped) ([#257])
+- Added more detailed landing page for the docs ([#260]).
 
 ### Changed
 
@@ -20,9 +21,10 @@
 - `operator-rs` `0.31.0` -> `0.34.0` -> `0.39.0` ([#219]) ([#257]).
 - Specified security context settings needed for OpenShift ([#222]).
 - Fixed template parsing for OpenShift tests ([#222]).
-- Revert openshift settings ([#233])
-- Support crate2nix in dev environments ([#234])
-- Fixed LDAP tests on Openshift ([#254])
+- Revert openshift settings ([#233]).
+- Support crate2nix in dev environments ([#234]).
+- Fixed LDAP tests on Openshift ([#254]).
+- Reorganized usage guide docs([#260]).
 
 ### Removed
 
@@ -38,6 +40,7 @@
 [#255]: https://github.com/stackabletech/airflow-operator/pull/255
 [#257]: https://github.com/stackabletech/airflow-operator/pull/257
 [#258]: https://github.com/stackabletech/airflow-operator/pull/258
+[#260]: https://github.com/stackabletech/airflow-operator/pull/260
 
 ## [23.1.0] - 2023-01-23
 

diff --git a/docs/modules/airflow/images/airflow_overview.drawio.svg b/docs/modules/airflow/images/airflow_overview.drawio.svg
diff --git a/docs/modules/airflow/pages/getting_started/first_steps.adoc b/docs/modules/airflow/pages/getting_started/first_steps.adoc
@@ -159,4 +159,4 @@ include::example$getting_started/code/getting_started.sh[tag=check-dag]
 
 == What's next
 
-Look at the xref:usage.adoc[Usage page] to find out more about configuring your Airflow cluster and loading your own DAG files.
+Look at the xref:usage-guide/index.adoc[] to find out more about configuring your Airflow cluster and loading your own DAG files.
diff --git a/docs/modules/airflow/pages/index.adoc b/docs/modules/airflow/pages/index.adoc
@@ -1,20 +1,51 @@
 = Stackable Operator for Apache Airflow
+:description: The Stackable Operator for Apache Airflow is a Kubernetes operator that can manage Apache Airflow clusters. Learn about its features, resources, dependencies and demos, and see the list of supported Airflow versions.
+:keywords: Stackable Operator, Apache Airflow, Kubernetes, k8s, operator, engineer, big data, metadata, job pipeline, scheduler, workflow, ETL
 
-This is an operator for Kubernetes that can manage https://airflow.apache.org/[Apache Airflow]
-clusters.
+The Stackable Operator for Apache Airflow manages https://airflow.apache.org/[Apache Airflow] instances on Kubernetes.
+Apache Airflow is an open-source application for creating, scheduling, and monitoring workflows. Workflows are defined as code, with tasks that can be run on a variety of platforms, including Hadoop, Spark, and Kubernetes itself. Airflow is a popular choice to orchestrate ETL workflows and data pipelines.
 
-WARNING: This operator is part of the Stackable Data Platform and only works with images from the
-https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%2Fairflow[Stackable] repository.
+== Getting started
+
+Get started using Airflow with the Stackable Operator by following the xref:getting_started/index.adoc[] guide. It guides you through installing the Operator alongside a PostgreSQL database and Redis instance, connecting to your Airflow instance and running your first workflow.
+
+== Resources
+
+The Operator manages three https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/[custom resources]: The _AirflowCluster_ and _AirflowDB_. It creates a number of different Kubernetes resources based on the custom resources.
+
+=== Custom resources
+
+The AirflowCluster is the main resource for the configuration of the Airflow instance. The resource defines three xref:concepts:roles-and-role-groups.adoc[roles]: `webserver`, `worker` and `scheduler`. The various configuration options are explained in the xref:usage-guide/index.adoc[]. It helps you tune your cluster to your needs by configuring xref:usage-guide/storage-resources.adoc[resource usage], xref:usage-guide/security.adoc[security], xref:usage-guide/logging.adoc[logging] and more.
+
+When an AirflowCluster is first deployed, an AirflowDB resource is created. The AirflowDB resource is a wrapper resource for the metadata SQL database that is used by Airflow to store information on users and permissions as well as workflows, task instances and their execution. The resource contains some configuration but also keeps track of whether the database has been initialized or not. It is not deleted automatically if a AirflowCluster is deleted, and so can be reused.
+
+=== Kubernetes resources
+
+Based on the custom resources you define, the Operator creates ConfigMaps, StatefulSets and Services.
+
+image::airflow_overview.drawio.svg[A diagram depicting the Kubernetes resources created by the operator]
+
+The diagram above depicts all the Kubernetes resources created by the operator, and how they relate to each other. The Job created for the AirflowDB is not shown.
+
+For every xref:concepts:roles-and-role-groups.adoc#_role_groups[role group] you define, the Operator creates a StatefulSet with the amount of replicas defined in the RoleGroup. Every Pod in the StatefulSet has two containers: the main container running Airflow and a sidecar container gathering metrics for xref:operators:monitoring.adoc[]. The Operator creates a Service per role group as well as a single service for the whole `webserver` role called `<clustername>-webserver`.
+
+// TODO configmaps?
+ConfigMaps are created, one per RoleGroup and also one for the AirflowDB. Both ConfigMaps contains two files: `log_config.py` and `webserver_config.py` which contain logging and general Airflow configuration respectively.
+
+== Dependencies
+
+Airflow requires an SQL database in which to store its metadata. The Stackable platform does not have its own Operator for an SQL database but the xref:getting_started/index.adoc[] guides you through installing an example database with an Airflow instance that you can use to get started.
+
+== Using custom workflows/DAGs
+
+https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html[Direct acyclic graphs (DAGs) of tasks] are the core entities you will use in Airflow. Have a look at the page on xref:usage-guide/mounting-dags.adoc[] to learn about the different ways of loading your custom DAGs into Airflow.
+
+== Demo
+
+You can install the xref:stackablectl::demos/airflow-scheduled-job.adoc[] demo and explore an Airflow installation, as well as how it interacts with xref:spark-k8s:index.adoc[Apache Spark].
 
 == Supported Versions
 
 The Stackable Operator for Apache Airflow currently supports the following versions of Airflow:
 
 include::partial$supported-versions.adoc[]
-
-== Docker
-
-[source]
-----
-docker pull docker.stackable.tech/stackable/airflow:<version>
-----
diff --git a/docs/modules/airflow/pages/usage-guide/applying-custom-resources.adoc b/docs/modules/airflow/pages/usage-guide/applying-custom-resources.adoc
@@ -0,0 +1,73 @@
+= Applying Custom Resources
+
+Airflow can be used to apply custom resources from within a cluster. An example of this could be a SparkApplication job that is to be triggered by Airflow. The steps below describe how this can be done.
+
+== Define an in-cluster Kubernetes connection
+
+An in-cluster connection can either be created from within the Webserver UI (note that the "in cluster configuration" box is ticked):
+
+image::airflow_connection_ui.png[Airflow Connections]
+
+Alternatively, the connection can be https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html[defined] by an environment variable in URI format:
+
+[source]
+AIRFLOW_CONN_KUBERNETES_IN_CLUSTER: "kubernetes://?__extra__=%7B%22extra__kubernetes__in_cluster%22%3A+true%2C+%22extra__kubernetes__kube_config%22%3A+%22%22%2C+%22extra__kubernetes__kube_config_path%22%3A+%22%22%2C+%22extra__kubernetes__namespace%22%3A+%22%22%7D"
+
+This can be supplied directly in the custom resource for all roles (Airflow expects configuration to be common across components):
+
+[source,yaml]
+----
+include::example$example-airflow-incluster.yaml[]
+----
+
+== Define a cluster role for Airflow to create SparkApplication resources
+
+Airflow cannot create or access SparkApplication resources by default - a cluster role is required for this:
+
+[source,yaml]
+----
+include::example$example-airflow-spark-clusterrole.yaml[]
+----
+
+and a corresponding cluster role binding:
+
+[source,yaml]
+----
+include::example$example-airflow-spark-clusterrolebinding.yaml[]
+----
+
+== DAG code
+
+Now for the DAG itself. The job to be started is a simple Spark job that calculates the value of pi:
+
+[source,yaml]
+----
+include::example$example-pyspark-pi.yaml[]
+----
+
+This will called from within a DAG by using the connection that was defined earlier. It will be wrapped by the `KubernetesHook` that the Airflow Kubernetes provider makes available [here.](https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py) There are two classes that are used to:
+
+- start the job
+- monitor the status of the job
+
+These are written in-line in the python code below, though this is just to make it clear how the code is used (the classes `SparkKubernetesOperator` and `SparkKubernetesSensor` will be used for all custom resources and thus are best defined as separate python files that the DAG would reference).
+
+[source,python]
+----
+include::example$example-spark-dag.py[]
+----
+<1> the wrapper class used for calling the job via `KubernetesHook`
+<2> the connection that created for in-cluster usage
+<3> the wrapper class used for monitoring the job via `KubernetesHook`
+<4> the start of the DAG code
+<5> the initial task to invoke the job
+<6> the subsequent task to monitor the job
+<7> the jobs are chained together in the correct order
+
+Once this DAG is xref:usage-guide/mounting-dags.adoc[mounted] in the DAG folder it can be called and its progress viewed from within the Webserver UI:
+
+image::airflow_dag_graph.png[Airflow Connections]
+
+Clicking on the "spark_pi_monitor" task and selecting the logs shows that the status of the job has been tracked by Airflow:
+
+image::airflow_dag_log.png[Airflow Connections]
diff --git a/...les/airflow/pages/cluster_operations.adoc → ...pages/usage-guide/cluster-operations.adoc b/...les/airflow/pages/cluster_operations.adoc → ...pages/usage-guide/cluster-operations.adoc
diff --git a/docs/modules/airflow/pages/usage-guide/index.adoc b/docs/modules/airflow/pages/usage-guide/index.adoc
@@ -0,0 +1 @@
+= Usage guide
diff --git a/docs/modules/airflow/pages/usage-guide/logging.adoc b/docs/modules/airflow/pages/usage-guide/logging.adoc
@@ -0,0 +1,43 @@
+= Log aggregation
+
+The logs can be forwarded to a Vector log aggregator by providing a discovery
+ConfigMap for the aggregator and by enabling the log agent:
+
+[source,yaml]
+----
+spec:
+  vectorAggregatorConfigMapName: vector-aggregator-discovery
+  webservers:
+    config:
+      logging:
+        enableVectorAgent: true
+        containers:
+          airflow:
+            loggers:
+              "flask_appbuilder":
+                level: WARN
+  workers:
+    config:
+      logging:
+        enableVectorAgent: true
+        containers:
+          airflow:
+            loggers:
+              "airflow.processor":
+                level: INFO
+  schedulers:
+    config:
+      logging:
+        enableVectorAgent: true
+        containers:
+          airflow:
+            loggers:
+              "airflow.processor_manager":
+                level: INFO
+  databaseInitialization:
+    logging:
+      enableVectorAgent: true
+----
+
+Further information on how to configure logging, can be found in
+xref:home:concepts:logging.adoc[].
diff --git a/docs/modules/airflow/pages/usage-guide/monitoring.adoc b/docs/modules/airflow/pages/usage-guide/monitoring.adoc
@@ -0,0 +1,4 @@
+= Monitoring
+
+The managed Airflow instances are automatically configured to export Prometheus metrics. See
+xref:home:operators:monitoring.adoc[] for more details.
diff --git a/docs/modules/airflow/pages/usage-guide/mounting-dags.adoc b/docs/modules/airflow/pages/usage-guide/mounting-dags.adoc
@@ -0,0 +1,51 @@
+= Mounting DAGs
+
+DAGs can be mounted by using a `ConfigMap` or a `PersistentVolumeClaim`. This is best illustrated with an example of each, shown in the next section.
+
+== via `ConfigMap`
+
+[source,python]
+----
+include::example$example-configmap.yaml[]
+----
+----
+include::example$example-airflow-dags-configmap.yaml[]
+----
+<1> The name of the configuration map
+<2> The name of the DAG (this is a renamed copy of the `example_bash_operator.py` from the Airflow examples)
+[source,yaml]
+<3> The volume backed by the configuration map
+<4> The name of the configuration map referenced by the Airflow cluster
+<5> The name of the mounted volume
+<6> The path of the mounted resource. Note that should map to a single DAG.
+<7> The resource has to be defined using `subPath`: this is to prevent the versioning of configuration map elements which may cause a conflict with how Airflow propagates DAGs between its components.
+<8> If the mount path described above is anything other than the standard location (the default is `$AIRFLOW_HOME/dags`), then the location should be defined using the relevant environment variable.
+
+The advantage of this approach is that a DAG can be provided "in-line", as it were. This becomes cumbersome when multiple DAGs are to be made available in this way, as each one has to be mapped individually. For multiple DAGs it is probably easier to expose them all via a mounted volume, which is shown below.
+
+=== via `git-sync`
+
+==== Overview
+
+https://github.com/kubernetes/git-sync/tree/release-3.x[git-sync] is a command that pulls a git repository into a local directory and is supplied as a sidecar container for use within Kubernetes. The Stackable implementation is a wrapper around this such that the binary and image requirements are included in the Stackable Airflow product images and do not need to be specified or handled in the `AirflowCluster` custom resource. Internal details such as image names and volume mounts are handled by the operator, so that only the repository and synchronisation details are required. An example of this usage is given in the next section.
+
+==== Example
+
+[source,yaml]
+----
+include::example$example-airflow-gitsync.yaml[]
+----
+
+<1> A `Secret` used for accessing database and admin user details (included here to illustrate where different credential secrets are defined)
+<2> The git-gync configuration block that contains list of git-sync elements
+<3> The repository that will be cloned (required)
+<4> The branch name (defaults to `main`)
+<5> The location of the DAG folder, relative to the synced repository root (required)
+<6> The depth of syncing i.e. the number of commits to clone (defaults to 1)
+<7> The synchronisation interval in seconds (defaults to 20 seconds)
+<8> The name of the `Secret` used to access the repository if it is not public. This should include two fields: `user` and `password` (which can be either a password - which is not recommended - or a github token, as described https://github.com/kubernetes/git-sync/tree/v3.6.4#flags-which-configure-authentication[here])
+<9> A map of optional configuration settings that are listed in https://github.com/kubernetes/git-sync/tree/v3.6.4#primary-flags[this] configuration section (and the ones that follow on that link)
+<10> An example showing how to specify a target revision (the default is HEAD). The revision can also a be tag or a commit, though this assumes that the target hash is contained within the number of commits specified by `depth`. If a tag or commit hash is specified, then git-sync will recognise that and not perform further cloning.
+
+
+IMPORTANT: The example above shows a _*list*_ of git-sync definitions, with a single element. This is to avoid breaking-changes in future releases. Currently, only one such git-sync definition is considered and processed.
diff --git a/docs/modules/airflow/pages/usage-guide/overrides.adoc b/docs/modules/airflow/pages/usage-guide/overrides.adoc
@@ -0,0 +1,42 @@
+
+= Configuration & Environment Overrides
+
+The cluster definition also supports overriding configuration properties and environment variables, either per role or per role group, where the more specific override (role group) has precedence over the less specific one (role).
+
+IMPORTANT: Overriding certain properties which are set by operator (such as the HTTP port) can interfere with the operator and can lead to problems. Additionally, for Airflow it is recommended
+that each component has the same configuration: not all components use each setting, but some things - such as external end-points - need to be consistent for things to work as expected.
+
+== Configuration Properties
+
+Airflow exposes an environment variable for every Airflow configuration setting, a list of which can be found in the https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html[Configuration Reference].
+
+Although Kubernetes can override these settings in one of two ways (Configuration overrides, or Environment Variable overrides), the affect is the same
+and currently only the latter is implemented. This is described in the following section.
+
+== Environment Variables
+
+These can be set - or overwritten - at either the role level:
+
+[source,yaml]
+----
+webservers:
+  envOverrides:
+    AIRFLOW__WEBSERVER__AUTO_REFRESH_INTERVAL: "8"
+  roleGroups:
+    default:
+      replicas: 1
+----
+
+Or per role group:
+
+[source,yaml]
+----
+webservers:
+  roleGroups:
+    default:
+      envOverrides:
+        AIRFLOW__WEBSERVER__AUTO_REFRESH_INTERVAL: "8"
+      replicas: 1
+----
+
+In both examples above we are replacing the default value of the UI DAG refresh (3s) with 8s. Note that all override property values must be strings.
diff --git a/.../modules/airflow/pages/pod_placement.adoc → ...flow/pages/usage-guide/pod-placement.adoc b/.../modules/airflow/pages/pod_placement.adoc → ...flow/pages/usage-guide/pod-placement.adoc
@@ -5,4 +5,4 @@ You can configure the Pod placement of the Airflow pods as described in xref:con
 The default affinities created by the operator are:
 
 1. Co-locate all the Airflow Pods (weight 20)
-2. Distribute all Pods within the same role (worker, webserver, scheduler) (weight 70)
+2. Distribute all Pods within the same role (worker, webserver, scheduler) (weight 70)
diff --git a/docs/modules/airflow/pages/usage-guide/security.adoc b/docs/modules/airflow/pages/usage-guide/security.adoc
@@ -0,0 +1,67 @@
+= Security
+
+== Authentication
+Every user has to authenticate themselves before using Airflow and there are several ways of doing this.
+
+=== Webinterface
+The default setting is to view and manually set up users via the Webserver UI. Note the blue "+" button where users can be added directly:
+
+image::airflow_security.png[Airflow Security menu]
+
+=== LDAP
+
+Airflow supports xref:nightly@home:concepts:authentication.adoc[authentication] of users against an LDAP server. This requires setting up an xref:nightly@home:concepts:authentication.adoc#authenticationclass[AuthenticationClass] for the LDAP server.
+The AuthenticationClass is then referenced in the AirflowCluster resource as follows:
+
+[source,yaml]
+----
+apiVersion: airflow.stackable.tech/v1alpha1
+kind: AirflowCluster
+metadata:
+  name: airflow-with-ldap
+spec:
+  image:
+    productVersion: 2.4.1
+    stackableVersion: 23.4.0-rc2
+  [...]
+  authenticationConfig:
+    authenticationClass: ldap    # <1>
+    userRegistrationRole: Admin  # <2>
+----
+
+<1> The reference to an AuthenticationClass called `ldap`
+<2> The default role that all users are assigned to
+
+Users that log in with LDAP are assigned to a default https://airflow.apache.org/docs/apache-airflow/stable/security/access-control.html#access-control[Role] which is specified with the `userRegistrationRole` property.
+
+You can follow the xref:nightly@home:tutorials:authentication_with_openldap.adoc[] tutorial to learn how to set up an AuthenticationClass for an LDAP server, as well as consulting the xref:nightly@home:reference:authenticationclass.adoc[] reference.
+
+The users and roles can be viewed as before in the Webserver UI, but note that the blue "+" button is not available when authenticating against LDAP:
+
+image::airflow_security_ldap.png[Airflow Security menu]
+
+== Authorization
+The Airflow Webserver delegates the https://airflow.apache.org/docs/apache-airflow/stable/security/access-control.html[handling of user access control] to https://flask-appbuilder.readthedocs.io/en/latest/security.html[Flask AppBuilder].
+
+=== Webinterface
+You can view, add to, and assign the roles displayed in the Airflow Webserver UI to existing users.
+
+=== LDAP
+
+Airflow supports assigning https://airflow.apache.org/docs/apache-airflow/stable/security/access-control.html#access-control[Roles] to users based on their LDAP group membership, though this is not yet supported by the Stackable operator.
+All the users logging in via LDAP get assigned to the same role which you can configure via the attribute `authenticationConfig.userRegistrationRole` on the `AirflowCluster` object:
+
+[source,yaml]
+----
+apiVersion: airflow.stackable.tech/v1alpha1
+kind: AirflowCluster
+metadata:
+  name: airflow-with-ldap
+spec:
+  [...]
+  authenticationConfig:
+    authenticationClass: ldap
+    userRegistrationRole: Admin  # <1>
+----
+
+<1> All users are assigned to the `Admin` role
Original file line number	Diff line number	Diff line change
Expand Up		@@ -159,4 +159,4 @@ include::example$getting_started/code/getting_started.sh[tag=check-dag]

		== What's next

		Look at the xref:usage.adoc[Usage page] to find out more about configuring your Airflow cluster and loading your own DAG files.
		Look at the xref:usage-guide/index.adoc[] to find out more about configuring your Airflow cluster and loading your own DAG files.