From ed13ca920181df0b57df9e94b298cd4d66c5b8ee Mon Sep 17 00:00:00 2001 From: Felix Hennig Date: Wed, 23 Aug 2023 14:58:10 +0200 Subject: [PATCH 1/8] remove unneccessary text --- docs/modules/hdfs/pages/index.adoc | 9 --------- 1 file changed, 9 deletions(-) diff --git a/docs/modules/hdfs/pages/index.adoc b/docs/modules/hdfs/pages/index.adoc index 34d4f18c..3f2ef9a0 100644 --- a/docs/modules/hdfs/pages/index.adoc +++ b/docs/modules/hdfs/pages/index.adoc @@ -2,8 +2,6 @@ The Stackable Operator for https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html[Apache HDFS] is used to set up HFDS in high-availability mode. It depends on the xref:zookeeper:ROOT:index.adoc[] to operate a ZooKeeper cluster to coordinate the active and standby NameNodes. -NOTE: This operator only works with images from the https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%2Fhadoop[Stackable] repository - == Roles Three xref:home:concepts:roles-and-role-groups.adoc[roles] of the HDFS cluster are implemented: @@ -33,10 +31,3 @@ In the custom resource you can specify the number of replicas per role group (Na The Stackable Operator for Apache HDFS currently supports the following versions of HDFS: include::partial$supported-versions.adoc[] - -== Docker image - -[source] ----- -docker pull docker.stackable.tech/stackable/hadoop: ----- From 2bae289b9defae49cf5f9c4e6bb43578e72aad3c Mon Sep 17 00:00:00 2001 From: Felix Hennig Date: Wed, 23 Aug 2023 15:03:13 +0200 Subject: [PATCH 2/8] New intro text --- docs/modules/hdfs/pages/index.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/modules/hdfs/pages/index.adoc b/docs/modules/hdfs/pages/index.adoc index 3f2ef9a0..33361fa3 100644 --- a/docs/modules/hdfs/pages/index.adoc +++ b/docs/modules/hdfs/pages/index.adoc @@ -1,6 +1,6 @@ = Stackable Operator for Apache HDFS -The Stackable Operator for https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html[Apache HDFS] is used to set up HFDS in high-availability mode. It depends on the xref:zookeeper:ROOT:index.adoc[] to operate a ZooKeeper cluster to coordinate the active and standby NameNodes. +The Stackable Operator for https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html[Apache HDFS] (Hadoop Distributed File System) is used to set up HFDS in high-availability mode. HDFS is a distributed file system designed to store and manage massive amounts of data across multiple machines in a fault-tolerant manner. The Operator depends on the xref:zookeeper:index.adoc[] to operate a ZooKeeper cluster to coordinate the active and standby NameNodes. == Roles From c502fce73d163cfb7f47c61a31c5292fa40b50a9 Mon Sep 17 00:00:00 2001 From: Felix Hennig Date: Wed, 23 Aug 2023 15:07:02 +0200 Subject: [PATCH 3/8] Added getting started blurp --- docs/modules/hdfs/pages/index.adoc | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/docs/modules/hdfs/pages/index.adoc b/docs/modules/hdfs/pages/index.adoc index 33361fa3..bae37571 100644 --- a/docs/modules/hdfs/pages/index.adoc +++ b/docs/modules/hdfs/pages/index.adoc @@ -2,7 +2,15 @@ The Stackable Operator for https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html[Apache HDFS] (Hadoop Distributed File System) is used to set up HFDS in high-availability mode. HDFS is a distributed file system designed to store and manage massive amounts of data across multiple machines in a fault-tolerant manner. The Operator depends on the xref:zookeeper:index.adoc[] to operate a ZooKeeper cluster to coordinate the active and standby NameNodes. -== Roles +== Getting started + +Follow the xref:getting_started/index.adoc[Getting started guide] which will guide you through installing the Stackable HDFS and ZooKeeper Operators, setting up ZooKeeper and HDFS and writing a file to HDFS to verify that everything is set up correctly. + +Afterwards you can consult the xref:usage-guide/index.adoc[] to learn more about tailoring your HDFS configuration to your needs, or have a look at the <> for some example setups. + +== Operator model + +=== Roles Three xref:home:concepts:roles-and-role-groups.adoc[roles] of the HDFS cluster are implemented: @@ -10,7 +18,7 @@ Three xref:home:concepts:roles-and-role-groups.adoc[roles] of the HDFS cluster a * JournalNode - responsible for keeping track of HDFS blocks and used to perform failovers in case the active NameNode fails. For details see: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html * NameNode - responsible for keeping track of HDFS blocks and providing access to the data. -== Kubernetes objects +=== Kubernetes objects The operator creates the following K8S objects per role group defined in the custom resource. @@ -26,6 +34,10 @@ In the custom resource you can specify the number of replicas per role group (Na * 1 JournalNode * 1 DataNode (should match at least the `clusterConfig.dfsReplication` factor) +== [[demos]]Demos + +TODO + == Supported Versions The Stackable Operator for Apache HDFS currently supports the following versions of HDFS: From 7f0d0d6e3d09881345fe976f67ada3542971cc04 Mon Sep 17 00:00:00 2001 From: Felix Hennig Date: Wed, 23 Aug 2023 15:19:47 +0200 Subject: [PATCH 4/8] Added demos --- docs/modules/hdfs/pages/index.adoc | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/modules/hdfs/pages/index.adoc b/docs/modules/hdfs/pages/index.adoc index bae37571..3039036b 100644 --- a/docs/modules/hdfs/pages/index.adoc +++ b/docs/modules/hdfs/pages/index.adoc @@ -36,7 +36,11 @@ In the custom resource you can specify the number of replicas per role group (Na == [[demos]]Demos -TODO +Two demos that use HDFS are available. + +**xref:stackablectl:demos/hbase-hdfs-load-cycling-data.adoc[]** loads a dataset of cycling data from S3 into HDFS and then uses HBase to analyse the data. + +**xref:stackablectl:demos/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc[]** showcases the integration between HDFS and Jupyter. New York Taxi data is stored in HDFS and analyzed in a Jupyter notebook. == Supported Versions From 9908ac5fd133afaa3b45db143674e85035921576 Mon Sep 17 00:00:00 2001 From: Felix Hennig Date: Wed, 23 Aug 2023 16:09:35 +0200 Subject: [PATCH 5/8] Added operator model --- .../hdfs/images/hdfs_overview.drawio.svg | 4 ++++ docs/modules/hdfs/pages/index.adoc | 19 +++++++++++++------ 2 files changed, 17 insertions(+), 6 deletions(-) create mode 100644 docs/modules/hdfs/images/hdfs_overview.drawio.svg diff --git a/docs/modules/hdfs/images/hdfs_overview.drawio.svg b/docs/modules/hdfs/images/hdfs_overview.drawio.svg new file mode 100644 index 00000000..e0ba5c59 --- /dev/null +++ b/docs/modules/hdfs/images/hdfs_overview.drawio.svg @@ -0,0 +1,4 @@ + + + +
Pod
<name>-<role>-<rg1>-1
Pod...
HDFS Operator
HDFS Operator
StatefulSet
<name>-<role>-<rg1>
StatefulSet...
Service
<name>-<role>-<rg1>
Service...
Pod
<name>-<role>-<rg1>-0
Pod...
ConfigMap
<name>-<role>-<rg1>
ConfigMap...
HdfsCluster
<name>
HdfsCluster...
create
create
read
read
Legend
Legend
Operator
Operator
Resource
Resource
Custom
Resource
Custom...
role group
<rg1>
role group...
StatefulSet
<name>-<role>-<rg2>
StatefulSet...
Service
<name>-<role>-<rg2>
Service...
Pod
<name>-<role>-<rg2>-0
Pod...
ConfigMap
<name>-<role>-<rg2>
ConfigMap...
Service
<name>-<role>
Service...
role
<role>
role...
references
references
role group
<rg2>
role group...
for each role (dataNode, journalNode, nameNode):
for each role (dataNode, journalNode, nameNode):
ConfigMap
<name>
ConfigMap...
discovery
ConfigMap
discovery...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/modules/hdfs/pages/index.adoc b/docs/modules/hdfs/pages/index.adoc index 3039036b..edebbc46 100644 --- a/docs/modules/hdfs/pages/index.adoc +++ b/docs/modules/hdfs/pages/index.adoc @@ -1,4 +1,6 @@ = Stackable Operator for Apache HDFS +:description: The Stackable Operator for Apache HDFS is a Kubernetes operator that can manage Apache HDFS clusters. Learn about its features, resources, dependencies and demos, and see the list of supported HDFS versions. +:keywords: Stackable Operator, Hadoop, Apache HDFS, Kubernetes, k8s, operator, engineer, big data, metadata, storage, cluster, distributed storage The Stackable Operator for https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html[Apache HDFS] (Hadoop Distributed File System) is used to set up HFDS in high-availability mode. HDFS is a distributed file system designed to store and manage massive amounts of data across multiple machines in a fault-tolerant manner. The Operator depends on the xref:zookeeper:index.adoc[] to operate a ZooKeeper cluster to coordinate the active and standby NameNodes. @@ -10,15 +12,14 @@ Afterwards you can consult the xref:usage-guide/index.adoc[] to learn more about == Operator model -=== Roles - -Three xref:home:concepts:roles-and-role-groups.adoc[roles] of the HDFS cluster are implemented: +The Operator manages the _HdfsCluster_ custom resource. The cluster implements three xref:home:concepts:roles-and-role-groups.adoc[roles]: * DataNode - responsible for storing the actual data. * JournalNode - responsible for keeping track of HDFS blocks and used to perform failovers in case the active NameNode fails. For details see: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html * NameNode - responsible for keeping track of HDFS blocks and providing access to the data. -=== Kubernetes objects + +image::hdfs_overview.drawio.svg[A diagram depicting the Kubernetes resources created by the Stackable Operator for Apache HDFS] The operator creates the following K8S objects per role group defined in the custom resource. @@ -34,13 +35,19 @@ In the custom resource you can specify the number of replicas per role group (Na * 1 JournalNode * 1 DataNode (should match at least the `clusterConfig.dfsReplication` factor) +The Operator creates a xref:concepts:service_discovery.adoc[service discovery ConfigMap] for the HDFS instance. The discovery ConfigMaps contains the `core-site.xml` file and the `hdfs-site.xml` file. + +== Dependencies + +HDFS depends on ZooKeeper for coordination between nodes. You can run a ZooKeeper cluster with the xref:zookeeper:index.adoc[]. Additionally the xref:commons-operator:index.adoc[] and xref:secret-operator:index.adoc[] are needed. + == [[demos]]Demos Two demos that use HDFS are available. -**xref:stackablectl:demos/hbase-hdfs-load-cycling-data.adoc[]** loads a dataset of cycling data from S3 into HDFS and then uses HBase to analyse the data. +**xref:stackablectl::demos/hbase-hdfs-load-cycling-data.adoc[]** loads a dataset of cycling data from S3 into HDFS and then uses HBase to analyse the data. -**xref:stackablectl:demos/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc[]** showcases the integration between HDFS and Jupyter. New York Taxi data is stored in HDFS and analyzed in a Jupyter notebook. +**xref:stackablectl::demos/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc[]** showcases the integration between HDFS and Jupyter. New York Taxi data is stored in HDFS and analyzed in a Jupyter notebook. == Supported Versions From 0fb47077bee6c58545577e9c0abbc533cf15f1e3 Mon Sep 17 00:00:00 2001 From: Felix Hennig Date: Wed, 23 Aug 2023 16:13:40 +0200 Subject: [PATCH 6/8] fixed typo --- docs/modules/hdfs/pages/index.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/modules/hdfs/pages/index.adoc b/docs/modules/hdfs/pages/index.adoc index edebbc46..9601f40c 100644 --- a/docs/modules/hdfs/pages/index.adoc +++ b/docs/modules/hdfs/pages/index.adoc @@ -45,7 +45,7 @@ HDFS depends on ZooKeeper for coordination between nodes. You can run a ZooKeepe Two demos that use HDFS are available. -**xref:stackablectl::demos/hbase-hdfs-load-cycling-data.adoc[]** loads a dataset of cycling data from S3 into HDFS and then uses HBase to analyse the data. +**xref:stackablectl::demos/hbase-hdfs-load-cycling-data.adoc[]** loads a dataset of cycling data from S3 into HDFS and then uses HBase to analyze the data. **xref:stackablectl::demos/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc[]** showcases the integration between HDFS and Jupyter. New York Taxi data is stored in HDFS and analyzed in a Jupyter notebook. From ac26db92ac469dcdc653f126baa8b7e70d5b43f8 Mon Sep 17 00:00:00 2001 From: Felix Hennig Date: Thu, 24 Aug 2023 08:58:26 +0200 Subject: [PATCH 7/8] Update docs/modules/hdfs/pages/index.adoc Co-authored-by: Malte Sander --- docs/modules/hdfs/pages/index.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/modules/hdfs/pages/index.adoc b/docs/modules/hdfs/pages/index.adoc index 9601f40c..7d8d455b 100644 --- a/docs/modules/hdfs/pages/index.adoc +++ b/docs/modules/hdfs/pages/index.adoc @@ -35,7 +35,7 @@ In the custom resource you can specify the number of replicas per role group (Na * 1 JournalNode * 1 DataNode (should match at least the `clusterConfig.dfsReplication` factor) -The Operator creates a xref:concepts:service_discovery.adoc[service discovery ConfigMap] for the HDFS instance. The discovery ConfigMaps contains the `core-site.xml` file and the `hdfs-site.xml` file. +The Operator creates a xref:concepts:service_discovery.adoc[service discovery ConfigMap] for the HDFS instance. The discovery ConfigMap contains the `core-site.xml` file and the `hdfs-site.xml` file. == Dependencies From 86e911f08ae26856ec608aa0eeb50aa37103645b Mon Sep 17 00:00:00 2001 From: Felix Hennig Date: Thu, 24 Aug 2023 08:58:35 +0200 Subject: [PATCH 8/8] Update docs/modules/hdfs/pages/index.adoc Co-authored-by: Malte Sander --- docs/modules/hdfs/pages/index.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/modules/hdfs/pages/index.adoc b/docs/modules/hdfs/pages/index.adoc index 7d8d455b..a33eac4c 100644 --- a/docs/modules/hdfs/pages/index.adoc +++ b/docs/modules/hdfs/pages/index.adoc @@ -39,7 +39,7 @@ The Operator creates a xref:concepts:service_discovery.adoc[service discovery Co == Dependencies -HDFS depends on ZooKeeper for coordination between nodes. You can run a ZooKeeper cluster with the xref:zookeeper:index.adoc[]. Additionally the xref:commons-operator:index.adoc[] and xref:secret-operator:index.adoc[] are needed. +HDFS depends on ZooKeeper for coordination between nodes. You can run a ZooKeeper cluster with the xref:zookeeper:index.adoc[]. Additionally, the xref:commons-operator:index.adoc[] and xref:secret-operator:index.adoc[] are needed. == [[demos]]Demos