Skip to content
This repository was archived by the owner on Feb 16, 2024. It is now read-only.

Commit 8717d11

Browse files
snockesbernauer
andcommitted
121 add documentation hbase hdfs demo (#122)
## Description This adds documentation to the hdfs-hbase demo. Co-authored-by: Sebastian Bernauer <[email protected]> Co-authored-by: Simon Nocke <[email protected]>
1 parent 8a44d5a commit 8717d11

File tree

10 files changed

+175
-0
lines changed

10 files changed

+175
-0
lines changed
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading

docs/modules/ROOT/nav.adoc

+1
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
** xref:commands/stack.adoc[]
99
* xref:demos/index.adoc[]
1010
** xref:demos/airflow-scheduled-job.adoc[]
11+
** xref:demos/hbase-hdfs-load-cycling-data.adoc[]
1112
** xref:demos/nifi-kafka-druid-earthquake-data.adoc[]
1213
** xref:demos/nifi-kafka-druid-water-level-data.adoc[]
1314
** xref:demos/trino-taxi-data.adoc[]
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
= hbase-hdfs-cycling-data
2+
3+
[NOTE]
4+
====
5+
This guide assumes you already have the demo `hbase-hdfs-load-cycling-data` installed.
6+
If you don't have it installed please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo].
7+
To put it simply you have to run `stackablectl demo install hbase-hdfs-load-cycling-data`.
8+
====
9+
10+
This demo will
11+
12+
* Install the required Stackable operators
13+
* Spin up the follow data products
14+
** *Hbase*: A open source distributed scalable, big data store. This demos uses it to store the https://www.kaggle.com/datasets/timgid/cyclistic-dataset-google-certificate-capstone?select=Divvy_Trips_2020_Q1.csv[cyclistic dataset] and enable access to it
15+
** *HDFS*: A distributed file system used to intermediately store the dataset before importing it into Hbase
16+
* Use https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html[distcp] to copy a https://www.kaggle.com/datasets/timgid/cyclistic-dataset-google-certificate-capstone?select=Divvy_Trips_2020_Q1.csv[cyclistic dataset] from an S3 bucket into HDFS
17+
* Create HFiles, which is a File format for hbase consisting of sorted key/value pairs. Both keys and values are byte arrays
18+
* Load Hfiles into an existing table via the `Importtsv` utility, which will load data in `TSV` or `CSV` format into HBase
19+
* Query data via hbase shell, which is an interactive shell to execute commands on the created table
20+
21+
You can see the deployed products as well as their relationship in the following diagram:
22+
23+
image::demo-hbase-hdfs-load-cycling-data/overview.png[]
24+
25+
== List deployed Stackable services
26+
To list the installed Stackable services run the following command:
27+
`stackablectl services list --all-namespaces`
28+
29+
[source,console]
30+
----
31+
PRODUCT NAME NAMESPACE ENDPOINTS EXTRA INFOS
32+
33+
hbase hbase default regionserver 172.18.0.5:32282
34+
ui http://172.18.0.5:31527
35+
metrics 172.18.0.5:31081
36+
37+
hdfs hdfs default datanode-default-0-metrics 172.18.0.2:31441
38+
datanode-default-0-data 172.18.0.2:32432
39+
datanode-default-0-http http://172.18.0.2:30758
40+
datanode-default-0-ipc 172.18.0.2:32323
41+
journalnode-default-0-metrics 172.18.0.5:31123
42+
journalnode-default-0-http http://172.18.0.5:30038
43+
journalnode-default-0-https https://172.18.0.5:31996
44+
journalnode-default-0-rpc 172.18.0.5:30080
45+
namenode-default-0-metrics 172.18.0.2:32753
46+
namenode-default-0-http http://172.18.0.2:32475
47+
namenode-default-0-rpc 172.18.0.2:31639
48+
namenode-default-1-metrics 172.18.0.4:32202
49+
namenode-default-1-http http://172.18.0.4:31486
50+
namenode-default-1-rpc 172.18.0.4:31874
51+
52+
zookeeper zookeeper default zk 172.18.0.4:32469
53+
----
54+
55+
[NOTE]
56+
====
57+
When a product instance has not finished starting yet, the service will have no endpoint.
58+
Starting all of the product instances might take a considerable amount of time depending on your internet connectivity.
59+
In case the product is not ready yet a warning might be shown.
60+
====
61+
62+
== The first Job
63+
https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html[DistCp] (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling, recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
64+
Therefore, the first Job uses DistCp to copy data from a S3 bucket into HDFS. Below you'll see parts from the logs.
65+
66+
[source]
67+
----
68+
Copying s3a://public-backup-nyc-tlc/cycling-tripdata/demo-cycling-tripdata.csv.gz to hdfs://hdfs/data/raw/demo-cycling-tripdata.csv.gz
69+
[LocalJobRunner Map Task Executor #0] mapred.RetriableFileCopyCommand (RetriableFileCopyCommand.java:getTempFile(235)) - Creating temp file: hdfs://hdfs/data/raw/.distcp.tmp.attempt_local60745921_0001_m_000000_0.1663687068145
70+
[LocalJobRunner Map Task Executor #0] mapred.RetriableFileCopyCommand (RetriableFileCopyCommand.java:doCopy(127)) - Writing to temporary target file path hdfs://hdfs/data/raw/.distcp.tmp.attempt_local60745921_0001_m_000000_0.1663687068145
71+
[LocalJobRunner Map Task Executor #0] mapred.RetriableFileCopyCommand (RetriableFileCopyCommand.java:doCopy(153)) - Renaming temporary target file path hdfs://hdfs/data/raw/.distcp.tmp.attempt_local60745921_0001_m_000000_0.1663687068145 to hdfs://hdfs/data/raw/demo-cycling-tripdata.csv.gz
72+
[LocalJobRunner Map Task Executor #0] mapred.RetriableFileCopyCommand (RetriableFileCopyCommand.java:doCopy(157)) - Completed writing hdfs://hdfs/data/raw/demo-cycling-tripdata.csv.gz (3342891 bytes)
73+
[LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(634)) -
74+
[LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:done(1244)) - Task:attempt_local60745921_0001_m_000000_0 is done. And is in the process of committing
75+
[LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(634)) -
76+
[LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:commit(1421)) - Task attempt_local60745921_0001_m_000000_0 is allowed to commit now
77+
[LocalJobRunner Map Task Executor #0] output.FileOutputCommitter (FileOutputCommitter.java:commitTask(609)) - Saved output of task 'attempt_local60745921_0001_m_000000_0' to file:/tmp/hadoop/mapred/staging/stackable339030898/.staging/_distcp-1760904616/_logs
78+
[LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(634)) - 100.0% Copying s3a://public-backup-nyc-tlc/cycling-tripdata/demo-cycling-tripdata.csv.gz to hdfs://hdfs/data/raw/demo-cycling-tripdata.csv.gz
79+
----
80+
81+
== The second Job
82+
The second Job consists of 2 steps.
83+
84+
First, we use `org.apache.hadoop.hbase.mapreduce.ImportTsv` (see https://hbase.apache.org/book.html#importtsv[ImportTsv Docs]) to create a table and Hfiles.
85+
Hfile is an Hbase dedicated file format which is performance optimized for hbase. It stores meta information about the data and thus increases the performance of hbase
86+
When connecting to the hbase master and opening a `bin/hbase shell` and executing `list`, you will see the created table. However, it'll contain 0 rows at this point.
87+
You can connect to the shell via
88+
[source]
89+
----
90+
kubectl exec -it hbase-master-default-0 -- bin/hbase shell
91+
----
92+
If you use k9s you can go into the `hbase-master-default-0` and execute `bin/hbase shell list`.
93+
94+
[source]
95+
----
96+
TABLE
97+
cycling-tripdata
98+
----
99+
100+
Secondly, we'll use `org.apache.hadoop.hbase.tool.LoadIncrementalHFiles` (see https://hbase.apache.org/book.html#arch.bulk.load[see bulk load docs]) to import the Hfiles into the table and ingest rows.
101+
You can now use the `bin/hbase shell` again and execute `count 'cycling-tripdata'` and see below for a partial result.
102+
103+
[source]
104+
----
105+
Current count: 1000, row: 02FD41C2518CCF81
106+
Current count: 2000, row: 06022E151BC79CE0
107+
Current count: 3000, row: 090E4E73A888604A
108+
...
109+
Current count: 82000, row: F7A8C86949FD9B1B
110+
Current count: 83000, row: FA9AA8F17E766FD5
111+
Current count: 84000, row: FDBD9EC46964C103
112+
84777 row(s)
113+
Took 13.4666 seconds
114+
=> 84777
115+
----
116+
117+
== The table
118+
You can now use the table and the data. You are able to use all available hbase shell commands. Below, you'll see the table description.
119+
120+
[source,console]
121+
----
122+
describe 'cycling-tripdata'
123+
Table cycling-tripdata is ENABLED
124+
cycling-tripdata
125+
COLUMN FAMILIES DESCRIPTION
126+
{NAME => 'end_lat', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
127+
{NAME => 'end_lng', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
128+
{NAME => 'end_station_id', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
129+
{NAME => 'end_station_name', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
130+
{NAME => 'ended_at', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
131+
{NAME => 'member_casual', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
132+
{NAME => 'rideable_type', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
133+
{NAME => 'start_lat', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
134+
{NAME => 'start_lng', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
135+
{NAME => 'start_station_id', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
136+
{NAME => 'start_station_name', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
137+
{NAME => 'started_at', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
138+
----
139+
140+
== The Hbase UI
141+
The Hbase web UI will give you information on status and metrics of your Hbase cluster.
142+
If the UI is not available please do a port-forward `kubectl port-forward hbase-master-default-0 16010`
143+
See below for the startpage.
144+
145+
image::demo-hbase-hdfs-load-cycling-data/hbase-ui-start-page.png[]
146+
147+
From the startpage you can check more details. For example details on the created table.
148+
149+
image::demo-hbase-hdfs-load-cycling-data/hbase-table-ui.png[]
150+
151+
== The HDFS UI
152+
[NOTE]
153+
====
154+
The hdfs services will be available with the next release 22-11 via `stackablectl services list --all-namespaces`.
155+
====
156+
You can also see HDFS details via a UI. Below you will see the overview of your HDFS cluster
157+
158+
image::demo-hbase-hdfs-load-cycling-data/hdfs-overview.png[]
159+
160+
The UI will give you information on the datanodes via the datanodes tab.
161+
162+
image::demo-hbase-hdfs-load-cycling-data/hdfs-datanode.png[]
163+
164+
You can also browse the directory with the UI.
165+
166+
image::demo-hbase-hdfs-load-cycling-data/hdfs-data.png[]
167+
168+
The raw data from the distcp job can be found here.
169+
170+
image::demo-hbase-hdfs-load-cycling-data/hdfs-data-raw.png[]
171+
172+
The structure of the Hilfes can be seen here.
173+
174+
image::demo-hbase-hdfs-load-cycling-data/hdfs-data-hfile.png[]

0 commit comments

Comments
 (0)