|
| 1 | += hbase-hdfs-cycling-data |
| 2 | + |
| 3 | +[NOTE] |
| 4 | +==== |
| 5 | +This guide assumes you already have the demo `hbase-hdfs-load-cycling-data` installed. |
| 6 | +If you don't have it installed please follow the xref:commands/demo.adoc#_install_demo[documentation on how to install a demo]. |
| 7 | +To put it simply you have to run `stackablectl demo install hbase-hdfs-load-cycling-data`. |
| 8 | +==== |
| 9 | + |
| 10 | +This demo will |
| 11 | + |
| 12 | +* Install the required Stackable operators |
| 13 | +* Spin up the follow data products |
| 14 | +** *Hbase*: A open source distributed scalable, big data store. This demos uses it to store the https://www.kaggle.com/datasets/timgid/cyclistic-dataset-google-certificate-capstone?select=Divvy_Trips_2020_Q1.csv[cyclistic dataset] and enable access to it |
| 15 | +** *HDFS*: A distributed file system used to intermediately store the dataset before importing it into Hbase |
| 16 | +* Use https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html[distcp] to copy a https://www.kaggle.com/datasets/timgid/cyclistic-dataset-google-certificate-capstone?select=Divvy_Trips_2020_Q1.csv[cyclistic dataset] from an S3 bucket into HDFS |
| 17 | +* Create HFiles, which is a File format for hbase consisting of sorted key/value pairs. Both keys and values are byte arrays |
| 18 | +* Load Hfiles into an existing table via the `Importtsv` utility, which will load data in `TSV` or `CSV` format into HBase |
| 19 | +* Query data via hbase shell, which is an interactive shell to execute commands on the created table |
| 20 | +
|
| 21 | +You can see the deployed products as well as their relationship in the following diagram: |
| 22 | + |
| 23 | +image::demo-hbase-hdfs-load-cycling-data/overview.png[] |
| 24 | + |
| 25 | +== List deployed Stackable services |
| 26 | +To list the installed Stackable services run the following command: |
| 27 | +`stackablectl services list --all-namespaces` |
| 28 | + |
| 29 | +[source,console] |
| 30 | +---- |
| 31 | +PRODUCT NAME NAMESPACE ENDPOINTS EXTRA INFOS |
| 32 | +
|
| 33 | + hbase hbase default regionserver 172.18.0.5:32282 |
| 34 | + ui http://172.18.0.5:31527 |
| 35 | + metrics 172.18.0.5:31081 |
| 36 | +
|
| 37 | + hdfs hdfs default datanode-default-0-metrics 172.18.0.2:31441 |
| 38 | + datanode-default-0-data 172.18.0.2:32432 |
| 39 | + datanode-default-0-http http://172.18.0.2:30758 |
| 40 | + datanode-default-0-ipc 172.18.0.2:32323 |
| 41 | + journalnode-default-0-metrics 172.18.0.5:31123 |
| 42 | + journalnode-default-0-http http://172.18.0.5:30038 |
| 43 | + journalnode-default-0-https https://172.18.0.5:31996 |
| 44 | + journalnode-default-0-rpc 172.18.0.5:30080 |
| 45 | + namenode-default-0-metrics 172.18.0.2:32753 |
| 46 | + namenode-default-0-http http://172.18.0.2:32475 |
| 47 | + namenode-default-0-rpc 172.18.0.2:31639 |
| 48 | + namenode-default-1-metrics 172.18.0.4:32202 |
| 49 | + namenode-default-1-http http://172.18.0.4:31486 |
| 50 | + namenode-default-1-rpc 172.18.0.4:31874 |
| 51 | +
|
| 52 | + zookeeper zookeeper default zk 172.18.0.4:32469 |
| 53 | +---- |
| 54 | + |
| 55 | +[NOTE] |
| 56 | +==== |
| 57 | +When a product instance has not finished starting yet, the service will have no endpoint. |
| 58 | +Starting all of the product instances might take a considerable amount of time depending on your internet connectivity. |
| 59 | +In case the product is not ready yet a warning might be shown. |
| 60 | +==== |
| 61 | + |
| 62 | +== The first Job |
| 63 | +https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html[DistCp] (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling, recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. |
| 64 | +Therefore, the first Job uses DistCp to copy data from a S3 bucket into HDFS. Below you'll see parts from the logs. |
| 65 | + |
| 66 | +[source] |
| 67 | +---- |
| 68 | +Copying s3a://public-backup-nyc-tlc/cycling-tripdata/demo-cycling-tripdata.csv.gz to hdfs://hdfs/data/raw/demo-cycling-tripdata.csv.gz |
| 69 | +[LocalJobRunner Map Task Executor #0] mapred.RetriableFileCopyCommand (RetriableFileCopyCommand.java:getTempFile(235)) - Creating temp file: hdfs://hdfs/data/raw/.distcp.tmp.attempt_local60745921_0001_m_000000_0.1663687068145 |
| 70 | +[LocalJobRunner Map Task Executor #0] mapred.RetriableFileCopyCommand (RetriableFileCopyCommand.java:doCopy(127)) - Writing to temporary target file path hdfs://hdfs/data/raw/.distcp.tmp.attempt_local60745921_0001_m_000000_0.1663687068145 |
| 71 | +[LocalJobRunner Map Task Executor #0] mapred.RetriableFileCopyCommand (RetriableFileCopyCommand.java:doCopy(153)) - Renaming temporary target file path hdfs://hdfs/data/raw/.distcp.tmp.attempt_local60745921_0001_m_000000_0.1663687068145 to hdfs://hdfs/data/raw/demo-cycling-tripdata.csv.gz |
| 72 | +[LocalJobRunner Map Task Executor #0] mapred.RetriableFileCopyCommand (RetriableFileCopyCommand.java:doCopy(157)) - Completed writing hdfs://hdfs/data/raw/demo-cycling-tripdata.csv.gz (3342891 bytes) |
| 73 | +[LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(634)) - |
| 74 | +[LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:done(1244)) - Task:attempt_local60745921_0001_m_000000_0 is done. And is in the process of committing |
| 75 | +[LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(634)) - |
| 76 | +[LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:commit(1421)) - Task attempt_local60745921_0001_m_000000_0 is allowed to commit now |
| 77 | +[LocalJobRunner Map Task Executor #0] output.FileOutputCommitter (FileOutputCommitter.java:commitTask(609)) - Saved output of task 'attempt_local60745921_0001_m_000000_0' to file:/tmp/hadoop/mapred/staging/stackable339030898/.staging/_distcp-1760904616/_logs |
| 78 | +[LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(634)) - 100.0% Copying s3a://public-backup-nyc-tlc/cycling-tripdata/demo-cycling-tripdata.csv.gz to hdfs://hdfs/data/raw/demo-cycling-tripdata.csv.gz |
| 79 | +---- |
| 80 | + |
| 81 | +== The second Job |
| 82 | +The second Job consists of 2 steps. |
| 83 | + |
| 84 | +First, we use `org.apache.hadoop.hbase.mapreduce.ImportTsv` (see https://hbase.apache.org/book.html#importtsv[ImportTsv Docs]) to create a table and Hfiles. |
| 85 | +Hfile is an Hbase dedicated file format which is performance optimized for hbase. It stores meta information about the data and thus increases the performance of hbase |
| 86 | +When connecting to the hbase master and opening a `bin/hbase shell` and executing `list`, you will see the created table. However, it'll contain 0 rows at this point. |
| 87 | +You can connect to the shell via |
| 88 | +[source] |
| 89 | +---- |
| 90 | +kubectl exec -it hbase-master-default-0 -- bin/hbase shell |
| 91 | +---- |
| 92 | +If you use k9s you can go into the `hbase-master-default-0` and execute `bin/hbase shell list`. |
| 93 | + |
| 94 | +[source] |
| 95 | +---- |
| 96 | +TABLE |
| 97 | +cycling-tripdata |
| 98 | +---- |
| 99 | + |
| 100 | +Secondly, we'll use `org.apache.hadoop.hbase.tool.LoadIncrementalHFiles` (see https://hbase.apache.org/book.html#arch.bulk.load[see bulk load docs]) to import the Hfiles into the table and ingest rows. |
| 101 | +You can now use the `bin/hbase shell` again and execute `count 'cycling-tripdata'` and see below for a partial result. |
| 102 | + |
| 103 | +[source] |
| 104 | +---- |
| 105 | +Current count: 1000, row: 02FD41C2518CCF81 |
| 106 | +Current count: 2000, row: 06022E151BC79CE0 |
| 107 | +Current count: 3000, row: 090E4E73A888604A |
| 108 | +... |
| 109 | +Current count: 82000, row: F7A8C86949FD9B1B |
| 110 | +Current count: 83000, row: FA9AA8F17E766FD5 |
| 111 | +Current count: 84000, row: FDBD9EC46964C103 |
| 112 | +84777 row(s) |
| 113 | +Took 13.4666 seconds |
| 114 | +=> 84777 |
| 115 | +---- |
| 116 | + |
| 117 | +== The table |
| 118 | +You can now use the table and the data. You are able to use all available hbase shell commands. Below, you'll see the table description. |
| 119 | + |
| 120 | +[source,console] |
| 121 | +---- |
| 122 | +describe 'cycling-tripdata' |
| 123 | +Table cycling-tripdata is ENABLED |
| 124 | +cycling-tripdata |
| 125 | +COLUMN FAMILIES DESCRIPTION |
| 126 | +{NAME => 'end_lat', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} |
| 127 | +{NAME => 'end_lng', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} |
| 128 | +{NAME => 'end_station_id', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} |
| 129 | +{NAME => 'end_station_name', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} |
| 130 | +{NAME => 'ended_at', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} |
| 131 | +{NAME => 'member_casual', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} |
| 132 | +{NAME => 'rideable_type', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} |
| 133 | +{NAME => 'start_lat', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} |
| 134 | +{NAME => 'start_lng', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} |
| 135 | +{NAME => 'start_station_id', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} |
| 136 | +{NAME => 'start_station_name', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} |
| 137 | +{NAME => 'started_at', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} |
| 138 | +---- |
| 139 | + |
| 140 | +== The Hbase UI |
| 141 | +The Hbase web UI will give you information on status and metrics of your Hbase cluster. |
| 142 | +If the UI is not available please do a port-forward `kubectl port-forward hbase-master-default-0 16010` |
| 143 | +See below for the startpage. |
| 144 | + |
| 145 | +image::demo-hbase-hdfs-load-cycling-data/hbase-ui-start-page.png[] |
| 146 | + |
| 147 | +From the startpage you can check more details. For example details on the created table. |
| 148 | + |
| 149 | +image::demo-hbase-hdfs-load-cycling-data/hbase-table-ui.png[] |
| 150 | + |
| 151 | +== The HDFS UI |
| 152 | +[NOTE] |
| 153 | +==== |
| 154 | +The hdfs services will be available with the next release 22-11 via `stackablectl services list --all-namespaces`. |
| 155 | +==== |
| 156 | +You can also see HDFS details via a UI. Below you will see the overview of your HDFS cluster |
| 157 | + |
| 158 | +image::demo-hbase-hdfs-load-cycling-data/hdfs-overview.png[] |
| 159 | + |
| 160 | +The UI will give you information on the datanodes via the datanodes tab. |
| 161 | + |
| 162 | +image::demo-hbase-hdfs-load-cycling-data/hdfs-datanode.png[] |
| 163 | + |
| 164 | +You can also browse the directory with the UI. |
| 165 | + |
| 166 | +image::demo-hbase-hdfs-load-cycling-data/hdfs-data.png[] |
| 167 | + |
| 168 | +The raw data from the distcp job can be found here. |
| 169 | + |
| 170 | +image::demo-hbase-hdfs-load-cycling-data/hdfs-data-raw.png[] |
| 171 | + |
| 172 | +The structure of the Hilfes can be seen here. |
| 173 | + |
| 174 | +image::demo-hbase-hdfs-load-cycling-data/hdfs-data-hfile.png[] |
0 commit comments