pyDKB/hdfs: use Python hdfs module instead of running shell command #118

mgolosova · 2018-04-05T09:56:11Z

Currently we are using default Hadoop command line client, which is written in Java. There`s nothing good in running Java app every now and then.

We can use Python module for hdfs instead.

One difficulty I can see is to get client configuration.
It can be taken from the system:

stored in /etc/hadoop/conf/hdfs-site.xml, or in $HADOOP_CONF_DIR/hdfs-site.xml, or somewhere else;
to get information from XML config file, we must know filesystem name: it can be found in the same hdfs-site.xml (parameter fs.defaultFS), or, if absent -- in core-site.xml;
HTTP connection parameters may not be specified (parameter dfs.http.address), so it is to be reconstructed from namenodes (dfs.ha.namenodes.$FS_NAME + dfs.namenode.rpc-address.$FS_NAME.$NN_NAME)

Or, it can be taken from config file $HDFSCLI_CONFIG created "by hands" (or in some automatic way from system configuration).

Next, we need to somehow detect active namenode from those found in system configuration.

Everything else looks just fine.

The text was updated successfully, but these errors were encountered:

Provide feedback