Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyDKB/hdfs: use Python hdfs module instead of running shell command #118

Open
mgolosova opened this issue Apr 5, 2018 · 0 comments
Open

Comments

@mgolosova
Copy link
Collaborator

Motivation

Currently we are using default Hadoop command line client, which is written in Java. There`s nothing good in running Java app every now and then.

What to do

We can use Python module for hdfs instead.

Difficulties

  1. One difficulty I can see is to get client configuration.
    It can be taken from the system:
  • stored in /etc/hadoop/conf/hdfs-site.xml, or in $HADOOP_CONF_DIR/hdfs-site.xml, or somewhere else;
  • to get information from XML config file, we must know filesystem name: it can be found in the same hdfs-site.xml (parameter fs.defaultFS), or, if absent -- in core-site.xml;
  • HTTP connection parameters may not be specified (parameter dfs.http.address), so it is to be reconstructed from namenodes (dfs.ha.namenodes.$FS_NAME + dfs.namenode.rpc-address.$FS_NAME.$NN_NAME)

Or, it can be taken from config file $HDFSCLI_CONFIG created "by hands" (or in some automatic way from system configuration).

  1. Next, we need to somehow detect active namenode from those found in system configuration.

Everything else looks just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant