-
Notifications
You must be signed in to change notification settings - Fork 197
[BAHIR-75] [WIP] Remote HDFS connector for Apache Spark using webhdfs protocol with support for Apache Knox #28
base: master
Are you sure you want to change the base?
Conversation
…ginal hadoop namespace for package private field/method access)
… remove more unused files
…le fixes, remove more unused files
… temporarily disable style checks for println
…v-mazumder/bahir into BAHIR-75-WebHdfsFileSystem
A few high-level questions before jumping into more detailed code review: Design Can you elaborate on differences/limitations/advantages over Hadoop default "webhdfs" scheme? i.e.
Configuration Some configuration parameters are specific to remote servers that should be specified by server not on connector level (some at server level may override connector level), i.e.
Usability Given that users need to know about the remote Hadoop server configuration (security, gateway path, etc) for WebHDFS access would it be nicer if ...
import org.apache.bahir.webhdfs.RemoteHdfsConnector
// initialize connector
RemoteHdfsConnector.initialize(protocol = "remoteHdfs", serverConfigFile = "/servers.xml",
truststorePath = "/truststore.jks")
// add another connection configuration, provide optional nickname "bluemix1"
RemoteHdfsConnector.addServerConfiguration(nickname = "bluemix1",
host = "ehaasp-577-mastermanager.bi.services.bluemix.net", port = "8443",
gatewayPath = "gateway/default", // optional, default is "gateway/default"
userName = "biadmin", password = "*******")
// load remote file using form <protocol>://<nickname>/<resourcePath>
// instead of equivalent longer form <protocol>://<server>:<port>/<resourcePath>
val df = spark.read.format("csv").option("header", "true")
.load("remoteHdfs://bluemix1/my_spark_datasets/NewYorkCity311Service.csv")
... Security
Debugging
Testing The outstanding unit tests should verify that the connector works with a ...
|
If I need to provide certificates for ssl verification, would I need to create truststores on worker nodes? On some clusters I work with I do not have access to the filesystem on the worker nodes. Also, I would like users to be able to specify a path to pem certificates rather than having to make them create a truststore. Is the following command, creating or reading from servers.xml?
If it is reading the file, on managed spark environments that don't exist within a hadoop environment will the user have access to this file? |
@snowch the code snippet I put under usability in my comment was merely a suggestion for an alternative to using hadoop configuration properties. I had intended the servers.xml file to contain all of the users remote Hadoop connections with host, port, username, password, etc. so that this type of configuration would not have to be done in the Spark program. All configuration files and truststore file would reside on the Spark driver (master node). In terms of SSL validation, you could opt to by-pass certificate validation. |
@ckadner Here goes my response to your comments
Yes
This is automatically taken care of by Apache Knox, in my understanding. That is one of the key goals of Apache Knox to relieve hadoop clients from nitigrity of internal security implementation of a hadoop Cluster. So we don't need to handle this at the code in client level if the webhdfs request is passing through Apache Knox.
Say a remote Spark cluster needs to read a file of size 2 GB and the Spark Cluster spawns 16 connections in parallel to do the same. So in turn 16 separate webhdfs calls are made to remote hdfs. However, though each call tries to read the data from different starting point, for each of them the end byte is the end of file. So first connection creates input stream corresponding to 0th byte till end of file, second from 128MB till end of file, the 3rd from 256 MB till and of file and so on. As a result of that the amount of data prepared in the server side for sending as response, the data transferred over the wire, and the data being read by the client side can potentially be much more than the original file size (in this example of 2 GB worth of original file it can potentially be close to 17 GB). This number would increase further more with more number of connections. For larger file size the extent of increase would be further higher too. In the approach used in this PR, for the above example, the total volume of data read and transferred over the wire will be always limited to 2 GB and some extra KBs (for record boundary resolution). This number will increase to a very less extent (still in KBs range) for more number of connections. And this increment will not depend on file size. So if a big volume of file (in GBs) has to be read with high number of connections in parallel the amount of data being processed at server side, transferred over the wire, and read at client side would be always limited to original file size and some extra KBs (for record boundary resolution).
You are right. However, I would put the 2 levels as Server Level and File Level. Some parameters won't change from file to file - they are specific to a remote hdfs server and therefore Server level parameters. Where as value of some parameters can be different from file to file. These are File level parameters. The Server Level parameters are - Gateway Path, User Name/Pasword, Webhdfs protocol version, Certificate Validation option (and other parameters associated with that). Where as File Level parameters are buffer sizes, file chunks sizes etc which can be different from File to File.
That's a good idea. We can have a set of default values for these parameters based on typical practice/convention. However, those default values can be overwritten if specified by user.
Right now this PR supports basic Auth at the Knox gateway level. Other authentication mechanisms supported by Apache Knox (SAML, OAuth, CAS, OpenId) are not supported yet.
No. It is internally handled by Apache Knox.
On a second thought I'm with you
Agreed
This PR focuses only on secured Hadoop cluster. Unsecured hadoop cluster can be accessed using existing webhdfs client library available from hadoop. So we don't need this.
Yes
We need not as it is more a feature of Apache Knox. |
@sourav-mazumder -- do you have any updates on the progress of this PR? |
This component implements Hadoop File System (org.apache,hadoop.fs.FileSystem) to provide an alternate mechanism (instead of using 'webhdfs or swebhdfs' file uri) for Spark to access (read/write) files from/to a remote Hadoop cluster using webhdfs protocol.
This component takes care of the following requirements related to accessing files (read/write) from/to a remote enterprise Hadoop cluster from a remote Spark cluster-
This component is not a full fledged implementation of Hadoop File System. It implements only those interfaces those are needed by Spark for reading data form remote HDFS and writing back the data to remote HDFS.
Example Usage -
Step 1: Set Hadoop configuration to define a custom uri of your choice and specify the class name BahirWebHdfsFileSystem. For example -
sc.hadoopConfiguration.set("fs.remoteHdfs.impl","org.apache.bahir.datasource.webhdfs.BahirWebHdfsFileSystem")
.You can use any name (apart form the standard uris like hdfs, webhdfs, file etc. already used by Spark) instead of 'remoteHdfs'. However subsequently while loading the file (or writing a file) the same should be used.
Step 2: Set the user name and password as below -
val userid = "biadmin"
val password = "password"
val userCred = userid + ":" + password
sc.hadoopConfiguration.set("usrCredStr",userCred)
Step 3 : Now you are ready to load any file from the remote Hadoop cluster using Spark's standard Dataframe/DataSet APIs. For example -
val filePath = "biginsights/spark-enablement/datasets/NewYorkCity311Service/311_Service_Requests_from_2010_to_Present.csv"
val srvr = "ehaasp-577-mastermanager.bi.services.bluemix.net:8443/gateway/default/webhdfs/v1"
val knoxPath = "gateway/default"
val webhdfsPath = "webhdfs/v1"
val prtcl = "remoteHdfs"
val fullPath = s"$prtcl://$srvr/$knoxPath/$webhdfsPath/$filePath"
val df = spark.read.format("csv").option("header", "true").load(fullPath)
Please not the use of 'gateway/default' and 'webhdfs/v1' used for specifying the server specific information in the path. The first one is specific to Apache Knox and the second one is specific for webhdfs protocol.
Step 4; To write data back to remote HDFS following steps can be used (using standard Dataframe writer of spark)
val filePathWrite = "biginsights/spark-enablement/datasets/NewYorkCity311Service/Result.csv"
val srvr = "ehaasp-577-mastermanager.bi.services.bluemix.net:8443"
val knoxPath = "gateway/default"
val webhdfsPath = "webhdfs/v1"
val prtcl = "remoteHdfs"
val fullPath = s"$prtcl://$srvr/$knoxPath/$webhdfsPath/$filePathWrite"
df.write.format("csv").option("header", "true").save(filePathw)
We are still working on followings -