Benchmarking on s3 #67

JulienPeloton · 2019-04-02T06:43:54Z

I'm trying to benchmark spark-fits on s3, by internally looping over the same piece of code:

path = "s3a://abucket/..."
fn = "afile.fits" # 700 MB

for index in range(N):
  df = spark.read\
    .format("fits")\
    .option("hdu", 1)\
    .load(os.path.join(path, fn))

  start = time.time()
  df.count()
  elapsed = time.time() - start
  print("{} seconds".format(elapsed))

With the default s3 configuration, it hangs after the first iteration, and I get a timeout error. I found that increasing the parameter that controls the maximum number of simultaneous connections to S3 (fs.s3a.connection.maximum) from 15 to 100 fixes somehow the problem. It is not clear exactly why and how, so it would be good to investigate further.

The text was updated successfully, but these errors were encountered:

JulienPeloton added IO benchmark s3 labels Apr 2, 2019

JulienPeloton mentioned this issue Apr 4, 2019

Bump version: 0.7.2 -> 0.7.3 #68

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking on s3 #67

Benchmarking on s3 #67

JulienPeloton commented Apr 2, 2019

Benchmarking on s3 #67

Benchmarking on s3 #67

Comments

JulienPeloton commented Apr 2, 2019