Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking on s3 #67

Open
JulienPeloton opened this issue Apr 2, 2019 · 0 comments
Open

Benchmarking on s3 #67

JulienPeloton opened this issue Apr 2, 2019 · 0 comments

Comments

@JulienPeloton
Copy link
Member

I'm trying to benchmark spark-fits on s3, by internally looping over the same piece of code:

path = "s3a://abucket/..."
fn = "afile.fits" # 700 MB

for index in range(N):
  df = spark.read\
    .format("fits")\
    .option("hdu", 1)\
    .load(os.path.join(path, fn))

  start = time.time()
  df.count()
  elapsed = time.time() - start
  print("{} seconds".format(elapsed))

With the default s3 configuration, it hangs after the first iteration, and I get a timeout error. I found that increasing the parameter that controls the maximum number of simultaneous connections to S3 (fs.s3a.connection.maximum) from 15 to 100 fixes somehow the problem. It is not clear exactly why and how, so it would be good to investigate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant