Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to evaluate job outputs - IOException: Could not read from s3... #4687

Open
doron-st opened this issue Feb 28, 2019 · 24 comments
Open
Labels
Needs Triage Ticket needs further investigation and refinement prior to moving to milestones

Comments

@doron-st
Copy link

While testing cromwell-36 with AWS batch I was able to reproduce this error:

2019-02-25 09:38:52,508 cromwell-system-akka.dispatchers.engine-dispatcher-24 ERROR - WorkflowManagerActor Workflow b6b9322c-3929-4b72-9598-45d97dfb858d failed (during ExecutingWorkflowState): cromwell.backend.standard.StandardAsyncExecutionActor$$anon$2: Failed to evaluate job outputs:
Bad output 'print_nach_nachman_meuman.out': [Attempted 1 time(s)] - IOException: Could not read from s3://nrglab-cromwell-genomics/cromwell-execution/run_multiple_tests/b6b9322c-3929-4b72-9598-45d97dfb858d/call-test_cromwell_on_aws/shard-61/SingleTest.test_cromwell_on_aws/f8ecf673-ed61-4b06-b1d6-c20f7efe986e/call-print_nach_nachman_meuman/print_nach_nachman_meuman-stdout.log: Cannot access file: s3://s3.amazonaws.com/nrglab-cromwell-genomics/cromwell-execution/run_multiple_tests/b6b9322c-3929-4b72-9598-45d97dfb858d/call-test_cromwell_on_aws/shard-61/SingleTest.test_cromwell_on_aws/f8ecf673-ed61-4b06-b1d6-c20f7efe986e/call-print_nach_nachman_meuman/print_nach_nachman_meuman-stdout.log
        at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$handleExecutionSuccess$1(StandardAsyncExecutionActor.scala:867)

The error occurs when running many sub-workflows within a single wrapping workflow.
The environment is configured correctly, and the test usually passes when running <30 subworkflows.

Here are the workflows:

run_multiple_test.wdl

import "three_task_sequence.wdl" as SingleTest

workflow run_multiple_tests {
    scatter (i in range(30)){
        call SingleTest.three_task_sequence{}
    }
}

three_task_sequence.wdl

workflow three_task_sequence{
    call print_nach

    call print_nach_nachman {
        input:
            previous = print_nach.out
    }

    call print_nach_nachman_meuman{
        input:
                previous = print_nach_nachman.out
    }
    output{
        Array[String] out = print_nach_nachman_meuman.out
    }
}

task print_nach{
     command{
         echo "nach"
     }
     output{
         Array[String] out = read_lines(stdout())
     }
     runtime {
	    docker: "ubuntu:latest"
	    maxRetries: 3
     }
 }

 task print_nach_nachman{
    Array[String] previous

     command{
         echo ${sep=' ' previous} " nachman"
     }
     output{
         Array[String] out = read_lines(stdout())
     }
     runtime {
        docker: "ubuntu:latest"
        maxRetries: 3
     }
     
 }

 task print_nach_nachman_meuman{
     Array[String] previous

      command{
        echo ${sep=' ' previous} " meuman"
      }
      output{
        Array[String] out = read_lines(stdout())
      }
      runtime {
        docker: "ubuntu:latest"
        maxRetries: 3
      }
  }

Here is the cromwell-conf:

// aws.conf
include required(classpath("application"))

webservice {
  port = 8001
  interface = 0.0.0.0
}

aws {
  application-name = "cromwell"
  auths = [{
      name = "default"
      scheme = "default"
  }]
  region = "us-east-1"
}

engine {
  filesystems {
    s3 { auth = "default" }
  }
}

backend {
  default = "AWSBATCH"
  providers {
    AWSBATCH {
      actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
      config {
        root = "s3://nrglab-cromwell-genomics/cromwell-execution"
        auth = "default"

        numSubmitAttempts = 3
        numCreateDefinitionAttempts = 3

        concurrent-job-limit = 100

        default-runtime-attributes {
          queueArn: "arn:aws:batch:us-east-1:66:job-queue/GenomicsDefaultQueue"
        }

        filesystems {
          s3 {
            auth = "default"
          }
        }
      }
    }
  }
}

system {
  job-rate-control {
    jobs = 1
    per = 1 second
  }
}

Would appreciate help on this.
I wonder if cromwell was ever tested for many parallel sub-workflows running on AWS?

Thanks!

@gemmalam gemmalam added the Needs Triage Ticket needs further investigation and refinement prior to moving to milestones label Mar 4, 2019
@caaespin
Copy link

Hey, did you ever manage to get a workaround for this error?

@geoffjentry
Copy link
Contributor

@caaespin I'm assuming that means you still see this. Are you using a recent Cromwell version? (42+)

@caaespin
Copy link

caaespin commented Jul 25, 2019

@geoffjentry yes. My current deployment is v42.

If you have access to the GATK forums, i put more details in my post there: https://gatkforums.broadinstitute.org/wdl/discussion/24268/aws-batch-randomly-fails-when-running-multiple-workflows/p1?new=1

@marpiech
Copy link

marpiech commented Aug 1, 2019

One up. I have similar error

@caaespin
Copy link

@geoffjentry from inspecting logs and AWS Batch console, i think what is happening is that the jobs fail because Cromwell shutdowns the VMs earlier than expected. So one of shard hasn't finished and is unable to upload to S3, hence the problem here occurs. Anyways this is a hypothesis based on what I saw, hopefully is helpful.

@alexwaldrop
Copy link

@geoffjentry Any movement on this? I'm having this same issue sporadically (v48 + AWS backend) with workflows that contain large scatter operations.

@geoffjentry
Copy link
Contributor

@alexwaldrop NB that I don't work there anymore and sadly haven't had the energy to actively contribute. Perhaps @aednichols can chime in

@blindmouse
Copy link

I am having the same error with the example "Using Data on S3" on https://docs.opendata.aws/genomics-workflows/orchestration/cromwell/cromwell-examples/ . I have changed the S3 bucket name in the .json file to my bucket name, but the run still failed. After reporting running failure, I have got the same error message. I am using cromwell-48. The S3 bucket has all public access, and I was logged in as the Admin in two terminal windows, one running the server and the other submitting the job. The previous two hello-world example were successful. There is no log file in the bucket and in the cromwell-execution, the only file create was the script. There is no rc or stderr or stdout created.

@sripaladugu
Copy link

sripaladugu commented Jul 21, 2020

I am having the same error with the example "Using Data on S3" on https://docs.opendata.aws/genomics-workflows/orchestration/cromwell/cromwell-examples/ . I have changed the S3 bucket name in the .json file to my bucket name, but the run still failed. After reporting running failure, I have got the same error message. I am using cromwell-48. The S3 bucket has all public access, and I was logged in as the Admin in two terminal windows, one running the server and the other submitting the job. The previous two hello-world example were successful. There is no log file in the bucket and in the cromwell-execution, the only file create was the script. There is no rc or stderr or stdout created.

@blindmouse Were you able to resolve your issue? I am encountering the same problem. Thanks.

@markjschreiber
Copy link
Contributor

markjschreiber commented Jul 21, 2020 via email

@sripaladugu
Copy link

This can happen if the job fails meaning that an rc.txt file isn’t created. It would be worth looking at the CloudWatch log for the batch job.

On Tue, Jul 21, 2020 at 4:07 PM Sri Paladugu @.***> wrote: Is there any progress on this issue? I am the getting the following exception: IOException: Could not read from s3:///results/ReadFile/5fec5c4a-2e3f-49ed-8f9e-6d9d2d759449/call-read_file/read_file-rc.txt Caused by: java.nio.file.NoSuchFileException: s3:// s3.amazonaws.com/s3bucketname/results/ReadFile/5fec5c4a-2e3f-49ed-8f9e-6d9d2d759449/call-read_file/read_file-rc.txt — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4687 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EMJZ66Z5PIAEUX3IBLR4XYPZANCNFSM4G23FFUQ .

Cloudwatch logs contained the following message: "/bin/bash: /var/scratch/fetch_and_run.sh: Is a directory"

@markjschreiber
Copy link
Contributor

markjschreiber commented Aug 8, 2020 via email

@mderan-da
Copy link

Hi @markjschreiber I'm also running into this error. I am using cromwell 53 with a custom cdk stack based on the CloudFormation infrastructure described here: https://docs.opendata.aws/genomics-workflows/

Are modifications needed for compatibility with newer versions of Cromwell? Are these documented somewhere?

@markjschreiber
Copy link
Contributor

markjschreiber commented Sep 11, 2020 via email

@mderan-da
Copy link

Hi @markjschreiber Thanks but it looks like the attachment didn't come through.

@yaomin
Copy link

yaomin commented Sep 13, 2020

@markjschreiber running into the same error for both v52 and v53.1. I am using the same CloudFormation @mderan-da mentioned . Appreciate your newer documentation on this.

@markjschreiber
Copy link
Contributor

markjschreiber commented Sep 14, 2020 via email

@dfeinzeig
Copy link

Cloudwatch logs contained the following message: "/bin/bash: /var/scratch/fetch_and_run.sh: Is a directory"

Also have this error. Anyone figure out what the issue is?

@geertvandeweyer
Copy link

Also have this error, using Cromwell 52, installed using this manual :

https://aws-genomics-workflows.s3.amazonaws.com/Installing+the+Genomics+Workflow+Core+and+Cromwell.pdf

logs say : fetch_and_run.is is a directory.

@geertvandeweyer
Copy link

Also have this error, using Cromwell 52, installed using this manual :

https://aws-genomics-workflows.s3.amazonaws.com/Installing+the+Genomics+Workflow+Core+and+Cromwell.pdf

logs say : fetch_and_run.is is a directory.

Extra info : cloning job & resubmitting through aws console runs fine. so it seems to be a temporary issue

@sscho
Copy link

sscho commented May 13, 2021

Hmmm, still stuck on this - any updates from your guys' end? I tried cloning and resubmitting, still getting the same error.

@ptdtan
Copy link

ptdtan commented Jun 8, 2021

Still getting this error today.

@alimayy
Copy link

alimayy commented Sep 12, 2022

I'm getting this error almost certainly when I run workflows where more samples (e.g. 96) than usual are scattered.
Cromwell version: 60-6048d0e-SNAP.

Is there a workaround to this?

@rnaidu
Copy link

rnaidu commented Jan 29, 2025

Hi all, are there any updates to a workaround for this error? I'm getting the same error using Cromwell v87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Ticket needs further investigation and refinement prior to moving to milestones
Projects
None yet
Development

No branches or pull requests