Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform with S3 backend sporadically fails with "Error: RequestError: send request failed caused by: Post "https://sts.amazonaws.com/": net/http: TLS handshake timeout" #28714

Closed
Cajga opened this issue May 16, 2021 · 6 comments
Labels
backend/s3 bug new new issue not yet triaged

Comments

@Cajga
Copy link

Cajga commented May 16, 2021

Terraform Version

$ terraform version
Terraform v0.15.3
on linux_amd64

Terraform Configuration Files

As the issue is happening even at a terraform init here is how the remote backend is configured:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "=3.39.0"
    }
  }
  required_version = "= 0.15.3"
  backend "s3" {
    bucket = "remote-state-IDGOESHERE"
    key    = "tf-state/terraform.tfstate"
    region = "eu-central-1"

    dynamodb_table = "remote-state-lock-IDGOESHERE"
    encrypt        = true
  }
}

Debug Output

We are running in a CI environment which is configured to run terraform in different directories (sometimes at the same time but inside a container so they do not conflict). At every CI job, we do a terrafom init and a terraform plan or apply. The issues happen with all kind of calls sporadically (~ 1 per every 30 terraform call).
Here you can see two trace outputs from two different runs:

Expected Behavior

Terraform calls work as expected

Actual Behavior

Terraform calls (init/plan and in few occasion apply) fail

Steps to Reproduce

As mentioned the issue happens with all type of terraform calls randomly

Additional Context

As mentioned above, terraform runs in a CI env configuring multiple directories. The issue happens sporadically and with different type of calls.

References

I could find some very old tickets searching for the phrase "net/http: TLS handshake timeout" which were closed with unable to reproduce but I am not sure if they were relevant (some of them had the issue permanently).

@Cajga
Copy link
Author

Cajga commented May 16, 2021

Maybe relevant: manually restarting the same CI job (which means no change in config!) normally fixes the issue

@oyaaraas
Copy link

I get this 99% of the time, but then once in a while it just works. Same version: 0.15.3. Will have to re-consider using terraform unless someone have a solution.

@oyaaraas
Copy link

After further investigation I think I have found the issue on my side - it is probably the firewall in my router which is causing it as it works fine when I tried it on a 4G connection.

@gdavison
Copy link
Contributor

Hi @Cajga. The error that you're seeing, net/http: TLS handshake timeout indicates a networking problem, where Terraform is not able to create a secure connection with AWS. The endpoint Terraform is trying to reach, https://sts.amazonaws.com/ is for the AWS Security Token Service (STS), which is used to authenticate sessions with the AWS API.

This may be caused by service issues at AWS, a poor network connection between your CI server and AWS, or problems with the TLS configuration.

@Cajga
Copy link
Author

Cajga commented Jun 11, 2021

Hi @gdavison, thanks for looking into this.

As you correctly said, it could be that the STS API is down/slow during these failing requests. At the same time, this seems to be happening ONLY when we touch the S3 back-end and NEVER when we use the AWS provider to configure resources on AWS.

Would it be possible to make the back-end connection more reliable from the terraform client side? Seems when these issues happen with the golang net/http package, people increase the timeout value of the TLS handshake (here you can find an example of this) or as best practice do a retry if it fails on the first connection.
This would be really useful for CI envs where one may have slow hosts/shared network.

P.S.: If you look into the second logs that I included in the description, you can see that an init was working fine but few seconds later the plan was failing. This closes out the wrong TLS configuration (at least on our side).

@github-actions
Copy link
Contributor

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backend/s3 bug new new issue not yet triaged
Projects
None yet
Development

No branches or pull requests

4 participants