Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: NodeGroup MR is marked READY=true even if readinessCheck has not been satisfied #1228

Open
1 task done
adamhouse opened this issue Mar 18, 2024 · 9 comments
Open
1 task done
Labels

Comments

@adamhouse
Copy link

adamhouse commented Mar 18, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Affected Resource(s)

eks.aws.upbound.io/v1beta1 -- NodeGroup

Resource MRs required to reproduce the bug

    - name: eks-workers
      base:
        apiVersion: eks.aws.upbound.io/v1beta1
        kind: NodeGroup
        spec:
          deletionPolicy: Delete
          forProvider:
            clusterNameSelector:
              matchControllerRef: true
              matchLabels:
                name: cluster-main
              policy:
                resolution: Required
                resolve: Always
            nodeRoleArnSelector:
              matchControllerRef: true
              matchLabels:
                name: worker-role
              policy:
                resolution: Required
                resolve: Always
            amiType: AL2_x86_64
            scalingConfig:
              - maxSize: 3
                minSize: 1
                desiredSize: 3
      patches:
        - type: PatchSet
          patchSetName: aws-pc
        - type: PatchSet
          patchSetName: region
        - fromFieldPath: spec.vpcConfig.subnets
          toFieldPath: spec.forProvider.subnetIds
        - fromFieldPath: status.worker.launchTemplate.id
          toFieldPath: spec.forProvider.launchTemplate[0].id
          policy:
            fromFieldPath: Required
        - fromFieldPath: status.worker.launchTemplate.version
          toFieldPath: spec.forProvider.launchTemplate[0].version
          transforms:
            - type: convert
              convert:
                toType: string
          policy:
            fromFieldPath: Required
        - type: ToCompositeFieldPath
          fromFieldPath: status.atProvider.status
          toFieldPath: status.worker.status
      readinessChecks:
        - type: MatchString
          fieldPath: status.atProvider.status
          matchString: "ACTIVE"

Steps to Reproduce

If access permits, throttle API requests to IAM in the AWS account you're testing in. This should cause the NodeGroup creation to fail.

What happened?

If AWS is unable to successfully create the NodeGroup above (resulting in a status of CREATE_FAILED on the AWS side that is passed to the resource at status.atProvider.status) Crossplane still marks the MR as READY=true despite a readinessCheck to explicitly match string ACTIVE.

Relevant Error Output Snippet

Example MR.

Resource marked as READY=true

9:09:27 ~ kg nodegroup tgplat-sbx-4-bjfqh-mcxhq
NAME                       READY   SYNCED   EXTERNAL-NAME              AGE
tgplat-sbx-4-bjfqh-mcxhq   True    True     tgplat-sbx-4-bjfqh-mcxhq   9m19s

Value of status.atProvider.status != ACTIVE as defined by readinessCheck.

9:09:32 ~ kg nodegroup tgplat-sbx-4-bjfqh-mcxhq -oyaml | yq -r '.status'
atProvider:
  amiType: AL2_x86_64
  arn: <REMOVED>
  capacityType: ON_DEMAND
  clusterName: tgplat-sbx-4-bjfqh-q544l
  diskSize: 0
  id: tgplat-sbx-4-bjfqh-q544l:tgplat-sbx-4-bjfqh-mcxhq
  launchTemplate:
    - id: lt-06de386050da58058
      name: terraform-20240318135907606600000007
      version: "1"
  nodeRoleArn: <REMOVED>
  releaseVersion: 1.28.5-20240307
  resources:
    - remoteAccessSecurityGroupId: ""
  scalingConfig:
    - desiredSize: 3
      maxSize: 3
      minSize: 1
  status: CREATE_FAILED
  subnetIds:
    - <REMOVED>
    - <REMOVED>
    - <REMOVED>
  tags:
    crossplane-kind: nodegroup.eks.aws.upbound.io
    crossplane-name: tgplat-sbx-4-bjfqh-mcxhq
    crossplane-providerconfig: tgplat-sbx-4-bjfqh-xlc5d
  tagsAll:
    crossplane-kind: nodegroup.eks.aws.upbound.io
    crossplane-name: tgplat-sbx-4-bjfqh-mcxhq
    crossplane-providerconfig: tgplat-sbx-4-bjfqh-xlc5d
  updateConfig:
    - maxUnavailable: 1
      maxUnavailablePercentage: 0
  version: "1.28"

Crossplane Version

v1.15.0-up.1

Provider Version

v1.1.0

Kubernetes Version

v1.28.6-eks-508b6b3

Kubernetes Distribution

EKS

Additional Info

No response

@adamhouse adamhouse added the bug Something isn't working label Mar 18, 2024
@turkenf
Copy link
Collaborator

turkenf commented May 8, 2024

Hi @adamhouse,

Thank you for raising this issue, could you please try with provider version 1.4.0 and let us know?

@adamhouse
Copy link
Author

@turkenf I'm still seeing the same behavior on provider version 1.8.0 but I'm not sure if this is a real "problem" or if I'm simply misunderstanding the intended use of readinessChecks.

My NodeGroup as defined in my Composition:

          - name: eks-worker
            base:
              apiVersion: eks.aws.upbound.io/v1beta1
              kind: NodeGroup
              spec:
                deletionPolicy: Delete
                forProvider:
                  amiType: AL2_x86_64
                  clusterNameSelector:
                    matchControllerRef: true
                    matchLabels:
                      name: cluster-main
                    policy:
                      resolution: Required
                      resolve: Always
                  nodeRoleArnSelector:
                    matchControllerRef: true
                    matchLabels:
                      name: worker-role
                    policy:
                      resolution: Required
                      resolve: Always
                  scalingConfig:
                    - desiredSize: 3
                      maxSize: 3
                      minSize: 1
            patches:
              - type: PatchSet
                patchSetName: aws-pc
              - type: PatchSet
                patchSetName: region
              - type: FromCompositeFieldPath
                fromFieldPath: spec.vpcConfig.subnets
                toFieldPath: spec.forProvider.subnetIds
              - type: FromCompositeFieldPath
                fromFieldPath: status.worker.launchTemplate.id
                toFieldPath: spec.forProvider.launchTemplate[0].id
                policy:
                  fromFieldPath: Required
              - type: FromCompositeFieldPath
                fromFieldPath: status.worker.launchTemplate.version
                toFieldPath: spec.forProvider.launchTemplate[0].version
                policy:
                  fromFieldPath: Required
                transforms:
                  - convert:
                      toType: string
                    type: convert
              - type: ToCompositeFieldPath
                fromFieldPath: status.atProvider.status
                toFieldPath: status.worker.status
            readinessChecks:
              - type: MatchString
                fieldPath: status.atProvider.status
                matchString: ACTIVE

The corresponding Managed Resource:

9:44:25 ~ kg nodegroup.eks vpc-cni-ue1-9c7t5-sq5nk
NAME                      SYNCED   READY   EXTERNAL-NAME             AGE
vpc-cni-ue1-9c7t5-sq5nk   True     True    vpc-cni-ue1-9c7t5-sq5nk   22m

The MR's status.atProvider output:

9:44:38 ~ kg nodegroup.eks vpc-cni-ue1-9c7t5-sq5nk -oyaml | yq -r .status.atProvider
amiType: AL2_x86_64
arn: <REMOVED>
capacityType: ON_DEMAND
clusterName: vpc-cni-ue1-sbx
diskSize: 0
id: vpc-cni-ue1-sbx:vpc-cni-ue1-9c7t5-sq5nk
launchTemplate:
  id: lt-0d97494af0df766d2
  name: terraform-20240628142005627000000003
  version: "1"
nodeRoleArn: <REMOVED>
releaseVersion: 1.29.3-20240625
scalingConfig:
  desiredSize: 3
  maxSize: 3
  minSize: 1
status: CREATING
subnetIds:
  - <REMOVED>
  - <REMOVED>
  - <REMOVED>
tags:
  crossplane-kind: nodegroup.eks.aws.upbound.io
  crossplane-name: vpc-cni-ue1-9c7t5-sq5nk
  crossplane-providerconfig: vpc-cni-ue1-9c7t5-pmrlf
tagsAll:
  crossplane-kind: nodegroup.eks.aws.upbound.io
  crossplane-name: vpc-cni-ue1-9c7t5-sq5nk
  crossplane-providerconfig: vpc-cni-ue1-9c7t5-pmrlf
updateConfig:
  maxUnavailable: 1
  maxUnavailablePercentage: 0
version: "1.29"

Note the value of status.atProvider.status is CREATING which conflicts with the readinessCheck I have defined. Given this scenario, should I expect the MR to show as READY=false?

Copy link

This provider repo does not have enough maintainers to address every issue. Since there has been no activity in the last 90 days it is now marked as stale. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

@github-actions github-actions bot added the stale label Sep 27, 2024
@mitchelldavis44
Copy link

mitchelldavis44 commented Sep 27, 2024

/fresh bump

@github-actions github-actions bot removed the stale label Sep 28, 2024
@bobh66
Copy link
Contributor

bobh66 commented Oct 14, 2024

I'm seeing this problem on 1.11.0 and I would expect that the Nodegroup should not show condition Ready=True if it is in CREATE_FAILED

@blakeromano
Copy link
Contributor

Based on my understanding I think this is a mismatch in what TF gives back, when it gives it back and Crossplane. I've seen this on Cloudfront distributions. https://github.com/hashicorp/terraform-provider-aws/blob/main/internal/service/eks/node_group.go#L709-L728 is called and I think Crossplane isn't actually accounting for the status that actually matters (which for nodegroup isn't the readiness it is the status TF gives).

However I do think it brings up the interesting conversation if we see this problem with several resources does it just make sense to be able to implement a custom ready/health check for a resource

@bobh66
Copy link
Contributor

bobh66 commented Oct 15, 2024

It's interesting that the Cluster.eks resource stays in Ready=False until the cluster is available, which takes a lot longer than a typical Nodegroup - it very well could be a difference in the terraform implementation

Copy link

This provider repo does not have enough maintainers to address every issue. Since there has been no activity in the last 90 days it is now marked as stale. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

@github-actions github-actions bot added the stale label Jan 14, 2025
@stevendborrelli
Copy link
Contributor

/fresh

@github-actions github-actions bot removed the stale label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants