[Bug]: NodeGroup MR is marked READY=true even if readinessCheck has not been satisfied #1228

adamhouse · 2024-03-18T14:26:20Z

Is there an existing issue for this?

I have searched the existing issues

Affected Resource(s)

eks.aws.upbound.io/v1beta1 -- NodeGroup

Resource MRs required to reproduce the bug

    - name: eks-workers
      base:
        apiVersion: eks.aws.upbound.io/v1beta1
        kind: NodeGroup
        spec:
          deletionPolicy: Delete
          forProvider:
            clusterNameSelector:
              matchControllerRef: true
              matchLabels:
                name: cluster-main
              policy:
                resolution: Required
                resolve: Always
            nodeRoleArnSelector:
              matchControllerRef: true
              matchLabels:
                name: worker-role
              policy:
                resolution: Required
                resolve: Always
            amiType: AL2_x86_64
            scalingConfig:
              - maxSize: 3
                minSize: 1
                desiredSize: 3
      patches:
        - type: PatchSet
          patchSetName: aws-pc
        - type: PatchSet
          patchSetName: region
        - fromFieldPath: spec.vpcConfig.subnets
          toFieldPath: spec.forProvider.subnetIds
        - fromFieldPath: status.worker.launchTemplate.id
          toFieldPath: spec.forProvider.launchTemplate[0].id
          policy:
            fromFieldPath: Required
        - fromFieldPath: status.worker.launchTemplate.version
          toFieldPath: spec.forProvider.launchTemplate[0].version
          transforms:
            - type: convert
              convert:
                toType: string
          policy:
            fromFieldPath: Required
        - type: ToCompositeFieldPath
          fromFieldPath: status.atProvider.status
          toFieldPath: status.worker.status
      readinessChecks:
        - type: MatchString
          fieldPath: status.atProvider.status
          matchString: "ACTIVE"

Steps to Reproduce

If access permits, throttle API requests to IAM in the AWS account you're testing in. This should cause the NodeGroup creation to fail.

What happened?

If AWS is unable to successfully create the NodeGroup above (resulting in a status of CREATE_FAILED on the AWS side that is passed to the resource at status.atProvider.status) Crossplane still marks the MR as READY=true despite a readinessCheck to explicitly match string ACTIVE.

Relevant Error Output Snippet

Example MR.

Resource marked as READY=true

9:09:27 ~ kg nodegroup tgplat-sbx-4-bjfqh-mcxhq
NAME                       READY   SYNCED   EXTERNAL-NAME              AGE
tgplat-sbx-4-bjfqh-mcxhq   True    True     tgplat-sbx-4-bjfqh-mcxhq   9m19s

Value of status.atProvider.status != ACTIVE as defined by readinessCheck.

9:09:32 ~ kg nodegroup tgplat-sbx-4-bjfqh-mcxhq -oyaml | yq -r '.status'
atProvider:
  amiType: AL2_x86_64
  arn: <REMOVED>
  capacityType: ON_DEMAND
  clusterName: tgplat-sbx-4-bjfqh-q544l
  diskSize: 0
  id: tgplat-sbx-4-bjfqh-q544l:tgplat-sbx-4-bjfqh-mcxhq
  launchTemplate:
    - id: lt-06de386050da58058
      name: terraform-20240318135907606600000007
      version: "1"
  nodeRoleArn: <REMOVED>
  releaseVersion: 1.28.5-20240307
  resources:
    - remoteAccessSecurityGroupId: ""
  scalingConfig:
    - desiredSize: 3
      maxSize: 3
      minSize: 1
  status: CREATE_FAILED
  subnetIds:
    - <REMOVED>
    - <REMOVED>
    - <REMOVED>
  tags:
    crossplane-kind: nodegroup.eks.aws.upbound.io
    crossplane-name: tgplat-sbx-4-bjfqh-mcxhq
    crossplane-providerconfig: tgplat-sbx-4-bjfqh-xlc5d
  tagsAll:
    crossplane-kind: nodegroup.eks.aws.upbound.io
    crossplane-name: tgplat-sbx-4-bjfqh-mcxhq
    crossplane-providerconfig: tgplat-sbx-4-bjfqh-xlc5d
  updateConfig:
    - maxUnavailable: 1
      maxUnavailablePercentage: 0
  version: "1.28"

Crossplane Version

v1.15.0-up.1

Provider Version

v1.1.0

Kubernetes Version

v1.28.6-eks-508b6b3

Kubernetes Distribution

EKS

Additional Info

No response

The text was updated successfully, but these errors were encountered:

turkenf · 2024-05-08T14:32:13Z

Hi @adamhouse,

Thank you for raising this issue, could you please try with provider version 1.4.0 and let us know?

adamhouse · 2024-06-28T14:57:51Z

@turkenf I'm still seeing the same behavior on provider version 1.8.0 but I'm not sure if this is a real "problem" or if I'm simply misunderstanding the intended use of readinessChecks.

My NodeGroup as defined in my Composition:

          - name: eks-worker
            base:
              apiVersion: eks.aws.upbound.io/v1beta1
              kind: NodeGroup
              spec:
                deletionPolicy: Delete
                forProvider:
                  amiType: AL2_x86_64
                  clusterNameSelector:
                    matchControllerRef: true
                    matchLabels:
                      name: cluster-main
                    policy:
                      resolution: Required
                      resolve: Always
                  nodeRoleArnSelector:
                    matchControllerRef: true
                    matchLabels:
                      name: worker-role
                    policy:
                      resolution: Required
                      resolve: Always
                  scalingConfig:
                    - desiredSize: 3
                      maxSize: 3
                      minSize: 1
            patches:
              - type: PatchSet
                patchSetName: aws-pc
              - type: PatchSet
                patchSetName: region
              - type: FromCompositeFieldPath
                fromFieldPath: spec.vpcConfig.subnets
                toFieldPath: spec.forProvider.subnetIds
              - type: FromCompositeFieldPath
                fromFieldPath: status.worker.launchTemplate.id
                toFieldPath: spec.forProvider.launchTemplate[0].id
                policy:
                  fromFieldPath: Required
              - type: FromCompositeFieldPath
                fromFieldPath: status.worker.launchTemplate.version
                toFieldPath: spec.forProvider.launchTemplate[0].version
                policy:
                  fromFieldPath: Required
                transforms:
                  - convert:
                      toType: string
                    type: convert
              - type: ToCompositeFieldPath
                fromFieldPath: status.atProvider.status
                toFieldPath: status.worker.status
            readinessChecks:
              - type: MatchString
                fieldPath: status.atProvider.status
                matchString: ACTIVE

The corresponding Managed Resource:

9:44:25 ~ kg nodegroup.eks vpc-cni-ue1-9c7t5-sq5nk
NAME                      SYNCED   READY   EXTERNAL-NAME             AGE
vpc-cni-ue1-9c7t5-sq5nk   True     True    vpc-cni-ue1-9c7t5-sq5nk   22m

The MR's status.atProvider output:

9:44:38 ~ kg nodegroup.eks vpc-cni-ue1-9c7t5-sq5nk -oyaml | yq -r .status.atProvider

amiType: AL2_x86_64
arn: <REMOVED>
capacityType: ON_DEMAND
clusterName: vpc-cni-ue1-sbx
diskSize: 0
id: vpc-cni-ue1-sbx:vpc-cni-ue1-9c7t5-sq5nk
launchTemplate:
  id: lt-0d97494af0df766d2
  name: terraform-20240628142005627000000003
  version: "1"
nodeRoleArn: <REMOVED>
releaseVersion: 1.29.3-20240625
scalingConfig:
  desiredSize: 3
  maxSize: 3
  minSize: 1
status: CREATING
subnetIds:
  - <REMOVED>
  - <REMOVED>
  - <REMOVED>
tags:
  crossplane-kind: nodegroup.eks.aws.upbound.io
  crossplane-name: vpc-cni-ue1-9c7t5-sq5nk
  crossplane-providerconfig: vpc-cni-ue1-9c7t5-pmrlf
tagsAll:
  crossplane-kind: nodegroup.eks.aws.upbound.io
  crossplane-name: vpc-cni-ue1-9c7t5-sq5nk
  crossplane-providerconfig: vpc-cni-ue1-9c7t5-pmrlf
updateConfig:
  maxUnavailable: 1
  maxUnavailablePercentage: 0
version: "1.29"

Note the value of status.atProvider.status is CREATING which conflicts with the readinessCheck I have defined. Given this scenario, should I expect the MR to show as READY=false?

github-actions · 2024-09-27T04:32:38Z

This provider repo does not have enough maintainers to address every issue. Since there has been no activity in the last 90 days it is now marked as stale. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

mitchelldavis44 · 2024-09-27T05:02:11Z

/fresh bump

bobh66 · 2024-10-14T21:55:22Z

I'm seeing this problem on 1.11.0 and I would expect that the Nodegroup should not show condition Ready=True if it is in CREATE_FAILED

blakeromano · 2024-10-15T15:22:31Z

Based on my understanding I think this is a mismatch in what TF gives back, when it gives it back and Crossplane. I've seen this on Cloudfront distributions. https://github.com/hashicorp/terraform-provider-aws/blob/main/internal/service/eks/node_group.go#L709-L728 is called and I think Crossplane isn't actually accounting for the status that actually matters (which for nodegroup isn't the readiness it is the status TF gives).

However I do think it brings up the interesting conversation if we see this problem with several resources does it just make sense to be able to implement a custom ready/health check for a resource

bobh66 · 2024-10-15T21:28:48Z

It's interesting that the Cluster.eks resource stays in Ready=False until the cluster is available, which takes a lot longer than a typical Nodegroup - it very well could be a difference in the terraform implementation

github-actions · 2025-01-14T04:30:37Z

This provider repo does not have enough maintainers to address every issue. Since there has been no activity in the last 90 days it is now marked as stale. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

stevendborrelli · 2025-01-21T16:18:17Z

/fresh

adamhouse added the bug Something isn't working label Mar 18, 2024

github-actions bot added the community label Mar 18, 2024

jeanduplessis removed the community label Apr 15, 2024

github-actions bot added the community label Apr 15, 2024

turkenf added needs:information needs:triage and removed community labels May 8, 2024

github-actions bot added the stale label Sep 27, 2024

github-actions bot removed the stale label Sep 28, 2024

github-actions bot added the stale label Jan 14, 2025

github-actions bot removed the stale label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: NodeGroup MR is marked READY=true even if readinessCheck has not been satisfied #1228

[Bug]: NodeGroup MR is marked READY=true even if readinessCheck has not been satisfied #1228

adamhouse commented Mar 18, 2024 •

edited

Loading

turkenf commented May 8, 2024

adamhouse commented Jun 28, 2024

github-actions bot commented Sep 27, 2024

mitchelldavis44 commented Sep 27, 2024 •

edited

Loading

bobh66 commented Oct 14, 2024

blakeromano commented Oct 15, 2024

bobh66 commented Oct 15, 2024

github-actions bot commented Jan 14, 2025

stevendborrelli commented Jan 21, 2025

[Bug]: NodeGroup MR is marked READY=true even if readinessCheck has not been satisfied #1228

[Bug]: NodeGroup MR is marked READY=true even if readinessCheck has not been satisfied #1228

Comments

adamhouse commented Mar 18, 2024 • edited Loading

Is there an existing issue for this?

Affected Resource(s)

Resource MRs required to reproduce the bug

Steps to Reproduce

What happened?

Relevant Error Output Snippet

Crossplane Version

Provider Version

Kubernetes Version

Kubernetes Distribution

Additional Info

turkenf commented May 8, 2024

adamhouse commented Jun 28, 2024

github-actions bot commented Sep 27, 2024

mitchelldavis44 commented Sep 27, 2024 • edited Loading

bobh66 commented Oct 14, 2024

blakeromano commented Oct 15, 2024

bobh66 commented Oct 15, 2024

github-actions bot commented Jan 14, 2025

stevendborrelli commented Jan 21, 2025

adamhouse commented Mar 18, 2024 •

edited

Loading

mitchelldavis44 commented Sep 27, 2024 •

edited

Loading