Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-48469: GCP: Update /etc/hosts file when ClusterHostedDNS is enabled #4800

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

sadasu
Copy link
Contributor

@sadasu sadasu commented Jan 22, 2025

Append /etc/hosts files with entries to resolve cluster api and api-int URLS. /etc/hosts will provide resolution for these URLs until kubelet joins the cluster and runs its CoreDNS pod which will then take over resolution of those 2 URLs

- What I did

- How to verify it

- Description for the changelog

@sadasu sadasu changed the title GCP: Update /etc/hosts file when ClusterHostedDNS is enabled WIP: OCPBUGS-48469: GCP: Update /etc/hosts file when ClusterHostedDNS is enabled Jan 22, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 22, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jan 22, 2025
@openshift-ci-robot
Copy link
Contributor

@sadasu: This pull request references Jira Issue OCPBUGS-48469, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jianli-wei

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Append /etc/hosts files with entries to resolve cluster api and api-int URLS. /etc/hosts will provide resolution for these URLs until kubelet joins the cluster and runs its CoreDNS pod which will then take over resolution of those 2 URLs

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sadasu sadasu force-pushed the gcp-update-etc-hosts branch 9 times, most recently from 1bc96ca to 3280ff8 Compare January 23, 2025 21:50
apiServerIntURL={{ .Infra.Status.APIServerInternalURL }}
# Add the/etc/hosts configuration file
mkdir -p /etc/hosts/conf.d
cat <<EOF | tee /etc/hosts/conf.d/etc-hosts.conf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could maybe name the file api.conf or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping that naming it as etc-hosts.conf would be it obvious that this file contains some configuration for /etc/hosts. Happy to call it api.conf is it contains information about resolving the API/-Int urls.

{{ else }}
exit 0
{{ end }}
if [ -z "${apiIntLBIPs}" ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:

Suggested change
if [ -z "${apiIntLBIPs}" ]; then
if [ -z "{{$apiIntLBIPs}}" ]; then

?

We appear to be missing a line like:

  apiIntLBIPs={{$apiIntLBIPs}}

to get the template variable that we defined on line 19 into a bash variable.

mkdir -p /etc/hosts/conf.d
cat <<EOF | tee /etc/hosts/conf.d/etc-hosts.conf
# Added by OpenShift
${apiLBIPs[0]} ${apiServerURL}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are treating apiLBIPs as a bash array here. That's fine if it is, but bash arrays always seem like kind of a pain to set up to me. It might be simpler to do this in the template:

Suggested change
${apiLBIPs[0]} ${apiServerURL}
{{$apiLBIPs[0]}} ${apiServerURL}

@@ -0,0 +1,11 @@
mode: 0755
path: "/usr/local/bin/update-etc-hosts"
contents:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially a simpler way, avoiding the script and systemd service:

path: "/etc/hosts"
append:
  - inline: |
    {{ if and (eq .Infra.Status.PlatformStatus.Type "GCP") (.Infra.Status.PlatformStatus.GCP) (.Infra.Status.PlatformStatus.GCP.CloudLoadBalancerConfig) (eq .Infra.Status.PlatformStatus.GCP.CloudLoadBalancerConfig.DNSType "ClusterHosted") }}
    {{ $apiIntLBIPs := cloudPlatformAPIIntLoadBalancerIPs . }}
    {{ if len $apiIntLBIPs > 0 }}
    {{ $apiLBIPs := cloudPlatformAPILoadBalancerIPs . }}
    {{ if len $apiLBIPs > 0 }}{{ $apiLBIPs[0] }}{{ else }}{{ $apiIntLBIPs[0] }}{{ end }} {{ .Infra.Status.APIServerURL }}
    {{ $apiIntLBIPs[0] }} {{ .Infra.Status.APIServerInternalURL }}
    {{ end }}
    {{ end }}

Copy link
Contributor Author

@sadasu sadasu Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The systemd service is providing us a way to time the running of this script before kubelet.

To check if feature to run in-cluster DNS on GCP and AWS is enabled
by checking if the value of
`PlatformStatus.GCP.CloudLoadBalancerConfig.DNSType` is set to
`ClusterHosted`.
@sadasu sadasu force-pushed the gcp-update-etc-hosts branch 2 times, most recently from 15c0227 to 3d451a6 Compare January 27, 2025 19:21
@sadasu sadasu changed the title WIP: OCPBUGS-48469: GCP: Update /etc/hosts file when ClusterHostedDNS is enabled OCPBUGS-48469: GCP: Update /etc/hosts file when ClusterHostedDNS is enabled Jan 27, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 27, 2025
@sadasu sadasu force-pushed the gcp-update-etc-hosts branch 5 times, most recently from 91d44ca to 1b13a9c Compare January 27, 2025 21:57
Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically seems fine, although I am unsure how to test, so let me know if you'd like any QE pre-merge testing on this

@@ -777,6 +778,33 @@ func cloudPlatformIngressLoadBalancerIPs(cfg RenderConfig) (interface{}, error)
}
}

// cloudPlatformLBIPAvailable returns true when DNSType is set to `ClusterHosted`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, based on the comment I'd expect some check for clusterhosted in the function. I guess it's implicit since the service enablement is dependent on this field?

(I know we do the same elsewhere in the template rendering, so I'm fine with it as is)

Copy link
Contributor

openshift-ci bot commented Jan 28, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sadasu, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 28, 2025
@gpei
Copy link

gpei commented Jan 28, 2025

@sadasu Hi, I just tried to use the latest commit to test the installation of GCP custom DNS. The gcp-update-etc-hosts.service couldn't be started on masters because of the following syntax error:

[core@gpei-0128-gcpdns-5zs6h-master-0 ~]$ journalctl -u gcp-update-etc-hosts.service --no-pager
Jan 28 05:58:05 gpei-0128-gcpdns-5zs6h-master-0 systemd[1]: Starting Update Default GCP /etc/hosts...
Jan 28 05:58:05 gpei-0128-gcpdns-5zs6h-master-0 bash[1233]: /bin/bash: -c: line 1: syntax error near unexpected token `then'
Jan 28 05:58:05 gpei-0128-gcpdns-5zs6h-master-0 bash[1233]: /bin/bash: -c: line 1: `    apiIntLBIPs=[10.0.0.2]      apiLBIPs=[34.54.248.13]    if [ -z $apiLBIPs ]; then    apiLBIPs=$apiIntLBIPs  fi  apiServerURL=https://api.gpei-0128-gcpdns.qe.gcp.devcluster.openshift.com:6443  apiServerIntURL=https://api-int.gpei-0128-gcpdns.qe.gcp.devcluster.openshift.com:6443  mkdir -p /etc/conf.d  cat <<EOF | tee /etc/conf.d/etc-hosts.conf              EOF  /usr/local/bin/update-etc-hosts'
Jan 28 05:58:05 gpei-0128-gcpdns-5zs6h-master-0 systemd[1]: gcp-update-etc-hosts.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 28 05:58:05 gpei-0128-gcpdns-5zs6h-master-0 systemd[1]: gcp-update-etc-hosts.service: Failed with result 'exit-code'.
Jan 28 05:58:05 gpei-0128-gcpdns-5zs6h-master-0 systemd[1]: Failed to start Update Default GCP /etc/hosts.

@sadasu sadasu force-pushed the gcp-update-etc-hosts branch from 1b13a9c to 77c93b5 Compare January 29, 2025 04:34
@gpei
Copy link

gpei commented Jan 29, 2025

Hi @sadasu still something wrong with the syntax of the bash in gcp-update-etc-hosts.service

[core@gpei-0129-gcpdns-lf995-master-0 ~]$ journalctl -u gcp-update-etc-hosts.service --no-pager
Jan 29 06:22:45 gpei-0129-gcpdns-lf995-master-0 systemd[1]: Starting Update Default GCP /etc/hosts...
Jan 29 06:22:45 gpei-0129-gcpdns-lf995-master-0 bash[1230]: /bin/bash: line 1: warning: here-document at line 1 delimited by end-of-file (wanted `EOF')
Jan 29 06:22:45 gpei-0129-gcpdns-lf995-master-0 bash[1231]: /bin/bash: line 1: if: command not found
Jan 29 06:22:45 gpei-0129-gcpdns-lf995-master-0 bash[1232]: tee: /etc/conf.d/etc-hosts.conf: No such file or directory
Jan 29 06:22:45 gpei-0129-gcpdns-lf995-master-0 bash[1232]: tee: EOF: Operation not permitted
Jan 29 06:22:45 gpei-0129-gcpdns-lf995-master-0 systemd[1]: gcp-update-etc-hosts.service: Main process exited, code=exited, status=1/FAILURE
Jan 29 06:22:45 gpei-0129-gcpdns-lf995-master-0 systemd[1]: gcp-update-etc-hosts.service: Failed with result 'exit-code'.
Jan 29 06:22:45 gpei-0129-gcpdns-lf995-master-0 systemd[1]: Failed to start Update Default GCP /etc/hosts.
[core@gpei-0129-gcpdns-lf995-master-0 ~]$ 
[core@gpei-0129-gcpdns-lf995-master-0 ~]$ 
[core@gpei-0129-gcpdns-lf995-master-0 ~]$ systemctl cat gcp-update-etc-hosts.service
# /etc/systemd/system/gcp-update-etc-hosts.service
[Unit]
Description=Update Default GCP /etc/hosts
# We don't need to do this on the firstboot
After=firstboot-osupdate.target
# Wait for NetworkManager to report it's online
After=NetworkManager-wait-online.service
# Run before kubelet
Before=kubelet-dependencies.target

[Service]
# Need oneshot to delay kubelet
Type=oneshot
ExecStart=/bin/bash -c " \
 \
apiIntLBIPs=[10.0.0.2] \
apiIntLBIP=10.0.0.2 \
 \
 \
apiLBIPs=[34.102.208.176] \
apiLBIP=34.102.208.176 \
 \
if [ -z "$apiLBIPs" ] \
then \
  # apiLBIPs will not be set on private clusters
  apiLBIPs=$apiIntLBIPs \
fi \
apiServerURL=https://api.gpei-0129-gcpdns.qe.gcp.devcluster.openshift.com:6443 \
apiServerIntURL=https://api-int.gpei-0129-gcpdns.qe.gcp.devcluster.openshift.com:6443 \
mkdir -p /etc/conf.d \
cat <<EOF | tee /etc/conf.d/etc-hosts.conf \
${apiLBIPs[0]}    ${apiServerURL} \
${apiIntLBIPs[0]}    ${apiServerIntURL} \
EOF \
# Update /etc/hosts \
/usr/local/bin/update-etc-hosts"

[Install]
RequiredBy=kubelet-dependencies.target

@sadasu sadasu force-pushed the gcp-update-etc-hosts branch from 77c93b5 to 171e94a Compare January 30, 2025 03:55
@sadasu
Copy link
Contributor Author

sadasu commented Jan 30, 2025

@gpei thanks for testing the 2 previous versions. I don't see why the bash script with correct syntax has failures when run within the systemd unit. I have reorganized my code and posted another version.

@gpei
Copy link

gpei commented Jan 30, 2025

@sadasu thanks for the update, with the latest code, we can move one step further now, but the /usr/local/bin/update-etc-hosts script seems still now working as expected somehow.

Here are the contents of some key files on the master for your reference:

[root@gpei-0130-gcpdns-hq6q8-master-2 ~]# cat /etc/conf.d/etc-hosts.conf
    
    
[root@gpei-0130-gcpdns-hq6q8-master-2 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
    
    
[root@gpei-0130-gcpdns-hq6q8-master-2 ~]# journalctl -u gcp-update-etc-hosts.service
Jan 30 06:11:27 gpei-0130-gcpdns-hq6q8-master-2 systemd[1]: Starting Update Default GCP /etc/hosts...
Jan 30 06:11:27 gpei-0130-gcpdns-hq6q8-master-2 bash[1235]:     
Jan 30 06:11:27 gpei-0130-gcpdns-hq6q8-master-2 bash[1235]:     
Jan 30 06:11:27 gpei-0130-gcpdns-hq6q8-master-2 bash[1229]: Done updating /etc/hosts
Jan 30 06:11:27 gpei-0130-gcpdns-hq6q8-master-2 systemd[1]: gcp-update-etc-hosts.service: Deactivated successfully.
Jan 30 06:11:27 gpei-0130-gcpdns-hq6q8-master-2 systemd[1]: Finished Update Default GCP /etc/hosts.
[root@gpei-0130-gcpdns-hq6q8-master-2 ~]# cat /usr/local/bin/update-etc-hosts
#!/bin/bash
apiLBIP=${1}
apiURL=${2}
apiIntLBIP=${3}
apiIntURL=${4}
if [ -z "$apiLBIP" ]; then
  # apiLBIPs are not expected to be set on private clusters
  apiLBIP=$apiIntLBIP
fi
mkdir -p /etc/conf.d
etc_hosts_config_filename="/etc/conf.d/etc-hosts.conf"
echo "${apiLBIP}    ${apiURL%:*}" >> ${etc_hosts_config_filename}
echo "${apiIntLBIP}    ${apiIntURL%:*}" >> ${etc_hosts_config_filename}
cat /etc/conf.d/etc-hosts.conf
cat /etc/conf.d/etc-hosts.conf >> /etc/hosts
echo "Done updating /etc/hosts"
[root@gpei-0130-gcpdns-hq6q8-master-2 ~]# systemctl cat gcp-update-etc-hosts.service
# /etc/systemd/system/gcp-update-etc-hosts.service
[Unit]
Description=Update Default GCP /etc/hosts
# We don't need to do this on the firstboot
After=firstboot-osupdate.target
# Wait for NetworkManager to report it's online
After=NetworkManager-wait-online.service
# Run before kubelet
Before=kubelet-dependencies.target

[Service]
# Need oneshot to delay kubelet
Type=oneshot
ExecStart=/bin/bash -c " \
 \
apiIntLBIP=10.0.0.2 \
 \
 \
apiLBIP=34.160.207.88 \
 \
apiServerURL=https://api.gpei-0130-gcpdns.qe.gcp.devcluster.openshift.com:6443 \
apiServerIntURL=https://api-int.gpei-0130-gcpdns.qe.gcp.devcluster.openshift.com:6443 \
/usr/local/bin/update-etc-hosts ${apiLBIP} ${apiServerURL} ${apiIntLBIP} ${apiServerIntURL}"

[Install]
RequiredBy=kubelet-dependencies.target

It looks like two empty lines have been added to file /etc/conf.d/etc-hosts.conf, the value of ${apiLBIP}/${apiServerURL}/${apiIntLBIP}/${apiServerIntURL} were not loaded when running /usr/local/bin/update-etc-hosts, maybe we need to also define them in "Environment" of the gcp-update-etc-hosts systemd service or some other way to make sure the variables could be set correctly.

fi
mkdir -p /etc/conf.d
etc_hosts_config_filename="/etc/conf.d/etc-hosts.conf"
echo "${apiLBIP} ${apiURL%:*}" >> ${etc_hosts_config_filename}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue I'm seeing is, when I manually executed the following command

 [root@gpei-0130-gcpdns-hq6q8-master-0 ~]# /usr/local/bin/update-etc-hosts 34.160.207.88 https://api.gpei-0130-gcpdns.qe.gcp.devcluster.openshift.com:6443 10.0.0.2 https://api-int.gpei-0130-gcpdns.qe.gcp.devcluster.openshift.com:6443
    
    
34.160.207.88    https://api.gpei-0130-gcpdns.qe.gcp.devcluster.openshift.com
10.0.0.2    https://api-int.gpei-0130-gcpdns.qe.gcp.devcluster.openshift.com
Done updating /etc/hosts

It's adding records like
10.0.0.2 https://api-int.gpei-0130-gcpdns.qe.gcp.devcluster.openshift.com to /etc/hosts file, but it can't help on resolve the URL, so looks like we also need to remove the beginning "https://" in addition.

Something like below should work based on my test, just for your reference.

apiHostname=${apiURL#*//}
apiIntHostname=${apiIntURL#*//}
echo "${apiLBIP}    ${apiHostname%%:*}" >> ${etc_hosts_config_filename}
echo "${apiIntLBIP}    ${apiIntHostname%%:*}" >> ${etc_hosts_config_filename}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I had removed the tailing port number previously. Now updated to also remove the leading https://.

@sadasu sadasu force-pushed the gcp-update-etc-hosts branch 2 times, most recently from f790dfb to 30ea55d Compare January 30, 2025 17:07
@sadasu
Copy link
Contributor Author

sadasu commented Jan 30, 2025

@gpei thanks for you attention to this work. I ended up moving some simple bash code back to templates/common/gcp/units/gcp-update-etc-hosts.service.yaml. My last attempt before I move to the use of "Environment".

@sadasu sadasu force-pushed the gcp-update-etc-hosts branch from 30ea55d to b02b304 Compare January 30, 2025 19:15
@sadasu
Copy link
Contributor Author

sadasu commented Jan 31, 2025

/retest-required

@gpei
Copy link

gpei commented Jan 31, 2025

@sadasu Now a new error was raised in the gcp-update-etc-hosts.service

[core@gpei-0131a-gcpdns-chwdj-master-2 ~]$ journalctl -u gcp-update-etc-hosts.service --no-pager
Jan 31 01:15:37 gpei-0131a-gcpdns-chwdj-master-2 systemd[1]: Starting Update Default GCP /etc/hosts...
Jan 31 01:15:37 gpei-0131a-gcpdns-chwdj-master-2 bash[1224]: mkdir: cannot create directory ‘etc_hosts_config_filename=’: Operation not permitted
Jan 31 01:15:37 gpei-0131a-gcpdns-chwdj-master-2 bash[1224]: mkdir: cannot create directory ‘echo’: Operation not permitted
Jan 31 01:15:37 gpei-0131a-gcpdns-chwdj-master-2 systemd[1]: gcp-update-etc-hosts.service: Main process exited, code=exited, status=1/FAILURE
Jan 31 01:15:37 gpei-0131a-gcpdns-chwdj-master-2 systemd[1]: gcp-update-etc-hosts.service: Failed with result 'exit-code'.
Jan 31 01:15:37 gpei-0131a-gcpdns-chwdj-master-2 systemd[1]: Failed to start Update Default GCP /etc/hosts.

[core@gpei-0131a-gcpdns-chwdj-master-2 ~]$ systemctl cat gcp-update-etc-hosts.service
# /etc/systemd/system/gcp-update-etc-hosts.service
[Unit]
Description=Update Default GCP /etc/hosts
# We don't need to do this on the firstboot
After=firstboot-osupdate.target
# Wait for NetworkManager to report it's online
After=NetworkManager-wait-online.service
# Run before kubelet
Before=kubelet-dependencies.target

[Service]
# Need oneshot to delay kubelet
Type=oneshot
ExecStart=/bin/bash -c " \
 \
apiIntLBIP=10.0.0.2  \
 \
apiLBIP=34.117.235.251 \
 \
apiServerURL=https://api.gpei-0131a-gcpdns.qe.gcp.devcluster.openshift.com:6443 \
apiServerIntURL=https://api-int.gpei-0131a-gcpdns.qe.gcp.devcluster.openshift.com:6443 \
apiServerHostPort=${apiServerURL#*//} \
apiServerIntHostPort=${apiServerIntURL#*//} \
apiServerHostname=${apiServerHostPort%:*} \
apiIntServerHostname=${apiServerIntHostPort%:*} \
mkdir -p /etc/conf.d \
etc_hosts_config_filename="/etc/conf.d/etc-hosts.conf" \
echo "${apiLBIP}    ${apiServerHostname}" >> ${etc_hosts_config_filename} \
echo "${apiIntLBIP}    ${apiIntServerHostname}" >> ${etc_hosts_config_filename} \
/usr/local/bin/update-etc-hosts"

[Install]
RequiredBy=kubelet-dependencies.target

I tried to put the creating /etc/conf.d directory task into ExecStartPre as following, it's working but still nothing was written to "/etc/conf.d/etc-hosts.conf"

[Service]
# Need oneshot to delay kubelet
Type=oneshot
ExecStartPre=/usr/bin/mkdir -p /etc/conf.d
ExecStart=/bin/bash -c " \
...

[root@gpei-0131a-gcpdns-chwdj-master-2 core]# ls -al /etc/conf.d/
total 12
drwxr-xr-x.   2 root root    6 Jan 31 01:15 .
drwxr-xr-x. 100 root root 8192 Jan 31 01:15 ..

Not sure if it's related to the write permission on that directory somehow, but even I specified User=root in the service file, "/etc/conf.d/etc-hosts.conf" file is still not present.

@sadasu
Copy link
Contributor Author

sadasu commented Jan 31, 2025

I tried to put the creating /etc/conf.d directory task into ExecStartPre as following, it's working but still nothing was written to "/etc/conf.d/etc-hosts.conf"

what if we created the file in ExecStartPre too. Like,
ExecStartPre=/usr/bin/mkdir -p /etc/conf.d; /usr/bin/echo "#Created by OpenShift" >> /etc/conf.d/etc-hosts.conf
It might be worth a try.

I am working on another version, that moves all file operations back to templates/common/gcp/files/usr-local-bin-update-etc-hosts.yaml

Append /etc/hosts files with entries to resolve cluster api and
api-int URLS. /etc/hosts will provide resolution for these URLs
until kubelet joins the cluster and runs its CoreDNS pod which
will then take over resolution of those 2 URLs
Added tests to accomodate GCP in-cluster DNS config
@sadasu sadasu force-pushed the gcp-update-etc-hosts branch from 8a70084 to bd13be9 Compare January 31, 2025 22:03
Copy link
Contributor

openshift-ci bot commented Feb 1, 2025

@sadasu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn bd13be9 link false /test okd-scos-e2e-aws-ovn
ci/prow/bootstrap-unit bd13be9 link false /test bootstrap-unit
ci/prow/e2e-gcp-op-single-node bd13be9 link true /test e2e-gcp-op-single-node
ci/prow/e2e-azure-ovn-upgrade-out-of-change bd13be9 link false /test e2e-azure-ovn-upgrade-out-of-change

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@gpei
Copy link

gpei commented Feb 1, 2025

@sadasu I think it's working as expected this time 🎉

IP address mapping of API/API-INT were both added into /etc/hosts, and masters are joined into the cluster.

[root@gpei-0201-gcpdns-mfggf-master-0 ~]# cat /etc/conf.d/etc-hosts.conf
35.241.55.195    api.gpei-0201-gcpdns.qe.gcp.devcluster.openshift.com
10.0.0.2    api-int.gpei-0201-gcpdns.qe.gcp.devcluster.openshift.com
[root@gpei-0201-gcpdns-mfggf-master-0 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
35.241.55.195    api.gpei-0201-gcpdns.qe.gcp.devcluster.openshift.com
10.0.0.2    api-int.gpei-0201-gcpdns.qe.gcp.devcluster.openshift.com
[root@gpei-0201-gcpdns-mfggf-master-0 ~]# 
[root@gpei-0201-gcpdns-mfggf-master-0 ~]# systemctl status gcp-update-etc-hosts.service
○ gcp-update-etc-hosts.service - Update Default GCP /etc/hosts
     Loaded: loaded (/etc/systemd/system/gcp-update-etc-hosts.service; enabled; preset: disabled)
     Active: inactive (dead) since Sat 2025-02-01 04:59:38 UTC; 41min ago
   Main PID: 1227 (code=exited, status=0/SUCCESS)
        CPU: 6ms

Feb 01 04:59:38 gpei-0201-gcpdns-mfggf-master-0 systemd[1]: Starting Update Default GCP /etc/hosts...
Feb 01 04:59:38 gpei-0201-gcpdns-mfggf-master-0 update-etc-hosts[1227]: Done updating /etc/hosts
Feb 01 04:59:38 gpei-0201-gcpdns-mfggf-master-0 systemd[1]: gcp-update-etc-hosts.service: Deactivated successfully.
Feb 01 04:59:38 gpei-0201-gcpdns-mfggf-master-0 systemd[1]: Finished Update Default GCP /etc/hosts.
[root@gpei-0201-gcpdns-mfggf-master-0 ~]# cat /usr/local/bin/update-etc-hosts
#!/bin/bash


apiIntLBIP=10.0.0.2 

apiLBIP=35.241.55.195

apiServerURL=https://api.gpei-0201-gcpdns.qe.gcp.devcluster.openshift.com:6443
apiServerIntURL=https://api-int.gpei-0201-gcpdns.qe.gcp.devcluster.openshift.com:6443
apiServerHostPort=${apiServerURL#*//}
apiServerIntHostPort=${apiServerIntURL#*//}
apiServerHostname=${apiServerHostPort%:*}
apiIntServerHostname=${apiServerIntHostPort%:*}
mkdir -p /etc/conf.d
etc_hosts_config_filename="/etc/conf.d/etc-hosts.conf"
echo "${apiLBIP}    ${apiServerHostname}" >> ${etc_hosts_config_filename}
echo "${apiIntLBIP}    ${apiIntServerHostname}" >> ${etc_hosts_config_filename}
if [ -f ${etc_hosts_config_filename} ]
then
  cat /etc/conf.d/etc-hosts.conf >> /etc/hosts
  echo "Done updating /etc/hosts"
fi
[root@preserve-gpei-worker k_files]# oc get node
NAME                              STATUS   ROLES                  AGE   VERSION
gpei-0201-gcpdns-mfggf-master-0   Ready    control-plane,master   52m   v1.32.1
gpei-0201-gcpdns-mfggf-master-1   Ready    control-plane,master   51m   v1.32.1
gpei-0201-gcpdns-mfggf-master-2   Ready    control-plane,master   51m   v1.32.1

[root@preserve-gpei-worker k_files]# oc get co
NAME                                       VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   False       True          True       49m     APIServicesAvailable: PreconditionNotReady...
baremetal                                  4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      47m     
cloud-controller-manager                   4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      52m     
cloud-credential                           4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      52m     
cluster-api                                4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      48m     
cluster-autoscaler                                                                                True        False         True       47m     machine-api not ready
config-operator                            4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      49m     
console                                                                                                                                        
control-plane-machine-set                  4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      47m     
csi-snapshot-controller                    4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      49m     
dns                                        4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      48m     
etcd                                       4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        True          False      47m     NodeInstallerProgressing: 1 node is at revision 0; 1 node is at revision 1; 1 node is at revision 3; 0 nodes have achieved new revision 7
image-registry                                                                                                                                 
ingress                                                                                           False       True          True       47m     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
insights                                   4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      48m     
kube-apiserver                             4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        True          True       45m     GuardControllerDegraded: Missing operand on node gpei-0201-gcpdns-mfggf-master-2
kube-controller-manager                    4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        True          False      45m     NodeInstallerProgressing: 3 nodes are at revision 4; 0 nodes have achieved new revision 5
kube-scheduler                             4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        True          False      45m     NodeInstallerProgressing: 1 node is at revision 0; 2 nodes are at revision 5
kube-storage-version-migrator              4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      49m     
machine-api                                                                                       False       True          True       47m     Operator is initializing
machine-approver                           4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      48m     
machine-config                             4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      47m     
marketplace                                4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      47m     
monitoring                                                                                        False       True          True       32m     UpdatingPrometheusOperator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded: got 2 unavailable replicas
network                                    4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      51m     
node-tuning                                4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      48m     
olm                                        4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   False       True          True       49m     CatalogdDeploymentCatalogdControllerManagerAvailable: Waiting for Deployment...
openshift-apiserver                        4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   False       True          True       49m     APIServicesAvailable: PreconditionNotReady
openshift-controller-manager               4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      47m     
openshift-samples                                                                                                                              
operator-lifecycle-manager                 4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      47m     
operator-lifecycle-manager-catalog         4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      47m     
operator-lifecycle-manager-packageserver   4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      41m     
service-ca                                 4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      49m     
storage                                    4.19.0-0.test-2025-02-01-015036-ci-ln-7f63wsk-latest   True        False         False      48m     

@gpei
Copy link

gpei commented Feb 1, 2025

The new problem I can see now is the workers are trying to fetch worker ignition from the original MCS address "https://api-int.gpei-0201-gcpdns.qe.gcp.devcluster.openshift.com:22623/config/worker" which couldn't be resolved, looks like we'll need to replace the URL to the internal LB IP address?

Here's the console Serial of worker:

�[K[�[0m�[0;31m*     �[0m] A start job is running for Ignition (fetch) (50min 30s / no limit)
[ 3034.007742] ignition[823]: GET https://api-int.gpei-0201-gcpdns.qe.gcp.devcluster.openshift.com:22623/config/worker: attempt #607
[ 3034.044185] ignition[823]: GET error: Get "https://api-int.gpei-0201-gcpdns.qe.gcp.devcluster.openshift.com:22623/config/worker": dial tcp: lookup api-int.gpei-0201-gcpdns.qe.gcp.devcluster.openshift.com on 169.254.169.254:53: no such host

I believe we can create another bug to track this issue.
(Update, created https://issues.redhat.com/browse/OCPBUGS-49737 to track the worker ignition fetching separately)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants