Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: add ostree.sync test #3998

Merged
merged 1 commit into from
Jan 23, 2025
Merged

tests: add ostree.sync test #3998

merged 1 commit into from
Jan 23, 2025

Conversation

HuijingHei
Copy link
Member

@HuijingHei HuijingHei commented Dec 26, 2024

Add test for https://issues.redhat.com/browse/OCPBUGS-15917, to verify ostree works in the siutation with disconnected the network volume.

As we do not have ceph for testing, according to the suggestion from Colin and Joseph: use something like NFS, we should in theory see the same error if we disconnected the NFS volume and we could not sync the filesystem.

Xref to steps in comment.

Copy link

openshift-ci bot commented Dec 26, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@HuijingHei
Copy link
Member Author

HuijingHei commented Dec 26, 2024

There are still 2 issues:

  • Can not work on qemu, as can not mount nfs share storage on the second machine using hostforward port like tests: add kdump over NFS #3922 (this is not must, can skip platform qemu if no better way to resolve)

  • Test on gcp, when rebooting, sometimes unmounting nfs there is A stop job is running for /var/tmp/data6 (1min 30s / no limit) which makes the the running time longer than 6.5 mins (PASS: ostree.sync (393.69s)). Note: try to add custom DefaultTimeoutStopSec=2s to /etc/systemd/system.conf, seems does not work.

More debug info on qemu:
Add c.MustSSH(nfs_client, "false") and run with --ssh-on-test-failure

./bin/kola run --qemu-image ../data/rhcos-419.96.202412161243-0-qemu.x86_64.qcow2 ostree.sync --debug --ssh-on-test-failure

[root@qemu1 core]# mount -v -t nfs4 10.0.2.2:/var/nfs/share1 /var/tmp/data1
mount.nfs4: timeout set for Fri Dec 27 07:10:38 2024
mount.nfs4: trying text-based options 'vers=4.2,addr=10.0.2.2,clientaddr=10.0.2.15'
mount.nfs4: mount(2): No such file or directory
mount.nfs4: mounting 10.0.2.2:/var/nfs/share1 failed, reason given by server: No such file or directory

The No such file or directory on qemu looks wired, and no such issue when running on gcp

@HuijingHei HuijingHei force-pushed the ostree-sync branch 5 times, most recently from 6d79dbc to 6f63a4a Compare December 27, 2024 01:49
@HuijingHei HuijingHei force-pushed the ostree-sync branch 11 times, most recently from 778121e to 6f821b7 Compare January 7, 2025 11:18
@HuijingHei
Copy link
Member Author

HuijingHei commented Jan 7, 2025

Except the above #3998 (comment), and test failed using unfixed ostree-2023.1-6.el9_2 in 414.92.202308081838-0 (which is expected).

Ready to review now, thanks!

@HuijingHei HuijingHei marked this pull request as ready for review January 7, 2025 11:54
@mike-nguyen
Copy link
Member

mike-nguyen commented Jan 14, 2025

We may need to forward port 111 too for NFS but it's giving me an error when I try it. I'll have to dig in some more. The other test is using an NFS container. I wonder if it is configured differently than using NFS directly on the host.

@HuijingHei
Copy link
Member Author

We may need to forward port 111 too for NFS but it's giving me an error when I try it. I'll have to dig in some more. The other test is using an NFS container. I wonder if it is configured differently than using NFS directly on the host.

I tried JB's #3922, run locally and passed. Maybe I should try with NFS container.

@HuijingHei
Copy link
Member Author

We may need to forward port 111 too for NFS but it's giving me an error when I try it. I'll have to dig in some more. The other test is using an NFS container. I wonder if it is configured differently than using NFS directly on the host.

I tried JB's #3922, run locally and passed. Maybe I should try with NFS container.

Tried NFS container, made some changes to the containerfile and push to quay.io/hhei/nfs, but failed to mount.

Also tried manually mount in kdump case:

  • mount actual share dir 10.0.2.2:/export failed
[root@qemu1 core]# mount -v 10.0.2.2:/export /mnt
mount.nfs: timeout set for Tue Jan 14 07:27:34 2025
mount.nfs: trying text-based options 'vers=4.2,addr=10.0.2.2,clientaddr=10.0.2.15'
mount.nfs: mount(2): No such file or directory
mount.nfs: trying text-based options 'addr=10.0.2.2'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: portmap query retrying: RPC: Unable to receive
mount.nfs: prog 100003, trying vers=3, prot=17
...
  • mount 10.0.2.2:/ successfully
[root@qemu1 core]# mount -t nfs -vvv 10.0.2.2:/ /mnt
mount.nfs: timeout set for Tue Jan 14 06:35:02 2025
mount.nfs: trying text-based options 'vers=4.2,addr=10.0.2.2,clientaddr=10.0.2.15'
[root@qemu1 core]# df -Th
10.0.2.2:/     nfs4       16G  2.7G   13G  17% /var/mnt

Not sure if there is any workaround for this.

@jbtrystram
Copy link
Contributor

Also tried manually mount in kdump case:

* mount actual share dir `10.0.2.2:/export` failed

[...]
* mount 10.0.2.2:/ successfully

Yeah I noticed the same. I don't understand why this happens. In the kdump test that does not matter too much but in your case it's important, we should get to the bottom of this :)

@HuijingHei
Copy link
Member Author

We may need to forward port 111 too for NFS but it's giving me an error when I try it.

Add hostforwardport 111 like {Service: "portmapper", HostPort: 111, GuestPort: 111}, but failed with sync.go:114: listen tcp :111: bind: permission denied

@mike-nguyen
Copy link
Member

mike-nguyen commented Jan 14, 2025

looks like nfs3 requires port 111. nfs4 only requires port 2049 but it has a concept of a global root directory:

Note with remote NFS paths

    They don't work the way they did in NFSv3. NFSv4 has a global root directory and all exported directories are children to it. So what would have been nfs-server:/export/users on NFSv3 is nfs-server:/users on NFSv4, because /export is the root directory. 

While I was unable to do

[root@qemu1 ~]# mount -t nfs4 10.0.2.2:/var/nfs/shared1 /var/tmp/data1
mount.nfs4: mounting 10.0.2.2:/var/nfs/shared1 failed, reason given by server: No such file or directory

I was able to do

[root@qemu1 ~]# mount -t nfs4 10.0.2.2:/var/nfs/ /var/tmp/data1
[root@qemu1 ~]# ls /var/tmp/data1
share1  share2  share3  share4  share5  share6

@HuijingHei
Copy link
Member Author

HuijingHei commented Jan 15, 2025

Thanks @mike-nguyen and @jbtrystram for the pointer! Though the CI failed, manually run 5 times, all passed.

$ bin/kola run -p qemu -b rhcos --qemu-image rhcos-419.96.202501100402-0-qemu.x86_64.qcow2 ostree.sync --debug --ssh-on-test-failure --multiply 5
=== RUN   ostree.sync4
=== RUN   ostree.sync0
=== RUN   ostree.sync1
=== RUN   ostree.sync2
=== RUN   ostree.sync3

--- PASS: ostree.sync4 (240.30s)
        sync.go:197: Got NFS mount.
        cluster.go:151: Running as unit: run-ra14dc52e6a1a4caa8e8c75ed7f9b0e8c.service
        sync.go:224: Set link down and rebooting.
        cluster.go:151: Running as unit: run-r531eacc6778741908ea53aa04c079ac9.service
        sync.go:243: Found test=1 in kernel argument after rebooted.

--- PASS: ostree.sync2 (244.50s)
--- PASS: ostree.sync3 (240.28s)
--- PASS: ostree.sync1 (240.65s)
--- PASS: ostree.sync0 (245.75s)

@HuijingHei HuijingHei force-pushed the ostree-sync branch 2 times, most recently from ecaa8ca to a8ab2af Compare January 15, 2025 14:31
@mike-nguyen
Copy link
Member

/retest-required

@mike-nguyen
Copy link
Member

Something strange is going on with CI. I created #4000 to see why we're getting a 403 to the f41-coreos-continous repos. If I add a curl before the yum -y distro-sync in build.sh it worked.

@HuijingHei
Copy link
Member Author

Something strange is going on with CI. I created #4000 to see why we're getting a 403 to the f41-coreos-continous repos. If I add a curl before the yum -y distro-sync in build.sh it worked.

Thank you for the debugging, seems the 403 issue is fixed, but failed with another reason, will look at it.

@HuijingHei HuijingHei force-pushed the ostree-sync branch 2 times, most recently from bd33db4 to 8c21f51 Compare January 17, 2025 08:45
@HuijingHei
Copy link
Member Author

Something strange is going on with CI. I created #4000 to see why we're getting a 403 to the f41-coreos-continous repos. If I add a curl before the yum -y distro-sync in build.sh it worked.

Thank you for the debugging, seems the 403 issue is fixed, but failed with another reason, will look at it.

CI is fixed by #4002, rebase the patch and rerun, ready to review, thanks!

@HuijingHei
Copy link
Member Author

Hold on, find this also passed on rhcos-414.92.202308081838-0 with unfixed ostree.

@HuijingHei HuijingHei force-pushed the ostree-sync branch 2 times, most recently from f16dc32 to 21823c2 Compare January 20, 2025 09:18
@HuijingHei
Copy link
Member Author

@jbtrystram put nfs-random-write.sh back to goroutine, and test failed with unfixed ostree (which is expected), test passed with fixed osteee.

[coreos-assembler]$ bin/kola run -p qemu -b rhcos --qemu-image /srv/data/rhcos-414.92.202308081838-0-qemu.x86_64.qcow2 ostree.sync
=== RUN   ostree.sync
2025-01-20T09:29:31Z kola: Test timed out. Adding as candidate for rerun success: ostree.sync
--- FAIL: ostree.sync (604.65s)
        sync.go:202: Got NFS mount.
        sync.go:230: Set link down and rebooting.
        cluster.go:151: Running as unit: run-raf180caa0d2341b988e2f1f8df6e84e9.service
        harness.go:106: TIMEOUT[10m0s]: ssh: sudo sh /usr/local/bin/nfs-random-write.sh
        harness.go:106: TIMEOUT[10m0s]: ssh: cat /proc/cmdline
FAIL, output in _kola_temp/qemu-2025-01-20-0919-12374
Error: harness: test suite failed
2025-01-20T09:29:31Z cli: harness: test suite failed

[coreos-assembler]$ bin/kola run -p qemu -b rhcos --qemu-image /srv/data/rhcos-419.96.202501100402-0-qemu.x86_64.qcow2 ostree.sync
=== RUN   ostree.sync
--- PASS: ostree.sync (227.92s)
        sync.go:202: Got NFS mount.
        sync.go:230: Set link down and rebooting.
        cluster.go:151: Running as unit: run-rb5032bf91dee40b780f01ed9bd635d9b.service
        sync.go:249: Found test=1 in kernel argument after rebooted.
PASS, output in _kola_temp/qemu-2025-01-20-0930-12504

Add test for https://issues.redhat.com/browse/OCPBUGS-15917, to
verify ostree can sync the filesystem with the disconnected
network volume(NFS).

As we do not have ceph for testing, according to the suggestion
from Colin and Joseph: `use something like NFS, we should in
theory see the same error if we disconnected the NFS volume and
we could not sync the filesystem.`

Coworked with JB.
@HuijingHei
Copy link
Member Author

@jmarrero could you help to review this PR when you are available? Thanks!

Copy link
Member

@jmarrero jmarrero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the extensive test @HuijingHei It looks to me that is doing what we need.
/lgtm

@mike-nguyen
Copy link
Member

/lgtm

@HuijingHei
Copy link
Member Author

Thank you all for the help, will merge this if no objection!

@HuijingHei HuijingHei merged commit 7dd4b4d into coreos:main Jan 23, 2025
5 checks passed
@HuijingHei HuijingHei deleted the ostree-sync branch January 23, 2025 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants