-
Notifications
You must be signed in to change notification settings - Fork 470
Add v1 Deployment & Ops Skills Taxonomy #19400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add v1 Deployment & Ops Skills Taxonomy #19400
Conversation
Files changed:
|
✅ Deploy Preview for cockroachdb-api-docs canceled.
|
✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
|
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify site configuration. |
f5ae0bd
to
5d71071
Compare
Fixes DOC-12354
5d71071
to
9cab2b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments mainly. I wonder if we should run this by someone from the PS team?
thanks @mwang1026 ! updated in latest commit based on your feedback happy to have someone from the PS team to look, who do you think we should tag ? |
hi @BramGruneir ! @mwang1026 suggested getting someone from the PS team to look at this docs PR, are you the right person to ask for help finding a reviewer? context for this docs PR is: In the January 2025 docs on-site, one of the things we discussed was a project called “Making CockroachDB Ops & Admin More Self-serve” Associated with that project was a list of tasks (aka a “skills taxonomy”) that users need help learning how to do for themselves This docs PR is an attempt to gather links to those tasks/skills in one place so that users can quickly find links to how to do these specific tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think this is a really good thing to have in our docs.
TODO based on offline comment from another team member: we should add a link on how to get debug/tsdump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments, mostly small nits/suggestions
Cockroach Labs offers [Professional Services](https://www.cockroachlabs.com/company/professional-services/) that can assist you with getting applications into production faster and more efficiently. | ||
{{site.data.alerts.end}} | ||
|
||
## Configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a one-liner description of each skill would be helpful under each header to introduce the following list, like:
"The configuration skill involves managing your CockroachDB monitoring and making informed configuration changes based on trends and alerts" or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, "configuration" in this context seems very specifically geared towards configuration of the underlying deployment infrastructure rather than configuration of the cluster itself. I think the term "configuration" itself is ambiguous, and should maybe be "Infrastructure configuration" or similar?
- [Rolling upgrades]({% link {{ page.version.version }}/upgrade-cockroach-version.md %}#perform-a-patch-upgrade) | ||
- Downgrade a cluster from a [patch version]({% link {{ page.version.version }}/upgrade-cockroach-version.md %}#roll-back-a-patch-upgrade) | ||
- Downgrade a cluster from a [major version]({% link {{ page.version.version }}/upgrade-cockroach-version.md %}#roll-back-a-major-version-upgrade) | ||
- [Change a cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#change-a-cluster-setting) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above, this in particular feels like it's related to the "configuration" skill unless you specify that "configuration" is for infrastructure.
|
||
- [Shut down a node gracefully]({% link {{ page.version.version }}/node-shutdown.md %}) | ||
- [Handling unplanned node outages]({% link {{ page.version.version }}/recommended-production-settings.md %}#load-balancing) | ||
- [Adding nodes]({% link {{ page.version.version }}/cockroach-start.md %}#add-a-node-to-a-cluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lists contain a mix of passive and active verb usage. Suggest "Add nodes"/"Remove nodes" rather than "Adding nodes"/"Removing nodes" etc
- Cluster repaving involves the following individual skills, which are also used during [rolling upgrades]({% link {{ page.version.version }}/upgrade-cockroach-version.md %}#perform-a-patch-upgrade): | ||
1. [Shut down a node gracefully]({% link {{ page.version.version }}/node-shutdown.md %}) | ||
1. Detach the [persistent volume]({% link {{ page.version.version }}/kubernetes-overview.md %}#kubernetes-terminology) (a.k.a. persistent disk) from the removed node's virtual machine (VM) (this step is optional but recommended) | ||
1. Delete the removed node's VM | ||
1. Start a new VM | ||
1. Reattach the persistent disk to the new VM (necessary if you did step #2) | ||
1. [Add a node to the cluster]({% link {{ page.version.version }}/cockroach-start.md %}#add-a-node-to-a-cluster) from the new VM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sudden switch to an instructional list feels awkward, is there not a better place to link to that describes cluster repaving in more depth?
- [Cluster instability: Dead/suspect nodes]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#node-liveness-issues) | ||
- [Out of memory problems]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#out-of-memory-oom-crash) | ||
- [Imbalanced cluster load]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing) | ||
- [EOF errors]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#client-connection-issues) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should spell this out, "End of file (EOF) errors"
1. Reattach the persistent disk to the new VM (necessary if you did step #2) | ||
1. [Add a node to the cluster]({% link {{ page.version.version }}/cockroach-start.md %}#add-a-node-to-a-cluster) from the new VM | ||
|
||
## Troubleshooting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this list describes problems rather than tasks, an introductory line for this section like I proposed above is definitely necessary to clarify.
- [Imbalanced cluster load]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing) | ||
- [EOF errors]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#client-connection-issues) | ||
- [Changefeed is falling behind]({% link {{ page.version.version }}/advanced-changefeed-configuration.md %}#lagging-ranges) | ||
- [Get a "debug zip" file]({% link {{ page.version.version }}/cockroach-debug-zip.md %}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the term "debug zip" commonly jargon for our customers? I'd think we should spell it out more clearly in this context, like "Download an archive for debugging"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless support is regularly asking customers to download a "debug zip" so we know that's the terminology they're looking for.
- [EOF errors]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#client-connection-issues) | ||
- [Changefeed is falling behind]({% link {{ page.version.version }}/advanced-changefeed-configuration.md %}#lagging-ranges) | ||
- [Get a "debug zip" file]({% link {{ page.version.version }}/cockroach-debug-zip.md %}) | ||
- [Get a "tsdump" (timeseries dump) file]({% link {{ page.version.version }}/cockroach-debug-tsdump.md %}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, how about "Collect timestamped diagnostic logs" or similar? This one I'm less inclined to describe because that's a mess.
- [Create S3 bucket for backup data]({% link {{ page.version.version }}/use-cloud-storage.md %}#amazon-s3-storage-classes) | ||
- [Full cluster backup to S3]({% link {{ page.version.version }}/take-full-and-incremental-backups.md %}#full-backups) | ||
- [Incremental backup to S3]({% link {{ page.version.version }}/take-full-and-incremental-backups.md %}#incremental-backups) | ||
- [Cluster restore from AWS S3]({% link {{ page.version.version }}/restore.md %}#restore-a-cluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have non-S3 storage topics for this?
- [Production Checklist]({% link {{ page.version.version }}/recommended-production-settings.md %}) | ||
- [Deploy CockroachDB Manually]({% link {{ page.version.version }}/manual-deployment.md %}) | ||
- [Deploy a Local Cluster from Binary (Secure)]({% link {{ page.version.version }}/secure-a-cluster.md %}) | ||
- [SQL Performance Best Practices]({% link {{ page.version.version }}/performance-best-practices-overview.md %}) | ||
- [Performance Tuning Recipes]({% link {{ page.version.version }}/performance-recipes.md %}) | ||
- [Troubleshoot Self-Hosted Setup]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be sentence case
Fixes DOC-12354
Rendered preview:
Deployment & ops skills taxonomy