-
Notifications
You must be signed in to change notification settings - Fork 12.7k
finetune.cpp command-line arg #13873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
perhaps no need to review until i have an actual SGD impl in a follow-on, @JohannesGaessler - but a few general questions about contributing:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should better keep that change as it time to get more feedbacks/approval.
Any changes made to the ggml source in this repository will eventually be synced to the ggml repository and vice versa; it is completely fine. I think the issue of a git submodule was previously brought up and rejected.
My opinion is that people serious about training should be writing a program rather than use a command line tool. Still, I think it's good to make things such as the learning rate configurable in the provided example program.
I don't remember whether those args were put in by me when I copypasted code or by Georgi when he later refactored it but I myself definitely did not make an intentional choice to use these exact arguments.
I don't know, sorry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the previous perplexity-specific arguments are needed.
For adding an SDG optimizer, add a new ggml op like |
yes, will do. should the actual SGD impl be a subsequent pull req (or several, e.g. starting first w/ just CPU impl) or do you want it all in one pull req? |
Either way would be fine with me as long as there are at no point broken or unfinished features on master. |
e752031
to
e689af8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward to the next PR(s).
you should see frivolous clang-format changes (using the project's .clang-format) only on lines changed in the PR (using git-clang-format). if there's something undesireable we could figure out what in the format config does it |
Don't autoformat code en masse unless it's done in a dedicated PR, it makes it unnecessarily difficult to track what was actually changed in a PR. |
Sorry, I didn't read the
part. |
7534bbf
to
48a16bf
Compare
Hi @WilliamTambellini @JohannesGaessler I think this is usable now, inviting code nitpicks etc :) |
Second (actual usable SGD) commit is 48a16bf (also shows above here) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mix up different projects: change of CLI/renaming and SGD. Need to split in 2 PRs.
@slaren ?
I think we should try to keep the number of divergent branches low if at all possible. How about this: spin out the changes from this PR to the command line interface and the |
|
Should be out-of-scope. Seems like the webgpu backend is missing an GGML_OP_SUM implementation. |
Fine to either skip WebGPU or disable the tests again for this PR. |
pushed an attempt to just skip the WebGPU device but otherwise leave test-opt enabled. if that doesn't work will just comment out the cmake for test-opt as before |
Thanks, very nice |
@graehl when merge to master?) |
I'm not aware of anything I can do on my end to get this merged (is someone waiting on me that I'm unaware of?). I just marked a 'conflict' above resolved but I think ultimately the permission is with the llama.cpp maintainers |
@JohannesGaessler completed for final review/merge
@graehl you may rebase to master to simply merge for JohannesGaessler
|
Don't think there's anything I can currently do (please be specific if I'm mistaken, I'm new). |
From my side, I'm basically waiting for you to look at graehl#1 and merge it if it's fine. I see it has a merge conflict again, I'll fix it. |
Rebase YOUR branch to master(then force push to your branch), see 0cc4m's changes, cherry pick 0cc4m(or rebase him changes to your's changes) |
Thanks for spelling this out, that was easy - didn't squash so we can keep occam's contrib. separate but it's all rebased and you should see it here. |
As I said, please use the human-readable parameters, and only the human-readable parameters, as the ones being passed to |
add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating m, v tensors. support finetune.cpp arg -opt SGD (or sgd). (default adamw as before) llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch) when using SGD instead of 19gb (55 sec/epoch) using adamw. (wikipedia 100 lines finetune) ( using the same GPU memory, adamw can only do before OOM 512 batch/context, reaching: train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00 val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00 SGD is superior, though it converges slower, with max before OOM 1728 batch/context (esp see the better validation perf): train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00 val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00 ) note: when finetuning long enough (or w/ enough -lr), validation accuracy *eventually* drops ('catastrophic forgetting') -lr-half (halflife) option useful for SGD to avoid oscillation or super slow underdamped learning (makes setting -lr more forgiving). terminal -lr for now is set by lr-halvings i.e. if you want at most 1/8 the inital -lr you set -lr-halvings 3. note: objective loss not directly comparable between adamw, sgd? - check perplexity or accuracy or consider relative improvements for convergence new finetune args -wd 1e-9 to enable weight decay in sgd or adamw, and max -epochs N (default 2 as before) cache (1 - wd*alpha) in 'adamw' opt struct - no noticeable perf benefit, disabled (still done for new SGD though) since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params would probably be able to change between SGD and AdamW with each epoch but would need to use adamw for the first (unconfirmed - no cmdline arg to set such a policy yet) test-opt checks adamw as before and now sgd (except for a few disabled tests for sgd only; probably just needs logging values and adding alternate reference values); tolerance on the 'regression' test is broader for sgd (so we don't need many more epochs)
This is fine and done now, but I cannot be confident the vulkan end of things is correct after the change (I just haven't read up on how the vulkan API works, at all). |
You can change what you want, once things are ready I'll do a proper review of the Vulkan parts and make sure they are okay. |
From my end I would consider this PR now essentially good to merge. So unless there is something else that is left to do I will make some cosmetic changes and rely on @0cc4m to fix Vulkan if necessary. After that I will approve and merge. |
I didn't change anything at all in vulkan - it's all greek to me :) Do take a look. Perhaps the tests weren't really running on vulkan (I had disabled them since I didn't have an impl). The change is that the op params tensor [1] is now sgd.wd instead of 1 - sgd.wd*sgd.alpha. ([0] is just sgd.alpha) |
Yeah, no worries. Here's a diff that does that change on the Vulkan shader, and removes two unnecessary preprocessor steps. diff --git a/ggml/src/ggml-vulkan/vulkan-shaders/opt_step_sgd.comp b/ggml/src/ggml-vulkan/vulkan-shaders/opt_step_sgd.comp
index 3d5e1d98f..6426dedee 100644
--- a/ggml/src/ggml-vulkan/vulkan-shaders/opt_step_sgd.comp
+++ b/ggml/src/ggml-vulkan/vulkan-shaders/opt_step_sgd.comp
@@ -1,9 +1,6 @@
#version 450
#include "generic_head.comp"
-#include "types.comp"
-
-#extension GL_EXT_control_flow_attributes : enable
layout(local_size_x = 512, local_size_y = 1, local_size_z = 1) in;
@@ -19,7 +16,7 @@ void main() {
}
const float alpha = data_params[0];
- const float keep = data_params[1];
+ const float keep = 1.f - alpha * data_params[1];
data_x[i] = data_x[i] * keep - alpha * data_grad[i];
} If you apply that the CI should pass again. |
add to ggml-opt learning rate (adamw alpha) cmdline arg, and an optimizer enum defaulting to adamw,
preparatory to work to support SGD
these are in common args a set of optimizer options active only for the new FINETUNE example (which includes all the previous finetune.cpp PERPLEXITY options as a precaution)
perhaps breaking with precedent, the ggml_opt_optimizer_params struct is included directly as args - if desired, we can instead just add learning rate and optimizer type to a struct independent of ggml-opt.h
as proposed in
#13835