Different generation in llama.cpp and llama-cpp-python

# Expected Behavior

I train a model with the [transformers](https://github.com/huggingface/transformers) lib, then convert it to llama.cpp format using `convert.py` from [llama.cpp](https://github.com/ggerganov/llama.cpp). Then, as a sanity check, I compare generation by transformers, llama.cpp and llama-cpp-python. I use f32 models in llama.cpp and llama-cpp-python. I configure all three to decode greedily, picking top 1 token at every step by setting `top-k=1` and setting `repeat_penalty=1.0`

I found that transformers and llama-cpp-python produce 100% same results while those of llama.cpp binaries differ. Perhaps there are generation parameters which default values differ for llama.cpp and llama-cpp-python? If not what could cause this discrepancy? 

# Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

* Physical (or virtual) hardware you are using, e.g. for Linux:

```
$ lscpu
$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    1
Core(s) per socket:    64
Socket(s):             2
NUMA node(s):          2
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 49
Model name:            AMD EPYC 7662 64-Core Processor
Stepping:              0
CPU MHz:               2100.865
CPU max MHz:           2000,0000
CPU min MHz:           1500,0000
BogoMIPS:              3999.91
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              16384K
NUMA node0 CPU(s):     0-63
NUMA node1 CPU(s):     64-127
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
$ nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-1d02b89e-9be9-ece5-472c-8ec1790ffdbc)
```

* Operating System, e.g. for Linux:

```
$ uname -a
Linux hostname 5.4.164-1.el7.elrepo.x86_64 #1 SMP Mon Dec 6 12:28:33 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
```

* SDK version, e.g. for Linux:

```
$ python3 --version
Python 3.10.12
$ make --version
GNU Make 4.2.1
Built for x86_64-pc-linux-gnu
$ g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Different generation in llama.cpp and llama-cpp-python #619

Expected Behavior

Environment and Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Different generation in llama.cpp and llama-cpp-python #619

Description

Expected Behavior

Environment and Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions