Commit dcc104d
authored
fix: setting batch size correctly for language dataset (#314)
the map function for generating language dataset takes batch_size as an
argument, whose default value is 1000. This creates an issue during
tokenization. Tokenizer adds pad tokens incorrectly. We set the batch
size correctly here.1 parent 627409d commit dcc104d
2 files changed
+14
-7
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
331 | 331 | | |
332 | 332 | | |
333 | 333 | | |
334 | | - | |
335 | 334 | | |
336 | 335 | | |
337 | 336 | | |
| |||
464 | 463 | | |
465 | 464 | | |
466 | 465 | | |
467 | | - | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
468 | 473 | | |
469 | 474 | | |
470 | 475 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
26 | | - | |
27 | 26 | | |
28 | 27 | | |
29 | 28 | | |
| |||
51 | 50 | | |
52 | 51 | | |
53 | 52 | | |
| 53 | + | |
54 | 54 | | |
55 | 55 | | |
| 56 | + | |
| 57 | + | |
56 | 58 | | |
57 | 59 | | |
58 | 60 | | |
59 | 61 | | |
60 | 62 | | |
61 | 63 | | |
62 | 64 | | |
| 65 | + | |
63 | 66 | | |
64 | 67 | | |
65 | 68 | | |
| |||
108 | 111 | | |
109 | 112 | | |
110 | 113 | | |
111 | | - | |
112 | | - | |
113 | | - | |
| 114 | + | |
114 | 115 | | |
115 | | - | |
116 | 116 | | |
117 | 117 | | |
118 | 118 | | |
| |||
243 | 243 | | |
244 | 244 | | |
245 | 245 | | |
| 246 | + | |
246 | 247 | | |
247 | 248 | | |
248 | 249 | | |
| |||
272 | 273 | | |
273 | 274 | | |
274 | 275 | | |
| 276 | + | |
275 | 277 | | |
276 | 278 | | |
277 | 279 | | |
0 commit comments