Fix batch size handling in DatasetTorch to ensure at least one batch per file #66

cpaniaguam · 2025-12-11T19:24:24Z

This pull request introduces several improvements and bug fixes to the DatasetTorch class and its usage, focusing on batch size handling, error checking, and test consistency. The main changes enforce that the batch size must evenly divide the number of samples per file, update the logic for calculating batches per file, and update tests and configuration to use consistent, valid batch sizes.

Core logic and validation improvements:

Added validation in DatasetTorch to raise a ValueError if batch_size does not evenly divide the number of samples per file, ensuring data consistency and preventing subtle bugs during batching.
Simplified the calculation of the number of batches per file and the logic for generating batch indices in __getitem__, improving code clarity and reliability.

Test updates and coverage:

Updated all relevant tests in tests/test_torch_mlp.py to use batch sizes that evenly divide the sample count, and added a new test to verify that an error is raised if this condition is not met. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Configuration and constants alignment:

Updated batch size values in test configuration files and constants to reflect the new divisibility requirement, ensuring tests and sample configs are valid. [1] [2] [3] [4]

Minor code cleanups:

Replaced usage of .keys() with direct in checks for dictionary membership throughout the code for improved readability and Pythonic style. [1] [2] [3]

…per file

codecov · 2025-12-11T19:40:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines	Coverage Δ
src/lanfactory/trainers/torch_mlp.py	`86.86% <100.00%> (+0.22%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

AlexanderFengler

Left a few thoughts thanks @cpaniaguam.
Definitely wasn't in good shape (not robust) before.
Thanks for looking into it.

src/lanfactory/trainers/torch_mlp.py

…curacy

…ng performance

…y samples per file

…size

AlexanderFengler

Some of the batch-size choices in the tests seem a bit crazy now, but let's roll with it :). Thanks @copilot

Copilot · 2026-01-04T12:01:20Z

@AlexanderFengler I've opened a new pull request, #68, to work on those changes. Once the pull request is ready, I'll request review from you.

Fix batch size handling in DatasetTorch to ensure at least one batch …

89fabbe

…per file

cpaniaguam requested a review from AlexanderFengler December 11, 2025 20:08

AlexanderFengler reviewed Dec 16, 2025

View reviewed changes

src/lanfactory/trainers/torch_mlp.py Outdated Show resolved Hide resolved

src/lanfactory/trainers/torch_mlp.py Show resolved Hide resolved

cpaniaguam added 6 commits December 16, 2025 10:10

Replace int(a/b) with a//b

90ee2cc

Update batch size in DatasetTorch tests to improve consistency and ac…

a839d37

…curacy

Update batch sizes in configuration and constants for improved traini…

92beafa

…ng performance

Refactor batch size handling in DatasetTorch to ensure divisibility b…

688f7a2

…y samples per file

Merge branch 'main' into fix-batch-size-greater-samples-per-file

1a917dc

Add test for DatasetTorch to raise ValueError on non-divisible batch …

987ae90

…size

cpaniaguam requested a review from AlexanderFengler December 16, 2025 22:22

cpaniaguam self-assigned this Dec 16, 2025

AlexanderFengler approved these changes Jan 4, 2026

View reviewed changes

AlexanderFengler merged commit 5cc4437 into main Jan 4, 2026
7 checks passed

Copilot AI mentioned this pull request Jan 4, 2026

No changes needed - batch size feedback acknowledged #68

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix batch size handling in DatasetTorch to ensure at least one batch per file #66

Fix batch size handling in DatasetTorch to ensure at least one batch per file #66

Uh oh!

cpaniaguam commented Dec 11, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 11, 2025 •

edited

Loading

Uh oh!

AlexanderFengler left a comment

Uh oh!

Uh oh!

Uh oh!

AlexanderFengler left a comment

Uh oh!

Uh oh!

Copilot AI commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix batch size handling in DatasetTorch to ensure at least one batch per file #66

Fix batch size handling in DatasetTorch to ensure at least one batch per file #66

Uh oh!

Conversation

cpaniaguam commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

AlexanderFengler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AlexanderFengler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cpaniaguam commented Dec 11, 2025 •

edited

Loading

codecov bot commented Dec 11, 2025 •

edited

Loading