Feature/#4 자체모델 #6
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
⭐Key Changes
비교 결과:
양자화 결과:
KULLM3 4bit 양자화와 배치 추론 구조로 리팩토링함으로써, 평균 응답 속도가 약 62% 개선되었으며, GPU VRAM 사용량은 약 50% 절감, 동시에 뉴스 3건 이상을 병렬 처리 가능하여 처리량은 최대 8배 증가하였다. 특히, 사용자 관점에서 체감되는 첫 응답까지의 대기 시간은 3초 → 1초 이내로 줄어들며 UX 측면에서도 큰 향상을 기대할 수 있다.
📌 issue