-
Notifications
You must be signed in to change notification settings - Fork 0
[DP-554] 검색어 자동완성 API (ES → MySQL 마이그레이션) #184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ssosee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
고생하셨습니다~ ^0^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
바이너리로 데이터를 강제하기(데이터 정확성이 올라감(?)) 위해서 COLLATE utf8mb4_bin 를 사용한건가요?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
맞습니다! 한글 자/모를 분리할 때 유니코드를 정확히 비교하기 위해서 사용했습니다!
| ); | ||
|
|
||
| // 스코어 계산을 위한 expression | ||
| var jamoScore = Expressions.numberTemplate(Double.class, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
혹시 var 타입을 사용한 이유가 있을까요?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
헙 코틀린 습관 이슈... 수정하겠습니다~!
| private final JPQLQueryFactory query; | ||
|
|
||
| @Override | ||
| public List<TechKeyword> searchKeyword(String inputJamo, String inputChosung, Pageable pageable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
대략 이런 쿼리로 이해했는데요. 맞을까요?!
select tk.*
from tech_keyword tk
where match(tk.jamo_key) against('+ㅈㅏㅂㅏ' in boolean mode) > 0.0
or match(tk.chosung_key) against('+ㅈㅂ' in boolean mode) > 0.0
order by greatest(
match(tk.jamo_key) against('+ㅈㅏㅂㅏ' in boolean mode),
match(tk.chosung_key) against('+ㅈㅂ' in boolean mode)
) desc,
char_length(tk.keyword) asc
limit 20;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
정확합니다! 실제로 호출하면 아래와 같은 쿼리가 작성됩니다.
select
tk1_0.id,
tk1_0.chosung_key,
tk1_0.created_at,
tk1_0.jamo_key,
tk1_0.keyword,
tk1_0.last_modified_at
from
tech_keyword tk1_0
where
match (tk1_0.jamo_key) against (? in boolean mode)>0.0
or match (tk1_0.chosung_key) against (? in boolean mode)>0.0
order by
greatest(match (tk1_0.jamo_key) against (? in boolean mode),
match (tk1_0.chosung_key) against (? in boolean mode)) desc,
character_length(tk1_0.keyword)
limit
?| private static final int COMPAT_JAMO_START = 0x3130; // 'ㄱ' | ||
| private static final int COMPAT_JAMO_END = 0x318F; // 'ㆎ' | ||
|
|
||
| // 한글 분해를 위한 상수 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CHOSUNG, JUNGSUNG, JONGSUNG 배열의 length 를 사용하지 않고, 따로 상수로 정의한 이유가 있을까요?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
여러 곳에서 반복적으로 쓰여서 상수로 분리했습니다!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
찾아보니, boolean mode 은 필터링/자동완성에 좋고 natural language mode는 랭킹은 좋다고 하는데요.
혹시 boolean mode 만 사용하신 이유가 있을까요?!
where 절에서는 boolean mode 를 사용하고 order by 절에서는 natural language mode 를 사용하면 좋을 것 같다(?) 라는 생각이 들어여 여쭤봅니다! 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
구현할 때는 혼합으로 사용한다는 생각을 못했었는데 좋은 의견 감사합니다!
테스트해보니 boolean mode 만을 사용하는 쿼리가 평균적으로 아주 미세하게 좀 더 빠르길래 알아보니, order by 할 때 boolean mode 는 단순 매칭 가중치만 보기 때문에 계산 비용이 상대적으로 낮은 반면, natural language mode는 정규화 등 추가 연산이 들어가서 조금 더 무겁다고 하네요!
대신 조회 결과 정렬에는 다음과 같은 차이가 있습니다!
- boolean mode 를 사용할 경우, 단순히 해당 단어가 매칭되었는가만 확인하기 때문에(1:있음, 0:없음) 동일 점수라면 char_length(keyword) asc 만 영향을 줍니다.
- 반면 natural language mode 를 사용하면 TF-IDF(?) 기반으로 문서 내 등장 횟수, 길이, 전체 데이터에서의 희귀도 등을 반영하여 점수를 연산하므로 좀 더 디테일한? 순위로 보여줄 수 있을 것 같습니다.
성능상 큰 차이는 없는 것으로 보여서 세웅님꼐서 말씀하신대로 혼합하여 사용하면 좋을 것 같습니다!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HangulUtils 클래스 멋있습니다!
혹시 static 메소드만 사용해서 abstract 으로 정의 하셨나요?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
맞습니다! 외부에서 static 메소드를 호출하기도 하고, 다른 유틸 클래스와의 통일성을 맞추기 위해 추상클래스로 변경했습니다.
| @Getter | ||
| @NoArgsConstructor(access = AccessLevel.PROTECTED) | ||
| @Table(indexes = { | ||
| @Index(name = "idx__ft__chosung_key", columnList = "chosung_key"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
사소한 부분인데, idx_tech_keyword_01 이런식으로 명명하기로 했던 것 같아요!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
소영님이 작성해주신, MySQLTestContainer 로 h2 의존성에서 벗어났습니다!
감사합니다.!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
지금은 TechKeywordService 클래스에서만 MySQL 을 사용하고 나머지 코드에서는 h2를 사용하고 있는데, 혹시 전부 MySQL을 사용하길 원하실까요?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
아뇨 그건 아닙니다!
필요한 곳에서만 자유롭게 사용하면 될것 같아오!
| .withPassword("test") | ||
| .withCommand( | ||
| "--character-set-server=utf8mb4", | ||
| "--collation-server=utf8mb4_unicode_ci", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
저희 서버 utf8mb4_general_ci 로 스키마 생성 되어 있긴한데, 큰 문제는 없겠죠..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
서버 기준으로 변경해두겠습니닷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
감사합니다~!
📝 작업 내용
🔗 참고할만한 자료(선택)
💬 리뷰 요구사항(선택)