feat: tistory blog thumbnail extraction #16

printSANO · 2025-03-11T08:09:31Z

요약

티스토리 블로그에서 썸네일이 있는 경우 추출

작업 내용

티스토리 블로그 썸네일 추출
테스트용 클라이언트 업데이트
x/net/html 패키지 추가

참고 사항

Summary by CodeRabbit

New Features
- 게시글 정보에 설명 내용이 포함되어, 보다 풍부한 콘텐츠를 제공합니다.
- 메시지 발행 시 게시물 카테고리 정보가 함께 전송되어 분류 기능이 강화되었습니다.
Chores
- 최신 Go 버전(1.23.0) 채택 및 네트워킹 관련 의존성 업데이트로 안정성을 향상시켰습니다.

coderabbitai · 2025-03-11T08:09:39Z

Walkthrough

이번 변경 사항은 Tistory 블로그 기능에 이미지 소스 추출 로직과 데이터 구조 개선, 메시지 발행 프로세스 업데이트를 도입합니다. extractImageSrc 함수에 default 케이스가 추가되어 HTML 콘텐츠 파싱 시 토큰 처리를 개선하였고, TistoryItem 구조체에 Description 필드가 추가되어 블로그 설명 데이터를 확장합니다. Go 버전이 1.20에서 1.23.0으로 업데이트되고, 새로운 네트워크 의존성이 추가되었으며, PublishMessage 함수의 인자에 blogCategory가 추가되어 메시지 발행 시 카테고리 정보를 포함하도록 개선되었습니다.

Changes

파일	변경 내용
`cmd/blogs/tistory.go`	`extractImageSrc` 함수에 `default` 케이스 추가
`cmd/blogs/utils.go`	`TistoryItem` 구조체에 `Description` 필드 추가
`go.mod`	Go 버전을 1.20에서 1.23.0으로 업데이트하고, `golang.org/x/net v0.37.0` 의존성 추가
`test_client/client.go`	`PublishMessage` 함수 시그니처 변경: `blogCategory` 매개변수 추가 및 호출 수정, `amqp091.Publishing`의 `Type` 필드에 blogCategory 값 반영

Sequence Diagram(s)

sequenceDiagram
    participant Caller as getTistoryPosts
    participant ImageExtractor as extractImageSrc
    participant Tokenizer as HTML Tokenizer

    Caller->>ImageExtractor: HTML 콘텐츠 전달
    ImageExtractor->>Tokenizer: HTML 토큰 파싱 요청
    Tokenizer-->>ImageExtractor: 토큰 스트림 제공 (<img> 태그 포함)
    ImageExtractor-->>Caller: 추출된 src 값 또는 빈 문자열 반환

sequenceDiagram
    participant Main as main
    participant Publisher as PublishMessage
    participant AMQP as amqp.Channel

    Main->>Publisher: userID, blogCategory, message 전달
    Publisher->>AMQP: Publishing 객체 생성 (Type = blogCategory)
    AMQP-->>Publisher: 메시지 처리 완료
    Publisher-->>Main: 결과 반환

Poem

토끼가 뛰노는 봄날처럼,
코드에 새 생명이 불어오네.
HTML 속 숨은 이미지 찾기,
썸네일이 반짝반짝 빛나고.
메시지에 카테고리 싹 붙여,
Go의 새 바람 타고 춤추네.
모두 함께, 즐거운 변화의 향연! 🐇🌸

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

cmd/blogs/tistory.go (1)
62-64: 썸네일 추출 기능 구현

블로그 포스트에서 추출한 썸네일을 로깅하는 것은 개발 및 디버깅에 도움이 됩니다. 프로덕션 환경에서는 로깅 수준을 조정하는 것이 좋습니다.
- log.Printf("Thumbnail: %s", thumbnail)
+ if thumbnail != "" {
+    log.Printf("Thumbnail extracted: %s", thumbnail)
+ }
test_client/client.go (1)
86-86: 불필요한 형 변환 제거

string(msg.UserID)와 string(msg.Category)는 불필요한 형 변환입니다. msg.UserID와 msg.Category는 이미 문자열입니다.
- err := PublishMessage(rabbitChan, queue.Name, messageBytes, string(msg.UserID), string(msg.Category))
+ err := PublishMessage(rabbitChan, queue.Name, messageBytes, msg.UserID, msg.Category)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0173fec and 3308736.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (4)

cmd/blogs/tistory.go (3 hunks)
cmd/blogs/utils.go (1 hunks)
go.mod (1 hunks)
test_client/client.go (3 hunks)

🔇 Additional comments (7)

go.mod (2)

8-8: golang.org/x/net 패키지 추가 확인

HTML 파싱을 위한 x/net/html 패키지 추가는 적절합니다. 이 패키지는 Tistory 블로그 썸네일 추출 기능 구현에 필요합니다.

3-3:
Details

✅ Verification successful

Go 버전이 크게 상승했습니다.

Go 1.20에서 1.23.0으로 큰 버전 업그레이드가 있었습니다. 이는 호환성 문제를 일으킬 수 있으며, 팀 내 모든 개발자가 이 버전을 설치했는지 확인해야 합니다.

🏁 Script executed:
#!/bin/bash
# Go 버전과 호환성 확인
go version
Length of output: 44
최신 Go 버전 확인 및 팀 환경 점검 권고

현재 시스템에서 go version 명령어를 실행한 결과, 설치된 Go 버전이 go1.23.3임을 확인했습니다. go.mod 파일에 명시된 버전인 go 1.23.0과 동일한 메이저 및 마이너 버전 계열이므로, 패치 버전 차이는 큰 문제로 보이지 않습니다. 다만, Go 1.20에서 1.23으로의 주요 업그레이드 만큼, 팀 내 모든 개발 환경에서 동일한 버전(또는 호환 가능한 패치 버전)을 사용하고 있는지 반드시 확인해 주세요.
cmd/blogs/utils.go (1)

143-149: 필드 추가가 적절히 이루어졌습니다.

TistoryItem 구조체에 Description 필드를 추가한 것은 블로그 콘텐츠에서 썸네일을 추출하기 위한 적절한 변경입니다. XML 및 JSON 태그가 올바르게 설정되었습니다.

cmd/blogs/tistory.go (2)

11-12: x/net/html 패키지 가져오기

Tistory 블로그 HTML 콘텐츠에서 이미지 소스를 추출하기 위해 적절한 패키지를 가져왔습니다.

71-71: 썸네일 필드 추가

Posts 구조체에 썸네일 필드를 적절히 추가했습니다.

test_client/client.go (2)

102-102: 함수 시그니처 변경 적절함

블로그 카테고리를 매개변수로 추가한 것은 메시지 메타데이터를 풍부하게 만들기 위한 좋은 변경입니다.

127-127: 메시지 타입 필드 설정

메시지의 Type 필드에 블로그 카테고리를 설정하는 것은 RabbitMQ에서 메시지를 분류하는 데 유용합니다.

coderabbitai · 2025-03-11T08:13:00Z

cmd/blogs/tistory.go

+func extractImageSrc(htmlContent string) string {
+	tokenizer := html.NewTokenizer(strings.NewReader(htmlContent))
+
+	for {
+		tt := tokenizer.Next()
+		token := tokenizer.Token()
+		switch tt {
+		case html.ErrorToken:
+			log.Printf("Error tokenizing HTML: %v", tokenizer.Err())
+			return ""
+		case html.SelfClosingTagToken:
+			if token.Data == "img" {
+				for _, attr := range token.Attr {
+					if attr.Key == "src" {
+						return attr.Val
+					}
+				}
+			}
+		}
+	}
+}


⚠️ Potential issue

이미지 소스 추출 함수의 무한 루프 문제

현재 구현은 HTML 콘텐츠에 이미지 태그가 없을 경우 함수가 반환되지 않는 문제가 있습니다. 또한 일반적인 img 태그(닫는 태그가 있는)를 감지하지 않습니다.

다음과 같이 수정하는 것이 좋습니다:

func extractImageSrc(htmlContent string) string { tokenizer := html.NewTokenizer(strings.NewReader(htmlContent)) for { tt := tokenizer.Next() token := tokenizer.Token() switch tt { case html.ErrorToken: log.Printf("Error tokenizing HTML: %v", tokenizer.Err()) return "" - case html.SelfClosingTagToken: + case html.SelfClosingTagToken, html.StartTagToken: if token.Data == "img" { for _, attr := range token.Attr { if attr.Key == "src" { return attr.Val } } } + // 최대 반복 횟수 제한 추가 (선택적) + // 대형 HTML 콘텐츠에서의 성능 개선을 위해 } } }

또한 다음 기능 추가를 고려해보세요:

타임아웃 메커니즘 추가

src가 상대 경로일 경우 절대 경로로 변환

더 많은 오류 처리

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func extractImageSrc(htmlContent string) string {

tokenizer := html.NewTokenizer(strings.NewReader(htmlContent))

for {

tt := tokenizer.Next()

token := tokenizer.Token()

switch tt {

case html.ErrorToken:

log.Printf("Error tokenizing HTML: %v", tokenizer.Err())

return ""

case html.SelfClosingTagToken:

if token.Data == "img" {

for _, attr := range token.Attr {

if attr.Key == "src" {

return attr.Val

}

}

}

}

}

}

func extractImageSrc(htmlContent string) string {

tokenizer := html.NewTokenizer(strings.NewReader(htmlContent))

for {

tt := tokenizer.Next()

token := tokenizer.Token()

switch tt {

case html.ErrorToken:

log.Printf("Error tokenizing HTML: %v", tokenizer.Err())

return ""

case html.SelfClosingTagToken, html.StartTagToken:

if token.Data == "img" {

for _, attr := range token.Attr {

if attr.Key == "src" {

return attr.Val

}

}

}

// 최대 반복 횟수 제한 추가 (선택적)

// 대형 HTML 콘텐츠에서의 성능 개선을 위해

}

}

}

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (1)

cmd/blogs/tistory.go (1)
76-76: ⚠️ Potential issue

함수가 올바르게 블로그 응답을 반환하지 않습니다.

getTistoryPosts 함수에서 반환 문이 누락되었습니다. 함수 끝에 tistoryBlogResponse를 반환해야 합니다.
    })
  }
+ return tistoryBlogResponse, nil
}

♻️ Duplicate comments (1)

cmd/blogs/tistory.go (1)
119-139: ⚠️ Potential issue

이미지 소스 추출 함수의 무한 루프 문제

현재 구현은 HTML 콘텐츠에 이미지 태그가 없을 경우 함수가 반환되지 않는 문제가 있습니다. 또한 일반적인 img 태그(닫는 태그가 있는)를 감지하지 않습니다.

다음과 같이 수정하는 것이 좋습니다:
func extractImageSrc(htmlContent string) string {
	tokenizer := html.NewTokenizer(strings.NewReader(htmlContent))

	for {
		tt := tokenizer.Next()
		token := tokenizer.Token()
		switch tt {
		case html.ErrorToken:
			log.Printf("Error tokenizing HTML: %v", tokenizer.Err())
			return ""
-		case html.SelfClosingTagToken:
+		case html.SelfClosingTagToken, html.StartTagToken:
			if token.Data == "img" {
				for _, attr := range token.Attr {
					if attr.Key == "src" {
						return attr.Val
					}
				}
			}
		}
	}
}
추가로 다음과 같은 개선사항도 고려해보세요:

상대 경로 URL을 절대 경로로 변환하는 로직 추가

이미지 URL의 유효성 검사

디버그 로깅 수준 조정

🧹 Nitpick comments (2)

cmd/blogs/tistory.go (2)
62-63: 로그 출력 개선이 필요합니다.

현재 모든 썸네일 URL을 로깅하고 있습니다. 많은 블로그 포스트를 처리할 경우 로그가 과도하게 생성될 수 있습니다. 디버깅 목적이 아니라면 로깅 수준을 조정하거나 제거하는 것이 좋습니다.
 thumbnail := extractImageSrc(item.Description)
-log.Printf("Thumbnail: %s", thumbnail)
+// 개발 환경에서만 로깅하거나 로깅 레벨을 조정
+if debug {
+    log.Printf("Thumbnail: %s", thumbnail)
+}
62-62: 이미지 추출 로직을 개선하세요.

현재 구현은 Tistory 블로그의 모든 콘텐츠에서 첫 번째 이미지만 추출합니다. 썸네일이 없거나 대표 이미지가 아닌 경우에 대한 처리가 필요합니다.
-thumbnail := extractImageSrc(item.Description)
+// 대표 이미지가 있는지 먼저 확인 (Open Graph 태그 또는 특정 클래스를 가진 이미지)
+thumbnail := extractThumbnailImage(item.Description)
+// 대표 이미지가 없는 경우 첫 번째 이미지 사용
+if thumbnail == "" {
+    thumbnail = extractImageSrc(item.Description)
+}
+// 기본 이미지 설정
+if thumbnail == "" {
+    thumbnail = "https://example.com/default-thumbnail.jpg"
+}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3308736 and 71d97c0.

📒 Files selected for processing (2)

cmd/blogs/tistory.go (3 hunks)
cmd/blogs/tistory.go (0 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cmd/blogs/tistory.go

sonarqubecloud · 2025-03-11T08:16:16Z

Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (2)

cmd/blogs/tistory.go (2)
140-140: ⚠️ Potential issue

구문 오류: 여분의 닫는 중괄호

140번 줄에 있는 닫는 중괄호는 불필요합니다. 이 중괄호에 대응하는 여는 중괄호가 없으므로 구문 오류를 발생시킵니다.
- }
75-75: ⚠️ Potential issue

getTistoryPosts 함수에 반환문 누락

getTistoryPosts 함수가 값을 반환하지 않습니다. 함수 선언에는 (BlogResponse, error) 반환 타입이 지정되어 있지만, 함수 끝에 반환문이 없습니다.

다음과 같이 수정하는 것이 좋습니다:
- }
+ 	return tistoryBlogResponse, nil
+ }

♻️ Duplicate comments (1)

cmd/blogs/tistory.go (1)
119-139: ⚠️ Potential issue

이미지 소스 추출 함수의 무한 루프 및 누락된 태그 처리 문제

현재 구현은 HTML 콘텐츠에 이미지 태그가 없을 경우 함수가 반환되지 않는 문제가 있습니다. 또한 일반적인 img 태그(닫는 태그가 있는)를 감지하지 않습니다.

다음과 같이 수정하는 것이 좋습니다:
func extractImageSrc(htmlContent string) string {
	tokenizer := html.NewTokenizer(strings.NewReader(htmlContent))

	for {
		tt := tokenizer.Next()
		token := tokenizer.Token()
		switch tt {
		case html.ErrorToken:
			log.Printf("Error tokenizing HTML: %v", tokenizer.Err())
			return ""
-		case html.SelfClosingTagToken:
+		case html.SelfClosingTagToken, html.StartTagToken:
			if token.Data == "img" {
				for _, attr := range token.Attr {
					if attr.Key == "src" {
						return attr.Val
					}
				}
			}
+		default:
+			// 최대 반복 횟수를 제한하거나
+			// 이미지를 찾을 수 없는 경우에 대한 대비책 추가
		}
	}
}

🧹 Nitpick comments (1)

cmd/blogs/tistory.go (1)
62-63: 썸네일 추출 로직이 추가되었습니다

블로그 설명에서 이미지 소스를 추출하는 기능이 잘 구현되었습니다. 로깅은 디버깅에 유용하지만, 프로덕션 환경에서는 불필요한 로그가 될 수 있습니다.

프로덕션 환경에서는 다음과 같이 로깅 레벨을 조정하거나 조건부 로깅을 고려해 보세요:
- log.Printf("Thumbnail: %s", thumbnail)
+ if debug {
+   log.Printf("Thumbnail: %s", thumbnail)
+ }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 71d97c0 and a9b76c3.

📒 Files selected for processing (3)

cmd/blogs/tistory.go (3 hunks)
cmd/blogs/tistory.go (0 hunks)
cmd/blogs/tistory.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

cmd/blogs/tistory.go
cmd/blogs/tistory.go

🔇 Additional comments (2)

cmd/blogs/tistory.go (2)

11-12: 새로운 의존성 추가 확인

HTML 파싱을 위한 golang.org/x/net/html 패키지가 잘 추가되었습니다. 이 패키지는 HTML 토큰화 및 파싱에 필수적인 기능을 제공합니다.

71-71: 썸네일 필드 추가 완료

Posts 구조체에 Thumbnail 필드가 성공적으로 추가되었습니다.

feat: tistory blog thumbnail extraction

3308736

printSANO requested a review from jungeunyooon March 11, 2025 08:09

jungeunyooon approved these changes Mar 11, 2025

View reviewed changes

coderabbitai bot reviewed Mar 11, 2025

View reviewed changes

printSANO added 2 commits March 11, 2025 01:13

fix: thumbnail log

71d97c0

fix: infinite loop

a9b76c3

coderabbitai bot reviewed Mar 11, 2025

View reviewed changes

printSANO merged commit 3b0cb55 into main Mar 11, 2025
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: tistory blog thumbnail extraction #16

feat: tistory blog thumbnail extraction #16

Uh oh!

printSANO commented Mar 11, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 11, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 11, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

sonarqubecloud bot commented Mar 11, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: tistory blog thumbnail extraction #16

feat: tistory blog thumbnail extraction #16

Uh oh!

Conversation

printSANO commented Mar 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

요약

작업 내용

참고 사항

관련 이슈

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Mar 11, 2025

Quality Gate failed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

printSANO commented Mar 11, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 11, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)