Skip to content

Conversation

qiancai
Copy link
Owner

@qiancai qiancai commented Sep 25, 2025

pingcap#21819)

First-time contributors' checklist

What is changed, added or deleted? (Required)

Which TiDB version(s) do your changes apply to? (Required)

Tips for choosing the affected version(s):

By default, CHOOSE MASTER ONLY so your changes will be applied to the next TiDB major or minor releases. If your PR involves a product feature behavior change or a compatibility change, CHOOSE THE AFFECTED RELEASE BRANCH(ES) AND MASTER.

For details, see tips for choosing the affected versions.

  • master (the latest development version)
  • v9.0 (TiDB 9.0 versions)
  • v8.5 (TiDB 8.5 versions)
  • v8.1 (TiDB 8.1 versions)
  • v7.5 (TiDB 7.5 versions)
  • v7.1 (TiDB 7.1 versions)
  • v6.5 (TiDB 6.5 versions)
  • v6.1 (TiDB 6.1 versions)
  • v5.4 (TiDB 5.4 versions)

What is the related PR or file link(s)?

  • This PR is translated from:
  • Other reference link(s):

Do your changes match any of the following descriptions?

  • Delete files
  • Change aliases
  • Need modification after applied to another branch
  • Might cause conflicts after applied to another branch

Summary by CodeRabbit

  • New Features

    • Automatic synchronization of Chinese-to-English documentation updates for pull requests.
    • AI-assisted translation for added, modified, and deleted sections, including TOC updates.
    • Status comments on target PRs after sync completion.
  • Chores

    • Added a CI workflow to orchestrate doc syncing.
    • Introduced supporting scripts for PR analysis, section matching, file updates, and TOC processing.
    • Added required third-party dependencies for GitHub, AI, and tokenization integrations.

Copy link

coderabbitai bot commented Sep 25, 2025

Walkthrough

Adds a GitHub Actions workflow to sync English docs from a Chinese PR and introduces a new Python-based pipeline under scripts/translate_doc_pr for analyzing PR diffs, matching sections, translating content via pluggable AI providers, and applying changes (add/update/delete/TOC) to target repos. Includes an orchestrating entry point, helpers, and dependencies.

Changes

Cohort / File(s) Summary
CI Workflow: PR sync orchestration
.github/workflows/sync-docs-cn-to-en.yml
New workflow to sync docs from a source PR to a target PR. Checks out repos, resolves target branch, runs translation workflow, commits/pushes if changes, and comments on the PR.
Packaging & Dependencies
scripts/translate_doc_pr/__init__.py, scripts/translate_doc_pr/requirements.txt
Adds package initializer exposing main and introduces dependencies: PyGithub, OpenAI-compatible, tiktoken, Google Generative AI.
Orchestrator
scripts/translate_doc_pr/main_workflow.py
New end-to-end entry point: config loading, AI provider abstraction, PR diff retrieval, change classification, and sequential processing of added/modified/deleted/TOC via submodules.
PR Analysis & Matching
scripts/translate_doc_pr/pr_analyzer.py, scripts/translate_doc_pr/section_matcher.py
Adds PR URL parsing, diff extraction, Markdown hierarchy building, section detection, and AI-assisted mapping of source sections to target locations with token-estimation and diagnostics.
File Operations: Update/Add/Delete
scripts/translate_doc_pr/file_updater.py, scripts/translate_doc_pr/file_adder.py, scripts/translate_doc_pr/file_deleter.py
Introduces AI-assisted section update/insert/delete utilities, batch translation for added files, and safe deletion of target files, with thread-safe logging and fallbacks.
TOC Processing
scripts/translate_doc_pr/toc_processor.py
New TOC-specific logic: link extraction, translation decisions, best-match line finding, grouping and applying added/modified/deleted TOC lines with AI translation where needed.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev as Developer
  participant GH as GitHub Actions
  participant MW as main_workflow.py
  participant GHAPI as GitHub API
  participant AI as UnifiedAIClient
  participant Repo as Target Repo (branch)

  Dev->>GH: Dispatch Workflow (source_pr_url, target_pr_url, ai_provider)
  GH->>MW: Run Python entrypoint with env/tokens
  MW->>GHAPI: Fetch target PR branch, source PR diff
  MW->>MW: Classify changes (added/modified/deleted/TOC)
  opt Added files
    MW->>AI: Translate in batches
    MW->>Repo: Write new files
  end
  opt Modified/Deleted sections
    MW->>GHAPI: Get source/target contents
    MW->>AI: Map/translate sections
    MW->>Repo: Apply updates/deletions
  end
  opt TOC
    MW->>AI: Translate required lines
    MW->>Repo: Apply TOC edits
  end
  MW->>Repo: Stage and commit if changes
  GH->>GHAPI: Push commit, comment on target PR
Loading
sequenceDiagram
  autonumber
  participant MW as main_workflow.py
  participant ANA as pr_analyzer.py
  participant MATCH as section_matcher.py
  participant UPD as file_updater.py
  participant ADD as file_adder.py
  participant DEL as file_deleter.py
  participant TOC as toc_processor.py
  participant AI as UnifiedAIClient

  MW->>ANA: get_pr_diff + build hierarchies
  ANA-->>MW: operations + section structures
  alt Added files
    MW->>ADD: process_added_files(...)
    ADD->>AI: translate batches
  else Modified sections
    MW->>MATCH: match_source_diff_to_target(...)
    MATCH->>AI: assist mapping
    MW->>UPD: process_modified_sections(...)
    UPD->>AI: get_updated_sections_from_ai(...)
  else Deleted sections/files
    MW->>DEL: process_deleted_files(...) or UPD.delete_sections(...)
  else TOC changes
    MW->>TOC: process_toc_files(...)
    TOC->>AI: translate_toc_lines(...)
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120+ minutes

Poem

I thump my paws—commit by commit,
Diffs hop in lines, I nibble each bit.
Sections align, TOC finds its friend,
AI whispers where translations blend.
Push goes the branch—carrot-bright green,
Docs in duet: CN to EN, serene. 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The title “Update system-variables” does not reflect the primary changes in this pull request, which introduce a new GitHub Actions workflow and a suite of translation scripts rather than modifications to system variable documentation. It is misleading and does not convey the core purpose of the changeset. Please rename the pull request title to clearly summarize the main contribution, for example by referencing the addition of the documentation sync workflow and translation modules, so that reviewers can immediately understand the primary change.
Description Check ⚠️ Warning The pull request description only contains the contribution template with empty checkboxes and no substantive answers in the required sections, so it fails to explain what was changed, which versions are affected, or provide any related links. This makes it impossible for reviewers to assess the scope or impact of the changes. Please complete the template by describing what files and functionality were added or modified, selecting the affected TiDB versions, and including any relevant PR or file links so that reviewers have the necessary context to evaluate the changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 89.22% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sync

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

Summary of Changes

Hello @qiancai, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily introduces a new, modular Python-based tool for automating documentation synchronization and translation across different language repositories. It integrates AI capabilities for intelligent content matching and translation, significantly improving the efficiency and accuracy of documentation updates. Additionally, it includes minor but important updates to existing documentation regarding system variables and TiKV configuration.

Highlights

  • New Documentation Auto-Sync and Translation Tool: Introduced a comprehensive Python package scripts/translate_doc_pr designed for automated synchronization and AI-powered translation of documentation. This tool aims to streamline the process of keeping documentation consistent across different language versions.
  • Modular Architecture: The new auto-sync tool is built with a modular design, separating concerns into distinct modules for PR analysis, section matching, file addition/deletion/updating, TOC processing, and main workflow orchestration, enhancing maintainability and testability.
  • AI-Powered Translation and Matching: The tool leverages AI (DeepSeek and Gemini) for intelligent section matching and content translation, ensuring that changes in source documentation are accurately reflected and translated in target language versions. It includes logic for handling various content types, including system variables and configuration items.
  • Enhanced DDL Parameter Management: The system-variables.md file has been updated to reflect new capabilities in TiDB v8.5.0, allowing modification of tidb_ddl_reorg_batch_size and tidb_ddl_reorg_worker_cnt for running DDL jobs using ADMIN ALTER DDL JOBS.
  • TiKV In-Memory Engine Configuration Update: The tikv-configuration-file.md documentation now provides a more detailed description and updated default value for the capacity parameter of the TiKV MVCC in-memory engine.
Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/sync-docs-cn-to-en.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a substantial new Python package for automating documentation translation, alongside some documentation updates. The new translation tool is a significant piece of work, but it has several critical issues that need to be addressed before it can be considered production-ready. The main problems are incorrect module imports that will cause runtime failures, a high degree of code duplication across modules, and several extremely large and complex functions that make the code difficult to maintain and debug. My review focuses on these structural and maintainability issues in the Python code. The documentation changes themselves are minor and look good.

"""

# Import main functionality for easy access
from main import main

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This import statement is incorrect and will cause a runtime error. There is no top-level main module in the standard library or in this project. Based on the project structure, you likely intended to import from main_workflow.py within this package. This should be a relative import: from .main_workflow import main.

This same incorrect import pattern (from main import ...) appears in several other files in this package and needs to be fixed throughout for the package to be importable and usable.

Comment on lines +814 to +1051
def process_single_file(file_path, source_sections, pr_diff, pr_url, github_client, ai_client, repo_config, max_non_system_sections=120):
"""Process a single file - thread-safe function for parallel processing"""
thread_id = threading.current_thread().name
thread_safe_print(f"\n📄 [{thread_id}] Processing {file_path}")

try:
# Check if this is a TOC file with special operations
if isinstance(source_sections, dict) and 'type' in source_sections and source_sections['type'] == 'toc':
from toc_processor import process_toc_file
return process_toc_file(file_path, source_sections, pr_url, github_client, ai_client, repo_config)

# Check if this is enhanced sections
if isinstance(source_sections, dict) and 'sections' in source_sections:
if source_sections.get('type') == 'enhanced_sections':
# Skip all the matching logic and directly extract data
thread_safe_print(f" [{thread_id}] 🚀 Using enhanced sections data, skipping matching logic")
enhanced_sections = source_sections['sections']

# Extract target sections and source old content from enhanced sections
# Maintain the exact order from match_source_diff_to_target.json
from collections import OrderedDict
target_sections = OrderedDict()
source_old_content_dict = OrderedDict()

# Process in the exact order they appear in enhanced_sections (which comes from match_source_diff_to_target.json)
for key, section_info in enhanced_sections.items():
if isinstance(section_info, dict):
operation = section_info.get('source_operation', '')

# Skip deleted sections - they shouldn't be in the enhanced_sections anyway
if operation == 'deleted':
continue

# For source sections: use old_content for modified, new_content for added
if operation == 'added':
source_content = section_info.get('source_new_content', '')
else: # modified
source_content = section_info.get('source_old_content', '')

# For target sections: use target_content for modified, empty string for added
if operation == 'added':
target_content = "" # Added sections have no existing target content
else: # modified
target_content = section_info.get('target_content', '')

# Add to both dictionaries using the same key from match_source_diff_to_target.json
if source_content is not None:
source_old_content_dict[key] = source_content
target_sections[key] = target_content

thread_safe_print(f" [{thread_id}] 📊 Extracted: {len(target_sections)} target sections, {len(source_old_content_dict)} source old content entries")

# Update sections with AI (get-updated-target-sections.py logic)
thread_safe_print(f" [{thread_id}] 🤖 Getting updated sections from AI...")
updated_sections = get_updated_sections_from_ai(pr_diff, target_sections, source_old_content_dict, ai_client, repo_config['source_language'], repo_config['target_language'], file_path)
if not updated_sections:
thread_safe_print(f" [{thread_id}] ⚠️ Could not get AI update")
return False, f"Could not get AI update for {file_path}"

# Return the AI results for further processing
thread_safe_print(f" [{thread_id}] ✅ Successfully got AI translation results for {file_path}")
return True, updated_sections # Return the actual AI results

else:
# New format: complete data structure
actual_sections = source_sections['sections']

# Regular file processing continues here for old format
# Get target hierarchy and content (get-target-affected-hierarchy.py logic)
from pr_analyzer import get_target_hierarchy_and_content
target_hierarchy, target_lines = get_target_hierarchy_and_content(file_path, github_client, repo_config['target_repo'])
if not target_hierarchy:
thread_safe_print(f" [{thread_id}] ⚠️ Could not get target content")
return False, f"Could not get target content for {file_path}"
else:
# Old format: direct dict
actual_sections = source_sections

# Only do mapping if we don't have enhanced sections
if 'enhanced_sections' not in locals() or not enhanced_sections:
# Separate different types of sections
from section_matcher import is_system_variable_or_config
system_var_sections = {}
toplevel_sections = {}
frontmatter_sections = {}
regular_sections = {}

for line_num, hierarchy in actual_sections.items():
if line_num == "0" and hierarchy == "frontmatter":
# Special handling for frontmatter
frontmatter_sections[line_num] = hierarchy
else:
# Extract the leaf title from hierarchy
leaf_title = hierarchy.split(' > ')[-1] if ' > ' in hierarchy else hierarchy

if is_system_variable_or_config(leaf_title):
system_var_sections[line_num] = hierarchy
elif leaf_title.startswith('# '):
# Top-level titles need special handling
toplevel_sections[line_num] = hierarchy
else:
regular_sections[line_num] = hierarchy

thread_safe_print(f" [{thread_id}] 📊 Found {len(system_var_sections)} system variable/config, {len(toplevel_sections)} top-level, {len(frontmatter_sections)} frontmatter, and {len(regular_sections)} regular sections")

target_affected = {}

# Process frontmatter sections with special handling
if frontmatter_sections:
thread_safe_print(f" [{thread_id}] 📄 Processing frontmatter section...")
# For frontmatter, we simply map it to line 0 in target
for line_num, hierarchy in frontmatter_sections.items():
target_affected[line_num] = hierarchy
thread_safe_print(f" [{thread_id}] ✅ Mapped {len(frontmatter_sections)} frontmatter section")

# Process top-level titles with special matching
if toplevel_sections:
thread_safe_print(f" [{thread_id}] 🔝 Top-level title matching for {len(toplevel_sections)} sections...")
from section_matcher import find_toplevel_title_matches
toplevel_matched, toplevel_failed, toplevel_skipped = find_toplevel_title_matches(toplevel_sections, target_lines)

if toplevel_matched:
target_affected.update(toplevel_matched)
thread_safe_print(f" [{thread_id}] ✅ Top-level matched {len(toplevel_matched)} sections")

if toplevel_failed:
thread_safe_print(f" [{thread_id}] ⚠️ {len(toplevel_failed)} top-level sections failed matching")
for failed in toplevel_failed:
thread_safe_print(f" ❌ {failed['hierarchy']}: {failed['reason']}")

# Process system variables/config sections with direct matching
if system_var_sections:
thread_safe_print(f" [{thread_id}] 🎯 Direct matching {len(system_var_sections)} system variable/config sections...")
from section_matcher import find_direct_matches_for_special_files
direct_matched, failed_matches, skipped_sections = find_direct_matches_for_special_files(system_var_sections, target_hierarchy, target_lines)

if direct_matched:
target_affected.update(direct_matched)
thread_safe_print(f" [{thread_id}] ✅ Direct matched {len(direct_matched)} system variable/config sections")

if failed_matches:
thread_safe_print(f" [{thread_id}] ⚠️ {len(failed_matches)} system variable/config sections failed direct matching")
for failed in failed_matches:
thread_safe_print(f" ❌ {failed['hierarchy']}: {failed['reason']}")

# Process regular sections with AI mapping using filtered target hierarchy
if regular_sections:
thread_safe_print(f" [{thread_id}] 🤖 AI mapping {len(regular_sections)} regular sections...")

# Filter target hierarchy to only include non-system sections for AI mapping
from section_matcher import filter_non_system_sections
filtered_target_hierarchy = filter_non_system_sections(target_hierarchy)

# Check if filtered target hierarchy exceeds the maximum allowed for AI mapping
MAX_NON_SYSTEM_SECTIONS_FOR_AI = 120
if len(filtered_target_hierarchy) > MAX_NON_SYSTEM_SECTIONS_FOR_AI:
thread_safe_print(f" [{thread_id}] ❌ Too many non-system sections ({len(filtered_target_hierarchy)} > {MAX_NON_SYSTEM_SECTIONS_FOR_AI})")
thread_safe_print(f" [{thread_id}] ⚠️ Skipping AI mapping for regular sections to avoid complexity")

# If no system sections were matched either, return error
if not target_affected:
error_message = f"File {file_path} has too many non-system sections ({len(filtered_target_hierarchy)} > {MAX_NON_SYSTEM_SECTIONS_FOR_AI}) and no system variable sections were matched"
return False, error_message

# Continue with only system variable matches if available
thread_safe_print(f" [{thread_id}] ✅ Proceeding with {len(target_affected)} system variable/config sections only")
else:
# Proceed with AI mapping using filtered hierarchy
source_list = list(regular_sections.values())
target_list = list(filtered_target_hierarchy.values())

from section_matcher import get_corresponding_sections
ai_response = get_corresponding_sections(source_list, target_list, ai_client, repo_config['source_language'], repo_config['target_language'], max_tokens=20000)
if ai_response:
# Parse AI response and find matching line numbers in the original (unfiltered) hierarchy
from section_matcher import parse_ai_response, find_matching_line_numbers
ai_sections = parse_ai_response(ai_response)
ai_matched = find_matching_line_numbers(ai_sections, target_hierarchy) # Use original hierarchy for line number lookup

if ai_matched:
target_affected.update(ai_matched)
thread_safe_print(f" [{thread_id}] ✅ AI mapped {len(ai_matched)} regular sections")
else:
thread_safe_print(f" [{thread_id}] ⚠️ AI mapping failed for regular sections")
else:
thread_safe_print(f" [{thread_id}] ⚠️ Could not get AI response for regular sections")

# Summary of mapping results
thread_safe_print(f" [{thread_id}] 📊 Total mapped: {len(target_affected)} out of {len(actual_sections)} sections")

if not target_affected:
thread_safe_print(f" [{thread_id}] ⚠️ Could not map sections")
return False, f"Could not map sections for {file_path}"

thread_safe_print(f" [{thread_id}] ✅ Mapped {len(target_affected)} sections")

# Extract target sections (get-target-affected-sections.py logic)
thread_safe_print(f" [{thread_id}] 📝 Extracting target sections...")
from pr_analyzer import extract_affected_sections
target_sections = extract_affected_sections(target_affected, target_lines)

# Extract source old content from the enhanced data structure
thread_safe_print(f" [{thread_id}] 📖 Extracting source old content...")
source_old_content_dict = {}

# Handle different data structures for source_sections
if isinstance(source_sections, dict) and 'sections' in source_sections:
# New format: complete data structure with enhanced matching info
for key, section_info in source_sections.items():
if isinstance(section_info, dict) and 'source_old_content' in section_info:
source_old_content_dict[key] = section_info['source_old_content']
else:
# Fallback: if we don't have the enhanced structure, we need to get it differently
thread_safe_print(f" [{thread_id}] ⚠️ Source sections missing enhanced structure, using fallback")
# For now, create empty dict to avoid errors - this should be addressed in the calling code
source_old_content_dict = {}

# Update sections with AI (get-updated-target-sections.py logic)
thread_safe_print(f" [{thread_id}] 🤖 Getting updated sections from AI...")
updated_sections = get_updated_sections_from_ai(pr_diff, target_sections, source_old_content_dict, ai_client, repo_config['source_language'], repo_config['target_language'], file_path)
if not updated_sections:
thread_safe_print(f" [{thread_id}] ⚠️ Could not get AI update")
return False, f"Could not get AI update for {file_path}"

# Update local document (update-target-doc-v2.py logic)
thread_safe_print(f" [{thread_id}] 💾 Updating local document...")
success = update_local_document(file_path, updated_sections, target_affected, repo_config['target_local_path'])

if success:
thread_safe_print(f" [{thread_id}] 🎉 Successfully updated {file_path}")
return True, f"Successfully updated {file_path}"
else:
thread_safe_print(f" [{thread_id}] ❌ Failed to update {file_path}")
return False, f"Failed to update {file_path}"

except Exception as e:
thread_safe_print(f" [{thread_id}] ❌ Error processing {file_path}: {e}")
return False, f"Error processing {file_path}: {e}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function process_single_file is extremely long (over 200 lines) and handles many different cases (TOC files, enhanced sections, regular files with different section types). This violates the Single Responsibility Principle and makes the code very difficult to read, test, and maintain. Please refactor this function by breaking it down into smaller, more focused functions, each handling a specific part of the logic.

Comment on lines +380 to +525
def process_regular_modified_file(source_file_path, file_sections, file_diff, pr_url, github_client, ai_client, repo_config, max_sections):
"""Process a regular markdown file that has been modified"""
try:
print(f" 📝 Processing as regular modified file: {source_file_path}")

# Extract the actual sections from the file_sections structure
# file_sections contains: {'sections': {...}, 'original_hierarchy': {...}, 'current_hierarchy': {...}}
if isinstance(file_sections, dict) and 'sections' in file_sections:
actual_sections = file_sections['sections']
else:
# Fallback: assume file_sections is already the sections dict
actual_sections = file_sections

print(f" 📊 Extracted sections: {len(actual_sections)} sections")

# CRITICAL: Load the source-diff-dict.json and perform matching
import json
import os
from section_matcher import match_source_diff_to_target
from pr_analyzer import get_target_hierarchy_and_content

# Load source-diff-dict.json with file prefix
temp_dir = ensure_temp_output_dir()
file_prefix = source_file_path.replace('/', '-').replace('.md', '')
source_diff_dict_file = os.path.join(temp_dir, f"{file_prefix}-source-diff-dict.json")
if os.path.exists(source_diff_dict_file):
with open(source_diff_dict_file, 'r', encoding='utf-8') as f:
source_diff_dict = json.load(f)
print(f" 📂 Loaded source diff dict with {len(source_diff_dict)} sections from {source_diff_dict_file}")

# Check source token limit before proceeding with processing
print(f" 🔍 Checking source token limit...")
within_limit, total_tokens, token_limit = check_source_token_limit(source_diff_dict_file)
if not within_limit:
print(f" 🚫 Skipping file processing: source content exceeds token limit")
print(f" 📊 Total tokens: {total_tokens:,} > Limit: {token_limit:,}")
print(f" ⏭️ File {source_file_path} will not be processed")
return False

else:
print(f" ❌ {source_diff_dict_file} not found")
return False

# Get target file hierarchy and content
target_repo = repo_config['target_repo']
target_hierarchy, target_lines = get_target_hierarchy_and_content(source_file_path, github_client, target_repo)

if not target_hierarchy or not target_lines:
print(f" ❌ Could not get target file content for {source_file_path}")
return False

print(f" 📖 Target file: {len(target_hierarchy)} sections, {len(target_lines)} lines")

# Perform source diff to target matching
print(f" 🔗 Matching source diff to target...")
enhanced_sections = match_source_diff_to_target(
source_diff_dict,
target_hierarchy,
target_lines,
ai_client,
repo_config,
max_sections,
AI_MAX_TOKENS
)

if not enhanced_sections:
print(f" ❌ No sections matched")
return False

print(f" ✅ Matched {len(enhanced_sections)} sections")

# Save the match result for reference
match_file = os.path.join(temp_dir, f"{source_file_path.replace('/', '-').replace('.md', '')}-match_source_diff_to_target.json")
with open(match_file, 'w', encoding='utf-8') as f:
json.dump(enhanced_sections, f, ensure_ascii=False, indent=2)
print(f" 💾 Saved match result to: {match_file}")

# Step 2: Get AI translation for the matched sections
print(f" 🤖 Getting AI translation for matched sections...")

# Create file data structure with enhanced matching info
# Wrap enhanced_sections in the expected format for process_single_file
file_data = {
source_file_path: {
'type': 'enhanced_sections',
'sections': enhanced_sections
}
}

# Call the existing process_modified_sections function to get AI translation
results = process_modified_sections(file_data, file_diff, pr_url, github_client, ai_client, repo_config, max_sections)

# Step 3: Update match_source_diff_to_target.json with AI results
if results and len(results) > 0:
file_path, success, ai_updated_sections = results[0] # Get first result
if success and isinstance(ai_updated_sections, dict):
print(f" 📝 Step 3: Updating {match_file} with AI results...")

# Load current match_source_diff_to_target.json
with open(match_file, 'r', encoding='utf-8') as f:
match_data = json.load(f)

# Add target_new_content field to each section based on AI results
updated_count = 0
for key, section_data in match_data.items():
operation = section_data.get('source_operation', '')

if operation == 'deleted':
# For deleted sections, set target_new_content to null
section_data['target_new_content'] = None
elif key in ai_updated_sections:
# For modified/added sections with AI translation
section_data['target_new_content'] = ai_updated_sections[key]
updated_count += 1
else:
# For sections not translated, keep original content
section_data['target_new_content'] = section_data.get('target_content', '')

# Save updated match_source_diff_to_target.json
with open(match_file, 'w', encoding='utf-8') as f:
json.dump(match_data, f, ensure_ascii=False, indent=2)

print(f" ✅ Updated {updated_count} sections with AI translations in {match_file}")

# Step 4: Apply updates to target document using update_target_document_from_match_data
print(f" 📝 Step 4: Applying updates to target document...")
from file_updater import update_target_document_from_match_data

success = update_target_document_from_match_data(match_file, repo_config['target_local_path'], source_file_path)
if success:
print(f" 🎉 Target document successfully updated!")
return True
else:
print(f" ❌ Failed to update target document")
return False

else:
print(f" ⚠️ AI translation failed or returned invalid results")
return False
else:
print(f" ⚠️ No results from process_modified_sections")
return False

except Exception as e:
print(f" ❌ Error processing regular modified file {source_file_path}: {e}")
return False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function process_regular_modified_file is very large and contains complex logic for processing a single file. This function's logic is tightly coupled with file updating and matching, and it would be better placed in the file_updater.py or section_matcher.py module. Moving this out of main_workflow.py would make the main workflow clearer and improve the overall modularity of the package.

Comment on lines +979 to +1447
def analyze_source_changes(pr_url, github_client, special_files=None, ignore_files=None, repo_configs=None, max_non_system_sections=120, pr_diff=None):
"""Analyze source language changes and categorize them as added, modified, or deleted"""
# Import modules needed in this function
import os
import json
from toc_processor import process_toc_operations

owner, repo, pr_number = parse_pr_url(pr_url)
repository = github_client.get_repo(f"{owner}/{repo}")
pr = repository.get_pull(pr_number)

# Get repository configuration for target repo info
repo_config = get_repo_config(pr_url, repo_configs)

print(f"📋 Processing PR #{pr_number}: {pr.title}")

# Get markdown files
files = pr.get_files()
markdown_files = [f for f in files if f.filename.endswith('.md')]

print(f"📄 Found {len(markdown_files)} markdown files")

# Return dictionaries for different operation types
added_sections = {} # New sections that were added
modified_sections = {} # Existing sections that were modified
deleted_sections = {} # Sections that were deleted
added_files = {} # Completely new files that were added
deleted_files = [] # Completely deleted files
ignored_files = [] # Files that were ignored
toc_files = {} # Special TOC files requiring special processing

for file in markdown_files:
print(f"\n🔍 Analyzing {file.filename}")

# Check if this file should be ignored
if file.filename in ignore_files:
print(f" ⏭️ Skipping ignored file: {file.filename}")
ignored_files.append(file.filename)
continue

# Check if this is a completely new file or deleted file
if file.status == 'added':
print(f" ➕ Detected new file: {file.filename}")
try:
file_content = repository.get_contents(file.filename, ref=pr.head.sha).decoded_content.decode('utf-8')
added_files[file.filename] = file_content
print(f" ✅ Added complete file for translation")
continue
except Exception as e:
print(f" ❌ Error getting new file content: {e}")
continue

elif file.status == 'removed':
print(f" 🗑️ Detected deleted file: {file.filename}")
deleted_files.append(file.filename)
print(f" ✅ Marked file for deletion")
continue

# For modified files, check if it's a special file like TOC.md
try:
file_content = repository.get_contents(file.filename, ref=pr.head.sha).decoded_content.decode('utf-8')
except Exception as e:
print(f" ❌ Error getting content: {e}")
continue

# Check if this is a TOC.md file requiring special processing
if os.path.basename(file.filename) in special_files:
print(f" 📋 Detected special file: {file.filename}")

# Get target file content for comparison
try:
target_repository = github_client.get_repo(repo_config['target_repo'])
target_file_content = target_repository.get_contents(file.filename, ref="master").decoded_content.decode('utf-8')
target_lines = target_file_content.split('\n')
except Exception as e:
print(f" ⚠️ Could not get target file content: {e}")
continue

# Analyze diff operations for TOC.md
operations = analyze_diff_operations(file)
source_lines = file_content.split('\n')

# Process with special TOC logic
toc_results = process_toc_operations(file.filename, operations, source_lines, target_lines, "") # Local path will be determined later

# Store TOC operations for later processing
if any([toc_results['added'], toc_results['modified'], toc_results['deleted']]):
# Combine all operations for processing
all_toc_operations = []
all_toc_operations.extend(toc_results['added'])
all_toc_operations.extend(toc_results['modified'])
all_toc_operations.extend(toc_results['deleted'])

# Add to special TOC processing queue (separate from regular sections)
toc_files[file.filename] = {
'type': 'toc',
'operations': all_toc_operations
}

print(f" 📋 TOC operations queued for processing:")
if toc_results['added']:
print(f" ➕ Added: {len(toc_results['added'])} entries")
if toc_results['modified']:
print(f" ✏️ Modified: {len(toc_results['modified'])} entries")
if toc_results['deleted']:
print(f" ❌ Deleted: {len(toc_results['deleted'])} entries")
else:
print(f" ℹ️ No TOC operations found")

continue # Skip regular processing for TOC files

# Analyze diff operations
operations = analyze_diff_operations(file)
print(f" 📝 Diff analysis: {len(operations['added_lines'])} added, {len(operations['modified_lines'])} modified, {len(operations['deleted_lines'])} deleted lines")

lines = file_content.split('\n')
all_headers = {}

# Track code block state
in_code_block = False
code_block_delimiter = None

# First pass: collect all headers (excluding those in code blocks)
for line_num, line in enumerate(lines, 1):
original_line = line
line = line.strip()

# Check for code block delimiters
if line.startswith('```') or line.startswith('~~~'):
if not in_code_block:
# Entering a code block
in_code_block = True
code_block_delimiter = line[:3]
continue
elif line.startswith(code_block_delimiter):
# Exiting a code block
in_code_block = False
code_block_delimiter = None
continue

# Skip processing if we're inside a code block
if in_code_block:
continue

# Process headers only if not in code block
if line.startswith('#'):
match = re.match(r'^(#{1,10})\s+(.+)', line)
if match:
level = len(match.group(1))
title = match.group(2).strip()
all_headers[line_num] = {
'level': level,
'title': title,
'line': line
}

# Build complete hierarchy from HEAD (after changes)
all_hierarchy_dict = build_hierarchy_dict(file_content)

# For deletion detection, we also need the base file hierarchy
try:
base_file_content = repository.get_contents(file.filename, ref=f"{repository.default_branch}").decoded_content.decode('utf-8')
base_hierarchy_dict = build_hierarchy_dict(base_file_content)
except Exception as e:
print(f" ⚠️ Could not get base file content: {e}")
base_hierarchy_dict = all_hierarchy_dict
base_file_content = file_content # Fallback to current content

# Find sections by operation type with corrected logic
sections_by_type = find_sections_by_operation_type(lines, operations, all_headers, base_hierarchy_dict)

# Prioritize modified headers over added ones (fix for header changes like --host -> --hosts)
modified_header_lines = set()
for modified_line in operations['modified_lines']:
if modified_line['is_header']:
modified_header_lines.add(modified_line['line_number'])

# Remove modified header lines from added set
sections_by_type['added'] = sections_by_type['added'] - modified_header_lines

# Enhanced logic: check for actual content changes within sections
# This helps detect changes in section content (not just headers)
print(f" 🔍 Enhanced detection: checking for actual section content changes...")

# Get only lines that have actual content changes (exclude headers)
real_content_changes = set()

# Added lines (new content, excluding headers)
for added_line in operations['added_lines']:
if not added_line['is_header']:
real_content_changes.add(added_line['line_number'])

# Deleted lines (removed content, excluding headers)
for deleted_line in operations['deleted_lines']:
if not deleted_line['is_header']:
real_content_changes.add(deleted_line['line_number'])

# Modified lines (changed content, excluding headers)
for modified_line in operations['modified_lines']:
if not modified_line['is_header']:
real_content_changes.add(modified_line['line_number'])

print(f" 📝 Real content changes (non-header): {sorted(real_content_changes)}")

# Find sections that contain actual content changes
content_affected_sections = set()
for changed_line in real_content_changes:
# Find which section this changed line belongs to
containing_section = None
for line_num in sorted(all_headers.keys()):
if line_num <= changed_line:
containing_section = line_num
else:
break

if containing_section and containing_section not in sections_by_type['added']:
# Additional check: make sure this is not just a line number shift
# Only add if the change is within reasonable distance from the section header
# AND if the changed line is not part of a completely deleted section header
is_deleted_header = False
for deleted_line in operations['deleted_lines']:
if (deleted_line['is_header'] and
abs(changed_line - deleted_line['line_number']) <= 2):
is_deleted_header = True
print(f" ⚠️ Skipping change at line {changed_line} (deleted header near line {deleted_line['line_number']})")
break

# More precise filtering: check if this change is actually meaningful
# Skip changes that are part of deleted content or line shifts due to deletions
should_include = True

# Skip exact deleted headers
for deleted_line in operations['deleted_lines']:
if (deleted_line['is_header'] and
changed_line == deleted_line['line_number']):
should_include = False
print(f" ⚠️ Skipping change at line {changed_line} (exact deleted header)")
break

# Skip changes that are very close to deleted content AND far from their containing section
# This helps filter out line shift artifacts while keeping real content changes
if should_include:
for deleted_line in operations['deleted_lines']:
# Only skip if both conditions are met:
# 1. Very close to deleted content (within 5 lines)
# 2. The change is far from its containing section (likely a shift artifact)
distance_to_deletion = abs(changed_line - deleted_line['line_number'])
distance_to_section = changed_line - containing_section

if (distance_to_deletion <= 5 and distance_to_section > 100):
should_include = False
print(f" ⚠️ Skipping change at line {changed_line} (likely line shift: {distance_to_deletion} lines from deletion, {distance_to_section} from section)")
break

if should_include and changed_line - containing_section <= 30:
content_affected_sections.add(containing_section)
print(f" 📝 Content change at line {changed_line} affects section at line {containing_section}")
elif should_include:
print(f" ⚠️ Skipping distant change at line {changed_line} from section {containing_section}")

# Add content-modified sections to the modified set, but exclude sections that are already marked as added or deleted
for line_num in content_affected_sections:
if (line_num not in sections_by_type['modified'] and
line_num not in sections_by_type['added'] and
line_num not in sections_by_type['deleted']): # ✅ Critical fix: exclude deleted sections
sections_by_type['modified'].add(line_num)
print(f" 📝 Added content-modified section at line {line_num}")
elif line_num in sections_by_type['deleted']:
print(f" 🚫 Skipping content-modified section at line {line_num}: already marked as deleted")

# Prepare sections data for source_diff_dict
file_modified = {}
file_added = {}
file_deleted = {}

# Build modified sections
for line_num in sections_by_type['modified']:
if line_num in all_hierarchy_dict:
file_modified[str(line_num)] = all_hierarchy_dict[line_num]

# Build added sections
for line_num in sections_by_type['added']:
if line_num in all_hierarchy_dict:
file_added[str(line_num)] = all_hierarchy_dict[line_num]

# Build deleted sections
for line_num in sections_by_type['deleted']:
if line_num in base_hierarchy_dict:
file_deleted[str(line_num)] = base_hierarchy_dict[line_num]

# Check for frontmatter changes (content before first top-level header)
print(f" 🔍 Checking for frontmatter changes...")
frontmatter_changed = False

# Check if any changes occur before the first top-level header
first_header_line = None
for line_num in sorted(all_headers.keys()):
header_info = all_headers[line_num]
if header_info['level'] == 1: # First top-level header
first_header_line = line_num
break

print(f" 📊 First header line: {first_header_line}")
print(f" 📊 Real content changes: {sorted(real_content_changes)}")

if first_header_line:
# Check if any real content changes are before the first header
for line_num in real_content_changes:
#print(f" 🔍 Checking line {line_num} vs first header {first_header_line}")
if line_num < first_header_line:
frontmatter_changed = True
print(f" 📄 Frontmatter change detected: line {line_num} < {first_header_line}")
break

print(f" 📊 Frontmatter changed: {frontmatter_changed}")

if frontmatter_changed:
print(f" 📄 Frontmatter changes detected (before line {first_header_line})")
# Add frontmatter as a special section with line number 0
file_modified["0"] = "frontmatter"
print(f" ✅ Added frontmatter section to modified sections")

# Build source diff dictionary
source_diff_dict = build_source_diff_dict(
file_modified, file_added, file_deleted,
all_hierarchy_dict, base_hierarchy_dict,
operations, file_content, base_file_content
)

# Breakpoint: Output source_diff_dict to file for review with file prefix

# Ensure temp_output directory exists
script_dir = os.path.dirname(os.path.abspath(__file__))
temp_dir = os.path.join(script_dir, "temp_output")
os.makedirs(temp_dir, exist_ok=True)

file_prefix = file.filename.replace('/', '-').replace('.md', '')
output_file = os.path.join(temp_dir, f"{file_prefix}-source-diff-dict.json")
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(source_diff_dict, f, ensure_ascii=False, indent=2)

print(f" 💾 Saved source diff dictionary to: {output_file}")
print(f" 📊 Source diff dictionary contains {len(source_diff_dict)} sections:")
for key, diff_info in source_diff_dict.items():
print(f" {diff_info['operation']}: {key} -> original_hierarchy: {diff_info['original_hierarchy']}")

# source-diff-dict.json generation is complete, continue to next step in main.py

# For modified headers, we need to build a mapping using original titles for matching
original_hierarchy_dict = all_hierarchy_dict.copy()

# Update hierarchy dict to use original content for modified headers when needed for matching
for line_num in sections_by_type['modified']:
if line_num in all_headers:
header_info = all_headers[line_num]
# Check if this header was modified and has original content
for op in operations['modified_lines']:
if (op['is_header'] and
op['line_number'] == line_num and
'original_content' in op):
# Create hierarchy path using original content for matching
original_line = op['original_content'].strip()
if original_line.startswith('#'):
# Build original hierarchy for matching
original_hierarchy = build_hierarchy_for_modified_section(
file_content, line_num, original_line, all_hierarchy_dict)
if original_hierarchy:
original_hierarchy_dict[line_num] = original_hierarchy
break

# Process added sections
if sections_by_type['added']:
file_added = {}
# Find insertion points using the simplified logic:
# Record the previous section hierarchy for each added section
insertion_points = find_previous_section_for_added(sections_by_type['added'], all_hierarchy_dict)

# Get actual content for added sections
for line_num in sections_by_type['added']:
if line_num in all_hierarchy_dict:
file_added[str(line_num)] = all_hierarchy_dict[line_num]

# Get source sections content (actual content, not just hierarchy)
if file_added:
source_sections_content = get_source_sections_content(pr_url, file.filename, file_added, github_client)
file_added = source_sections_content # Replace hierarchy with actual content

if file_added:
added_sections[file.filename] = {
'sections': file_added,
'insertion_points': insertion_points
}
print(f" ➕ Found {len(file_added)} added sections with {len(insertion_points)} insertion points")

# Process modified sections
if sections_by_type['modified']:
file_modified = {}
for line_num in sections_by_type['modified']:
if line_num in original_hierarchy_dict:
file_modified[str(line_num)] = original_hierarchy_dict[line_num]

if file_modified:
modified_sections[file.filename] = {
'sections': file_modified,
'original_hierarchy': original_hierarchy_dict,
'current_hierarchy': all_hierarchy_dict
}
print(f" ✏️ Found {len(file_modified)} modified sections")

# Process deleted sections
if sections_by_type['deleted']:
file_deleted = {}
for line_num in sections_by_type['deleted']:
# Use base hierarchy to get the deleted section info
if line_num in base_hierarchy_dict:
file_deleted[str(line_num)] = base_hierarchy_dict[line_num]

if file_deleted:
deleted_sections[file.filename] = file_deleted
print(f" ❌ Found {len(file_deleted)} deleted sections")

# Enhanced logic: also check content-level changes using legacy detection
# This helps detect changes in section content (not just headers)
print(f" 🔍 Enhanced detection: checking content-level changes...")
changed_lines = get_changed_line_ranges(file)
affected_sections = find_affected_sections(lines, changed_lines, all_headers)

legacy_modified = {}
for line_num in affected_sections:
if line_num in all_hierarchy_dict:
section_hierarchy = all_hierarchy_dict[line_num]
# Only add if not already detected by operation-type analysis
already_detected = False
if file.filename in modified_sections:
for existing_line, existing_hierarchy in modified_sections[file.filename].get('sections', {}).items():
if existing_hierarchy == section_hierarchy:
already_detected = True
break

if not already_detected:
legacy_modified[str(line_num)] = section_hierarchy

if legacy_modified:
print(f" ✅ Enhanced detection found {len(legacy_modified)} additional content-modified sections")
# Merge with existing modified sections
if file.filename in modified_sections:
# Merge the sections
existing_sections = modified_sections[file.filename].get('sections', {})
existing_sections.update(legacy_modified)
modified_sections[file.filename]['sections'] = existing_sections
else:
# Create new entry
modified_sections[file.filename] = {
'sections': legacy_modified,
'original_hierarchy': all_hierarchy_dict,
'current_hierarchy': all_hierarchy_dict
}

print(f"\n📊 Summary:")
#print(f" ✏️ Modified files: {} files")
print(f" 📄 Added files: {len(added_files)} files")
print(f" 🗑️ Deleted files: {len(deleted_files)} files")
print(f" 📋 TOC files: {len(toc_files)} files")
if ignored_files:
print(f" ⏭️ Ignored files: {len(ignored_files)} files")
for ignored_file in ignored_files:
print(f" - {ignored_file}")

return added_sections, modified_sections, deleted_sections, added_files, deleted_files, toc_files

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The analyze_source_changes function is very long and complex, spanning hundreds of lines. It handles file analysis, diff parsing, hierarchy building, and categorizing changes. To improve readability and maintainability, this function should be refactored into several smaller, more specialized functions. Each smaller function should have a single, well-defined responsibility.

Comment on lines +572 to +689
def match_source_diff_to_target(source_diff_dict, target_hierarchy, target_lines, ai_client, repo_config, max_non_system_sections=120, max_tokens=20000):
"""
Match source_diff_dict original_hierarchy to target file sections
Uses direct matching for system variables/config and AI matching for others

Returns:
dict: Matched sections with enhanced information including:
- target_line: Line number in target file
- target_hierarchy: Target section hierarchy
- insertion_type: For added sections only
- source_original_hierarchy: Original hierarchy from source
- source_operation: Operation type (modified/added/deleted)
- source_old_content: Old content from source diff
- source_new_content: New content from source diff
"""
thread_safe_print(f"🔗 Starting source diff to target matching...")

# Extract hierarchies from source diff dict
source_hierarchies = extract_hierarchies_from_diff_dict(source_diff_dict)

if not source_hierarchies:
thread_safe_print(f"⚠️ No hierarchies to match")
return {}

# Process sections in original order to maintain consistency
# Initialize final matching results with ordered dict to preserve order
from collections import OrderedDict
all_matched_sections = OrderedDict()

# Categorize sections for processing strategy but maintain order
direct_match_sections = OrderedDict()
ai_match_sections = OrderedDict()
added_sections = OrderedDict()
bottom_sections = OrderedDict() # New category for bottom sections

for key, hierarchy in source_hierarchies.items():
# Check if this is a bottom section (no matching needed)
if hierarchy.startswith('bottom-'):
bottom_sections[key] = hierarchy
# Check if this is an added section
elif key.startswith('added_'):
added_sections[key] = hierarchy
else:
# Extract the leaf title from hierarchy for checking
leaf_title = hierarchy.split(' > ')[-1] if ' > ' in hierarchy else hierarchy

# Check if this is suitable for direct matching
if (hierarchy == "frontmatter" or
leaf_title.startswith('# ') or # Top-level titles
is_system_variable_or_config(leaf_title)): # System variables/config
direct_match_sections[key] = hierarchy
else:
ai_match_sections[key] = hierarchy

thread_safe_print(f"📊 Section categorization:")
thread_safe_print(f" 🎯 Direct matching: {len(direct_match_sections)} sections")
thread_safe_print(f" 🤖 AI matching: {len(ai_match_sections)} sections")
thread_safe_print(f" ➕ Added sections: {len(added_sections)} sections")
thread_safe_print(f" 🔚 Bottom sections: {len(bottom_sections)} sections (no matching needed)")

# Process each section in original order
thread_safe_print(f"\n🔄 Processing sections in original order...")

for key, hierarchy in source_hierarchies.items():
thread_safe_print(f" 🔍 Processing {key}: {hierarchy}")

# Determine processing strategy based on section type and content
if hierarchy.startswith('bottom-'):
# Bottom section - no matching needed, append to end
thread_safe_print(f" 🔚 Bottom section - append to end of document")
result = {
"target_line": "-1", # Special marker for bottom sections
"target_hierarchy": hierarchy # Keep the bottom-xxx marker
}
elif key.startswith('added_'):
# Added section - find insertion point
thread_safe_print(f" ➕ Added section - finding insertion point")
result = process_added_section(key, hierarchy, target_hierarchy, target_lines, ai_client, repo_config, max_non_system_sections, max_tokens)
else:
# Modified or deleted section - find matching section
operation = source_diff_dict[key].get('operation', 'unknown')
thread_safe_print(f" {operation.capitalize()} section - finding target match")
result = process_modified_or_deleted_section(key, hierarchy, target_hierarchy, target_lines, ai_client, repo_config, max_non_system_sections, max_tokens)

if result:
# Add source language information from source_diff_dict
source_info = source_diff_dict.get(key, {})

# Extract target content from target_lines
target_line = result.get('target_line', 'unknown')
target_content = ""
if target_line != 'unknown' and target_line != '0':
try:
target_line_num = int(target_line)
# For ALL operations, only extract direct content (no sub-sections)
# This avoids duplication when both parent and child sections have operations
target_content = extract_section_direct_content(target_line_num, target_lines)
except (ValueError, IndexError):
target_content = ""
elif target_line == '0':
# For frontmatter, extract content from beginning to first header
target_content = extract_frontmatter_content(target_lines)

enhanced_result = {
**result, # Include existing target matching info
'target_content': target_content, # Add target section content
'source_original_hierarchy': source_info.get('original_hierarchy', ''),
'source_operation': source_info.get('operation', ''),
'source_old_content': source_info.get('old_content', ''),
'source_new_content': source_info.get('new_content', '')
}
all_matched_sections[key] = enhanced_result
thread_safe_print(f" ✅ {key}: -> line {target_line}")
else:
thread_safe_print(f" ❌ {key}: matching failed")

thread_safe_print(f"\n📊 Final matching results: {len(all_matched_sections)} total matches")
return all_matched_sections

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The match_source_diff_to_target function is very large and complex. It orchestrates direct matching, AI matching, and processes different types of sections. This makes it difficult to understand and maintain. Please consider breaking this function down into smaller, more focused helper functions to improve its structure and readability.

Comment on lines +16 to +18
def thread_safe_print(*args, **kwargs):
with print_lock:
print(*args, **kwargs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The thread_safe_print function is duplicated across multiple files in this package (including file_deleter.py, file_updater.py, main_workflow.py, etc.). To improve maintainability and adhere to the DRY (Don't Repeat Yourself) principle, this function should be defined once in a shared utility module and imported where needed.

Comment on lines +20 to +73
def create_section_batches(file_content, max_lines_per_batch=200):
"""Create batches of file content for translation, respecting section boundaries"""
lines = file_content.split('\n')

# Find all section headers
section_starts = []
for i, line in enumerate(lines):
line = line.strip()
if line.startswith('#'):
match = re.match(r'^(#{1,10})\s+(.+)', line)
if match:
section_starts.append(i + 1) # 1-based line numbers

# If no sections found, just batch by line count
if not section_starts:
batches = []
for i in range(0, len(lines), max_lines_per_batch):
batch_lines = lines[i:i + max_lines_per_batch]
batches.append('\n'.join(batch_lines))
return batches

# Create batches respecting section boundaries
batches = []
current_batch_start = 0

for i, section_start in enumerate(section_starts):
section_start_idx = section_start - 1 # Convert to 0-based

# Check if adding this section would exceed the line limit
if (section_start_idx - current_batch_start) > max_lines_per_batch:
# Close current batch at the previous section boundary
if current_batch_start < section_start_idx:
batch_lines = lines[current_batch_start:section_start_idx]
batches.append('\n'.join(batch_lines))
current_batch_start = section_start_idx

# If this is the last section, or the next section would create a batch too large
if i == len(section_starts) - 1:
# Add remaining content as final batch
batch_lines = lines[current_batch_start:]
batches.append('\n'.join(batch_lines))
else:
next_section_start = section_starts[i + 1] - 1 # 0-based
if (next_section_start - current_batch_start) > max_lines_per_batch:
# Close current batch at current section boundary
batch_lines = lines[current_batch_start:section_start_idx]
if batch_lines: # Only add non-empty batches
batches.append('\n'.join(batch_lines))
current_batch_start = section_start_idx

# Clean up any empty batches
batches = [batch for batch in batches if batch.strip()]

return batches

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic in create_section_batches is quite complex and hard to follow, especially with the lookahead checks for the next section. This complexity increases the risk of bugs and makes the function difficult to maintain.

Consider refactoring this to a simpler, more straightforward algorithm. A simpler approach could be:

  1. Initialize an empty current_batch and a list of batches.
  2. Iterate through the sections (or lines if no sections).
  3. If adding the current section/line to current_batch does not exceed max_lines_per_batch, add it.
  4. Otherwise, add current_batch to batches, and start a new current_batch with the current section/line.
  5. After the loop, add the final current_batch if it's not empty.

Comment on lines +101 to +121
try:
from main import print_token_estimation
print_token_estimation(prompt, "File addition translation")
except ImportError:
# Fallback if import fails - use tiktoken
try:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode(prompt)
actual_tokens = len(tokens)
char_count = len(prompt)
print(f" 💰 File addition translation")
print(f" 📝 Input: {char_count:,} characters")
print(f" 🔢 Actual tokens: {actual_tokens:,} (using tiktoken cl100k_base)")
except Exception:
# Final fallback to character approximation
estimated_tokens = len(prompt) // 4
char_count = len(prompt)
print(f" 💰 File addition translation")
print(f" 📝 Input: {char_count:,} characters")
print(f" 🔢 Estimated tokens: ~{estimated_tokens:,} (fallback: 4 chars/token approximation)")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block for token estimation is duplicated in several modules (file_updater.py, section_matcher.py, toc_processor.py). This duplicated code should be refactored into a single utility function in a shared module to avoid redundancy and make future changes easier.

Comment on lines +251 to +273
def get_pr_diff(pr_url, github_client):
"""Get the diff content from a GitHub PR (from auto-sync-pr-changes.py)"""
try:
from pr_analyzer import parse_pr_url
owner, repo, pr_number = parse_pr_url(pr_url)
repository = github_client.get_repo(f"{owner}/{repo}")
pr = repository.get_pull(pr_number)

# Get files and their patches
files = pr.get_files()
diff_content = []

for file in files:
if file.filename.endswith('.md') and file.patch:
diff_content.append(f"File: {file.filename}")
diff_content.append(file.patch)
diff_content.append("-" * 80)

return "\n".join(diff_content)

except Exception as e:
thread_safe_print(f" ❌ Error getting PR diff: {e}")
return None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The get_pr_diff function is duplicated in pr_analyzer.py. To adhere to the DRY (Don't Repeat Yourself) principle, this function should be defined in only one place (for example, in pr_analyzer.py) and imported here.

Comment on lines +20 to +37
def clean_title_for_matching(title):
"""Clean title for matching by removing markdown formatting and span elements"""
if not title:
return ""

# Remove span elements like <span class="version-mark">New in v5.0</span>
title = re.sub(r'<span[^>]*>.*?</span>', '', title)

# Remove markdown header prefix (# ## ### etc.)
title = re.sub(r'^#{1,6}\s*', '', title.strip())

# Remove backticks
title = title.replace('`', '')

# Strip whitespace
title = title.strip()

return title

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The clean_title_for_matching function is duplicated from pr_analyzer.py. This and other shared utility functions should be centralized in a single utility module to avoid code duplication and ensure consistency.

@qiancai qiancai merged commit ab90fca into master Sep 25, 2025
6 of 7 checks passed
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

🧹 Nitpick comments (30)
.github/workflows/sync-docs-cn-to-en.yml (1)

46-70: Harden PR URL parsing.

cut -f7 is brittle (fails with trailing paths like /files). Use regex to extract owner/repo/PR number.

Apply:

       - name: Extract PR information
         id: extract_info
         run: |
           # Extract source repo info
           SOURCE_URL="${{ github.event.inputs.source_pr_url }}"
-          SOURCE_OWNER=$(echo $SOURCE_URL | cut -d'/' -f4)
-          SOURCE_REPO=$(echo $SOURCE_URL | cut -d'/' -f5)
-          SOURCE_PR=$(echo $SOURCE_URL | cut -d'/' -f7)
+          SOURCE_OWNER=$(echo "$SOURCE_URL" | sed -E 's#https?://github\.com/([^/]+)/([^/]+)/pull/([0-9]+).*#\1#')
+          SOURCE_REPO=$(echo "$SOURCE_URL" | sed -E 's#https?://github\.com/([^/]+)/([^/]+)/pull/([0-9]+).*#\2#')
+          SOURCE_PR=$(echo "$SOURCE_URL" | sed -E 's#https?://github\.com/[^/]+/[^/]+/pull/([0-9]+).*#\1#')
           
           # Extract target repo info
           TARGET_URL="${{ github.event.inputs.target_pr_url }}"
-          TARGET_OWNER=$(echo $TARGET_URL | cut -d'/' -f4)
-          TARGET_REPO=$(echo $TARGET_URL | cut -d'/' -f5)
-          TARGET_PR=$(echo $TARGET_URL | cut -d'/' -f7)
+          TARGET_OWNER=$(echo "$TARGET_URL" | sed -E 's#https?://github\.com/([^/]+)/([^/]+)/pull/([0-9]+).*#\1#')
+          TARGET_REPO=$(echo "$TARGET_URL" | sed -E 's#https?://github\.com/([^/]+)/([^/]+)/pull/([0-9]+).*#\2#')
+          TARGET_PR=$(echo "$TARGET_URL" | sed -E 's#https?://github\.com/[^/]+/[^/]+/pull/([0-9]+).*#\1#')
scripts/translate_doc_pr/file_adder.py (5)

8-12: Remove unused imports.

json, Github, OpenAI are unused here.

-import json
 import threading
-from github import Github
-from openai import OpenAI

20-74: Fix batching to respect section boundaries and max_lines reliably.

Current logic can drop content and create empty batches. Pack section ranges then chunk >max batches.

-def create_section_batches(file_content, max_lines_per_batch=200):
-    """Create batches of file content for translation, respecting section boundaries"""
-    lines = file_content.split('\n')
-    
-    # Find all section headers
-    section_starts = []
-    for i, line in enumerate(lines):
-        line = line.strip()
-        if line.startswith('#'):
-            match = re.match(r'^(#{1,10})\s+(.+)', line)
-            if match:
-                section_starts.append(i + 1)  # 1-based line numbers
-    
-    # If no sections found, just batch by line count
-    if not section_starts:
-        batches = []
-        for i in range(0, len(lines), max_lines_per_batch):
-            batch_lines = lines[i:i + max_lines_per_batch]
-            batches.append('\n'.join(batch_lines))
-        return batches
-    
-    # Create batches respecting section boundaries
-    batches = []
-    current_batch_start = 0
-    
-    for i, section_start in enumerate(section_starts):
-        section_start_idx = section_start - 1  # Convert to 0-based
-        
-        # Check if adding this section would exceed the line limit
-        if (section_start_idx - current_batch_start) > max_lines_per_batch:
-            # Close current batch at the previous section boundary
-            if current_batch_start < section_start_idx:
-                batch_lines = lines[current_batch_start:section_start_idx]
-                batches.append('\n'.join(batch_lines))
-                current_batch_start = section_start_idx
-        
-        # If this is the last section, or the next section would create a batch too large
-        if i == len(section_starts) - 1:
-            # Add remaining content as final batch
-            batch_lines = lines[current_batch_start:]
-            batches.append('\n'.join(batch_lines))
-        else:
-            next_section_start = section_starts[i + 1] - 1  # 0-based
-            if (next_section_start - current_batch_start) > max_lines_per_batch:
-                # Close current batch at current section boundary
-                batch_lines = lines[current_batch_start:section_start_idx]
-                if batch_lines:  # Only add non-empty batches
-                    batches.append('\n'.join(batch_lines))
-                current_batch_start = section_start_idx
-    
-    # Clean up any empty batches
-    batches = [batch for batch in batches if batch.strip()]
-    
-    return batches
+def create_section_batches(file_content, max_lines_per_batch=200):
+    """Batch content by Markdown sections, then enforce max_lines_per_batch."""
+    lines = file_content.split('\n')
+    header_idx = [i for i, line in enumerate(lines) if re.match(r'^\#{1,10}\s+\S', line.strip())]
+
+    # No headers: simple chunking
+    if not header_idx:
+        return [
+            '\n'.join(chunk)
+            for i in range(0, len(lines), max_lines_per_batch)
+            if (chunk := lines[i:i + max_lines_per_batch])
+        ]
+
+    # Build section slices
+    sections = []
+    for i, start in enumerate(header_idx):
+        end = header_idx[i + 1] if i + 1 < len(header_idx) else len(lines)
+        sections.append(lines[start:end])
+
+    # Pack sections into batches
+    batches, cur, cur_len = [], [], 0
+    for sec in sections:
+        sec_len = len(sec)
+        if cur and cur_len + sec_len > max_lines_per_batch:
+            batches.append('\n'.join(cur))
+            cur, cur_len = [], 0
+        cur.extend(sec)
+        cur_len += sec_len
+    if cur:
+        batches.append('\n'.join(cur))
+
+    # Split any oversized batch (single very large section)
+    final_batches = []
+    for batch in batches:
+        blines = batch.split('\n')
+        if len(blines) <= max_lines_per_batch:
+            if batch.strip():
+                final_batches.append(batch)
+        else:
+            for i in range(0, len(blines), max_lines_per_batch):
+                chunk = '\n'.join(blines[i:i + max_lines_per_batch]).strip()
+                if chunk:
+                    final_batches.append(chunk)
+    return final_batches

112-121: Drop f-prefix where no interpolation occurs.

Cleans up lints (F541) and keeps output unchanged.

-            print(f"   💰 File addition translation")
-            print(f"      📝 Input: {char_count:,} characters")
-            print(f"      🔢 Actual tokens: {actual_tokens:,} (using tiktoken cl100k_base)")
+            print("   💰 File addition translation")
+            print(f"      📝 Input: {char_count:,} characters")
+            print(f"      🔢 Actual tokens: {actual_tokens:,} (using tiktoken cl100k_base)")
@@
-            print(f"   💰 File addition translation")
-            print(f"      📝 Input: {char_count:,} characters")
-            print(f"      🔢 Estimated tokens: ~{estimated_tokens:,} (fallback: 4 chars/token approximation)")
+            print("   💰 File addition translation")
+            print(f"      📝 Input: {char_count:,} characters")
+            print(f"      🔢 Estimated tokens: ~{estimated_tokens:,} (fallback: 4 chars/token approximation)")
@@
-    thread_safe_print(f"\n✅ Completed processing all new files")
+    thread_safe_print("\n✅ Completed processing all new files")

Also applies to: 128-129, 193-193


135-140: Rename unused params to signal intent.

pr_url and github_client are unused.

-def process_added_files(added_files, pr_url, github_client, ai_client, repo_config):
+def process_added_files(added_files, _pr_url, _github_client, ai_client, repo_config):

131-134: Avoid blind excepts; narrow or log with type.

Catching bare Exception hides actionable errors.

Consider catching specific exceptions (e.g., httpx.HTTPError from AI client; OSError for file writes), or at least include exception type:

except Exception as e:
    thread_safe_print(f"   ❌ Batch translation failed: {type(e).__name__}: {e}")

Also applies to: 190-191

scripts/translate_doc_pr/file_deleter.py (4)

8-8: Remove unused import.

Github is not used.

-from github import Github

17-22: Rename unused param to underscore.

github_client isn’t used.

-def process_deleted_files(deleted_files, github_client, repo_config):
+def process_deleted_files(deleted_files, _github_client, repo_config):

38-39: Avoid blind except; log type.

Narrow to OSError or include type name in message.

-            except Exception as e:
-                thread_safe_print(f"   ❌ Error deleting file {target_file_path}: {e}")
+            except OSError as e:
+                thread_safe_print(f"   ❌ Error deleting file {target_file_path}: {type(e).__name__}: {e}")

43-44: Drop f-prefix without interpolation.

-    thread_safe_print(f"\n✅ Completed processing deleted files")
+    thread_safe_print("\n✅ Completed processing deleted files")
scripts/translate_doc_pr/main_workflow.py (6)

612-615: Remove duplicate re-imports.

process_toc_files and process_modified_sections are already imported at top.

-    # Import necessary functions
-    from file_updater import process_modified_sections, update_target_document_from_match_data
-    from toc_processor import process_toc_files
+    # Functions already imported at module top; no re-import needed

528-533: Drop redundant local import; parse_pr_url already imported.

-def get_workflow_repo_config(pr_url, repo_configs):
+def get_workflow_repo_config(pr_url, repo_configs):
     """Get repository configuration for workflow environment"""
-    from pr_analyzer import parse_pr_url
-    
     owner, repo, pr_number = parse_pr_url(pr_url)

115-122: Unify logging and remove f-prefix without placeholders.

Use thread_safe_print consistently and avoid f-strings without interpolation.

-            print(f"🧹 Cleaned existing temp_output directory")
+            thread_safe_print("🧹 Cleaned existing temp_output directory")
@@
-            print(f"🧹 Removed existing temp_output file")
+            thread_safe_print("🧹 Removed existing temp_output file")
@@
-    print(f"📁 Created temp_output directory: {temp_dir}")
+    thread_safe_print(f"📁 Created temp_output directory: {temp_dir}")

423-431: Prefer local target file content over GitHub API master ref.

Using remote master can mismatch the checked-out PR head branch and degrade matching accuracy. Read from repo_config['target_local_path'] instead, falling back to API if missing.

I can provide a patch if you confirm local path layout mirrors target repo paths.


251-274: Duplicate get_pr_diff logic; reuse pr_analyzer.get_pr_diff.

Avoid maintaining two implementations.

Option:

  • Remove local get_pr_diff.
  • Import and call pr_analyzer.get_pr_diff at the top.

Minimal change at call site:

-    pr_diff = get_pr_diff(SOURCE_PR_URL, github_client)
+    from pr_analyzer import get_pr_diff as analyzer_get_pr_diff
+    pr_diff = analyzer_get_pr_diff(SOURCE_PR_URL, github_client)

Also applies to: 592-596


56-57: Message text: avoid long strings in exceptions.

Minor: prefer concise exception messages or custom exception classes.

Also applies to: 159-160

scripts/translate_doc_pr/toc_processor.py (3)

10-12: Remove unused imports.

Github and OpenAI are not used.

-from github import Github
-from openai import OpenAI

241-241: Drop f-prefix where no interpolation occurs.

Cleans up F541 lints.

-        thread_safe_print(f"   ⏭️  No TOC lines need translation")
+        thread_safe_print("   ⏭️  No TOC lines need translation")
@@
-            print(f"   💰 TOC translation")
+            print("   💰 TOC translation")
@@
-            print(f"   💰 TOC translation")
+            print("   💰 TOC translation")
@@
-        thread_safe_print(f"   📝 AI translation response received")
+        thread_safe_print("   📝 AI translation response received")
@@
-    thread_safe_print(f"   ✅ All TOC files processed")
+    thread_safe_print("   ✅ All TOC files processed")

Also applies to: 286-286, 293-295, 303-304, 434-434


90-91: Rename unused params to underscore; avoid blind except.

  • file_path/target_local_path/pr_url/github_client are unused.
  • Catch specific exceptions or include type name in logs.
-def process_toc_operations(file_path, operations, source_lines, target_lines, target_local_path):
+def process_toc_operations(_file_path, operations, source_lines, target_lines, _target_local_path):
@@
-def process_toc_files(toc_files, pr_url, github_client, ai_client, repo_config):
+def process_toc_files(toc_files, _pr_url, _github_client, ai_client, repo_config):

And adjust excepts:

except Exception as e:
    thread_safe_print(f"   ❌ AI translation failed: {type(e).__name__}: {e}")

Also applies to: 421-423, 332-334

scripts/translate_doc_pr/pr_analyzer.py (8)

1-1: Remove non-executable shebang or mark file executable.

The shebang is present but the file isn’t executable in source control. Prefer removing it to avoid EXE001 lint, or ensure the file mode is +x.

Apply this diff to remove the shebang:

-#!/usr/bin/env python3

22-26: Harden PR URL parsing.

Splitting on "/" is brittle (trailing slashes, query params). Use a regex or urllib to reliably extract owner/repo/number.

Apply this diff:

 def parse_pr_url(pr_url):
-    """Parse PR URL to get repo info"""
-    parts = pr_url.split('/')
-    return parts[-4], parts[-3], int(parts[-1])  # owner, repo, pr_number
+    """Parse PR URL to get owner, repo, and PR number robustly"""
+    import re
+    m = re.search(r'https?://[^/]+/([^/]+)/([^/]+)/pull/(\d+)', pr_url.rstrip('/'))
+    if not m:
+        raise ValueError(f"Invalid PR URL: {pr_url}")
+    owner, repo, pr_number = m.group(1), m.group(2), int(m.group(3))
+    return owner, repo, pr_number

41-63: Narrow exception handling and use thread-safe logging.

Catching Exception swallows actionable errors; prefer GithubException and use thread_safe_print for consistency.

Apply this diff:

-from github import Github
+from github import Github
+from github.GithubException import GithubException
@@
 def get_pr_diff(pr_url, github_client):
@@
-    except Exception as e:
-        print(f"   ❌ Error getting PR diff: {e}")
+    except GithubException as e:
+        thread_safe_print(f"   ❌ GitHub API error getting PR diff: {e}")
+        return None
+    except Exception as e:
+        thread_safe_print(f"   ❌ Unexpected error getting PR diff: {e}")
         return None

228-289: Remove unused locals to reduce noise (Ruff F841).

original_line and title are never used.

Apply this diff:

-        original_line = line
         line = line.strip()
@@
-            if match:
-                level = len(match.group(1))
-                title = match.group(2).strip()
+            if match:
+                level = len(match.group(1))

742-744: Remove unused cross-import to avoid circular dependency risks.

This import isn’t used and introduces an unnecessary coupling.

Apply this diff:

-    from section_matcher import clean_title_for_matching

1311-1319: Gate local artifact writes behind an env flag and use a temp directory.

Always writing to the repo path is noisy and brittle in CI. Make it opt-in and write to a temp dir by default.

Apply this diff:

-        script_dir = os.path.dirname(os.path.abspath(__file__))
-        temp_dir = os.path.join(script_dir, "temp_output")
-        os.makedirs(temp_dir, exist_ok=True)
-        
-        file_prefix = file.filename.replace('/', '-').replace('.md', '')
-        output_file = os.path.join(temp_dir, f"{file_prefix}-source-diff-dict.json")
-        with open(output_file, 'w', encoding='utf-8') as f:
-            json.dump(source_diff_dict, f, ensure_ascii=False, indent=2)
-        
-        print(f"   💾 Saved source diff dictionary to: {output_file}")
+        if os.getenv("WRITE_SOURCE_DIFF", "0") == "1":
+            out_dir = os.getenv("SOURCE_DIFF_OUTDIR", os.path.join(tempfile.gettempdir(), "translate_doc_pr"))
+            os.makedirs(out_dir, exist_ok=True)
+            file_prefix = file.filename.replace('/', '-').replace('.md', '')
+            output_file = os.path.join(out_dir, f"{file_prefix}-source-diff-dict.json")
+            with open(output_file, 'w', encoding='utf-8') as f:
+                json.dump(source_diff_dict, f, ensure_ascii=False, indent=2)
+            thread_safe_print(f"   💾 Saved source diff dictionary to: {output_file}")

Note: add import tempfile at the top.


982-984: Remove redundant re-imports.

os and json are already imported at module scope.

Apply this diff:

-    import os
-    import json

671-690: Deduplicate helpers across modules.

clean_title_for_matching is duplicated in section_matcher.py. Extract into a shared utils module to avoid drift.

scripts/translate_doc_pr/section_matcher.py (3)

6-12: Remove unused imports.

Github, OpenAI, os, json aren’t used here.

Apply this diff:

-import os
 import re
-import json
 import threading
-from github import Github
-from openai import OpenAI

430-456: Deduplicate build_hierarchy_path with pr_analyzer.

Both modules implement identical logic. Extract into a shared utility to avoid divergence.


16-19: Unify logging with pr_analyzer.

Duplicate thread_safe_print exists across modules. Consider centralizing in a utils module.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7dbc37b and 88d9864.

📒 Files selected for processing (10)
  • .github/workflows/sync-docs-cn-to-en.yml (1 hunks)
  • scripts/translate_doc_pr/__init__.py (1 hunks)
  • scripts/translate_doc_pr/file_adder.py (1 hunks)
  • scripts/translate_doc_pr/file_deleter.py (1 hunks)
  • scripts/translate_doc_pr/file_updater.py (1 hunks)
  • scripts/translate_doc_pr/main_workflow.py (1 hunks)
  • scripts/translate_doc_pr/pr_analyzer.py (1 hunks)
  • scripts/translate_doc_pr/requirements.txt (1 hunks)
  • scripts/translate_doc_pr/section_matcher.py (1 hunks)
  • scripts/translate_doc_pr/toc_processor.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (7)
scripts/translate_doc_pr/__init__.py (1)
scripts/translate_doc_pr/main_workflow.py (1)
  • main (544-688)
scripts/translate_doc_pr/file_deleter.py (6)
scripts/translate_doc_pr/file_adder.py (1)
  • thread_safe_print (16-18)
scripts/translate_doc_pr/file_updater.py (1)
  • thread_safe_print (17-19)
scripts/translate_doc_pr/main_workflow.py (1)
  • thread_safe_print (94-96)
scripts/translate_doc_pr/pr_analyzer.py (1)
  • thread_safe_print (16-19)
scripts/translate_doc_pr/section_matcher.py (1)
  • thread_safe_print (16-18)
scripts/translate_doc_pr/toc_processor.py (1)
  • thread_safe_print (16-18)
scripts/translate_doc_pr/file_adder.py (1)
scripts/translate_doc_pr/main_workflow.py (4)
  • thread_safe_print (94-96)
  • main (544-688)
  • print_token_estimation (137-144)
  • chat_completion (165-198)
scripts/translate_doc_pr/section_matcher.py (2)
scripts/translate_doc_pr/main_workflow.py (4)
  • thread_safe_print (94-96)
  • main (544-688)
  • print_token_estimation (137-144)
  • chat_completion (165-198)
scripts/translate_doc_pr/pr_analyzer.py (5)
  • thread_safe_print (16-19)
  • clean_title_for_matching (672-689)
  • build_hierarchy_path (291-316)
  • extract_section_direct_content (454-479)
  • extract_frontmatter_content (481-494)
scripts/translate_doc_pr/toc_processor.py (1)
scripts/translate_doc_pr/main_workflow.py (4)
  • thread_safe_print (94-96)
  • main (544-688)
  • print_token_estimation (137-144)
  • chat_completion (165-198)
scripts/translate_doc_pr/main_workflow.py (6)
scripts/translate_doc_pr/pr_analyzer.py (6)
  • analyze_source_changes (979-1447)
  • get_repo_config (27-39)
  • get_target_hierarchy_and_content (603-616)
  • parse_pr_url (22-25)
  • thread_safe_print (16-19)
  • get_pr_diff (41-62)
scripts/translate_doc_pr/file_adder.py (2)
  • process_added_files (135-193)
  • thread_safe_print (16-18)
scripts/translate_doc_pr/file_deleter.py (2)
  • process_deleted_files (17-43)
  • thread_safe_print (13-15)
scripts/translate_doc_pr/file_updater.py (3)
  • process_modified_sections (574-605)
  • thread_safe_print (17-19)
  • update_target_document_from_match_data (1196-1357)
scripts/translate_doc_pr/toc_processor.py (2)
  • process_toc_files (421-434)
  • thread_safe_print (16-18)
scripts/translate_doc_pr/section_matcher.py (2)
  • match_source_diff_to_target (572-689)
  • thread_safe_print (16-18)
scripts/translate_doc_pr/file_updater.py (4)
scripts/translate_doc_pr/main_workflow.py (4)
  • thread_safe_print (94-96)
  • main (544-688)
  • print_token_estimation (137-144)
  • chat_completion (165-198)
scripts/translate_doc_pr/pr_analyzer.py (6)
  • thread_safe_print (16-19)
  • find_section_boundaries (361-400)
  • get_target_hierarchy_and_content (603-616)
  • clean_title_for_matching (672-689)
  • build_hierarchy_dict (228-289)
  • extract_affected_sections (497-514)
scripts/translate_doc_pr/section_matcher.py (9)
  • thread_safe_print (16-18)
  • find_direct_matches_for_special_files (138-298)
  • filter_non_system_sections (300-316)
  • get_corresponding_sections (318-379)
  • is_system_variable_or_config (39-80)
  • clean_title_for_matching (20-37)
  • parse_ai_response (381-397)
  • find_matching_line_numbers (399-428)
  • map_insertion_points_to_target (457-551)
scripts/translate_doc_pr/toc_processor.py (2)
  • thread_safe_print (16-18)
  • process_toc_file (332-419)
🪛 Ruff (0.13.1)
scripts/translate_doc_pr/__init__.py

1-1: Shebang is present but file is not executable

(EXE001)

scripts/translate_doc_pr/file_deleter.py

17-17: Unused function argument: github_client

(ARG001)


38-38: Do not catch blind exception: Exception

(BLE001)


43-43: f-string without any placeholders

Remove extraneous f prefix

(F541)

scripts/translate_doc_pr/file_adder.py

112-112: f-string without any placeholders

Remove extraneous f prefix

(F541)


115-115: Do not catch blind exception: Exception

(BLE001)


119-119: f-string without any placeholders

Remove extraneous f prefix

(F541)


128-128: f-string without any placeholders

Remove extraneous f prefix

(F541)


129-129: Consider moving this statement to an else block

(TRY300)


131-131: Do not catch blind exception: Exception

(BLE001)


135-135: Unused function argument: pr_url

(ARG001)


135-135: Unused function argument: github_client

(ARG001)


190-190: Do not catch blind exception: Exception

(BLE001)


193-193: f-string without any placeholders

Remove extraneous f prefix

(F541)

scripts/translate_doc_pr/section_matcher.py

88-88: f-string without any placeholders

Remove extraneous f prefix

(F541)


122-122: f-string without any placeholders

Remove extraneous f prefix

(F541)


138-138: Unused function argument: target_hierarchy

(ARG001)


169-169: f-string without any placeholders

Remove extraneous f prefix

(F541)


235-235: f-string without any placeholders

Remove extraneous f prefix

(F541)


242-242: f-string without any placeholders

Remove extraneous f prefix

(F541)


294-294: String contains ambiguous (INFORMATION SOURCE). Did you mean i (LATIN SMALL LETTER I)?

(RUF001)


326-332: Possible SQL injection vector through string-based query construction

(S608)


335-335: f-string without any placeholders

Remove extraneous f prefix

(F541)


337-337: f-string without any placeholders

Remove extraneous f prefix

(F541)


354-354: Do not catch blind exception: Exception

(BLE001)


371-371: f-string without any placeholders

Remove extraneous f prefix

(F541)


372-372: f-string without any placeholders

Remove extraneous f prefix

(F541)


374-374: f-string without any placeholders

Remove extraneous f prefix

(F541)


376-376: Consider moving this statement to an else block

(TRY300)


377-377: Do not catch blind exception: Exception

(BLE001)


430-430: Unused function argument: lines

(ARG001)


457-457: Unused function argument: file_path

(ARG001)


457-457: Unused function argument: pr_url

(ARG001)


457-457: Unused function argument: github_client

(ARG001)


480-480: Unpacked variable failed_matches is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


480-480: Unpacked variable skipped_sections is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


486-486: Prefer next(iter(matched_dict.keys())) over single element slice

Replace with next(iter(matched_dict.keys()))

(RUF015)


533-533: Prefer next(iter(ai_matched.keys())) over single element slice

Replace with next(iter(ai_matched.keys()))

(RUF015)


587-587: f-string without any placeholders

Remove extraneous f prefix

(F541)


593-593: f-string without any placeholders

Remove extraneous f prefix

(F541)


626-626: f-string without any placeholders

Remove extraneous f prefix

(F541)


629-629: String contains ambiguous (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)


633-633: f-string without any placeholders

Remove extraneous f prefix

(F541)


641-641: f-string without any placeholders

Remove extraneous f prefix

(F541)


648-648: f-string without any placeholders

Remove extraneous f prefix

(F541)


648-648: String contains ambiguous (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)


711-711: Prefer next(iter(matched_dict.keys())) over single element slice

Replace with next(iter(matched_dict.keys()))

(RUF015)


721-721: Unpacked variable failed_matches is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


721-721: Unpacked variable skipped_sections is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


725-725: Prefer next(iter(matched_dict.keys())) over single element slice

Replace with next(iter(matched_dict.keys()))

(RUF015)


726-726: Prefer next(iter(matched_dict.values())) over single element slice

Replace with next(iter(matched_dict.values()))

(RUF015)


762-762: Prefer next(iter(ai_matched.keys())) over single element slice

Replace with next(iter(ai_matched.keys()))

(RUF015)


763-763: Prefer next(iter(ai_matched.values())) over single element slice

Replace with next(iter(ai_matched.values()))

(RUF015)


811-811: Prefer next(iter(matched_dict.keys())) over single element slice

Replace with next(iter(matched_dict.keys()))

(RUF015)


821-821: Unpacked variable failed_matches is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


821-821: Unpacked variable skipped_sections is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


825-825: Prefer next(iter(matched_dict.keys())) over single element slice

Replace with next(iter(matched_dict.keys()))

(RUF015)


826-826: Prefer next(iter(matched_dict.values())) over single element slice

Replace with next(iter(matched_dict.values()))

(RUF015)


854-854: Prefer next(iter(ai_matched.keys())) over single element slice

Replace with next(iter(ai_matched.keys()))

(RUF015)


855-855: Prefer next(iter(ai_matched.values())) over single element slice

Replace with next(iter(ai_matched.values()))

(RUF015)


917-917: Local variable current_level is assigned to but never used

Remove assignment to unused variable current_level

(F841)


939-939: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

scripts/translate_doc_pr/toc_processor.py

90-90: Unused function argument: file_path

(ARG001)


90-90: Unused function argument: target_local_path

(ARG001)


92-92: f-string without any placeholders

Remove extraneous f prefix

(F541)


126-126: String contains ambiguous (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)


241-241: f-string without any placeholders

Remove extraneous f prefix

(F541)


286-286: f-string without any placeholders

Remove extraneous f prefix

(F541)


289-289: Do not catch blind exception: Exception

(BLE001)


293-293: f-string without any placeholders

Remove extraneous f prefix

(F541)


303-303: f-string without any placeholders

Remove extraneous f prefix

(F541)


328-328: Do not catch blind exception: Exception

(BLE001)


332-332: Unused function argument: pr_url

(ARG001)


332-332: Unused function argument: github_client

(ARG001)


387-387: String contains ambiguous (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)


400-400: String contains ambiguous (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)


403-403: String contains ambiguous (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)


418-418: Do not catch blind exception: Exception

(BLE001)


434-434: f-string without any placeholders

Remove extraneous f prefix

(F541)

scripts/translate_doc_pr/main_workflow.py

56-56: Avoid specifying long messages outside the exception class

(TRY003)


115-115: f-string without any placeholders

Remove extraneous f prefix

(F541)


119-119: f-string without any placeholders

Remove extraneous f prefix

(F541)


132-132: Do not catch blind exception: Exception

(BLE001)


157-157: Avoid specifying long messages outside the exception class

(TRY003)


159-159: Avoid specifying long messages outside the exception class

(TRY003)


163-163: Avoid specifying long messages outside the exception class

(TRY003)


179-179: f-string without any placeholders

Remove extraneous f prefix

(F541)


188-188: f-string without any placeholders

Remove extraneous f prefix

(F541)


191-191: f-string without any placeholders

Remove extraneous f prefix

(F541)


195-195: Use explicit conversion flag

Replace with conversion flag

(RUF010)


197-197: f-string without any placeholders

Remove extraneous f prefix

(F541)


198-198: Use raise without specifying exception name

Remove exception name

(TRY201)


221-221: Loop control variable key not used within loop body

Rename unused key to _key

(B007)


235-235: f-string without any placeholders

Remove extraneous f prefix

(F541)


247-247: Do not catch blind exception: Exception

(BLE001)


271-271: Do not catch blind exception: Exception

(BLE001)


275-275: Unused function argument: target_sections

(ARG001)


312-312: Loop control variable key not used within loop body

Rename unused key to _key

(B007)


319-319: f-string without any placeholders

Remove extraneous f prefix

(F541)


370-370: Unused function argument: file_sections

(ARG001)


411-411: f-string without any placeholders

Remove extraneous f prefix

(F541)


414-414: f-string without any placeholders

Remove extraneous f prefix

(F541)


434-434: f-string without any placeholders

Remove extraneous f prefix

(F541)


446-446: f-string without any placeholders

Remove extraneous f prefix

(F541)


458-458: f-string without any placeholders

Remove extraneous f prefix

(F541)


474-474: Unpacked variable file_path is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


505-505: f-string without any placeholders

Remove extraneous f prefix

(F541)


510-510: f-string without any placeholders

Remove extraneous f prefix

(F541)


513-513: f-string without any placeholders

Remove extraneous f prefix

(F541)


517-517: f-string without any placeholders

Remove extraneous f prefix

(F541)


520-520: f-string without any placeholders

Remove extraneous f prefix

(F541)


523-523: Do not catch blind exception: Exception

(BLE001)


530-530: Redefinition of unused parse_pr_url from line 21

Remove definition: parse_pr_url

(F811)


536-536: Avoid specifying long messages outside the exception class

(TRY003)


556-556: f-string without any placeholders

Remove extraneous f prefix

(F541)


584-584: Do not catch blind exception: Exception

(BLE001)


591-591: f-string without any placeholders

Remove extraneous f prefix

(F541)


599-599: f-string without any placeholders

Remove extraneous f prefix

(F541)


600-600: Unpacked variable added_sections is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


600-600: Unpacked variable deleted_sections is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


610-610: f-string without any placeholders

Remove extraneous f prefix

(F541)


614-614: Redefinition of unused process_toc_files from line 25

Remove definition: process_toc_files

(F811)


620-620: f-string without any placeholders

Remove extraneous f prefix

(F541)


626-626: f-string without any placeholders

Remove extraneous f prefix

(F541)


632-632: f-string without any placeholders

Remove extraneous f prefix

(F541)


658-658: f-string without any placeholders

Remove extraneous f prefix

(F541)


683-683: f-string without any placeholders

Remove extraneous f prefix

(F541)


688-688: f-string without any placeholders

Remove extraneous f prefix

(F541)

scripts/translate_doc_pr/file_updater.py

51-51: Local variable diff_content is assigned to but never used

Remove assignment to unused variable diff_content

(F841)


102-102: f-string without any placeholders

Remove extraneous f prefix

(F541)


105-105: f-string without any placeholders

Remove extraneous f prefix

(F541)


109-109: f-string without any placeholders

Remove extraneous f prefix

(F541)


125-125: Do not catch blind exception: Exception

(BLE001)


138-138: f-string without any placeholders

Remove extraneous f prefix

(F541)


150-150: Consider moving this statement to an else block

(TRY300)


152-152: Do not catch blind exception: Exception

(BLE001)


164-164: f-string without any placeholders

Remove extraneous f prefix

(F541)


173-173: f-string without any placeholders

Remove extraneous f prefix

(F541)


176-176: f-string without any placeholders

Remove extraneous f prefix

(F541)


180-180: f-string without any placeholders

Remove extraneous f prefix

(F541)


200-200: Consider moving this statement to an else block

(TRY300)


202-202: f-string without any placeholders

Remove extraneous f prefix

(F541)


230-230: Consider moving this statement to an else block

(TRY300)


231-231: Do not use bare except

(E722)


231-232: try-except-continue detected, consider logging the exception

(S112)


235-235: Use raise without specifying exception name

Remove exception name

(TRY201)


237-237: f-string without any placeholders

Remove extraneous f prefix

(F541)


257-257: Do not catch blind exception: Exception

(BLE001)


417-417: Consider moving this statement to an else block

(TRY300)


419-419: Do not catch blind exception: Exception

(BLE001)


430-430: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


468-468: f-string without any placeholders

Remove extraneous f prefix

(F541)


568-568: Consider moving this statement to an else block

(TRY300)


570-570: Do not catch blind exception: Exception

(BLE001)


601-601: Do not catch blind exception: Exception

(BLE001)


633-633: Do not catch blind exception: Exception

(BLE001)


639-639: Unused function argument: pr_url

(ARG001)


684-684: Unpacked variable skipped_sections is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


688-688: Loop control variable hierarchy_string not used within loop body

Rename unused hierarchy_string to _hierarchy_string

(B007)


726-726: Loop control variable source_line not used within loop body

Rename unused source_line to _source_line

(B007)


768-768: Local variable target_hierarchy is assigned to but never used

Remove assignment to unused variable target_hierarchy

(F841)


808-808: Consider moving this statement to an else block

(TRY300)


810-810: Do not catch blind exception: Exception

(BLE001)


1049-1049: Do not catch blind exception: Exception

(BLE001)


1056-1056: String contains ambiguous (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)


1059-1059: String contains ambiguous (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)


1066-1066: String contains ambiguous (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)


1118-1118: Unused function argument: operation_type

(ARG001)


1183-1183: Do not catch blind exception: Exception

(BLE001)


1190-1190: f-string without any placeholders

Remove extraneous f prefix

(F541)


1219-1219: f-string without any placeholders

Remove extraneous f prefix

(F541)


1270-1270: Unpacked variable section_data is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1270-1270: Unpacked variable line_num is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1287-1287: f-string without any placeholders

Remove extraneous f prefix

(F541)


1296-1296: f-string without any placeholders

Remove extraneous f prefix

(F541)


1389-1389: f-string without any placeholders

Remove extraneous f prefix

(F541)


1414-1414: f-string without any placeholders

Remove extraneous f prefix

(F541)


1418-1418: f-string without any placeholders

Remove extraneous f prefix

(F541)


1423-1423: Use explicit conversion flag

Replace with conversion flag

(RUF010)


1429-1429: f-string without any placeholders

Remove extraneous f prefix

(F541)


1461-1461: f-string without any placeholders

Remove extraneous f prefix

(F541)


1471-1471: f-string without any placeholders

Remove extraneous f prefix

(F541)


1483-1483: f-string without any placeholders

Remove extraneous f prefix

(F541)


1502-1502: f-string without any placeholders

Remove extraneous f prefix

(F541)


1522-1522: f-string without any placeholders

Remove extraneous f prefix

(F541)


1553-1553: f-string without any placeholders

Remove extraneous f prefix

(F541)


1585-1585: f-string without any placeholders

Remove extraneous f prefix

(F541)


1622-1622: f-string without any placeholders

Remove extraneous f prefix

(F541)


1688-1688: f-string without any placeholders

Remove extraneous f prefix

(F541)

scripts/translate_doc_pr/pr_analyzer.py

1-1: Shebang is present but file is not executable

(EXE001)


33-33: Avoid specifying long messages outside the exception class

(TRY003)


60-60: Do not catch blind exception: Exception

(BLE001)


240-240: Local variable original_line is assigned to but never used

Remove assignment to unused variable original_line

(F841)


265-265: Local variable title is assigned to but never used

Remove assignment to unused variable title

(F841)


291-291: Unused function argument: lines

(ARG001)


328-328: Local variable original_title is assigned to but never used

Remove assignment to unused variable original_title

(F841)


341-341: Local variable title is assigned to but never used

Remove assignment to unused variable title

(F841)


368-368: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


402-402: Unused function argument: hierarchy_dict

(ARG001)


487-487: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


526-526: Unused function argument: lines

(ARG001)


547-547: Unused function argument: lines

(ARG001)


613-613: Consider moving this statement to an else block

(TRY300)


614-614: Do not catch blind exception: Exception

(BLE001)


644-644: Consider moving this statement to an else block

(TRY300)


645-645: Do not catch blind exception: Exception

(BLE001)


665-665: Consider moving this statement to an else block

(TRY300)


667-667: Do not catch blind exception: Exception

(BLE001)


746-746: Unused function argument: hierarchy_dict

(ARG001)


754-754: Unused function argument: base_hierarchy_dict

(ARG001)


922-922: Loop control variable base_line_str not used within loop body

Rename unused base_line_str to _base_line_str

(B007)


944-944: f-string without any placeholders

Remove extraneous f prefix

(F541)


979-979: Unused function argument: max_non_system_sections

(ARG001)


979-979: Unused function argument: pr_diff

(ARG001)


982-982: Redefinition of unused os from line 8

Remove definition: os

(F811)


983-983: Redefinition of unused json from line 7

Remove definition: json

(F811)


1021-1021: String contains ambiguous (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)


1025-1025: f-string without any placeholders

Remove extraneous f prefix

(F541)


1027-1027: Do not catch blind exception: Exception

(BLE001)


1034-1034: f-string without any placeholders

Remove extraneous f prefix

(F541)


1040-1040: Do not catch blind exception: Exception

(BLE001)


1053-1053: Do not catch blind exception: Exception

(BLE001)


1078-1078: f-string without any placeholders

Remove extraneous f prefix

(F541)


1080-1080: String contains ambiguous (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)


1086-1086: f-string without any placeholders

Remove extraneous f prefix

(F541)


1086-1086: String contains ambiguous (INFORMATION SOURCE). Did you mean i (LATIN SMALL LETTER I)?

(RUF001)


1142-1142: Do not catch blind exception: Exception

(BLE001)


1161-1161: f-string without any placeholders

Remove extraneous f prefix

(F541)


1202-1202: Local variable is_deleted_header is assigned to but never used

Remove assignment to unused variable is_deleted_header

(F841)


1270-1270: f-string without any placeholders

Remove extraneous f prefix

(F541)


1299-1299: f-string without any placeholders

Remove extraneous f prefix

(F541)


1371-1371: String contains ambiguous (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)


1402-1402: f-string without any placeholders

Remove extraneous f prefix

(F541)


1413-1413: Loop control variable existing_line not used within loop body

Rename unused existing_line to _existing_line

(B007)


1437-1437: f-string without any placeholders

Remove extraneous f prefix

(F541)

🪛 actionlint (1.7.7)
.github/workflows/sync-docs-cn-to-en.yml

37-37: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: tidb-check
🔇 Additional comments (1)
scripts/translate_doc_pr/main_workflow.py (1)

39-40: No changes needed: "gemini-2.0-flash" is a valid Google GenAI model ID.

Comment on lines +3 to +24
on:
workflow_dispatch:
inputs:
source_pr_url:
description: 'Source PR URL (Chinese docs repository)'
required: true
type: string
default: ''
target_pr_url:
description: 'Target PR URL (English docs repository)'
required: true
type: string
default: ''
ai_provider:
description: 'AI Provider to use for translation'
required: false
type: choice
options:
- deepseek
- gemini
default: 'gemini'

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Declare least-privilege permissions for GITHUB_TOKEN.

Explicitly set permissions to enable contents and PR writes; current default may be read-only.

Apply:

 on:
   workflow_dispatch:
     inputs:
       source_pr_url:
         description: 'Source PR URL (Chinese docs repository)'
         required: true
         type: string
         default: ''
@@
         default: 'gemini'
 
+permissions:
+  contents: write
+  pull-requests: write
🤖 Prompt for AI Agents
.github/workflows/sync-docs-cn-to-en.yml around lines 3 to 24: the workflow
lacks an explicit permissions block for GITHUB_TOKEN, which can default to
read-only; add a top-level permissions section granting least-privilege write
access needed for the job (e.g., contents: write and pull-requests: write) so
the workflow can update repository contents and create/update PRs, ensuring no
broader permissions are granted.

Comment on lines +37 to +39
uses: actions/setup-python@v4
with:
python-version: '3.9'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Update setup-python to v5 (actionlint failure).

Use actions/setup-python@v5 to avoid runner errors.

Apply:

-      - name: Set up Python
-        uses: actions/setup-python@v4
+      - name: Set up Python
+        uses: actions/setup-python@v5
         with:
           python-version: '3.9'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.9'
🧰 Tools
🪛 actionlint (1.7.7)

37-37: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🤖 Prompt for AI Agents
.github/workflows/sync-docs-cn-to-en.yml around lines 37 to 39: the workflow
pins actions/setup-python@v4 which causes actionlint/runner errors; update the
workflow to use actions/setup-python@v5 by changing the action reference to
actions/setup-python@v5 and ensure the existing with: python-version setting
remains unchanged (no other logic changes needed).

Comment on lines +71 to +83
- name: Get target PR branch info
id: target_branch
run: |
# Get target PR branch name
TARGET_BRANCH=$(curl -s \
-H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
-H "Accept: application/vnd.github.v3+json" \
"https://api.github.com/repos/${{ steps.extract_info.outputs.target_owner }}/${{ steps.extract_info.outputs.target_repo }}/pulls/${{ steps.extract_info.outputs.target_pr }}" \
| jq -r '.head.ref')

echo "target_branch=${TARGET_BRANCH}" >> $GITHUB_OUTPUT
echo "Target branch: ${TARGET_BRANCH}"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Cross-repo PRs and forks: clone/push the PR head repository using a PAT; current flow will fail.

Issues:

  • You only fetch .head.ref but clone base repo; this breaks for forked PRs (branch won't exist in base).
  • GITHUB_TOKEN cannot push to another repo; use a PAT (or GitHub App) with contents:write on the head repo.
  • Commenting on the target PR also needs a token authorized on that repo.

Fix by retrieving head repo full name and cloning/pushing to it with a PAT, and by using the same PAT for API calls.

Suggested patch:

       - name: Get target PR branch info
         id: target_branch
         run: |
           # Get target PR branch name
-          TARGET_BRANCH=$(curl -s \
+          PR_JSON=$(curl -s \
             -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
-            -H "Accept: application/vnd.github.v3+json" \
+            -H "Accept: application/vnd.github+json" \
             "https://api.github.com/repos/${{ steps.extract_info.outputs.target_owner }}/${{ steps.extract_info.outputs.target_repo }}/pulls/${{ steps.extract_info.outputs.target_pr }}" \
-            | jq -r '.head.ref')
+          )
+          TARGET_BRANCH=$(echo "$PR_JSON" | jq -r '.head.ref')
+          TARGET_HEAD_REPO=$(echo "$PR_JSON" | jq -r '.head.repo.full_name')
+          echo "target_branch=${TARGET_BRANCH}" >> $GITHUB_OUTPUT
+          echo "target_head_repo=${TARGET_HEAD_REPO}" >> $GITHUB_OUTPUT
-          echo "target_branch=${TARGET_BRANCH}" >> $GITHUB_OUTPUT
-          echo "Target branch: ${TARGET_BRANCH}"
+          echo "Target branch: ${TARGET_BRANCH}"
+          echo "Head repo: ${TARGET_HEAD_REPO}"

       - name: Clone target repository
         run: |
-          # Clone target repository with the PR branch
-          git clone https://x-access-token:${{ secrets.GITHUB_TOKEN }}@github.com/${{ steps.extract_info.outputs.target_owner }}/${{ steps.extract_info.outputs.target_repo }}.git target_repo
+          # Clone PR head repository (supports forks) with a PAT
+          git clone https://x-access-token:${{ secrets.SYNC_REPO_TOKEN }}@github.com/${{ steps.target_branch.outputs.target_head_repo }}.git target_repo
           cd target_repo
-          git checkout ${{ steps.target_branch.outputs.target_branch }}
+          git fetch origin ${{ steps.target_branch.outputs.target_branch }}
+          git checkout -B ${{ steps.target_branch.outputs.target_branch }} origin/${{ steps.target_branch.outputs.target_branch }}
           git config user.name "github-actions[bot]"
           git config user.email "github-actions[bot]@users.noreply.github.com"

       - name: Commit and push changes
         run: |
           cd target_repo
           git add .
           if git diff --staged --quiet; then
             echo "No changes to commit"
           else
             git commit -m "Auto-sync: Update English docs from Chinese PR ${{ github.event.inputs.source_pr_url }}
 
             Synced from: ${{ github.event.inputs.source_pr_url }}
             Target PR: ${{ github.event.inputs.target_pr_url }}
             AI Provider: ${{ github.event.inputs.ai_provider }}
 
             Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>"
 
-            git push origin ${{ steps.target_branch.outputs.target_branch }}
+            git push https://x-access-token:${{ secrets.SYNC_REPO_TOKEN }}@github.com/${{ steps.target_branch.outputs.target_head_repo }}.git ${{ steps.target_branch.outputs.target_branch }}
             echo "Changes pushed to target PR branch: ${{ steps.target_branch.outputs.target_branch }}"
           fi

       - name: Add comment to target PR
         run: |
           # Add a comment to the target PR about the sync
-          curl -X POST \
-            -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
+          curl -X POST \
+            -H "Authorization: token ${{ secrets.SYNC_REPO_TOKEN }}" \
             -H "Accept: application/vnd.github.v3+json" \
             "https://api.github.com/repos/${{ steps.extract_info.outputs.target_owner }}/${{ steps.extract_info.outputs.target_repo }}/issues/${{ steps.extract_info.outputs.target_pr }}/comments" \
             -d "{
               \"body\": \"🤖 **Auto-sync completed**\\n\\n📥 **Source PR**: ${{ github.event.inputs.source_pr_url }}\\n🎯 **Target PR**: ${{ github.event.inputs.target_pr_url }}\\n✅ English documentation has been updated based on Chinese documentation changes.\\n\\n_This comment was generated automatically by the sync workflow._\"
             }"

Also add a repo secret SYNC_REPO_TOKEN (PAT) with repo:write access on the head repo (or use a GitHub App).

Also applies to: 84-93, 125-134

🤖 Prompt for AI Agents
.github/workflows/sync-docs-cn-to-en.yml lines 71-83 (and similarly update
84-93, 125-134): the workflow currently reads only .head.ref and clones/pushes
the base repo using GITHUB_TOKEN which fails for forked PRs and prevents pushing
to the head repo; change the flow to fetch the head repo full_name (owner/repo)
and head.ref from the PR API, switch cloning/push operations to use that head
repo full_name, and replace GITHUB_TOKEN with a PAT stored in a repo secret
(e.g. SYNC_REPO_TOKEN with repo:write/contents scope) for git clone/push and for
any API calls (comments or other PR updates) so the token has permission on the
head repo; ensure the workflow uses the head repo URL when adding remotes and
pushing branches and update the other referenced blocks (84-93, 125-134) to use
the same head-repo + PAT approach.

Comment on lines +19 to +22
from main import main

# Make main function available at package level
__all__ = ["main"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix package export import path

import scripts.translate_doc_pr currently explodes with ModuleNotFoundError: No module named 'main' because the package init tries to do an absolute from main import main. The callable lives in main_workflow.py, so the package entry point is unusable until we point the import at the correct module.

-from main import main
+from .main_workflow import main
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from main import main
# Make main function available at package level
__all__ = ["main"]
from .main_workflow import main
# Make main function available at package level
__all__ = ["main"]
🤖 Prompt for AI Agents
In scripts/translate_doc_pr/__init__.py around lines 19 to 22, the package init
does an absolute import from a non-existent module ("main"), causing
ModuleNotFoundError; replace the import with a relative import that points to
the actual callable (main_workflow.py) — i.e., import the main function using a
package-relative import and keep __all__ as ["main"] so importing
scripts.translate_doc_pr exposes the correct entrypoint.

Comment on lines +100 to +105
# Add token estimation
try:
from main import print_token_estimation
print_token_estimation(prompt, "File addition translation")
except ImportError:
# Fallback if import fails - use tiktoken
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wrong module import: print_token_estimation is in main_workflow.py, not main.

This always ImportError-falls back, hiding token stats.

Apply:

-    try:
-        from main import print_token_estimation
+    try:
+        from main_workflow import print_token_estimation
         print_token_estimation(prompt, "File addition translation")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Add token estimation
try:
from main import print_token_estimation
print_token_estimation(prompt, "File addition translation")
except ImportError:
# Fallback if import fails - use tiktoken
# Add token estimation
try:
from main_workflow import print_token_estimation
print_token_estimation(prompt, "File addition translation")
except ImportError:
# Fallback if import fails - use tiktoken
🤖 Prompt for AI Agents
In scripts/translate_doc_pr/file_adder.py around lines 100 to 105, the code
incorrectly imports print_token_estimation from module main (which causes an
ImportError and always triggers the fallback); update the import to load
print_token_estimation from main_workflow (from main_workflow import
print_token_estimation) and keep the existing try/except fallback behavior so
token estimation is used when available and falls back to tiktoken only if the
import or call fails.

Comment on lines +340 to +346
try:
from main import print_token_estimation
print_token_estimation(prompt, f"Section mapping ({source_language} → {target_language})")
except ImportError:
# Fallback if import fails - use tiktoken
try:
import tiktoken
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix import of token estimation helper (wrong module name).

main_workflow.py defines print_token_estimation. Importing from main will always fail and suppress the feature.

Apply this diff:

-    try:
-        from main import print_token_estimation
+    try:
+        # Try local module name first
+        from main_workflow import print_token_estimation
     print_token_estimation(prompt, f"Section mapping ({source_language} → {target_language})")

Optional: add a second fallback try for package-style import if you plan to package this module.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
from main import print_token_estimation
print_token_estimation(prompt, f"Section mapping ({source_language}{target_language})")
except ImportError:
# Fallback if import fails - use tiktoken
try:
import tiktoken
try:
# Try local module name first
from main_workflow import print_token_estimation
print_token_estimation(prompt, f"Section mapping ({source_language}{target_language})")
except ImportError:
# Fallback if import fails - use tiktoken
try:
import tiktoken
🤖 Prompt for AI Agents
In scripts/translate_doc_pr/section_matcher.py around lines 340-346, the code
tries to import print_token_estimation from the wrong module name ("main"),
which always triggers the ImportError and disables the helper; change the import
to "from main_workflow import print_token_estimation" and call it as before;
additionally add a second fallback import attempt for package-style imports
(e.g., "from .main_workflow import print_token_estimation") before falling back
to the tiktoken-based implementation so the helper works both when run
standalone and when packaged.

Comment on lines +493 to +497
'insertion_after_line': insertion_after_line,
'target_hierarchy': target_hierarchy.get(str(target_line_num), ''),
'insertion_type': point_info['insertion_type'],
'new_sections': point_info['new_sections']
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Key type mismatch when reading target_hierarchy.

target_hierarchy keys are ints; using str(target_line_num) returns empty string.

Apply this diff:

-                    'target_hierarchy': target_hierarchy.get(str(target_line_num), ''),
+                    'target_hierarchy': target_hierarchy.get(target_line_num, ''),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
'insertion_after_line': insertion_after_line,
'target_hierarchy': target_hierarchy.get(str(target_line_num), ''),
'insertion_type': point_info['insertion_type'],
'new_sections': point_info['new_sections']
}
'insertion_after_line': insertion_after_line,
'target_hierarchy': target_hierarchy.get(target_line_num, ''),
'insertion_type': point_info['insertion_type'],
'new_sections': point_info['new_sections']
}
🤖 Prompt for AI Agents
In scripts/translate_doc_pr/section_matcher.py around lines 493 to 497, the code
uses str(target_line_num) to read from target_hierarchy whose keys are integers,
causing lookups to return an empty string; change the lookup to use the integer
key (i.e., target_hierarchy.get(target_line_num, '')) so the correct value is
returned (and keep the existing fallback ''), ensuring types match between the
key and the lookup.

Comment on lines +866 to +875
"""Extract target section content from target_lines (includes sub-sections)"""
if target_line_num >= len(target_lines):
return ""

start_line = target_line_num - 1 # Convert to 0-based index

# Find the end of the section by looking for the next header
current_line = target_lines[start_line].strip()
if not current_line.startswith('#'):
return current_line
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Off-by-one: allow extracting content at last line.

Guard should be 1-based range check.

Apply this diff:

-    if target_line_num >= len(target_lines):
+    if target_line_num < 1 or target_line_num > len(target_lines):
         return ""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""Extract target section content from target_lines (includes sub-sections)"""
if target_line_num >= len(target_lines):
return ""
start_line = target_line_num - 1 # Convert to 0-based index
# Find the end of the section by looking for the next header
current_line = target_lines[start_line].strip()
if not current_line.startswith('#'):
return current_line
"""Extract target section content from target_lines (includes sub-sections)"""
if target_line_num < 1 or target_line_num > len(target_lines):
return ""
start_line = target_line_num - 1 # Convert to 0-based index
# Find the end of the section by looking for the next header
current_line = target_lines[start_line].strip()
if not current_line.startswith('#'):
return current_line
🤖 Prompt for AI Agents
In scripts/translate_doc_pr/section_matcher.py around lines 866 to 875, the
current guard uses a 0-based check (if target_line_num >= len(target_lines))
which prevents extracting the last line when target_line_num is 1-based; change
the guard to a 1-based range check (e.g., if target_line_num < 1 or
target_line_num > len(target_lines): return "") so valid 1-based indices
including the last line are allowed, then keep start_line = target_line_num - 1
and the subsequent logic unchanged.

Comment on lines +905 to +913
def extract_section_direct_content(target_line_num, target_lines):
"""Extract ONLY the direct content of a section (excluding sub-sections)"""
if target_line_num >= len(target_lines):
return ""

start_line = target_line_num - 1 # Convert to 0-based index

# Find the end of the section by looking for the next header
current_line = target_lines[start_line].strip()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Same off-by-one in direct-content extractor.

Apply this diff:

-    if target_line_num >= len(target_lines):
+    if target_line_num < 1 or target_line_num > len(target_lines):
         return ""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def extract_section_direct_content(target_line_num, target_lines):
"""Extract ONLY the direct content of a section (excluding sub-sections)"""
if target_line_num >= len(target_lines):
return ""
start_line = target_line_num - 1 # Convert to 0-based index
# Find the end of the section by looking for the next header
current_line = target_lines[start_line].strip()
def extract_section_direct_content(target_line_num, target_lines):
"""Extract ONLY the direct content of a section (excluding sub-sections)"""
if target_line_num < 1 or target_line_num > len(target_lines):
return ""
start_line = target_line_num - 1 # Convert to 0-based index
# Find the end of the section by looking for the next header
current_line = target_lines[start_line].strip()
🤖 Prompt for AI Agents
In scripts/translate_doc_pr/section_matcher.py around lines 905 to 913, the
direct-content extractor has the same off-by-one bug: start_line is set to
target_line_num - 1 but the rest of the logic treats target_line_num as already
zero-based. Fix it by making start_line = target_line_num (remove the -1) and
adjust the initial bounds check if needed (use target_line_num < 0 or
target_line_num >= len(target_lines) for validity); leave the subsequent
current_line = target_lines[start_line].strip() and section end detection logic
unchanged.

Comment on lines +274 to +278
# Add token estimation
try:
from main import print_token_estimation
print_token_estimation(prompt, "TOC translation")
except ImportError:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wrong module import: print_token_estimation is in main_workflow.py.

-    try:
-        from main import print_token_estimation
+    try:
+        from main_workflow import print_token_estimation
         print_token_estimation(prompt, "TOC translation")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Add token estimation
try:
from main import print_token_estimation
print_token_estimation(prompt, "TOC translation")
except ImportError:
# Add token estimation
try:
from main_workflow import print_token_estimation
print_token_estimation(prompt, "TOC translation")
except ImportError:
🤖 Prompt for AI Agents
In scripts/translate_doc_pr/toc_processor.py around lines 274 to 278, the code
attempts to import print_token_estimation from main but that function lives in
main_workflow.py; change the import to "from main_workflow import
print_token_estimation" (keeping the existing try/except ImportError block) so
the correct module is referenced and token estimation is called successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants