-
Notifications
You must be signed in to change notification settings - Fork 0
Update system-variables #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds a GitHub Actions workflow to sync English docs from a Chinese PR and introduces a new Python-based pipeline under scripts/translate_doc_pr for analyzing PR diffs, matching sections, translating content via pluggable AI providers, and applying changes (add/update/delete/TOC) to target repos. Includes an orchestrating entry point, helpers, and dependencies. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as Developer
participant GH as GitHub Actions
participant MW as main_workflow.py
participant GHAPI as GitHub API
participant AI as UnifiedAIClient
participant Repo as Target Repo (branch)
Dev->>GH: Dispatch Workflow (source_pr_url, target_pr_url, ai_provider)
GH->>MW: Run Python entrypoint with env/tokens
MW->>GHAPI: Fetch target PR branch, source PR diff
MW->>MW: Classify changes (added/modified/deleted/TOC)
opt Added files
MW->>AI: Translate in batches
MW->>Repo: Write new files
end
opt Modified/Deleted sections
MW->>GHAPI: Get source/target contents
MW->>AI: Map/translate sections
MW->>Repo: Apply updates/deletions
end
opt TOC
MW->>AI: Translate required lines
MW->>Repo: Apply TOC edits
end
MW->>Repo: Stage and commit if changes
GH->>GHAPI: Push commit, comment on target PR
sequenceDiagram
autonumber
participant MW as main_workflow.py
participant ANA as pr_analyzer.py
participant MATCH as section_matcher.py
participant UPD as file_updater.py
participant ADD as file_adder.py
participant DEL as file_deleter.py
participant TOC as toc_processor.py
participant AI as UnifiedAIClient
MW->>ANA: get_pr_diff + build hierarchies
ANA-->>MW: operations + section structures
alt Added files
MW->>ADD: process_added_files(...)
ADD->>AI: translate batches
else Modified sections
MW->>MATCH: match_source_diff_to_target(...)
MATCH->>AI: assist mapping
MW->>UPD: process_modified_sections(...)
UPD->>AI: get_updated_sections_from_ai(...)
else Deleted sections/files
MW->>DEL: process_deleted_files(...) or UPD.delete_sections(...)
else TOC changes
MW->>TOC: process_toc_files(...)
TOC->>AI: translate_toc_lines(...)
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120+ minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @qiancai, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request primarily introduces a new, modular Python-based tool for automating documentation synchronization and translation across different language repositories. It integrates AI capabilities for intelligent content matching and translation, significantly improving the efficiency and accuracy of documentation updates. Additionally, it includes minor but important updates to existing documentation regarding system variables and TiKV configuration. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a substantial new Python package for automating documentation translation, alongside some documentation updates. The new translation tool is a significant piece of work, but it has several critical issues that need to be addressed before it can be considered production-ready. The main problems are incorrect module imports that will cause runtime failures, a high degree of code duplication across modules, and several extremely large and complex functions that make the code difficult to maintain and debug. My review focuses on these structural and maintainability issues in the Python code. The documentation changes themselves are minor and look good.
""" | ||
|
||
# Import main functionality for easy access | ||
from main import main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import statement is incorrect and will cause a runtime error. There is no top-level main
module in the standard library or in this project. Based on the project structure, you likely intended to import from main_workflow.py
within this package. This should be a relative import: from .main_workflow import main
.
This same incorrect import pattern (from main import ...
) appears in several other files in this package and needs to be fixed throughout for the package to be importable and usable.
def process_single_file(file_path, source_sections, pr_diff, pr_url, github_client, ai_client, repo_config, max_non_system_sections=120): | ||
"""Process a single file - thread-safe function for parallel processing""" | ||
thread_id = threading.current_thread().name | ||
thread_safe_print(f"\n📄 [{thread_id}] Processing {file_path}") | ||
|
||
try: | ||
# Check if this is a TOC file with special operations | ||
if isinstance(source_sections, dict) and 'type' in source_sections and source_sections['type'] == 'toc': | ||
from toc_processor import process_toc_file | ||
return process_toc_file(file_path, source_sections, pr_url, github_client, ai_client, repo_config) | ||
|
||
# Check if this is enhanced sections | ||
if isinstance(source_sections, dict) and 'sections' in source_sections: | ||
if source_sections.get('type') == 'enhanced_sections': | ||
# Skip all the matching logic and directly extract data | ||
thread_safe_print(f" [{thread_id}] 🚀 Using enhanced sections data, skipping matching logic") | ||
enhanced_sections = source_sections['sections'] | ||
|
||
# Extract target sections and source old content from enhanced sections | ||
# Maintain the exact order from match_source_diff_to_target.json | ||
from collections import OrderedDict | ||
target_sections = OrderedDict() | ||
source_old_content_dict = OrderedDict() | ||
|
||
# Process in the exact order they appear in enhanced_sections (which comes from match_source_diff_to_target.json) | ||
for key, section_info in enhanced_sections.items(): | ||
if isinstance(section_info, dict): | ||
operation = section_info.get('source_operation', '') | ||
|
||
# Skip deleted sections - they shouldn't be in the enhanced_sections anyway | ||
if operation == 'deleted': | ||
continue | ||
|
||
# For source sections: use old_content for modified, new_content for added | ||
if operation == 'added': | ||
source_content = section_info.get('source_new_content', '') | ||
else: # modified | ||
source_content = section_info.get('source_old_content', '') | ||
|
||
# For target sections: use target_content for modified, empty string for added | ||
if operation == 'added': | ||
target_content = "" # Added sections have no existing target content | ||
else: # modified | ||
target_content = section_info.get('target_content', '') | ||
|
||
# Add to both dictionaries using the same key from match_source_diff_to_target.json | ||
if source_content is not None: | ||
source_old_content_dict[key] = source_content | ||
target_sections[key] = target_content | ||
|
||
thread_safe_print(f" [{thread_id}] 📊 Extracted: {len(target_sections)} target sections, {len(source_old_content_dict)} source old content entries") | ||
|
||
# Update sections with AI (get-updated-target-sections.py logic) | ||
thread_safe_print(f" [{thread_id}] 🤖 Getting updated sections from AI...") | ||
updated_sections = get_updated_sections_from_ai(pr_diff, target_sections, source_old_content_dict, ai_client, repo_config['source_language'], repo_config['target_language'], file_path) | ||
if not updated_sections: | ||
thread_safe_print(f" [{thread_id}] ⚠️ Could not get AI update") | ||
return False, f"Could not get AI update for {file_path}" | ||
|
||
# Return the AI results for further processing | ||
thread_safe_print(f" [{thread_id}] ✅ Successfully got AI translation results for {file_path}") | ||
return True, updated_sections # Return the actual AI results | ||
|
||
else: | ||
# New format: complete data structure | ||
actual_sections = source_sections['sections'] | ||
|
||
# Regular file processing continues here for old format | ||
# Get target hierarchy and content (get-target-affected-hierarchy.py logic) | ||
from pr_analyzer import get_target_hierarchy_and_content | ||
target_hierarchy, target_lines = get_target_hierarchy_and_content(file_path, github_client, repo_config['target_repo']) | ||
if not target_hierarchy: | ||
thread_safe_print(f" [{thread_id}] ⚠️ Could not get target content") | ||
return False, f"Could not get target content for {file_path}" | ||
else: | ||
# Old format: direct dict | ||
actual_sections = source_sections | ||
|
||
# Only do mapping if we don't have enhanced sections | ||
if 'enhanced_sections' not in locals() or not enhanced_sections: | ||
# Separate different types of sections | ||
from section_matcher import is_system_variable_or_config | ||
system_var_sections = {} | ||
toplevel_sections = {} | ||
frontmatter_sections = {} | ||
regular_sections = {} | ||
|
||
for line_num, hierarchy in actual_sections.items(): | ||
if line_num == "0" and hierarchy == "frontmatter": | ||
# Special handling for frontmatter | ||
frontmatter_sections[line_num] = hierarchy | ||
else: | ||
# Extract the leaf title from hierarchy | ||
leaf_title = hierarchy.split(' > ')[-1] if ' > ' in hierarchy else hierarchy | ||
|
||
if is_system_variable_or_config(leaf_title): | ||
system_var_sections[line_num] = hierarchy | ||
elif leaf_title.startswith('# '): | ||
# Top-level titles need special handling | ||
toplevel_sections[line_num] = hierarchy | ||
else: | ||
regular_sections[line_num] = hierarchy | ||
|
||
thread_safe_print(f" [{thread_id}] 📊 Found {len(system_var_sections)} system variable/config, {len(toplevel_sections)} top-level, {len(frontmatter_sections)} frontmatter, and {len(regular_sections)} regular sections") | ||
|
||
target_affected = {} | ||
|
||
# Process frontmatter sections with special handling | ||
if frontmatter_sections: | ||
thread_safe_print(f" [{thread_id}] 📄 Processing frontmatter section...") | ||
# For frontmatter, we simply map it to line 0 in target | ||
for line_num, hierarchy in frontmatter_sections.items(): | ||
target_affected[line_num] = hierarchy | ||
thread_safe_print(f" [{thread_id}] ✅ Mapped {len(frontmatter_sections)} frontmatter section") | ||
|
||
# Process top-level titles with special matching | ||
if toplevel_sections: | ||
thread_safe_print(f" [{thread_id}] 🔝 Top-level title matching for {len(toplevel_sections)} sections...") | ||
from section_matcher import find_toplevel_title_matches | ||
toplevel_matched, toplevel_failed, toplevel_skipped = find_toplevel_title_matches(toplevel_sections, target_lines) | ||
|
||
if toplevel_matched: | ||
target_affected.update(toplevel_matched) | ||
thread_safe_print(f" [{thread_id}] ✅ Top-level matched {len(toplevel_matched)} sections") | ||
|
||
if toplevel_failed: | ||
thread_safe_print(f" [{thread_id}] ⚠️ {len(toplevel_failed)} top-level sections failed matching") | ||
for failed in toplevel_failed: | ||
thread_safe_print(f" ❌ {failed['hierarchy']}: {failed['reason']}") | ||
|
||
# Process system variables/config sections with direct matching | ||
if system_var_sections: | ||
thread_safe_print(f" [{thread_id}] 🎯 Direct matching {len(system_var_sections)} system variable/config sections...") | ||
from section_matcher import find_direct_matches_for_special_files | ||
direct_matched, failed_matches, skipped_sections = find_direct_matches_for_special_files(system_var_sections, target_hierarchy, target_lines) | ||
|
||
if direct_matched: | ||
target_affected.update(direct_matched) | ||
thread_safe_print(f" [{thread_id}] ✅ Direct matched {len(direct_matched)} system variable/config sections") | ||
|
||
if failed_matches: | ||
thread_safe_print(f" [{thread_id}] ⚠️ {len(failed_matches)} system variable/config sections failed direct matching") | ||
for failed in failed_matches: | ||
thread_safe_print(f" ❌ {failed['hierarchy']}: {failed['reason']}") | ||
|
||
# Process regular sections with AI mapping using filtered target hierarchy | ||
if regular_sections: | ||
thread_safe_print(f" [{thread_id}] 🤖 AI mapping {len(regular_sections)} regular sections...") | ||
|
||
# Filter target hierarchy to only include non-system sections for AI mapping | ||
from section_matcher import filter_non_system_sections | ||
filtered_target_hierarchy = filter_non_system_sections(target_hierarchy) | ||
|
||
# Check if filtered target hierarchy exceeds the maximum allowed for AI mapping | ||
MAX_NON_SYSTEM_SECTIONS_FOR_AI = 120 | ||
if len(filtered_target_hierarchy) > MAX_NON_SYSTEM_SECTIONS_FOR_AI: | ||
thread_safe_print(f" [{thread_id}] ❌ Too many non-system sections ({len(filtered_target_hierarchy)} > {MAX_NON_SYSTEM_SECTIONS_FOR_AI})") | ||
thread_safe_print(f" [{thread_id}] ⚠️ Skipping AI mapping for regular sections to avoid complexity") | ||
|
||
# If no system sections were matched either, return error | ||
if not target_affected: | ||
error_message = f"File {file_path} has too many non-system sections ({len(filtered_target_hierarchy)} > {MAX_NON_SYSTEM_SECTIONS_FOR_AI}) and no system variable sections were matched" | ||
return False, error_message | ||
|
||
# Continue with only system variable matches if available | ||
thread_safe_print(f" [{thread_id}] ✅ Proceeding with {len(target_affected)} system variable/config sections only") | ||
else: | ||
# Proceed with AI mapping using filtered hierarchy | ||
source_list = list(regular_sections.values()) | ||
target_list = list(filtered_target_hierarchy.values()) | ||
|
||
from section_matcher import get_corresponding_sections | ||
ai_response = get_corresponding_sections(source_list, target_list, ai_client, repo_config['source_language'], repo_config['target_language'], max_tokens=20000) | ||
if ai_response: | ||
# Parse AI response and find matching line numbers in the original (unfiltered) hierarchy | ||
from section_matcher import parse_ai_response, find_matching_line_numbers | ||
ai_sections = parse_ai_response(ai_response) | ||
ai_matched = find_matching_line_numbers(ai_sections, target_hierarchy) # Use original hierarchy for line number lookup | ||
|
||
if ai_matched: | ||
target_affected.update(ai_matched) | ||
thread_safe_print(f" [{thread_id}] ✅ AI mapped {len(ai_matched)} regular sections") | ||
else: | ||
thread_safe_print(f" [{thread_id}] ⚠️ AI mapping failed for regular sections") | ||
else: | ||
thread_safe_print(f" [{thread_id}] ⚠️ Could not get AI response for regular sections") | ||
|
||
# Summary of mapping results | ||
thread_safe_print(f" [{thread_id}] 📊 Total mapped: {len(target_affected)} out of {len(actual_sections)} sections") | ||
|
||
if not target_affected: | ||
thread_safe_print(f" [{thread_id}] ⚠️ Could not map sections") | ||
return False, f"Could not map sections for {file_path}" | ||
|
||
thread_safe_print(f" [{thread_id}] ✅ Mapped {len(target_affected)} sections") | ||
|
||
# Extract target sections (get-target-affected-sections.py logic) | ||
thread_safe_print(f" [{thread_id}] 📝 Extracting target sections...") | ||
from pr_analyzer import extract_affected_sections | ||
target_sections = extract_affected_sections(target_affected, target_lines) | ||
|
||
# Extract source old content from the enhanced data structure | ||
thread_safe_print(f" [{thread_id}] 📖 Extracting source old content...") | ||
source_old_content_dict = {} | ||
|
||
# Handle different data structures for source_sections | ||
if isinstance(source_sections, dict) and 'sections' in source_sections: | ||
# New format: complete data structure with enhanced matching info | ||
for key, section_info in source_sections.items(): | ||
if isinstance(section_info, dict) and 'source_old_content' in section_info: | ||
source_old_content_dict[key] = section_info['source_old_content'] | ||
else: | ||
# Fallback: if we don't have the enhanced structure, we need to get it differently | ||
thread_safe_print(f" [{thread_id}] ⚠️ Source sections missing enhanced structure, using fallback") | ||
# For now, create empty dict to avoid errors - this should be addressed in the calling code | ||
source_old_content_dict = {} | ||
|
||
# Update sections with AI (get-updated-target-sections.py logic) | ||
thread_safe_print(f" [{thread_id}] 🤖 Getting updated sections from AI...") | ||
updated_sections = get_updated_sections_from_ai(pr_diff, target_sections, source_old_content_dict, ai_client, repo_config['source_language'], repo_config['target_language'], file_path) | ||
if not updated_sections: | ||
thread_safe_print(f" [{thread_id}] ⚠️ Could not get AI update") | ||
return False, f"Could not get AI update for {file_path}" | ||
|
||
# Update local document (update-target-doc-v2.py logic) | ||
thread_safe_print(f" [{thread_id}] 💾 Updating local document...") | ||
success = update_local_document(file_path, updated_sections, target_affected, repo_config['target_local_path']) | ||
|
||
if success: | ||
thread_safe_print(f" [{thread_id}] 🎉 Successfully updated {file_path}") | ||
return True, f"Successfully updated {file_path}" | ||
else: | ||
thread_safe_print(f" [{thread_id}] ❌ Failed to update {file_path}") | ||
return False, f"Failed to update {file_path}" | ||
|
||
except Exception as e: | ||
thread_safe_print(f" [{thread_id}] ❌ Error processing {file_path}: {e}") | ||
return False, f"Error processing {file_path}: {e}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function process_single_file
is extremely long (over 200 lines) and handles many different cases (TOC files, enhanced sections, regular files with different section types). This violates the Single Responsibility Principle and makes the code very difficult to read, test, and maintain. Please refactor this function by breaking it down into smaller, more focused functions, each handling a specific part of the logic.
def process_regular_modified_file(source_file_path, file_sections, file_diff, pr_url, github_client, ai_client, repo_config, max_sections): | ||
"""Process a regular markdown file that has been modified""" | ||
try: | ||
print(f" 📝 Processing as regular modified file: {source_file_path}") | ||
|
||
# Extract the actual sections from the file_sections structure | ||
# file_sections contains: {'sections': {...}, 'original_hierarchy': {...}, 'current_hierarchy': {...}} | ||
if isinstance(file_sections, dict) and 'sections' in file_sections: | ||
actual_sections = file_sections['sections'] | ||
else: | ||
# Fallback: assume file_sections is already the sections dict | ||
actual_sections = file_sections | ||
|
||
print(f" 📊 Extracted sections: {len(actual_sections)} sections") | ||
|
||
# CRITICAL: Load the source-diff-dict.json and perform matching | ||
import json | ||
import os | ||
from section_matcher import match_source_diff_to_target | ||
from pr_analyzer import get_target_hierarchy_and_content | ||
|
||
# Load source-diff-dict.json with file prefix | ||
temp_dir = ensure_temp_output_dir() | ||
file_prefix = source_file_path.replace('/', '-').replace('.md', '') | ||
source_diff_dict_file = os.path.join(temp_dir, f"{file_prefix}-source-diff-dict.json") | ||
if os.path.exists(source_diff_dict_file): | ||
with open(source_diff_dict_file, 'r', encoding='utf-8') as f: | ||
source_diff_dict = json.load(f) | ||
print(f" 📂 Loaded source diff dict with {len(source_diff_dict)} sections from {source_diff_dict_file}") | ||
|
||
# Check source token limit before proceeding with processing | ||
print(f" 🔍 Checking source token limit...") | ||
within_limit, total_tokens, token_limit = check_source_token_limit(source_diff_dict_file) | ||
if not within_limit: | ||
print(f" 🚫 Skipping file processing: source content exceeds token limit") | ||
print(f" 📊 Total tokens: {total_tokens:,} > Limit: {token_limit:,}") | ||
print(f" ⏭️ File {source_file_path} will not be processed") | ||
return False | ||
|
||
else: | ||
print(f" ❌ {source_diff_dict_file} not found") | ||
return False | ||
|
||
# Get target file hierarchy and content | ||
target_repo = repo_config['target_repo'] | ||
target_hierarchy, target_lines = get_target_hierarchy_and_content(source_file_path, github_client, target_repo) | ||
|
||
if not target_hierarchy or not target_lines: | ||
print(f" ❌ Could not get target file content for {source_file_path}") | ||
return False | ||
|
||
print(f" 📖 Target file: {len(target_hierarchy)} sections, {len(target_lines)} lines") | ||
|
||
# Perform source diff to target matching | ||
print(f" 🔗 Matching source diff to target...") | ||
enhanced_sections = match_source_diff_to_target( | ||
source_diff_dict, | ||
target_hierarchy, | ||
target_lines, | ||
ai_client, | ||
repo_config, | ||
max_sections, | ||
AI_MAX_TOKENS | ||
) | ||
|
||
if not enhanced_sections: | ||
print(f" ❌ No sections matched") | ||
return False | ||
|
||
print(f" ✅ Matched {len(enhanced_sections)} sections") | ||
|
||
# Save the match result for reference | ||
match_file = os.path.join(temp_dir, f"{source_file_path.replace('/', '-').replace('.md', '')}-match_source_diff_to_target.json") | ||
with open(match_file, 'w', encoding='utf-8') as f: | ||
json.dump(enhanced_sections, f, ensure_ascii=False, indent=2) | ||
print(f" 💾 Saved match result to: {match_file}") | ||
|
||
# Step 2: Get AI translation for the matched sections | ||
print(f" 🤖 Getting AI translation for matched sections...") | ||
|
||
# Create file data structure with enhanced matching info | ||
# Wrap enhanced_sections in the expected format for process_single_file | ||
file_data = { | ||
source_file_path: { | ||
'type': 'enhanced_sections', | ||
'sections': enhanced_sections | ||
} | ||
} | ||
|
||
# Call the existing process_modified_sections function to get AI translation | ||
results = process_modified_sections(file_data, file_diff, pr_url, github_client, ai_client, repo_config, max_sections) | ||
|
||
# Step 3: Update match_source_diff_to_target.json with AI results | ||
if results and len(results) > 0: | ||
file_path, success, ai_updated_sections = results[0] # Get first result | ||
if success and isinstance(ai_updated_sections, dict): | ||
print(f" 📝 Step 3: Updating {match_file} with AI results...") | ||
|
||
# Load current match_source_diff_to_target.json | ||
with open(match_file, 'r', encoding='utf-8') as f: | ||
match_data = json.load(f) | ||
|
||
# Add target_new_content field to each section based on AI results | ||
updated_count = 0 | ||
for key, section_data in match_data.items(): | ||
operation = section_data.get('source_operation', '') | ||
|
||
if operation == 'deleted': | ||
# For deleted sections, set target_new_content to null | ||
section_data['target_new_content'] = None | ||
elif key in ai_updated_sections: | ||
# For modified/added sections with AI translation | ||
section_data['target_new_content'] = ai_updated_sections[key] | ||
updated_count += 1 | ||
else: | ||
# For sections not translated, keep original content | ||
section_data['target_new_content'] = section_data.get('target_content', '') | ||
|
||
# Save updated match_source_diff_to_target.json | ||
with open(match_file, 'w', encoding='utf-8') as f: | ||
json.dump(match_data, f, ensure_ascii=False, indent=2) | ||
|
||
print(f" ✅ Updated {updated_count} sections with AI translations in {match_file}") | ||
|
||
# Step 4: Apply updates to target document using update_target_document_from_match_data | ||
print(f" 📝 Step 4: Applying updates to target document...") | ||
from file_updater import update_target_document_from_match_data | ||
|
||
success = update_target_document_from_match_data(match_file, repo_config['target_local_path'], source_file_path) | ||
if success: | ||
print(f" 🎉 Target document successfully updated!") | ||
return True | ||
else: | ||
print(f" ❌ Failed to update target document") | ||
return False | ||
|
||
else: | ||
print(f" ⚠️ AI translation failed or returned invalid results") | ||
return False | ||
else: | ||
print(f" ⚠️ No results from process_modified_sections") | ||
return False | ||
|
||
except Exception as e: | ||
print(f" ❌ Error processing regular modified file {source_file_path}: {e}") | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function process_regular_modified_file
is very large and contains complex logic for processing a single file. This function's logic is tightly coupled with file updating and matching, and it would be better placed in the file_updater.py
or section_matcher.py
module. Moving this out of main_workflow.py
would make the main workflow clearer and improve the overall modularity of the package.
def analyze_source_changes(pr_url, github_client, special_files=None, ignore_files=None, repo_configs=None, max_non_system_sections=120, pr_diff=None): | ||
"""Analyze source language changes and categorize them as added, modified, or deleted""" | ||
# Import modules needed in this function | ||
import os | ||
import json | ||
from toc_processor import process_toc_operations | ||
|
||
owner, repo, pr_number = parse_pr_url(pr_url) | ||
repository = github_client.get_repo(f"{owner}/{repo}") | ||
pr = repository.get_pull(pr_number) | ||
|
||
# Get repository configuration for target repo info | ||
repo_config = get_repo_config(pr_url, repo_configs) | ||
|
||
print(f"📋 Processing PR #{pr_number}: {pr.title}") | ||
|
||
# Get markdown files | ||
files = pr.get_files() | ||
markdown_files = [f for f in files if f.filename.endswith('.md')] | ||
|
||
print(f"📄 Found {len(markdown_files)} markdown files") | ||
|
||
# Return dictionaries for different operation types | ||
added_sections = {} # New sections that were added | ||
modified_sections = {} # Existing sections that were modified | ||
deleted_sections = {} # Sections that were deleted | ||
added_files = {} # Completely new files that were added | ||
deleted_files = [] # Completely deleted files | ||
ignored_files = [] # Files that were ignored | ||
toc_files = {} # Special TOC files requiring special processing | ||
|
||
for file in markdown_files: | ||
print(f"\n🔍 Analyzing {file.filename}") | ||
|
||
# Check if this file should be ignored | ||
if file.filename in ignore_files: | ||
print(f" ⏭️ Skipping ignored file: {file.filename}") | ||
ignored_files.append(file.filename) | ||
continue | ||
|
||
# Check if this is a completely new file or deleted file | ||
if file.status == 'added': | ||
print(f" ➕ Detected new file: {file.filename}") | ||
try: | ||
file_content = repository.get_contents(file.filename, ref=pr.head.sha).decoded_content.decode('utf-8') | ||
added_files[file.filename] = file_content | ||
print(f" ✅ Added complete file for translation") | ||
continue | ||
except Exception as e: | ||
print(f" ❌ Error getting new file content: {e}") | ||
continue | ||
|
||
elif file.status == 'removed': | ||
print(f" 🗑️ Detected deleted file: {file.filename}") | ||
deleted_files.append(file.filename) | ||
print(f" ✅ Marked file for deletion") | ||
continue | ||
|
||
# For modified files, check if it's a special file like TOC.md | ||
try: | ||
file_content = repository.get_contents(file.filename, ref=pr.head.sha).decoded_content.decode('utf-8') | ||
except Exception as e: | ||
print(f" ❌ Error getting content: {e}") | ||
continue | ||
|
||
# Check if this is a TOC.md file requiring special processing | ||
if os.path.basename(file.filename) in special_files: | ||
print(f" 📋 Detected special file: {file.filename}") | ||
|
||
# Get target file content for comparison | ||
try: | ||
target_repository = github_client.get_repo(repo_config['target_repo']) | ||
target_file_content = target_repository.get_contents(file.filename, ref="master").decoded_content.decode('utf-8') | ||
target_lines = target_file_content.split('\n') | ||
except Exception as e: | ||
print(f" ⚠️ Could not get target file content: {e}") | ||
continue | ||
|
||
# Analyze diff operations for TOC.md | ||
operations = analyze_diff_operations(file) | ||
source_lines = file_content.split('\n') | ||
|
||
# Process with special TOC logic | ||
toc_results = process_toc_operations(file.filename, operations, source_lines, target_lines, "") # Local path will be determined later | ||
|
||
# Store TOC operations for later processing | ||
if any([toc_results['added'], toc_results['modified'], toc_results['deleted']]): | ||
# Combine all operations for processing | ||
all_toc_operations = [] | ||
all_toc_operations.extend(toc_results['added']) | ||
all_toc_operations.extend(toc_results['modified']) | ||
all_toc_operations.extend(toc_results['deleted']) | ||
|
||
# Add to special TOC processing queue (separate from regular sections) | ||
toc_files[file.filename] = { | ||
'type': 'toc', | ||
'operations': all_toc_operations | ||
} | ||
|
||
print(f" 📋 TOC operations queued for processing:") | ||
if toc_results['added']: | ||
print(f" ➕ Added: {len(toc_results['added'])} entries") | ||
if toc_results['modified']: | ||
print(f" ✏️ Modified: {len(toc_results['modified'])} entries") | ||
if toc_results['deleted']: | ||
print(f" ❌ Deleted: {len(toc_results['deleted'])} entries") | ||
else: | ||
print(f" ℹ️ No TOC operations found") | ||
|
||
continue # Skip regular processing for TOC files | ||
|
||
# Analyze diff operations | ||
operations = analyze_diff_operations(file) | ||
print(f" 📝 Diff analysis: {len(operations['added_lines'])} added, {len(operations['modified_lines'])} modified, {len(operations['deleted_lines'])} deleted lines") | ||
|
||
lines = file_content.split('\n') | ||
all_headers = {} | ||
|
||
# Track code block state | ||
in_code_block = False | ||
code_block_delimiter = None | ||
|
||
# First pass: collect all headers (excluding those in code blocks) | ||
for line_num, line in enumerate(lines, 1): | ||
original_line = line | ||
line = line.strip() | ||
|
||
# Check for code block delimiters | ||
if line.startswith('```') or line.startswith('~~~'): | ||
if not in_code_block: | ||
# Entering a code block | ||
in_code_block = True | ||
code_block_delimiter = line[:3] | ||
continue | ||
elif line.startswith(code_block_delimiter): | ||
# Exiting a code block | ||
in_code_block = False | ||
code_block_delimiter = None | ||
continue | ||
|
||
# Skip processing if we're inside a code block | ||
if in_code_block: | ||
continue | ||
|
||
# Process headers only if not in code block | ||
if line.startswith('#'): | ||
match = re.match(r'^(#{1,10})\s+(.+)', line) | ||
if match: | ||
level = len(match.group(1)) | ||
title = match.group(2).strip() | ||
all_headers[line_num] = { | ||
'level': level, | ||
'title': title, | ||
'line': line | ||
} | ||
|
||
# Build complete hierarchy from HEAD (after changes) | ||
all_hierarchy_dict = build_hierarchy_dict(file_content) | ||
|
||
# For deletion detection, we also need the base file hierarchy | ||
try: | ||
base_file_content = repository.get_contents(file.filename, ref=f"{repository.default_branch}").decoded_content.decode('utf-8') | ||
base_hierarchy_dict = build_hierarchy_dict(base_file_content) | ||
except Exception as e: | ||
print(f" ⚠️ Could not get base file content: {e}") | ||
base_hierarchy_dict = all_hierarchy_dict | ||
base_file_content = file_content # Fallback to current content | ||
|
||
# Find sections by operation type with corrected logic | ||
sections_by_type = find_sections_by_operation_type(lines, operations, all_headers, base_hierarchy_dict) | ||
|
||
# Prioritize modified headers over added ones (fix for header changes like --host -> --hosts) | ||
modified_header_lines = set() | ||
for modified_line in operations['modified_lines']: | ||
if modified_line['is_header']: | ||
modified_header_lines.add(modified_line['line_number']) | ||
|
||
# Remove modified header lines from added set | ||
sections_by_type['added'] = sections_by_type['added'] - modified_header_lines | ||
|
||
# Enhanced logic: check for actual content changes within sections | ||
# This helps detect changes in section content (not just headers) | ||
print(f" 🔍 Enhanced detection: checking for actual section content changes...") | ||
|
||
# Get only lines that have actual content changes (exclude headers) | ||
real_content_changes = set() | ||
|
||
# Added lines (new content, excluding headers) | ||
for added_line in operations['added_lines']: | ||
if not added_line['is_header']: | ||
real_content_changes.add(added_line['line_number']) | ||
|
||
# Deleted lines (removed content, excluding headers) | ||
for deleted_line in operations['deleted_lines']: | ||
if not deleted_line['is_header']: | ||
real_content_changes.add(deleted_line['line_number']) | ||
|
||
# Modified lines (changed content, excluding headers) | ||
for modified_line in operations['modified_lines']: | ||
if not modified_line['is_header']: | ||
real_content_changes.add(modified_line['line_number']) | ||
|
||
print(f" 📝 Real content changes (non-header): {sorted(real_content_changes)}") | ||
|
||
# Find sections that contain actual content changes | ||
content_affected_sections = set() | ||
for changed_line in real_content_changes: | ||
# Find which section this changed line belongs to | ||
containing_section = None | ||
for line_num in sorted(all_headers.keys()): | ||
if line_num <= changed_line: | ||
containing_section = line_num | ||
else: | ||
break | ||
|
||
if containing_section and containing_section not in sections_by_type['added']: | ||
# Additional check: make sure this is not just a line number shift | ||
# Only add if the change is within reasonable distance from the section header | ||
# AND if the changed line is not part of a completely deleted section header | ||
is_deleted_header = False | ||
for deleted_line in operations['deleted_lines']: | ||
if (deleted_line['is_header'] and | ||
abs(changed_line - deleted_line['line_number']) <= 2): | ||
is_deleted_header = True | ||
print(f" ⚠️ Skipping change at line {changed_line} (deleted header near line {deleted_line['line_number']})") | ||
break | ||
|
||
# More precise filtering: check if this change is actually meaningful | ||
# Skip changes that are part of deleted content or line shifts due to deletions | ||
should_include = True | ||
|
||
# Skip exact deleted headers | ||
for deleted_line in operations['deleted_lines']: | ||
if (deleted_line['is_header'] and | ||
changed_line == deleted_line['line_number']): | ||
should_include = False | ||
print(f" ⚠️ Skipping change at line {changed_line} (exact deleted header)") | ||
break | ||
|
||
# Skip changes that are very close to deleted content AND far from their containing section | ||
# This helps filter out line shift artifacts while keeping real content changes | ||
if should_include: | ||
for deleted_line in operations['deleted_lines']: | ||
# Only skip if both conditions are met: | ||
# 1. Very close to deleted content (within 5 lines) | ||
# 2. The change is far from its containing section (likely a shift artifact) | ||
distance_to_deletion = abs(changed_line - deleted_line['line_number']) | ||
distance_to_section = changed_line - containing_section | ||
|
||
if (distance_to_deletion <= 5 and distance_to_section > 100): | ||
should_include = False | ||
print(f" ⚠️ Skipping change at line {changed_line} (likely line shift: {distance_to_deletion} lines from deletion, {distance_to_section} from section)") | ||
break | ||
|
||
if should_include and changed_line - containing_section <= 30: | ||
content_affected_sections.add(containing_section) | ||
print(f" 📝 Content change at line {changed_line} affects section at line {containing_section}") | ||
elif should_include: | ||
print(f" ⚠️ Skipping distant change at line {changed_line} from section {containing_section}") | ||
|
||
# Add content-modified sections to the modified set, but exclude sections that are already marked as added or deleted | ||
for line_num in content_affected_sections: | ||
if (line_num not in sections_by_type['modified'] and | ||
line_num not in sections_by_type['added'] and | ||
line_num not in sections_by_type['deleted']): # ✅ Critical fix: exclude deleted sections | ||
sections_by_type['modified'].add(line_num) | ||
print(f" 📝 Added content-modified section at line {line_num}") | ||
elif line_num in sections_by_type['deleted']: | ||
print(f" 🚫 Skipping content-modified section at line {line_num}: already marked as deleted") | ||
|
||
# Prepare sections data for source_diff_dict | ||
file_modified = {} | ||
file_added = {} | ||
file_deleted = {} | ||
|
||
# Build modified sections | ||
for line_num in sections_by_type['modified']: | ||
if line_num in all_hierarchy_dict: | ||
file_modified[str(line_num)] = all_hierarchy_dict[line_num] | ||
|
||
# Build added sections | ||
for line_num in sections_by_type['added']: | ||
if line_num in all_hierarchy_dict: | ||
file_added[str(line_num)] = all_hierarchy_dict[line_num] | ||
|
||
# Build deleted sections | ||
for line_num in sections_by_type['deleted']: | ||
if line_num in base_hierarchy_dict: | ||
file_deleted[str(line_num)] = base_hierarchy_dict[line_num] | ||
|
||
# Check for frontmatter changes (content before first top-level header) | ||
print(f" 🔍 Checking for frontmatter changes...") | ||
frontmatter_changed = False | ||
|
||
# Check if any changes occur before the first top-level header | ||
first_header_line = None | ||
for line_num in sorted(all_headers.keys()): | ||
header_info = all_headers[line_num] | ||
if header_info['level'] == 1: # First top-level header | ||
first_header_line = line_num | ||
break | ||
|
||
print(f" 📊 First header line: {first_header_line}") | ||
print(f" 📊 Real content changes: {sorted(real_content_changes)}") | ||
|
||
if first_header_line: | ||
# Check if any real content changes are before the first header | ||
for line_num in real_content_changes: | ||
#print(f" 🔍 Checking line {line_num} vs first header {first_header_line}") | ||
if line_num < first_header_line: | ||
frontmatter_changed = True | ||
print(f" 📄 Frontmatter change detected: line {line_num} < {first_header_line}") | ||
break | ||
|
||
print(f" 📊 Frontmatter changed: {frontmatter_changed}") | ||
|
||
if frontmatter_changed: | ||
print(f" 📄 Frontmatter changes detected (before line {first_header_line})") | ||
# Add frontmatter as a special section with line number 0 | ||
file_modified["0"] = "frontmatter" | ||
print(f" ✅ Added frontmatter section to modified sections") | ||
|
||
# Build source diff dictionary | ||
source_diff_dict = build_source_diff_dict( | ||
file_modified, file_added, file_deleted, | ||
all_hierarchy_dict, base_hierarchy_dict, | ||
operations, file_content, base_file_content | ||
) | ||
|
||
# Breakpoint: Output source_diff_dict to file for review with file prefix | ||
|
||
# Ensure temp_output directory exists | ||
script_dir = os.path.dirname(os.path.abspath(__file__)) | ||
temp_dir = os.path.join(script_dir, "temp_output") | ||
os.makedirs(temp_dir, exist_ok=True) | ||
|
||
file_prefix = file.filename.replace('/', '-').replace('.md', '') | ||
output_file = os.path.join(temp_dir, f"{file_prefix}-source-diff-dict.json") | ||
with open(output_file, 'w', encoding='utf-8') as f: | ||
json.dump(source_diff_dict, f, ensure_ascii=False, indent=2) | ||
|
||
print(f" 💾 Saved source diff dictionary to: {output_file}") | ||
print(f" 📊 Source diff dictionary contains {len(source_diff_dict)} sections:") | ||
for key, diff_info in source_diff_dict.items(): | ||
print(f" {diff_info['operation']}: {key} -> original_hierarchy: {diff_info['original_hierarchy']}") | ||
|
||
# source-diff-dict.json generation is complete, continue to next step in main.py | ||
|
||
# For modified headers, we need to build a mapping using original titles for matching | ||
original_hierarchy_dict = all_hierarchy_dict.copy() | ||
|
||
# Update hierarchy dict to use original content for modified headers when needed for matching | ||
for line_num in sections_by_type['modified']: | ||
if line_num in all_headers: | ||
header_info = all_headers[line_num] | ||
# Check if this header was modified and has original content | ||
for op in operations['modified_lines']: | ||
if (op['is_header'] and | ||
op['line_number'] == line_num and | ||
'original_content' in op): | ||
# Create hierarchy path using original content for matching | ||
original_line = op['original_content'].strip() | ||
if original_line.startswith('#'): | ||
# Build original hierarchy for matching | ||
original_hierarchy = build_hierarchy_for_modified_section( | ||
file_content, line_num, original_line, all_hierarchy_dict) | ||
if original_hierarchy: | ||
original_hierarchy_dict[line_num] = original_hierarchy | ||
break | ||
|
||
# Process added sections | ||
if sections_by_type['added']: | ||
file_added = {} | ||
# Find insertion points using the simplified logic: | ||
# Record the previous section hierarchy for each added section | ||
insertion_points = find_previous_section_for_added(sections_by_type['added'], all_hierarchy_dict) | ||
|
||
# Get actual content for added sections | ||
for line_num in sections_by_type['added']: | ||
if line_num in all_hierarchy_dict: | ||
file_added[str(line_num)] = all_hierarchy_dict[line_num] | ||
|
||
# Get source sections content (actual content, not just hierarchy) | ||
if file_added: | ||
source_sections_content = get_source_sections_content(pr_url, file.filename, file_added, github_client) | ||
file_added = source_sections_content # Replace hierarchy with actual content | ||
|
||
if file_added: | ||
added_sections[file.filename] = { | ||
'sections': file_added, | ||
'insertion_points': insertion_points | ||
} | ||
print(f" ➕ Found {len(file_added)} added sections with {len(insertion_points)} insertion points") | ||
|
||
# Process modified sections | ||
if sections_by_type['modified']: | ||
file_modified = {} | ||
for line_num in sections_by_type['modified']: | ||
if line_num in original_hierarchy_dict: | ||
file_modified[str(line_num)] = original_hierarchy_dict[line_num] | ||
|
||
if file_modified: | ||
modified_sections[file.filename] = { | ||
'sections': file_modified, | ||
'original_hierarchy': original_hierarchy_dict, | ||
'current_hierarchy': all_hierarchy_dict | ||
} | ||
print(f" ✏️ Found {len(file_modified)} modified sections") | ||
|
||
# Process deleted sections | ||
if sections_by_type['deleted']: | ||
file_deleted = {} | ||
for line_num in sections_by_type['deleted']: | ||
# Use base hierarchy to get the deleted section info | ||
if line_num in base_hierarchy_dict: | ||
file_deleted[str(line_num)] = base_hierarchy_dict[line_num] | ||
|
||
if file_deleted: | ||
deleted_sections[file.filename] = file_deleted | ||
print(f" ❌ Found {len(file_deleted)} deleted sections") | ||
|
||
# Enhanced logic: also check content-level changes using legacy detection | ||
# This helps detect changes in section content (not just headers) | ||
print(f" 🔍 Enhanced detection: checking content-level changes...") | ||
changed_lines = get_changed_line_ranges(file) | ||
affected_sections = find_affected_sections(lines, changed_lines, all_headers) | ||
|
||
legacy_modified = {} | ||
for line_num in affected_sections: | ||
if line_num in all_hierarchy_dict: | ||
section_hierarchy = all_hierarchy_dict[line_num] | ||
# Only add if not already detected by operation-type analysis | ||
already_detected = False | ||
if file.filename in modified_sections: | ||
for existing_line, existing_hierarchy in modified_sections[file.filename].get('sections', {}).items(): | ||
if existing_hierarchy == section_hierarchy: | ||
already_detected = True | ||
break | ||
|
||
if not already_detected: | ||
legacy_modified[str(line_num)] = section_hierarchy | ||
|
||
if legacy_modified: | ||
print(f" ✅ Enhanced detection found {len(legacy_modified)} additional content-modified sections") | ||
# Merge with existing modified sections | ||
if file.filename in modified_sections: | ||
# Merge the sections | ||
existing_sections = modified_sections[file.filename].get('sections', {}) | ||
existing_sections.update(legacy_modified) | ||
modified_sections[file.filename]['sections'] = existing_sections | ||
else: | ||
# Create new entry | ||
modified_sections[file.filename] = { | ||
'sections': legacy_modified, | ||
'original_hierarchy': all_hierarchy_dict, | ||
'current_hierarchy': all_hierarchy_dict | ||
} | ||
|
||
print(f"\n📊 Summary:") | ||
#print(f" ✏️ Modified files: {} files") | ||
print(f" 📄 Added files: {len(added_files)} files") | ||
print(f" 🗑️ Deleted files: {len(deleted_files)} files") | ||
print(f" 📋 TOC files: {len(toc_files)} files") | ||
if ignored_files: | ||
print(f" ⏭️ Ignored files: {len(ignored_files)} files") | ||
for ignored_file in ignored_files: | ||
print(f" - {ignored_file}") | ||
|
||
return added_sections, modified_sections, deleted_sections, added_files, deleted_files, toc_files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The analyze_source_changes
function is very long and complex, spanning hundreds of lines. It handles file analysis, diff parsing, hierarchy building, and categorizing changes. To improve readability and maintainability, this function should be refactored into several smaller, more specialized functions. Each smaller function should have a single, well-defined responsibility.
def match_source_diff_to_target(source_diff_dict, target_hierarchy, target_lines, ai_client, repo_config, max_non_system_sections=120, max_tokens=20000): | ||
""" | ||
Match source_diff_dict original_hierarchy to target file sections | ||
Uses direct matching for system variables/config and AI matching for others | ||
|
||
Returns: | ||
dict: Matched sections with enhanced information including: | ||
- target_line: Line number in target file | ||
- target_hierarchy: Target section hierarchy | ||
- insertion_type: For added sections only | ||
- source_original_hierarchy: Original hierarchy from source | ||
- source_operation: Operation type (modified/added/deleted) | ||
- source_old_content: Old content from source diff | ||
- source_new_content: New content from source diff | ||
""" | ||
thread_safe_print(f"🔗 Starting source diff to target matching...") | ||
|
||
# Extract hierarchies from source diff dict | ||
source_hierarchies = extract_hierarchies_from_diff_dict(source_diff_dict) | ||
|
||
if not source_hierarchies: | ||
thread_safe_print(f"⚠️ No hierarchies to match") | ||
return {} | ||
|
||
# Process sections in original order to maintain consistency | ||
# Initialize final matching results with ordered dict to preserve order | ||
from collections import OrderedDict | ||
all_matched_sections = OrderedDict() | ||
|
||
# Categorize sections for processing strategy but maintain order | ||
direct_match_sections = OrderedDict() | ||
ai_match_sections = OrderedDict() | ||
added_sections = OrderedDict() | ||
bottom_sections = OrderedDict() # New category for bottom sections | ||
|
||
for key, hierarchy in source_hierarchies.items(): | ||
# Check if this is a bottom section (no matching needed) | ||
if hierarchy.startswith('bottom-'): | ||
bottom_sections[key] = hierarchy | ||
# Check if this is an added section | ||
elif key.startswith('added_'): | ||
added_sections[key] = hierarchy | ||
else: | ||
# Extract the leaf title from hierarchy for checking | ||
leaf_title = hierarchy.split(' > ')[-1] if ' > ' in hierarchy else hierarchy | ||
|
||
# Check if this is suitable for direct matching | ||
if (hierarchy == "frontmatter" or | ||
leaf_title.startswith('# ') or # Top-level titles | ||
is_system_variable_or_config(leaf_title)): # System variables/config | ||
direct_match_sections[key] = hierarchy | ||
else: | ||
ai_match_sections[key] = hierarchy | ||
|
||
thread_safe_print(f"📊 Section categorization:") | ||
thread_safe_print(f" 🎯 Direct matching: {len(direct_match_sections)} sections") | ||
thread_safe_print(f" 🤖 AI matching: {len(ai_match_sections)} sections") | ||
thread_safe_print(f" ➕ Added sections: {len(added_sections)} sections") | ||
thread_safe_print(f" 🔚 Bottom sections: {len(bottom_sections)} sections (no matching needed)") | ||
|
||
# Process each section in original order | ||
thread_safe_print(f"\n🔄 Processing sections in original order...") | ||
|
||
for key, hierarchy in source_hierarchies.items(): | ||
thread_safe_print(f" 🔍 Processing {key}: {hierarchy}") | ||
|
||
# Determine processing strategy based on section type and content | ||
if hierarchy.startswith('bottom-'): | ||
# Bottom section - no matching needed, append to end | ||
thread_safe_print(f" 🔚 Bottom section - append to end of document") | ||
result = { | ||
"target_line": "-1", # Special marker for bottom sections | ||
"target_hierarchy": hierarchy # Keep the bottom-xxx marker | ||
} | ||
elif key.startswith('added_'): | ||
# Added section - find insertion point | ||
thread_safe_print(f" ➕ Added section - finding insertion point") | ||
result = process_added_section(key, hierarchy, target_hierarchy, target_lines, ai_client, repo_config, max_non_system_sections, max_tokens) | ||
else: | ||
# Modified or deleted section - find matching section | ||
operation = source_diff_dict[key].get('operation', 'unknown') | ||
thread_safe_print(f" {operation.capitalize()} section - finding target match") | ||
result = process_modified_or_deleted_section(key, hierarchy, target_hierarchy, target_lines, ai_client, repo_config, max_non_system_sections, max_tokens) | ||
|
||
if result: | ||
# Add source language information from source_diff_dict | ||
source_info = source_diff_dict.get(key, {}) | ||
|
||
# Extract target content from target_lines | ||
target_line = result.get('target_line', 'unknown') | ||
target_content = "" | ||
if target_line != 'unknown' and target_line != '0': | ||
try: | ||
target_line_num = int(target_line) | ||
# For ALL operations, only extract direct content (no sub-sections) | ||
# This avoids duplication when both parent and child sections have operations | ||
target_content = extract_section_direct_content(target_line_num, target_lines) | ||
except (ValueError, IndexError): | ||
target_content = "" | ||
elif target_line == '0': | ||
# For frontmatter, extract content from beginning to first header | ||
target_content = extract_frontmatter_content(target_lines) | ||
|
||
enhanced_result = { | ||
**result, # Include existing target matching info | ||
'target_content': target_content, # Add target section content | ||
'source_original_hierarchy': source_info.get('original_hierarchy', ''), | ||
'source_operation': source_info.get('operation', ''), | ||
'source_old_content': source_info.get('old_content', ''), | ||
'source_new_content': source_info.get('new_content', '') | ||
} | ||
all_matched_sections[key] = enhanced_result | ||
thread_safe_print(f" ✅ {key}: -> line {target_line}") | ||
else: | ||
thread_safe_print(f" ❌ {key}: matching failed") | ||
|
||
thread_safe_print(f"\n📊 Final matching results: {len(all_matched_sections)} total matches") | ||
return all_matched_sections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The match_source_diff_to_target
function is very large and complex. It orchestrates direct matching, AI matching, and processes different types of sections. This makes it difficult to understand and maintain. Please consider breaking this function down into smaller, more focused helper functions to improve its structure and readability.
def thread_safe_print(*args, **kwargs): | ||
with print_lock: | ||
print(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thread_safe_print
function is duplicated across multiple files in this package (including file_deleter.py
, file_updater.py
, main_workflow.py
, etc.). To improve maintainability and adhere to the DRY (Don't Repeat Yourself) principle, this function should be defined once in a shared utility module and imported where needed.
def create_section_batches(file_content, max_lines_per_batch=200): | ||
"""Create batches of file content for translation, respecting section boundaries""" | ||
lines = file_content.split('\n') | ||
|
||
# Find all section headers | ||
section_starts = [] | ||
for i, line in enumerate(lines): | ||
line = line.strip() | ||
if line.startswith('#'): | ||
match = re.match(r'^(#{1,10})\s+(.+)', line) | ||
if match: | ||
section_starts.append(i + 1) # 1-based line numbers | ||
|
||
# If no sections found, just batch by line count | ||
if not section_starts: | ||
batches = [] | ||
for i in range(0, len(lines), max_lines_per_batch): | ||
batch_lines = lines[i:i + max_lines_per_batch] | ||
batches.append('\n'.join(batch_lines)) | ||
return batches | ||
|
||
# Create batches respecting section boundaries | ||
batches = [] | ||
current_batch_start = 0 | ||
|
||
for i, section_start in enumerate(section_starts): | ||
section_start_idx = section_start - 1 # Convert to 0-based | ||
|
||
# Check if adding this section would exceed the line limit | ||
if (section_start_idx - current_batch_start) > max_lines_per_batch: | ||
# Close current batch at the previous section boundary | ||
if current_batch_start < section_start_idx: | ||
batch_lines = lines[current_batch_start:section_start_idx] | ||
batches.append('\n'.join(batch_lines)) | ||
current_batch_start = section_start_idx | ||
|
||
# If this is the last section, or the next section would create a batch too large | ||
if i == len(section_starts) - 1: | ||
# Add remaining content as final batch | ||
batch_lines = lines[current_batch_start:] | ||
batches.append('\n'.join(batch_lines)) | ||
else: | ||
next_section_start = section_starts[i + 1] - 1 # 0-based | ||
if (next_section_start - current_batch_start) > max_lines_per_batch: | ||
# Close current batch at current section boundary | ||
batch_lines = lines[current_batch_start:section_start_idx] | ||
if batch_lines: # Only add non-empty batches | ||
batches.append('\n'.join(batch_lines)) | ||
current_batch_start = section_start_idx | ||
|
||
# Clean up any empty batches | ||
batches = [batch for batch in batches if batch.strip()] | ||
|
||
return batches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in create_section_batches
is quite complex and hard to follow, especially with the lookahead checks for the next section. This complexity increases the risk of bugs and makes the function difficult to maintain.
Consider refactoring this to a simpler, more straightforward algorithm. A simpler approach could be:
- Initialize an empty
current_batch
and a list ofbatches
. - Iterate through the sections (or lines if no sections).
- If adding the current section/line to
current_batch
does not exceedmax_lines_per_batch
, add it. - Otherwise, add
current_batch
tobatches
, and start a newcurrent_batch
with the current section/line. - After the loop, add the final
current_batch
if it's not empty.
try: | ||
from main import print_token_estimation | ||
print_token_estimation(prompt, "File addition translation") | ||
except ImportError: | ||
# Fallback if import fails - use tiktoken | ||
try: | ||
import tiktoken | ||
enc = tiktoken.get_encoding("cl100k_base") | ||
tokens = enc.encode(prompt) | ||
actual_tokens = len(tokens) | ||
char_count = len(prompt) | ||
print(f" 💰 File addition translation") | ||
print(f" 📝 Input: {char_count:,} characters") | ||
print(f" 🔢 Actual tokens: {actual_tokens:,} (using tiktoken cl100k_base)") | ||
except Exception: | ||
# Final fallback to character approximation | ||
estimated_tokens = len(prompt) // 4 | ||
char_count = len(prompt) | ||
print(f" 💰 File addition translation") | ||
print(f" 📝 Input: {char_count:,} characters") | ||
print(f" 🔢 Estimated tokens: ~{estimated_tokens:,} (fallback: 4 chars/token approximation)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_pr_diff(pr_url, github_client): | ||
"""Get the diff content from a GitHub PR (from auto-sync-pr-changes.py)""" | ||
try: | ||
from pr_analyzer import parse_pr_url | ||
owner, repo, pr_number = parse_pr_url(pr_url) | ||
repository = github_client.get_repo(f"{owner}/{repo}") | ||
pr = repository.get_pull(pr_number) | ||
|
||
# Get files and their patches | ||
files = pr.get_files() | ||
diff_content = [] | ||
|
||
for file in files: | ||
if file.filename.endswith('.md') and file.patch: | ||
diff_content.append(f"File: {file.filename}") | ||
diff_content.append(file.patch) | ||
diff_content.append("-" * 80) | ||
|
||
return "\n".join(diff_content) | ||
|
||
except Exception as e: | ||
thread_safe_print(f" ❌ Error getting PR diff: {e}") | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def clean_title_for_matching(title): | ||
"""Clean title for matching by removing markdown formatting and span elements""" | ||
if not title: | ||
return "" | ||
|
||
# Remove span elements like <span class="version-mark">New in v5.0</span> | ||
title = re.sub(r'<span[^>]*>.*?</span>', '', title) | ||
|
||
# Remove markdown header prefix (# ## ### etc.) | ||
title = re.sub(r'^#{1,6}\s*', '', title.strip()) | ||
|
||
# Remove backticks | ||
title = title.replace('`', '') | ||
|
||
# Strip whitespace | ||
title = title.strip() | ||
|
||
return title |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 19
🧹 Nitpick comments (30)
.github/workflows/sync-docs-cn-to-en.yml (1)
46-70
: Harden PR URL parsing.cut -f7 is brittle (fails with trailing paths like /files). Use regex to extract owner/repo/PR number.
Apply:
- name: Extract PR information id: extract_info run: | # Extract source repo info SOURCE_URL="${{ github.event.inputs.source_pr_url }}" - SOURCE_OWNER=$(echo $SOURCE_URL | cut -d'/' -f4) - SOURCE_REPO=$(echo $SOURCE_URL | cut -d'/' -f5) - SOURCE_PR=$(echo $SOURCE_URL | cut -d'/' -f7) + SOURCE_OWNER=$(echo "$SOURCE_URL" | sed -E 's#https?://github\.com/([^/]+)/([^/]+)/pull/([0-9]+).*#\1#') + SOURCE_REPO=$(echo "$SOURCE_URL" | sed -E 's#https?://github\.com/([^/]+)/([^/]+)/pull/([0-9]+).*#\2#') + SOURCE_PR=$(echo "$SOURCE_URL" | sed -E 's#https?://github\.com/[^/]+/[^/]+/pull/([0-9]+).*#\1#') # Extract target repo info TARGET_URL="${{ github.event.inputs.target_pr_url }}" - TARGET_OWNER=$(echo $TARGET_URL | cut -d'/' -f4) - TARGET_REPO=$(echo $TARGET_URL | cut -d'/' -f5) - TARGET_PR=$(echo $TARGET_URL | cut -d'/' -f7) + TARGET_OWNER=$(echo "$TARGET_URL" | sed -E 's#https?://github\.com/([^/]+)/([^/]+)/pull/([0-9]+).*#\1#') + TARGET_REPO=$(echo "$TARGET_URL" | sed -E 's#https?://github\.com/([^/]+)/([^/]+)/pull/([0-9]+).*#\2#') + TARGET_PR=$(echo "$TARGET_URL" | sed -E 's#https?://github\.com/[^/]+/[^/]+/pull/([0-9]+).*#\1#')scripts/translate_doc_pr/file_adder.py (5)
8-12
: Remove unused imports.json, Github, OpenAI are unused here.
-import json import threading -from github import Github -from openai import OpenAI
20-74
: Fix batching to respect section boundaries and max_lines reliably.Current logic can drop content and create empty batches. Pack section ranges then chunk >max batches.
-def create_section_batches(file_content, max_lines_per_batch=200): - """Create batches of file content for translation, respecting section boundaries""" - lines = file_content.split('\n') - - # Find all section headers - section_starts = [] - for i, line in enumerate(lines): - line = line.strip() - if line.startswith('#'): - match = re.match(r'^(#{1,10})\s+(.+)', line) - if match: - section_starts.append(i + 1) # 1-based line numbers - - # If no sections found, just batch by line count - if not section_starts: - batches = [] - for i in range(0, len(lines), max_lines_per_batch): - batch_lines = lines[i:i + max_lines_per_batch] - batches.append('\n'.join(batch_lines)) - return batches - - # Create batches respecting section boundaries - batches = [] - current_batch_start = 0 - - for i, section_start in enumerate(section_starts): - section_start_idx = section_start - 1 # Convert to 0-based - - # Check if adding this section would exceed the line limit - if (section_start_idx - current_batch_start) > max_lines_per_batch: - # Close current batch at the previous section boundary - if current_batch_start < section_start_idx: - batch_lines = lines[current_batch_start:section_start_idx] - batches.append('\n'.join(batch_lines)) - current_batch_start = section_start_idx - - # If this is the last section, or the next section would create a batch too large - if i == len(section_starts) - 1: - # Add remaining content as final batch - batch_lines = lines[current_batch_start:] - batches.append('\n'.join(batch_lines)) - else: - next_section_start = section_starts[i + 1] - 1 # 0-based - if (next_section_start - current_batch_start) > max_lines_per_batch: - # Close current batch at current section boundary - batch_lines = lines[current_batch_start:section_start_idx] - if batch_lines: # Only add non-empty batches - batches.append('\n'.join(batch_lines)) - current_batch_start = section_start_idx - - # Clean up any empty batches - batches = [batch for batch in batches if batch.strip()] - - return batches +def create_section_batches(file_content, max_lines_per_batch=200): + """Batch content by Markdown sections, then enforce max_lines_per_batch.""" + lines = file_content.split('\n') + header_idx = [i for i, line in enumerate(lines) if re.match(r'^\#{1,10}\s+\S', line.strip())] + + # No headers: simple chunking + if not header_idx: + return [ + '\n'.join(chunk) + for i in range(0, len(lines), max_lines_per_batch) + if (chunk := lines[i:i + max_lines_per_batch]) + ] + + # Build section slices + sections = [] + for i, start in enumerate(header_idx): + end = header_idx[i + 1] if i + 1 < len(header_idx) else len(lines) + sections.append(lines[start:end]) + + # Pack sections into batches + batches, cur, cur_len = [], [], 0 + for sec in sections: + sec_len = len(sec) + if cur and cur_len + sec_len > max_lines_per_batch: + batches.append('\n'.join(cur)) + cur, cur_len = [], 0 + cur.extend(sec) + cur_len += sec_len + if cur: + batches.append('\n'.join(cur)) + + # Split any oversized batch (single very large section) + final_batches = [] + for batch in batches: + blines = batch.split('\n') + if len(blines) <= max_lines_per_batch: + if batch.strip(): + final_batches.append(batch) + else: + for i in range(0, len(blines), max_lines_per_batch): + chunk = '\n'.join(blines[i:i + max_lines_per_batch]).strip() + if chunk: + final_batches.append(chunk) + return final_batches
112-121
: Drop f-prefix where no interpolation occurs.Cleans up lints (F541) and keeps output unchanged.
- print(f" 💰 File addition translation") - print(f" 📝 Input: {char_count:,} characters") - print(f" 🔢 Actual tokens: {actual_tokens:,} (using tiktoken cl100k_base)") + print(" 💰 File addition translation") + print(f" 📝 Input: {char_count:,} characters") + print(f" 🔢 Actual tokens: {actual_tokens:,} (using tiktoken cl100k_base)") @@ - print(f" 💰 File addition translation") - print(f" 📝 Input: {char_count:,} characters") - print(f" 🔢 Estimated tokens: ~{estimated_tokens:,} (fallback: 4 chars/token approximation)") + print(" 💰 File addition translation") + print(f" 📝 Input: {char_count:,} characters") + print(f" 🔢 Estimated tokens: ~{estimated_tokens:,} (fallback: 4 chars/token approximation)") @@ - thread_safe_print(f"\n✅ Completed processing all new files") + thread_safe_print("\n✅ Completed processing all new files")Also applies to: 128-129, 193-193
135-140
: Rename unused params to signal intent.pr_url and github_client are unused.
-def process_added_files(added_files, pr_url, github_client, ai_client, repo_config): +def process_added_files(added_files, _pr_url, _github_client, ai_client, repo_config):
131-134
: Avoid blind excepts; narrow or log with type.Catching bare Exception hides actionable errors.
Consider catching specific exceptions (e.g., httpx.HTTPError from AI client; OSError for file writes), or at least include exception type:
except Exception as e: thread_safe_print(f" ❌ Batch translation failed: {type(e).__name__}: {e}")Also applies to: 190-191
scripts/translate_doc_pr/file_deleter.py (4)
8-8
: Remove unused import.Github is not used.
-from github import Github
17-22
: Rename unused param to underscore.github_client isn’t used.
-def process_deleted_files(deleted_files, github_client, repo_config): +def process_deleted_files(deleted_files, _github_client, repo_config):
38-39
: Avoid blind except; log type.Narrow to OSError or include type name in message.
- except Exception as e: - thread_safe_print(f" ❌ Error deleting file {target_file_path}: {e}") + except OSError as e: + thread_safe_print(f" ❌ Error deleting file {target_file_path}: {type(e).__name__}: {e}")
43-44
: Drop f-prefix without interpolation.- thread_safe_print(f"\n✅ Completed processing deleted files") + thread_safe_print("\n✅ Completed processing deleted files")scripts/translate_doc_pr/main_workflow.py (6)
612-615
: Remove duplicate re-imports.process_toc_files and process_modified_sections are already imported at top.
- # Import necessary functions - from file_updater import process_modified_sections, update_target_document_from_match_data - from toc_processor import process_toc_files + # Functions already imported at module top; no re-import needed
528-533
: Drop redundant local import; parse_pr_url already imported.-def get_workflow_repo_config(pr_url, repo_configs): +def get_workflow_repo_config(pr_url, repo_configs): """Get repository configuration for workflow environment""" - from pr_analyzer import parse_pr_url - owner, repo, pr_number = parse_pr_url(pr_url)
115-122
: Unify logging and remove f-prefix without placeholders.Use thread_safe_print consistently and avoid f-strings without interpolation.
- print(f"🧹 Cleaned existing temp_output directory") + thread_safe_print("🧹 Cleaned existing temp_output directory") @@ - print(f"🧹 Removed existing temp_output file") + thread_safe_print("🧹 Removed existing temp_output file") @@ - print(f"📁 Created temp_output directory: {temp_dir}") + thread_safe_print(f"📁 Created temp_output directory: {temp_dir}")
423-431
: Prefer local target file content over GitHub API master ref.Using remote master can mismatch the checked-out PR head branch and degrade matching accuracy. Read from repo_config['target_local_path'] instead, falling back to API if missing.
I can provide a patch if you confirm local path layout mirrors target repo paths.
251-274
: Duplicate get_pr_diff logic; reuse pr_analyzer.get_pr_diff.Avoid maintaining two implementations.
Option:
- Remove local get_pr_diff.
- Import and call pr_analyzer.get_pr_diff at the top.
Minimal change at call site:
- pr_diff = get_pr_diff(SOURCE_PR_URL, github_client) + from pr_analyzer import get_pr_diff as analyzer_get_pr_diff + pr_diff = analyzer_get_pr_diff(SOURCE_PR_URL, github_client)Also applies to: 592-596
56-57
: Message text: avoid long strings in exceptions.Minor: prefer concise exception messages or custom exception classes.
Also applies to: 159-160
scripts/translate_doc_pr/toc_processor.py (3)
10-12
: Remove unused imports.Github and OpenAI are not used.
-from github import Github -from openai import OpenAI
241-241
: Drop f-prefix where no interpolation occurs.Cleans up F541 lints.
- thread_safe_print(f" ⏭️ No TOC lines need translation") + thread_safe_print(" ⏭️ No TOC lines need translation") @@ - print(f" 💰 TOC translation") + print(" 💰 TOC translation") @@ - print(f" 💰 TOC translation") + print(" 💰 TOC translation") @@ - thread_safe_print(f" 📝 AI translation response received") + thread_safe_print(" 📝 AI translation response received") @@ - thread_safe_print(f" ✅ All TOC files processed") + thread_safe_print(" ✅ All TOC files processed")Also applies to: 286-286, 293-295, 303-304, 434-434
90-91
: Rename unused params to underscore; avoid blind except.
- file_path/target_local_path/pr_url/github_client are unused.
- Catch specific exceptions or include type name in logs.
-def process_toc_operations(file_path, operations, source_lines, target_lines, target_local_path): +def process_toc_operations(_file_path, operations, source_lines, target_lines, _target_local_path): @@ -def process_toc_files(toc_files, pr_url, github_client, ai_client, repo_config): +def process_toc_files(toc_files, _pr_url, _github_client, ai_client, repo_config):And adjust excepts:
except Exception as e: thread_safe_print(f" ❌ AI translation failed: {type(e).__name__}: {e}")Also applies to: 421-423, 332-334
scripts/translate_doc_pr/pr_analyzer.py (8)
1-1
: Remove non-executable shebang or mark file executable.The shebang is present but the file isn’t executable in source control. Prefer removing it to avoid EXE001 lint, or ensure the file mode is +x.
Apply this diff to remove the shebang:
-#!/usr/bin/env python3
22-26
: Harden PR URL parsing.Splitting on "/" is brittle (trailing slashes, query params). Use a regex or urllib to reliably extract owner/repo/number.
Apply this diff:
def parse_pr_url(pr_url): - """Parse PR URL to get repo info""" - parts = pr_url.split('/') - return parts[-4], parts[-3], int(parts[-1]) # owner, repo, pr_number + """Parse PR URL to get owner, repo, and PR number robustly""" + import re + m = re.search(r'https?://[^/]+/([^/]+)/([^/]+)/pull/(\d+)', pr_url.rstrip('/')) + if not m: + raise ValueError(f"Invalid PR URL: {pr_url}") + owner, repo, pr_number = m.group(1), m.group(2), int(m.group(3)) + return owner, repo, pr_number
41-63
: Narrow exception handling and use thread-safe logging.Catching Exception swallows actionable errors; prefer GithubException and use thread_safe_print for consistency.
Apply this diff:
-from github import Github +from github import Github +from github.GithubException import GithubException @@ def get_pr_diff(pr_url, github_client): @@ - except Exception as e: - print(f" ❌ Error getting PR diff: {e}") + except GithubException as e: + thread_safe_print(f" ❌ GitHub API error getting PR diff: {e}") + return None + except Exception as e: + thread_safe_print(f" ❌ Unexpected error getting PR diff: {e}") return None
228-289
: Remove unused locals to reduce noise (Ruff F841).original_line and title are never used.
Apply this diff:
- original_line = line line = line.strip() @@ - if match: - level = len(match.group(1)) - title = match.group(2).strip() + if match: + level = len(match.group(1))
742-744
: Remove unused cross-import to avoid circular dependency risks.This import isn’t used and introduces an unnecessary coupling.
Apply this diff:
- from section_matcher import clean_title_for_matching
1311-1319
: Gate local artifact writes behind an env flag and use a temp directory.Always writing to the repo path is noisy and brittle in CI. Make it opt-in and write to a temp dir by default.
Apply this diff:
- script_dir = os.path.dirname(os.path.abspath(__file__)) - temp_dir = os.path.join(script_dir, "temp_output") - os.makedirs(temp_dir, exist_ok=True) - - file_prefix = file.filename.replace('/', '-').replace('.md', '') - output_file = os.path.join(temp_dir, f"{file_prefix}-source-diff-dict.json") - with open(output_file, 'w', encoding='utf-8') as f: - json.dump(source_diff_dict, f, ensure_ascii=False, indent=2) - - print(f" 💾 Saved source diff dictionary to: {output_file}") + if os.getenv("WRITE_SOURCE_DIFF", "0") == "1": + out_dir = os.getenv("SOURCE_DIFF_OUTDIR", os.path.join(tempfile.gettempdir(), "translate_doc_pr")) + os.makedirs(out_dir, exist_ok=True) + file_prefix = file.filename.replace('/', '-').replace('.md', '') + output_file = os.path.join(out_dir, f"{file_prefix}-source-diff-dict.json") + with open(output_file, 'w', encoding='utf-8') as f: + json.dump(source_diff_dict, f, ensure_ascii=False, indent=2) + thread_safe_print(f" 💾 Saved source diff dictionary to: {output_file}")Note: add
import tempfile
at the top.
982-984
: Remove redundant re-imports.os and json are already imported at module scope.
Apply this diff:
- import os - import json
671-690
: Deduplicate helpers across modules.clean_title_for_matching is duplicated in section_matcher.py. Extract into a shared utils module to avoid drift.
scripts/translate_doc_pr/section_matcher.py (3)
6-12
: Remove unused imports.Github, OpenAI, os, json aren’t used here.
Apply this diff:
-import os import re -import json import threading -from github import Github -from openai import OpenAI
430-456
: Deduplicate build_hierarchy_path with pr_analyzer.Both modules implement identical logic. Extract into a shared utility to avoid divergence.
16-19
: Unify logging with pr_analyzer.Duplicate thread_safe_print exists across modules. Consider centralizing in a utils module.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
.github/workflows/sync-docs-cn-to-en.yml
(1 hunks)scripts/translate_doc_pr/__init__.py
(1 hunks)scripts/translate_doc_pr/file_adder.py
(1 hunks)scripts/translate_doc_pr/file_deleter.py
(1 hunks)scripts/translate_doc_pr/file_updater.py
(1 hunks)scripts/translate_doc_pr/main_workflow.py
(1 hunks)scripts/translate_doc_pr/pr_analyzer.py
(1 hunks)scripts/translate_doc_pr/requirements.txt
(1 hunks)scripts/translate_doc_pr/section_matcher.py
(1 hunks)scripts/translate_doc_pr/toc_processor.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (7)
scripts/translate_doc_pr/__init__.py (1)
scripts/translate_doc_pr/main_workflow.py (1)
main
(544-688)
scripts/translate_doc_pr/file_deleter.py (6)
scripts/translate_doc_pr/file_adder.py (1)
thread_safe_print
(16-18)scripts/translate_doc_pr/file_updater.py (1)
thread_safe_print
(17-19)scripts/translate_doc_pr/main_workflow.py (1)
thread_safe_print
(94-96)scripts/translate_doc_pr/pr_analyzer.py (1)
thread_safe_print
(16-19)scripts/translate_doc_pr/section_matcher.py (1)
thread_safe_print
(16-18)scripts/translate_doc_pr/toc_processor.py (1)
thread_safe_print
(16-18)
scripts/translate_doc_pr/file_adder.py (1)
scripts/translate_doc_pr/main_workflow.py (4)
thread_safe_print
(94-96)main
(544-688)print_token_estimation
(137-144)chat_completion
(165-198)
scripts/translate_doc_pr/section_matcher.py (2)
scripts/translate_doc_pr/main_workflow.py (4)
thread_safe_print
(94-96)main
(544-688)print_token_estimation
(137-144)chat_completion
(165-198)scripts/translate_doc_pr/pr_analyzer.py (5)
thread_safe_print
(16-19)clean_title_for_matching
(672-689)build_hierarchy_path
(291-316)extract_section_direct_content
(454-479)extract_frontmatter_content
(481-494)
scripts/translate_doc_pr/toc_processor.py (1)
scripts/translate_doc_pr/main_workflow.py (4)
thread_safe_print
(94-96)main
(544-688)print_token_estimation
(137-144)chat_completion
(165-198)
scripts/translate_doc_pr/main_workflow.py (6)
scripts/translate_doc_pr/pr_analyzer.py (6)
analyze_source_changes
(979-1447)get_repo_config
(27-39)get_target_hierarchy_and_content
(603-616)parse_pr_url
(22-25)thread_safe_print
(16-19)get_pr_diff
(41-62)scripts/translate_doc_pr/file_adder.py (2)
process_added_files
(135-193)thread_safe_print
(16-18)scripts/translate_doc_pr/file_deleter.py (2)
process_deleted_files
(17-43)thread_safe_print
(13-15)scripts/translate_doc_pr/file_updater.py (3)
process_modified_sections
(574-605)thread_safe_print
(17-19)update_target_document_from_match_data
(1196-1357)scripts/translate_doc_pr/toc_processor.py (2)
process_toc_files
(421-434)thread_safe_print
(16-18)scripts/translate_doc_pr/section_matcher.py (2)
match_source_diff_to_target
(572-689)thread_safe_print
(16-18)
scripts/translate_doc_pr/file_updater.py (4)
scripts/translate_doc_pr/main_workflow.py (4)
thread_safe_print
(94-96)main
(544-688)print_token_estimation
(137-144)chat_completion
(165-198)scripts/translate_doc_pr/pr_analyzer.py (6)
thread_safe_print
(16-19)find_section_boundaries
(361-400)get_target_hierarchy_and_content
(603-616)clean_title_for_matching
(672-689)build_hierarchy_dict
(228-289)extract_affected_sections
(497-514)scripts/translate_doc_pr/section_matcher.py (9)
thread_safe_print
(16-18)find_direct_matches_for_special_files
(138-298)filter_non_system_sections
(300-316)get_corresponding_sections
(318-379)is_system_variable_or_config
(39-80)clean_title_for_matching
(20-37)parse_ai_response
(381-397)find_matching_line_numbers
(399-428)map_insertion_points_to_target
(457-551)scripts/translate_doc_pr/toc_processor.py (2)
thread_safe_print
(16-18)process_toc_file
(332-419)
🪛 Ruff (0.13.1)
scripts/translate_doc_pr/__init__.py
1-1: Shebang is present but file is not executable
(EXE001)
scripts/translate_doc_pr/file_deleter.py
17-17: Unused function argument: github_client
(ARG001)
38-38: Do not catch blind exception: Exception
(BLE001)
43-43: f-string without any placeholders
Remove extraneous f
prefix
(F541)
scripts/translate_doc_pr/file_adder.py
112-112: f-string without any placeholders
Remove extraneous f
prefix
(F541)
115-115: Do not catch blind exception: Exception
(BLE001)
119-119: f-string without any placeholders
Remove extraneous f
prefix
(F541)
128-128: f-string without any placeholders
Remove extraneous f
prefix
(F541)
129-129: Consider moving this statement to an else
block
(TRY300)
131-131: Do not catch blind exception: Exception
(BLE001)
135-135: Unused function argument: pr_url
(ARG001)
135-135: Unused function argument: github_client
(ARG001)
190-190: Do not catch blind exception: Exception
(BLE001)
193-193: f-string without any placeholders
Remove extraneous f
prefix
(F541)
scripts/translate_doc_pr/section_matcher.py
88-88: f-string without any placeholders
Remove extraneous f
prefix
(F541)
122-122: f-string without any placeholders
Remove extraneous f
prefix
(F541)
138-138: Unused function argument: target_hierarchy
(ARG001)
169-169: f-string without any placeholders
Remove extraneous f
prefix
(F541)
235-235: f-string without any placeholders
Remove extraneous f
prefix
(F541)
242-242: f-string without any placeholders
Remove extraneous f
prefix
(F541)
294-294: String contains ambiguous ℹ
(INFORMATION SOURCE). Did you mean i
(LATIN SMALL LETTER I)?
(RUF001)
326-332: Possible SQL injection vector through string-based query construction
(S608)
335-335: f-string without any placeholders
Remove extraneous f
prefix
(F541)
337-337: f-string without any placeholders
Remove extraneous f
prefix
(F541)
354-354: Do not catch blind exception: Exception
(BLE001)
371-371: f-string without any placeholders
Remove extraneous f
prefix
(F541)
372-372: f-string without any placeholders
Remove extraneous f
prefix
(F541)
374-374: f-string without any placeholders
Remove extraneous f
prefix
(F541)
376-376: Consider moving this statement to an else
block
(TRY300)
377-377: Do not catch blind exception: Exception
(BLE001)
430-430: Unused function argument: lines
(ARG001)
457-457: Unused function argument: file_path
(ARG001)
457-457: Unused function argument: pr_url
(ARG001)
457-457: Unused function argument: github_client
(ARG001)
480-480: Unpacked variable failed_matches
is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
480-480: Unpacked variable skipped_sections
is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
486-486: Prefer next(iter(matched_dict.keys()))
over single element slice
Replace with next(iter(matched_dict.keys()))
(RUF015)
533-533: Prefer next(iter(ai_matched.keys()))
over single element slice
Replace with next(iter(ai_matched.keys()))
(RUF015)
587-587: f-string without any placeholders
Remove extraneous f
prefix
(F541)
593-593: f-string without any placeholders
Remove extraneous f
prefix
(F541)
626-626: f-string without any placeholders
Remove extraneous f
prefix
(F541)
629-629: String contains ambiguous ➕
(HEAVY PLUS SIGN). Did you mean +
(PLUS SIGN)?
(RUF001)
633-633: f-string without any placeholders
Remove extraneous f
prefix
(F541)
641-641: f-string without any placeholders
Remove extraneous f
prefix
(F541)
648-648: f-string without any placeholders
Remove extraneous f
prefix
(F541)
648-648: String contains ambiguous ➕
(HEAVY PLUS SIGN). Did you mean +
(PLUS SIGN)?
(RUF001)
711-711: Prefer next(iter(matched_dict.keys()))
over single element slice
Replace with next(iter(matched_dict.keys()))
(RUF015)
721-721: Unpacked variable failed_matches
is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
721-721: Unpacked variable skipped_sections
is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
725-725: Prefer next(iter(matched_dict.keys()))
over single element slice
Replace with next(iter(matched_dict.keys()))
(RUF015)
726-726: Prefer next(iter(matched_dict.values()))
over single element slice
Replace with next(iter(matched_dict.values()))
(RUF015)
762-762: Prefer next(iter(ai_matched.keys()))
over single element slice
Replace with next(iter(ai_matched.keys()))
(RUF015)
763-763: Prefer next(iter(ai_matched.values()))
over single element slice
Replace with next(iter(ai_matched.values()))
(RUF015)
811-811: Prefer next(iter(matched_dict.keys()))
over single element slice
Replace with next(iter(matched_dict.keys()))
(RUF015)
821-821: Unpacked variable failed_matches
is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
821-821: Unpacked variable skipped_sections
is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
825-825: Prefer next(iter(matched_dict.keys()))
over single element slice
Replace with next(iter(matched_dict.keys()))
(RUF015)
826-826: Prefer next(iter(matched_dict.values()))
over single element slice
Replace with next(iter(matched_dict.values()))
(RUF015)
854-854: Prefer next(iter(ai_matched.keys()))
over single element slice
Replace with next(iter(ai_matched.keys()))
(RUF015)
855-855: Prefer next(iter(ai_matched.values()))
over single element slice
Replace with next(iter(ai_matched.values()))
(RUF015)
917-917: Local variable current_level
is assigned to but never used
Remove assignment to unused variable current_level
(F841)
939-939: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
scripts/translate_doc_pr/toc_processor.py
90-90: Unused function argument: file_path
(ARG001)
90-90: Unused function argument: target_local_path
(ARG001)
92-92: f-string without any placeholders
Remove extraneous f
prefix
(F541)
126-126: String contains ambiguous ➕
(HEAVY PLUS SIGN). Did you mean +
(PLUS SIGN)?
(RUF001)
241-241: f-string without any placeholders
Remove extraneous f
prefix
(F541)
286-286: f-string without any placeholders
Remove extraneous f
prefix
(F541)
289-289: Do not catch blind exception: Exception
(BLE001)
293-293: f-string without any placeholders
Remove extraneous f
prefix
(F541)
303-303: f-string without any placeholders
Remove extraneous f
prefix
(F541)
328-328: Do not catch blind exception: Exception
(BLE001)
332-332: Unused function argument: pr_url
(ARG001)
332-332: Unused function argument: github_client
(ARG001)
387-387: String contains ambiguous ➕
(HEAVY PLUS SIGN). Did you mean +
(PLUS SIGN)?
(RUF001)
400-400: String contains ambiguous ➕
(HEAVY PLUS SIGN). Did you mean +
(PLUS SIGN)?
(RUF001)
403-403: String contains ambiguous ➕
(HEAVY PLUS SIGN). Did you mean +
(PLUS SIGN)?
(RUF001)
418-418: Do not catch blind exception: Exception
(BLE001)
434-434: f-string without any placeholders
Remove extraneous f
prefix
(F541)
scripts/translate_doc_pr/main_workflow.py
56-56: Avoid specifying long messages outside the exception class
(TRY003)
115-115: f-string without any placeholders
Remove extraneous f
prefix
(F541)
119-119: f-string without any placeholders
Remove extraneous f
prefix
(F541)
132-132: Do not catch blind exception: Exception
(BLE001)
157-157: Avoid specifying long messages outside the exception class
(TRY003)
159-159: Avoid specifying long messages outside the exception class
(TRY003)
163-163: Avoid specifying long messages outside the exception class
(TRY003)
179-179: f-string without any placeholders
Remove extraneous f
prefix
(F541)
188-188: f-string without any placeholders
Remove extraneous f
prefix
(F541)
191-191: f-string without any placeholders
Remove extraneous f
prefix
(F541)
195-195: Use explicit conversion flag
Replace with conversion flag
(RUF010)
197-197: f-string without any placeholders
Remove extraneous f
prefix
(F541)
198-198: Use raise
without specifying exception name
Remove exception name
(TRY201)
221-221: Loop control variable key
not used within loop body
Rename unused key
to _key
(B007)
235-235: f-string without any placeholders
Remove extraneous f
prefix
(F541)
247-247: Do not catch blind exception: Exception
(BLE001)
271-271: Do not catch blind exception: Exception
(BLE001)
275-275: Unused function argument: target_sections
(ARG001)
312-312: Loop control variable key
not used within loop body
Rename unused key
to _key
(B007)
319-319: f-string without any placeholders
Remove extraneous f
prefix
(F541)
370-370: Unused function argument: file_sections
(ARG001)
411-411: f-string without any placeholders
Remove extraneous f
prefix
(F541)
414-414: f-string without any placeholders
Remove extraneous f
prefix
(F541)
434-434: f-string without any placeholders
Remove extraneous f
prefix
(F541)
446-446: f-string without any placeholders
Remove extraneous f
prefix
(F541)
458-458: f-string without any placeholders
Remove extraneous f
prefix
(F541)
474-474: Unpacked variable file_path
is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
505-505: f-string without any placeholders
Remove extraneous f
prefix
(F541)
510-510: f-string without any placeholders
Remove extraneous f
prefix
(F541)
513-513: f-string without any placeholders
Remove extraneous f
prefix
(F541)
517-517: f-string without any placeholders
Remove extraneous f
prefix
(F541)
520-520: f-string without any placeholders
Remove extraneous f
prefix
(F541)
523-523: Do not catch blind exception: Exception
(BLE001)
530-530: Redefinition of unused parse_pr_url
from line 21
Remove definition: parse_pr_url
(F811)
536-536: Avoid specifying long messages outside the exception class
(TRY003)
556-556: f-string without any placeholders
Remove extraneous f
prefix
(F541)
584-584: Do not catch blind exception: Exception
(BLE001)
591-591: f-string without any placeholders
Remove extraneous f
prefix
(F541)
599-599: f-string without any placeholders
Remove extraneous f
prefix
(F541)
600-600: Unpacked variable added_sections
is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
600-600: Unpacked variable deleted_sections
is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
610-610: f-string without any placeholders
Remove extraneous f
prefix
(F541)
614-614: Redefinition of unused process_toc_files
from line 25
Remove definition: process_toc_files
(F811)
620-620: f-string without any placeholders
Remove extraneous f
prefix
(F541)
626-626: f-string without any placeholders
Remove extraneous f
prefix
(F541)
632-632: f-string without any placeholders
Remove extraneous f
prefix
(F541)
658-658: f-string without any placeholders
Remove extraneous f
prefix
(F541)
683-683: f-string without any placeholders
Remove extraneous f
prefix
(F541)
688-688: f-string without any placeholders
Remove extraneous f
prefix
(F541)
scripts/translate_doc_pr/file_updater.py
51-51: Local variable diff_content
is assigned to but never used
Remove assignment to unused variable diff_content
(F841)
102-102: f-string without any placeholders
Remove extraneous f
prefix
(F541)
105-105: f-string without any placeholders
Remove extraneous f
prefix
(F541)
109-109: f-string without any placeholders
Remove extraneous f
prefix
(F541)
125-125: Do not catch blind exception: Exception
(BLE001)
138-138: f-string without any placeholders
Remove extraneous f
prefix
(F541)
150-150: Consider moving this statement to an else
block
(TRY300)
152-152: Do not catch blind exception: Exception
(BLE001)
164-164: f-string without any placeholders
Remove extraneous f
prefix
(F541)
173-173: f-string without any placeholders
Remove extraneous f
prefix
(F541)
176-176: f-string without any placeholders
Remove extraneous f
prefix
(F541)
180-180: f-string without any placeholders
Remove extraneous f
prefix
(F541)
200-200: Consider moving this statement to an else
block
(TRY300)
202-202: f-string without any placeholders
Remove extraneous f
prefix
(F541)
230-230: Consider moving this statement to an else
block
(TRY300)
231-231: Do not use bare except
(E722)
231-232: try
-except
-continue
detected, consider logging the exception
(S112)
235-235: Use raise
without specifying exception name
Remove exception name
(TRY201)
237-237: f-string without any placeholders
Remove extraneous f
prefix
(F541)
257-257: Do not catch blind exception: Exception
(BLE001)
417-417: Consider moving this statement to an else
block
(TRY300)
419-419: Do not catch blind exception: Exception
(BLE001)
430-430: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
468-468: f-string without any placeholders
Remove extraneous f
prefix
(F541)
568-568: Consider moving this statement to an else
block
(TRY300)
570-570: Do not catch blind exception: Exception
(BLE001)
601-601: Do not catch blind exception: Exception
(BLE001)
633-633: Do not catch blind exception: Exception
(BLE001)
639-639: Unused function argument: pr_url
(ARG001)
684-684: Unpacked variable skipped_sections
is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
688-688: Loop control variable hierarchy_string
not used within loop body
Rename unused hierarchy_string
to _hierarchy_string
(B007)
726-726: Loop control variable source_line
not used within loop body
Rename unused source_line
to _source_line
(B007)
768-768: Local variable target_hierarchy
is assigned to but never used
Remove assignment to unused variable target_hierarchy
(F841)
808-808: Consider moving this statement to an else
block
(TRY300)
810-810: Do not catch blind exception: Exception
(BLE001)
1049-1049: Do not catch blind exception: Exception
(BLE001)
1056-1056: String contains ambiguous ➕
(HEAVY PLUS SIGN). Did you mean +
(PLUS SIGN)?
(RUF001)
1059-1059: String contains ambiguous ➕
(HEAVY PLUS SIGN). Did you mean +
(PLUS SIGN)?
(RUF001)
1066-1066: String contains ambiguous ➕
(HEAVY PLUS SIGN). Did you mean +
(PLUS SIGN)?
(RUF001)
1118-1118: Unused function argument: operation_type
(ARG001)
1183-1183: Do not catch blind exception: Exception
(BLE001)
1190-1190: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1219-1219: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1270-1270: Unpacked variable section_data
is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1270-1270: Unpacked variable line_num
is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1287-1287: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1296-1296: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1389-1389: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1414-1414: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1418-1418: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1423-1423: Use explicit conversion flag
Replace with conversion flag
(RUF010)
1429-1429: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1461-1461: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1471-1471: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1483-1483: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1502-1502: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1522-1522: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1553-1553: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1585-1585: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1622-1622: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1688-1688: f-string without any placeholders
Remove extraneous f
prefix
(F541)
scripts/translate_doc_pr/pr_analyzer.py
1-1: Shebang is present but file is not executable
(EXE001)
33-33: Avoid specifying long messages outside the exception class
(TRY003)
60-60: Do not catch blind exception: Exception
(BLE001)
240-240: Local variable original_line
is assigned to but never used
Remove assignment to unused variable original_line
(F841)
265-265: Local variable title
is assigned to but never used
Remove assignment to unused variable title
(F841)
291-291: Unused function argument: lines
(ARG001)
328-328: Local variable original_title
is assigned to but never used
Remove assignment to unused variable original_title
(F841)
341-341: Local variable title
is assigned to but never used
Remove assignment to unused variable title
(F841)
368-368: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
402-402: Unused function argument: hierarchy_dict
(ARG001)
487-487: Loop control variable i
not used within loop body
Rename unused i
to _i
(B007)
526-526: Unused function argument: lines
(ARG001)
547-547: Unused function argument: lines
(ARG001)
613-613: Consider moving this statement to an else
block
(TRY300)
614-614: Do not catch blind exception: Exception
(BLE001)
644-644: Consider moving this statement to an else
block
(TRY300)
645-645: Do not catch blind exception: Exception
(BLE001)
665-665: Consider moving this statement to an else
block
(TRY300)
667-667: Do not catch blind exception: Exception
(BLE001)
746-746: Unused function argument: hierarchy_dict
(ARG001)
754-754: Unused function argument: base_hierarchy_dict
(ARG001)
922-922: Loop control variable base_line_str
not used within loop body
Rename unused base_line_str
to _base_line_str
(B007)
944-944: f-string without any placeholders
Remove extraneous f
prefix
(F541)
979-979: Unused function argument: max_non_system_sections
(ARG001)
979-979: Unused function argument: pr_diff
(ARG001)
982-982: Redefinition of unused os
from line 8
Remove definition: os
(F811)
983-983: Redefinition of unused json
from line 7
Remove definition: json
(F811)
1021-1021: String contains ambiguous ➕
(HEAVY PLUS SIGN). Did you mean +
(PLUS SIGN)?
(RUF001)
1025-1025: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1027-1027: Do not catch blind exception: Exception
(BLE001)
1034-1034: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1040-1040: Do not catch blind exception: Exception
(BLE001)
1053-1053: Do not catch blind exception: Exception
(BLE001)
1078-1078: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1080-1080: String contains ambiguous ➕
(HEAVY PLUS SIGN). Did you mean +
(PLUS SIGN)?
(RUF001)
1086-1086: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1086-1086: String contains ambiguous ℹ
(INFORMATION SOURCE). Did you mean i
(LATIN SMALL LETTER I)?
(RUF001)
1142-1142: Do not catch blind exception: Exception
(BLE001)
1161-1161: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1202-1202: Local variable is_deleted_header
is assigned to but never used
Remove assignment to unused variable is_deleted_header
(F841)
1270-1270: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1299-1299: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1371-1371: String contains ambiguous ➕
(HEAVY PLUS SIGN). Did you mean +
(PLUS SIGN)?
(RUF001)
1402-1402: f-string without any placeholders
Remove extraneous f
prefix
(F541)
1413-1413: Loop control variable existing_line
not used within loop body
Rename unused existing_line
to _existing_line
(B007)
1437-1437: f-string without any placeholders
Remove extraneous f
prefix
(F541)
🪛 actionlint (1.7.7)
.github/workflows/sync-docs-cn-to-en.yml
37-37: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: tidb-check
🔇 Additional comments (1)
scripts/translate_doc_pr/main_workflow.py (1)
39-40
: No changes needed: "gemini-2.0-flash" is a valid Google GenAI model ID.
on: | ||
workflow_dispatch: | ||
inputs: | ||
source_pr_url: | ||
description: 'Source PR URL (Chinese docs repository)' | ||
required: true | ||
type: string | ||
default: '' | ||
target_pr_url: | ||
description: 'Target PR URL (English docs repository)' | ||
required: true | ||
type: string | ||
default: '' | ||
ai_provider: | ||
description: 'AI Provider to use for translation' | ||
required: false | ||
type: choice | ||
options: | ||
- deepseek | ||
- gemini | ||
default: 'gemini' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Declare least-privilege permissions for GITHUB_TOKEN.
Explicitly set permissions to enable contents and PR writes; current default may be read-only.
Apply:
on:
workflow_dispatch:
inputs:
source_pr_url:
description: 'Source PR URL (Chinese docs repository)'
required: true
type: string
default: ''
@@
default: 'gemini'
+permissions:
+ contents: write
+ pull-requests: write
🤖 Prompt for AI Agents
.github/workflows/sync-docs-cn-to-en.yml around lines 3 to 24: the workflow
lacks an explicit permissions block for GITHUB_TOKEN, which can default to
read-only; add a top-level permissions section granting least-privilege write
access needed for the job (e.g., contents: write and pull-requests: write) so
the workflow can update repository contents and create/update PRs, ensuring no
broader permissions are granted.
uses: actions/setup-python@v4 | ||
with: | ||
python-version: '3.9' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update setup-python to v5 (actionlint failure).
Use actions/setup-python@v5 to avoid runner errors.
Apply:
- - name: Set up Python
- uses: actions/setup-python@v4
+ - name: Set up Python
+ uses: actions/setup-python@v5
with:
python-version: '3.9'
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
uses: actions/setup-python@v4 | |
with: | |
python-version: '3.9' | |
- name: Set up Python | |
uses: actions/setup-python@v5 | |
with: | |
python-version: '3.9' |
🧰 Tools
🪛 actionlint (1.7.7)
37-37: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🤖 Prompt for AI Agents
.github/workflows/sync-docs-cn-to-en.yml around lines 37 to 39: the workflow
pins actions/setup-python@v4 which causes actionlint/runner errors; update the
workflow to use actions/setup-python@v5 by changing the action reference to
actions/setup-python@v5 and ensure the existing with: python-version setting
remains unchanged (no other logic changes needed).
- name: Get target PR branch info | ||
id: target_branch | ||
run: | | ||
# Get target PR branch name | ||
TARGET_BRANCH=$(curl -s \ | ||
-H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \ | ||
-H "Accept: application/vnd.github.v3+json" \ | ||
"https://api.github.com/repos/${{ steps.extract_info.outputs.target_owner }}/${{ steps.extract_info.outputs.target_repo }}/pulls/${{ steps.extract_info.outputs.target_pr }}" \ | ||
| jq -r '.head.ref') | ||
|
||
echo "target_branch=${TARGET_BRANCH}" >> $GITHUB_OUTPUT | ||
echo "Target branch: ${TARGET_BRANCH}" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cross-repo PRs and forks: clone/push the PR head repository using a PAT; current flow will fail.
Issues:
- You only fetch
.head.ref
but clone base repo; this breaks for forked PRs (branch won't exist in base). - GITHUB_TOKEN cannot push to another repo; use a PAT (or GitHub App) with contents:write on the head repo.
- Commenting on the target PR also needs a token authorized on that repo.
Fix by retrieving head repo full name and cloning/pushing to it with a PAT, and by using the same PAT for API calls.
Suggested patch:
- name: Get target PR branch info
id: target_branch
run: |
# Get target PR branch name
- TARGET_BRANCH=$(curl -s \
+ PR_JSON=$(curl -s \
-H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
- -H "Accept: application/vnd.github.v3+json" \
+ -H "Accept: application/vnd.github+json" \
"https://api.github.com/repos/${{ steps.extract_info.outputs.target_owner }}/${{ steps.extract_info.outputs.target_repo }}/pulls/${{ steps.extract_info.outputs.target_pr }}" \
- | jq -r '.head.ref')
+ )
+ TARGET_BRANCH=$(echo "$PR_JSON" | jq -r '.head.ref')
+ TARGET_HEAD_REPO=$(echo "$PR_JSON" | jq -r '.head.repo.full_name')
+ echo "target_branch=${TARGET_BRANCH}" >> $GITHUB_OUTPUT
+ echo "target_head_repo=${TARGET_HEAD_REPO}" >> $GITHUB_OUTPUT
- echo "target_branch=${TARGET_BRANCH}" >> $GITHUB_OUTPUT
- echo "Target branch: ${TARGET_BRANCH}"
+ echo "Target branch: ${TARGET_BRANCH}"
+ echo "Head repo: ${TARGET_HEAD_REPO}"
- name: Clone target repository
run: |
- # Clone target repository with the PR branch
- git clone https://x-access-token:${{ secrets.GITHUB_TOKEN }}@github.com/${{ steps.extract_info.outputs.target_owner }}/${{ steps.extract_info.outputs.target_repo }}.git target_repo
+ # Clone PR head repository (supports forks) with a PAT
+ git clone https://x-access-token:${{ secrets.SYNC_REPO_TOKEN }}@github.com/${{ steps.target_branch.outputs.target_head_repo }}.git target_repo
cd target_repo
- git checkout ${{ steps.target_branch.outputs.target_branch }}
+ git fetch origin ${{ steps.target_branch.outputs.target_branch }}
+ git checkout -B ${{ steps.target_branch.outputs.target_branch }} origin/${{ steps.target_branch.outputs.target_branch }}
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
- name: Commit and push changes
run: |
cd target_repo
git add .
if git diff --staged --quiet; then
echo "No changes to commit"
else
git commit -m "Auto-sync: Update English docs from Chinese PR ${{ github.event.inputs.source_pr_url }}
Synced from: ${{ github.event.inputs.source_pr_url }}
Target PR: ${{ github.event.inputs.target_pr_url }}
AI Provider: ${{ github.event.inputs.ai_provider }}
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>"
- git push origin ${{ steps.target_branch.outputs.target_branch }}
+ git push https://x-access-token:${{ secrets.SYNC_REPO_TOKEN }}@github.com/${{ steps.target_branch.outputs.target_head_repo }}.git ${{ steps.target_branch.outputs.target_branch }}
echo "Changes pushed to target PR branch: ${{ steps.target_branch.outputs.target_branch }}"
fi
- name: Add comment to target PR
run: |
# Add a comment to the target PR about the sync
- curl -X POST \
- -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
+ curl -X POST \
+ -H "Authorization: token ${{ secrets.SYNC_REPO_TOKEN }}" \
-H "Accept: application/vnd.github.v3+json" \
"https://api.github.com/repos/${{ steps.extract_info.outputs.target_owner }}/${{ steps.extract_info.outputs.target_repo }}/issues/${{ steps.extract_info.outputs.target_pr }}/comments" \
-d "{
\"body\": \"🤖 **Auto-sync completed**\\n\\n📥 **Source PR**: ${{ github.event.inputs.source_pr_url }}\\n🎯 **Target PR**: ${{ github.event.inputs.target_pr_url }}\\n✅ English documentation has been updated based on Chinese documentation changes.\\n\\n_This comment was generated automatically by the sync workflow._\"
}"
Also add a repo secret SYNC_REPO_TOKEN (PAT) with repo:write access on the head repo (or use a GitHub App).
Also applies to: 84-93, 125-134
🤖 Prompt for AI Agents
.github/workflows/sync-docs-cn-to-en.yml lines 71-83 (and similarly update
84-93, 125-134): the workflow currently reads only .head.ref and clones/pushes
the base repo using GITHUB_TOKEN which fails for forked PRs and prevents pushing
to the head repo; change the flow to fetch the head repo full_name (owner/repo)
and head.ref from the PR API, switch cloning/push operations to use that head
repo full_name, and replace GITHUB_TOKEN with a PAT stored in a repo secret
(e.g. SYNC_REPO_TOKEN with repo:write/contents scope) for git clone/push and for
any API calls (comments or other PR updates) so the token has permission on the
head repo; ensure the workflow uses the head repo URL when adding remotes and
pushing branches and update the other referenced blocks (84-93, 125-134) to use
the same head-repo + PAT approach.
from main import main | ||
|
||
# Make main function available at package level | ||
__all__ = ["main"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix package export import path
import scripts.translate_doc_pr
currently explodes with ModuleNotFoundError: No module named 'main'
because the package init tries to do an absolute from main import main
. The callable lives in main_workflow.py
, so the package entry point is unusable until we point the import at the correct module.
-from main import main
+from .main_workflow import main
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from main import main | |
# Make main function available at package level | |
__all__ = ["main"] | |
from .main_workflow import main | |
# Make main function available at package level | |
__all__ = ["main"] |
🤖 Prompt for AI Agents
In scripts/translate_doc_pr/__init__.py around lines 19 to 22, the package init
does an absolute import from a non-existent module ("main"), causing
ModuleNotFoundError; replace the import with a relative import that points to
the actual callable (main_workflow.py) — i.e., import the main function using a
package-relative import and keep __all__ as ["main"] so importing
scripts.translate_doc_pr exposes the correct entrypoint.
# Add token estimation | ||
try: | ||
from main import print_token_estimation | ||
print_token_estimation(prompt, "File addition translation") | ||
except ImportError: | ||
# Fallback if import fails - use tiktoken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong module import: print_token_estimation is in main_workflow.py, not main.
This always ImportError-falls back, hiding token stats.
Apply:
- try:
- from main import print_token_estimation
+ try:
+ from main_workflow import print_token_estimation
print_token_estimation(prompt, "File addition translation")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Add token estimation | |
try: | |
from main import print_token_estimation | |
print_token_estimation(prompt, "File addition translation") | |
except ImportError: | |
# Fallback if import fails - use tiktoken | |
# Add token estimation | |
try: | |
from main_workflow import print_token_estimation | |
print_token_estimation(prompt, "File addition translation") | |
except ImportError: | |
# Fallback if import fails - use tiktoken |
🤖 Prompt for AI Agents
In scripts/translate_doc_pr/file_adder.py around lines 100 to 105, the code
incorrectly imports print_token_estimation from module main (which causes an
ImportError and always triggers the fallback); update the import to load
print_token_estimation from main_workflow (from main_workflow import
print_token_estimation) and keep the existing try/except fallback behavior so
token estimation is used when available and falls back to tiktoken only if the
import or call fails.
try: | ||
from main import print_token_estimation | ||
print_token_estimation(prompt, f"Section mapping ({source_language} → {target_language})") | ||
except ImportError: | ||
# Fallback if import fails - use tiktoken | ||
try: | ||
import tiktoken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix import of token estimation helper (wrong module name).
main_workflow.py defines print_token_estimation. Importing from main will always fail and suppress the feature.
Apply this diff:
- try:
- from main import print_token_estimation
+ try:
+ # Try local module name first
+ from main_workflow import print_token_estimation
print_token_estimation(prompt, f"Section mapping ({source_language} → {target_language})")
Optional: add a second fallback try for package-style import if you plan to package this module.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
try: | |
from main import print_token_estimation | |
print_token_estimation(prompt, f"Section mapping ({source_language} → {target_language})") | |
except ImportError: | |
# Fallback if import fails - use tiktoken | |
try: | |
import tiktoken | |
try: | |
# Try local module name first | |
from main_workflow import print_token_estimation | |
print_token_estimation(prompt, f"Section mapping ({source_language} → {target_language})") | |
except ImportError: | |
# Fallback if import fails - use tiktoken | |
try: | |
import tiktoken |
🤖 Prompt for AI Agents
In scripts/translate_doc_pr/section_matcher.py around lines 340-346, the code
tries to import print_token_estimation from the wrong module name ("main"),
which always triggers the ImportError and disables the helper; change the import
to "from main_workflow import print_token_estimation" and call it as before;
additionally add a second fallback import attempt for package-style imports
(e.g., "from .main_workflow import print_token_estimation") before falling back
to the tiktoken-based implementation so the helper works both when run
standalone and when packaged.
'insertion_after_line': insertion_after_line, | ||
'target_hierarchy': target_hierarchy.get(str(target_line_num), ''), | ||
'insertion_type': point_info['insertion_type'], | ||
'new_sections': point_info['new_sections'] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Key type mismatch when reading target_hierarchy.
target_hierarchy keys are ints; using str(target_line_num) returns empty string.
Apply this diff:
- 'target_hierarchy': target_hierarchy.get(str(target_line_num), ''),
+ 'target_hierarchy': target_hierarchy.get(target_line_num, ''),
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
'insertion_after_line': insertion_after_line, | |
'target_hierarchy': target_hierarchy.get(str(target_line_num), ''), | |
'insertion_type': point_info['insertion_type'], | |
'new_sections': point_info['new_sections'] | |
} | |
'insertion_after_line': insertion_after_line, | |
'target_hierarchy': target_hierarchy.get(target_line_num, ''), | |
'insertion_type': point_info['insertion_type'], | |
'new_sections': point_info['new_sections'] | |
} |
🤖 Prompt for AI Agents
In scripts/translate_doc_pr/section_matcher.py around lines 493 to 497, the code
uses str(target_line_num) to read from target_hierarchy whose keys are integers,
causing lookups to return an empty string; change the lookup to use the integer
key (i.e., target_hierarchy.get(target_line_num, '')) so the correct value is
returned (and keep the existing fallback ''), ensuring types match between the
key and the lookup.
"""Extract target section content from target_lines (includes sub-sections)""" | ||
if target_line_num >= len(target_lines): | ||
return "" | ||
|
||
start_line = target_line_num - 1 # Convert to 0-based index | ||
|
||
# Find the end of the section by looking for the next header | ||
current_line = target_lines[start_line].strip() | ||
if not current_line.startswith('#'): | ||
return current_line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Off-by-one: allow extracting content at last line.
Guard should be 1-based range check.
Apply this diff:
- if target_line_num >= len(target_lines):
+ if target_line_num < 1 or target_line_num > len(target_lines):
return ""
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"""Extract target section content from target_lines (includes sub-sections)""" | |
if target_line_num >= len(target_lines): | |
return "" | |
start_line = target_line_num - 1 # Convert to 0-based index | |
# Find the end of the section by looking for the next header | |
current_line = target_lines[start_line].strip() | |
if not current_line.startswith('#'): | |
return current_line | |
"""Extract target section content from target_lines (includes sub-sections)""" | |
if target_line_num < 1 or target_line_num > len(target_lines): | |
return "" | |
start_line = target_line_num - 1 # Convert to 0-based index | |
# Find the end of the section by looking for the next header | |
current_line = target_lines[start_line].strip() | |
if not current_line.startswith('#'): | |
return current_line |
🤖 Prompt for AI Agents
In scripts/translate_doc_pr/section_matcher.py around lines 866 to 875, the
current guard uses a 0-based check (if target_line_num >= len(target_lines))
which prevents extracting the last line when target_line_num is 1-based; change
the guard to a 1-based range check (e.g., if target_line_num < 1 or
target_line_num > len(target_lines): return "") so valid 1-based indices
including the last line are allowed, then keep start_line = target_line_num - 1
and the subsequent logic unchanged.
def extract_section_direct_content(target_line_num, target_lines): | ||
"""Extract ONLY the direct content of a section (excluding sub-sections)""" | ||
if target_line_num >= len(target_lines): | ||
return "" | ||
|
||
start_line = target_line_num - 1 # Convert to 0-based index | ||
|
||
# Find the end of the section by looking for the next header | ||
current_line = target_lines[start_line].strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same off-by-one in direct-content extractor.
Apply this diff:
- if target_line_num >= len(target_lines):
+ if target_line_num < 1 or target_line_num > len(target_lines):
return ""
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def extract_section_direct_content(target_line_num, target_lines): | |
"""Extract ONLY the direct content of a section (excluding sub-sections)""" | |
if target_line_num >= len(target_lines): | |
return "" | |
start_line = target_line_num - 1 # Convert to 0-based index | |
# Find the end of the section by looking for the next header | |
current_line = target_lines[start_line].strip() | |
def extract_section_direct_content(target_line_num, target_lines): | |
"""Extract ONLY the direct content of a section (excluding sub-sections)""" | |
if target_line_num < 1 or target_line_num > len(target_lines): | |
return "" | |
start_line = target_line_num - 1 # Convert to 0-based index | |
# Find the end of the section by looking for the next header | |
current_line = target_lines[start_line].strip() |
🤖 Prompt for AI Agents
In scripts/translate_doc_pr/section_matcher.py around lines 905 to 913, the
direct-content extractor has the same off-by-one bug: start_line is set to
target_line_num - 1 but the rest of the logic treats target_line_num as already
zero-based. Fix it by making start_line = target_line_num (remove the -1) and
adjust the initial bounds check if needed (use target_line_num < 0 or
target_line_num >= len(target_lines) for validity); leave the subsequent
current_line = target_lines[start_line].strip() and section end detection logic
unchanged.
# Add token estimation | ||
try: | ||
from main import print_token_estimation | ||
print_token_estimation(prompt, "TOC translation") | ||
except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong module import: print_token_estimation is in main_workflow.py.
- try:
- from main import print_token_estimation
+ try:
+ from main_workflow import print_token_estimation
print_token_estimation(prompt, "TOC translation")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Add token estimation | |
try: | |
from main import print_token_estimation | |
print_token_estimation(prompt, "TOC translation") | |
except ImportError: | |
# Add token estimation | |
try: | |
from main_workflow import print_token_estimation | |
print_token_estimation(prompt, "TOC translation") | |
except ImportError: |
🤖 Prompt for AI Agents
In scripts/translate_doc_pr/toc_processor.py around lines 274 to 278, the code
attempts to import print_token_estimation from main but that function lives in
main_workflow.py; change the import to "from main_workflow import
print_token_estimation" (keeping the existing try/except ImportError block) so
the correct module is referenced and token estimation is called successfully.
pingcap#21819)
First-time contributors' checklist
What is changed, added or deleted? (Required)
Which TiDB version(s) do your changes apply to? (Required)
Tips for choosing the affected version(s):
By default, CHOOSE MASTER ONLY so your changes will be applied to the next TiDB major or minor releases. If your PR involves a product feature behavior change or a compatibility change, CHOOSE THE AFFECTED RELEASE BRANCH(ES) AND MASTER.
For details, see tips for choosing the affected versions.
What is the related PR or file link(s)?
Do your changes match any of the following descriptions?
Summary by CodeRabbit
New Features
Chores