Update system-variables #43

qiancai · 2025-09-25T09:58:26Z

First-time contributors' checklist

I've signed the Contributor License Agreement, which is required for the repository owners to accept my contribution.

What is changed, added or deleted? (Required)

Which TiDB version(s) do your changes apply to? (Required)

Tips for choosing the affected version(s):

By default, CHOOSE MASTER ONLY so your changes will be applied to the next TiDB major or minor releases. If your PR involves a product feature behavior change or a compatibility change, CHOOSE THE AFFECTED RELEASE BRANCH(ES) AND MASTER.

For details, see tips for choosing the affected versions.

What is the related PR or file link(s)?

This PR is translated from:
Other reference link(s):

Do your changes match any of the following descriptions?

Delete files
Change aliases
Need modification after applied to another branch
Might cause conflicts after applied to another branch

Summary by CodeRabbit

New Features
- Automatic synchronization of Chinese-to-English documentation updates for pull requests.
- AI-assisted translation for added, modified, and deleted sections, including TOC updates.
- Status comments on target PRs after sync completion.
Chores
- Added a CI workflow to orchestrate doc syncing.
- Introduced supporting scripts for PR analysis, section matching, file updates, and TOC processing.
- Added required third-party dependencies for GitHub, AI, and tokenization integrations.

pingcap#21819)

…ingcap#21828)

coderabbitai · 2025-09-25T09:58:33Z

Walkthrough

Adds a GitHub Actions workflow to sync English docs from a Chinese PR and introduces a new Python-based pipeline under scripts/translate_doc_pr for analyzing PR diffs, matching sections, translating content via pluggable AI providers, and applying changes (add/update/delete/TOC) to target repos. Includes an orchestrating entry point, helpers, and dependencies.

Changes

Cohort / File(s)	Summary
CI Workflow: PR sync orchestration `.github/workflows/sync-docs-cn-to-en.yml`	New workflow to sync docs from a source PR to a target PR. Checks out repos, resolves target branch, runs translation workflow, commits/pushes if changes, and comments on the PR.
Packaging & Dependencies `scripts/translate_doc_pr/__init__.py`, `scripts/translate_doc_pr/requirements.txt`	Adds package initializer exposing `main` and introduces dependencies: PyGithub, OpenAI-compatible, tiktoken, Google Generative AI.
Orchestrator `scripts/translate_doc_pr/main_workflow.py`	New end-to-end entry point: config loading, AI provider abstraction, PR diff retrieval, change classification, and sequential processing of added/modified/deleted/TOC via submodules.
PR Analysis & Matching `scripts/translate_doc_pr/pr_analyzer.py`, `scripts/translate_doc_pr/section_matcher.py`	Adds PR URL parsing, diff extraction, Markdown hierarchy building, section detection, and AI-assisted mapping of source sections to target locations with token-estimation and diagnostics.
File Operations: Update/Add/Delete `scripts/translate_doc_pr/file_updater.py`, `scripts/translate_doc_pr/file_adder.py`, `scripts/translate_doc_pr/file_deleter.py`	Introduces AI-assisted section update/insert/delete utilities, batch translation for added files, and safe deletion of target files, with thread-safe logging and fallbacks.
TOC Processing `scripts/translate_doc_pr/toc_processor.py`	New TOC-specific logic: link extraction, translation decisions, best-match line finding, grouping and applying added/modified/deleted TOC lines with AI translation where needed.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev as Developer
  participant GH as GitHub Actions
  participant MW as main_workflow.py
  participant GHAPI as GitHub API
  participant AI as UnifiedAIClient
  participant Repo as Target Repo (branch)

  Dev->>GH: Dispatch Workflow (source_pr_url, target_pr_url, ai_provider)
  GH->>MW: Run Python entrypoint with env/tokens
  MW->>GHAPI: Fetch target PR branch, source PR diff
  MW->>MW: Classify changes (added/modified/deleted/TOC)
  opt Added files
    MW->>AI: Translate in batches
    MW->>Repo: Write new files
  end
  opt Modified/Deleted sections
    MW->>GHAPI: Get source/target contents
    MW->>AI: Map/translate sections
    MW->>Repo: Apply updates/deletions
  end
  opt TOC
    MW->>AI: Translate required lines
    MW->>Repo: Apply TOC edits
  end
  MW->>Repo: Stage and commit if changes
  GH->>GHAPI: Push commit, comment on target PR

sequenceDiagram
  autonumber
  participant MW as main_workflow.py
  participant ANA as pr_analyzer.py
  participant MATCH as section_matcher.py
  participant UPD as file_updater.py
  participant ADD as file_adder.py
  participant DEL as file_deleter.py
  participant TOC as toc_processor.py
  participant AI as UnifiedAIClient

  MW->>ANA: get_pr_diff + build hierarchies
  ANA-->>MW: operations + section structures
  alt Added files
    MW->>ADD: process_added_files(...)
    ADD->>AI: translate batches
  else Modified sections
    MW->>MATCH: match_source_diff_to_target(...)
    MATCH->>AI: assist mapping
    MW->>UPD: process_modified_sections(...)
    UPD->>AI: get_updated_sections_from_ai(...)
  else Deleted sections/files
    MW->>DEL: process_deleted_files(...) or UPD.delete_sections(...)
  else TOC changes
    MW->>TOC: process_toc_files(...)
    TOC->>AI: translate_toc_lines(...)
  end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120+ minutes

Poem

I thump my paws—commit by commit,
Diffs hop in lines, I nibble each bit.
Sections align, TOC finds its friend,
AI whispers where translations blend.
Push goes the branch—carrot-bright green,
Docs in duet: CN to EN, serene. 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The title “Update system-variables” does not reflect the primary changes in this pull request, which introduce a new GitHub Actions workflow and a suite of translation scripts rather than modifications to system variable documentation. It is misleading and does not convey the core purpose of the changeset.	Please rename the pull request title to clearly summarize the main contribution, for example by referencing the addition of the documentation sync workflow and translation modules, so that reviewers can immediately understand the primary change.
Description Check	⚠️ Warning	The pull request description only contains the contribution template with empty checkboxes and no substantive answers in the required sections, so it fails to explain what was changed, which versions are affected, or provide any related links. This makes it impossible for reviewers to assess the scope or impact of the changes.	Please complete the template by describing what files and functionality were added or modified, selecting the affected TiDB versions, and including any relevant PR or file links so that reviewers have the necessary context to evaluate the changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 89.22% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch sync

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-09-25T09:59:08Z

Summary of Changes

Hello @qiancai, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily introduces a new, modular Python-based tool for automating documentation synchronization and translation across different language repositories. It integrates AI capabilities for intelligent content matching and translation, significantly improving the efficiency and accuracy of documentation updates. Additionally, it includes minor but important updates to existing documentation regarding system variables and TiKV configuration.

Highlights

New Documentation Auto-Sync and Translation Tool: Introduced a comprehensive Python package scripts/translate_doc_pr designed for automated synchronization and AI-powered translation of documentation. This tool aims to streamline the process of keeping documentation consistent across different language versions.
Modular Architecture: The new auto-sync tool is built with a modular design, separating concerns into distinct modules for PR analysis, section matching, file addition/deletion/updating, TOC processing, and main workflow orchestration, enhancing maintainability and testability.
AI-Powered Translation and Matching: The tool leverages AI (DeepSeek and Gemini) for intelligent section matching and content translation, ensuring that changes in source documentation are accurately reflected and translated in target language versions. It includes logic for handling various content types, including system variables and configuration items.
Enhanced DDL Parameter Management: The system-variables.md file has been updated to reflect new capabilities in TiDB v8.5.0, allowing modification of tidb_ddl_reorg_batch_size and tidb_ddl_reorg_worker_cnt for running DDL jobs using ADMIN ALTER DDL JOBS.
TiKV In-Memory Engine Configuration Update: The tikv-configuration-file.md documentation now provides a more detailed description and updated default value for the capacity parameter of the TiKV MVCC in-memory engine.

Ignored Files

Ignored by pattern: .github/workflows/** (1)
- .github/workflows/sync-docs-cn-to-en.yml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a substantial new Python package for automating documentation translation, alongside some documentation updates. The new translation tool is a significant piece of work, but it has several critical issues that need to be addressed before it can be considered production-ready. The main problems are incorrect module imports that will cause runtime failures, a high degree of code duplication across modules, and several extremely large and complex functions that make the code difficult to maintain and debug. My review focuses on these structural and maintainability issues in the Python code. The documentation changes themselves are minor and look good.

gemini-code-assist · 2025-09-25T10:01:11Z

scripts/translate_doc_pr/__init__.py

+"""
+
+# Import main functionality for easy access
+from main import main


This import statement is incorrect and will cause a runtime error. There is no top-level main module in the standard library or in this project. Based on the project structure, you likely intended to import from main_workflow.py within this package. This should be a relative import: from .main_workflow import main.

This same incorrect import pattern (from main import ...) appears in several other files in this package and needs to be fixed throughout for the package to be importable and usable.

gemini-code-assist · 2025-09-25T10:01:11Z

scripts/translate_doc_pr/file_updater.py

+def process_single_file(file_path, source_sections, pr_diff, pr_url, github_client, ai_client, repo_config, max_non_system_sections=120):
+    """Process a single file - thread-safe function for parallel processing"""
+    thread_id = threading.current_thread().name
+    thread_safe_print(f"\n📄 [{thread_id}] Processing {file_path}")
+
+    try:
+        # Check if this is a TOC file with special operations
+        if isinstance(source_sections, dict) and 'type' in source_sections and source_sections['type'] == 'toc':
+            from toc_processor import process_toc_file
+            return process_toc_file(file_path, source_sections, pr_url, github_client, ai_client, repo_config)
+
+        # Check if this is enhanced sections
+        if isinstance(source_sections, dict) and 'sections' in source_sections:
+            if source_sections.get('type') == 'enhanced_sections':
+                # Skip all the matching logic and directly extract data
+                thread_safe_print(f"   [{thread_id}] 🚀 Using enhanced sections data, skipping matching logic")
+                enhanced_sections = source_sections['sections']
+
+                # Extract target sections and source old content from enhanced sections
+                # Maintain the exact order from match_source_diff_to_target.json
+                from collections import OrderedDict
+                target_sections = OrderedDict()
+                source_old_content_dict = OrderedDict()
+
+                # Process in the exact order they appear in enhanced_sections (which comes from match_source_diff_to_target.json)
+                for key, section_info in enhanced_sections.items():
+                    if isinstance(section_info, dict):
+                        operation = section_info.get('source_operation', '')
+
+                        # Skip deleted sections - they shouldn't be in the enhanced_sections anyway
+                        if operation == 'deleted':
+                            continue
+
+                        # For source sections: use old_content for modified, new_content for added
+                        if operation == 'added':
+                            source_content = section_info.get('source_new_content', '')
+                        else:  # modified
+                            source_content = section_info.get('source_old_content', '')
+
+                        # For target sections: use target_content for modified, empty string for added
+                        if operation == 'added':
+                            target_content = ""  # Added sections have no existing target content
+                        else:  # modified
+                            target_content = section_info.get('target_content', '')
+
+                        # Add to both dictionaries using the same key from match_source_diff_to_target.json
+                        if source_content is not None:
+                            source_old_content_dict[key] = source_content
+                        target_sections[key] = target_content
+
+                thread_safe_print(f"   [{thread_id}] 📊 Extracted: {len(target_sections)} target sections, {len(source_old_content_dict)} source old content entries")
+
+                # Update sections with AI (get-updated-target-sections.py logic)
+                thread_safe_print(f"   [{thread_id}] 🤖 Getting updated sections from AI...")
+                updated_sections = get_updated_sections_from_ai(pr_diff, target_sections, source_old_content_dict, ai_client, repo_config['source_language'], repo_config['target_language'], file_path)
+                if not updated_sections:
+                    thread_safe_print(f"   [{thread_id}] ⚠️  Could not get AI update")
+                    return False, f"Could not get AI update for {file_path}"
+
+                # Return the AI results for further processing
+                thread_safe_print(f"   [{thread_id}] ✅ Successfully got AI translation results for {file_path}")
+                return True, updated_sections  # Return the actual AI results
+
+            else:
+                # New format: complete data structure
+                actual_sections = source_sections['sections']
+
+        # Regular file processing continues here for old format
+        # Get target hierarchy and content (get-target-affected-hierarchy.py logic)
+        from pr_analyzer import get_target_hierarchy_and_content
+        target_hierarchy, target_lines = get_target_hierarchy_and_content(file_path, github_client, repo_config['target_repo'])
+        if not target_hierarchy:
+            thread_safe_print(f"   [{thread_id}] ⚠️  Could not get target content")
+            return False, f"Could not get target content for {file_path}"
+        else:
+            # Old format: direct dict
+            actual_sections = source_sections
+
+        # Only do mapping if we don't have enhanced sections
+        if 'enhanced_sections' not in locals() or not enhanced_sections:
+            # Separate different types of sections
+            from section_matcher import is_system_variable_or_config
+            system_var_sections = {}
+            toplevel_sections = {}
+            frontmatter_sections = {}
+            regular_sections = {}
+
+            for line_num, hierarchy in actual_sections.items():
+                if line_num == "0" and hierarchy == "frontmatter":
+                    # Special handling for frontmatter
+                    frontmatter_sections[line_num] = hierarchy
+                else:
+                    # Extract the leaf title from hierarchy
+                    leaf_title = hierarchy.split(' > ')[-1] if ' > ' in hierarchy else hierarchy
+
+                    if is_system_variable_or_config(leaf_title):
+                        system_var_sections[line_num] = hierarchy
+                    elif leaf_title.startswith('# '):
+                        # Top-level titles need special handling
+                        toplevel_sections[line_num] = hierarchy
+                    else:
+                        regular_sections[line_num] = hierarchy
+
+        thread_safe_print(f"   [{thread_id}] 📊 Found {len(system_var_sections)} system variable/config, {len(toplevel_sections)} top-level, {len(frontmatter_sections)} frontmatter, and {len(regular_sections)} regular sections")
+
+        target_affected = {}
+
+        # Process frontmatter sections with special handling
+        if frontmatter_sections:
+            thread_safe_print(f"   [{thread_id}] 📄 Processing frontmatter section...")
+            # For frontmatter, we simply map it to line 0 in target
+            for line_num, hierarchy in frontmatter_sections.items():
+                target_affected[line_num] = hierarchy
+            thread_safe_print(f"   [{thread_id}] ✅ Mapped {len(frontmatter_sections)} frontmatter section")
+
+        # Process top-level titles with special matching
+        if toplevel_sections:
+            thread_safe_print(f"   [{thread_id}] 🔝 Top-level title matching for {len(toplevel_sections)} sections...")
+            from section_matcher import find_toplevel_title_matches
+            toplevel_matched, toplevel_failed, toplevel_skipped = find_toplevel_title_matches(toplevel_sections, target_lines)
+
+            if toplevel_matched:
+                target_affected.update(toplevel_matched)
+                thread_safe_print(f"   [{thread_id}] ✅ Top-level matched {len(toplevel_matched)} sections")
+
+            if toplevel_failed:
+                thread_safe_print(f"   [{thread_id}] ⚠️  {len(toplevel_failed)} top-level sections failed matching")
+                for failed in toplevel_failed:
+                    thread_safe_print(f"       ❌ {failed['hierarchy']}: {failed['reason']}")
+
+        # Process system variables/config sections with direct matching
+        if system_var_sections:
+            thread_safe_print(f"   [{thread_id}] 🎯 Direct matching {len(system_var_sections)} system variable/config sections...")
+            from section_matcher import find_direct_matches_for_special_files
+            direct_matched, failed_matches, skipped_sections = find_direct_matches_for_special_files(system_var_sections, target_hierarchy, target_lines)
+
+            if direct_matched:
+                target_affected.update(direct_matched)
+                thread_safe_print(f"   [{thread_id}] ✅ Direct matched {len(direct_matched)} system variable/config sections")
+
+            if failed_matches:
+                thread_safe_print(f"   [{thread_id}] ⚠️  {len(failed_matches)} system variable/config sections failed direct matching")
+                for failed in failed_matches:
+                    thread_safe_print(f"       ❌ {failed['hierarchy']}: {failed['reason']}")
+
+        # Process regular sections with AI mapping using filtered target hierarchy
+        if regular_sections:
+            thread_safe_print(f"   [{thread_id}] 🤖 AI mapping {len(regular_sections)} regular sections...")
+
+            # Filter target hierarchy to only include non-system sections for AI mapping
+            from section_matcher import filter_non_system_sections
+            filtered_target_hierarchy = filter_non_system_sections(target_hierarchy)
+
+            # Check if filtered target hierarchy exceeds the maximum allowed for AI mapping
+            MAX_NON_SYSTEM_SECTIONS_FOR_AI = 120
+            if len(filtered_target_hierarchy) > MAX_NON_SYSTEM_SECTIONS_FOR_AI:
+                thread_safe_print(f"   [{thread_id}] ❌ Too many non-system sections ({len(filtered_target_hierarchy)} > {MAX_NON_SYSTEM_SECTIONS_FOR_AI})")
+                thread_safe_print(f"   [{thread_id}] ⚠️  Skipping AI mapping for regular sections to avoid complexity")
+
+                # If no system sections were matched either, return error
+                if not target_affected:
+                    error_message = f"File {file_path} has too many non-system sections ({len(filtered_target_hierarchy)} > {MAX_NON_SYSTEM_SECTIONS_FOR_AI}) and no system variable sections were matched"
+                    return False, error_message
+
+                # Continue with only system variable matches if available
+                thread_safe_print(f"   [{thread_id}] ✅ Proceeding with {len(target_affected)} system variable/config sections only")
+            else:
+                # Proceed with AI mapping using filtered hierarchy
+                source_list = list(regular_sections.values())
+                target_list = list(filtered_target_hierarchy.values())
+
+                from section_matcher import get_corresponding_sections
+                ai_response = get_corresponding_sections(source_list, target_list, ai_client, repo_config['source_language'], repo_config['target_language'], max_tokens=20000)
+                if ai_response:
+                    # Parse AI response and find matching line numbers in the original (unfiltered) hierarchy
+                    from section_matcher import parse_ai_response, find_matching_line_numbers
+                    ai_sections = parse_ai_response(ai_response)
+                    ai_matched = find_matching_line_numbers(ai_sections, target_hierarchy)  # Use original hierarchy for line number lookup
+
+                    if ai_matched:
+                        target_affected.update(ai_matched)
+                        thread_safe_print(f"   [{thread_id}] ✅ AI mapped {len(ai_matched)} regular sections")
+                    else:
+                        thread_safe_print(f"   [{thread_id}] ⚠️  AI mapping failed for regular sections")
+                else:
+                    thread_safe_print(f"   [{thread_id}] ⚠️  Could not get AI response for regular sections")
+
+        # Summary of mapping results
+        thread_safe_print(f"   [{thread_id}] 📊 Total mapped: {len(target_affected)} out of {len(actual_sections)} sections")
+
+        if not target_affected:
+            thread_safe_print(f"   [{thread_id}] ⚠️  Could not map sections")
+            return False, f"Could not map sections for {file_path}"
+
+        thread_safe_print(f"   [{thread_id}] ✅ Mapped {len(target_affected)} sections")
+
+        # Extract target sections (get-target-affected-sections.py logic)
+        thread_safe_print(f"   [{thread_id}] 📝 Extracting target sections...")
+        from pr_analyzer import extract_affected_sections
+        target_sections = extract_affected_sections(target_affected, target_lines)
+
+        # Extract source old content from the enhanced data structure
+        thread_safe_print(f"   [{thread_id}] 📖 Extracting source old content...")
+        source_old_content_dict = {}
+
+        # Handle different data structures for source_sections
+        if isinstance(source_sections, dict) and 'sections' in source_sections:
+            # New format: complete data structure with enhanced matching info
+            for key, section_info in source_sections.items():
+                if isinstance(section_info, dict) and 'source_old_content' in section_info:
+                    source_old_content_dict[key] = section_info['source_old_content']
+        else:
+            # Fallback: if we don't have the enhanced structure, we need to get it differently
+            thread_safe_print(f"   [{thread_id}] ⚠️  Source sections missing enhanced structure, using fallback")
+            # For now, create empty dict to avoid errors - this should be addressed in the calling code
+            source_old_content_dict = {}
+
+        # Update sections with AI (get-updated-target-sections.py logic)
+        thread_safe_print(f"   [{thread_id}] 🤖 Getting updated sections from AI...")
+        updated_sections = get_updated_sections_from_ai(pr_diff, target_sections, source_old_content_dict, ai_client, repo_config['source_language'], repo_config['target_language'], file_path)
+        if not updated_sections:
+            thread_safe_print(f"   [{thread_id}] ⚠️  Could not get AI update")
+            return False, f"Could not get AI update for {file_path}"
+
+        # Update local document (update-target-doc-v2.py logic)
+        thread_safe_print(f"   [{thread_id}] 💾 Updating local document...")
+        success = update_local_document(file_path, updated_sections, target_affected, repo_config['target_local_path'])
+
+        if success:
+            thread_safe_print(f"   [{thread_id}] 🎉 Successfully updated {file_path}")
+            return True, f"Successfully updated {file_path}"
+        else:
+            thread_safe_print(f"   [{thread_id}] ❌ Failed to update {file_path}")
+            return False, f"Failed to update {file_path}"
+
+    except Exception as e:
+        thread_safe_print(f"   [{thread_id}] ❌ Error processing {file_path}: {e}")
+        return False, f"Error processing {file_path}: {e}"


The function process_single_file is extremely long (over 200 lines) and handles many different cases (TOC files, enhanced sections, regular files with different section types). This violates the Single Responsibility Principle and makes the code very difficult to read, test, and maintain. Please refactor this function by breaking it down into smaller, more focused functions, each handling a specific part of the logic.

gemini-code-assist · 2025-09-25T10:01:11Z

scripts/translate_doc_pr/main_workflow.py

+def process_regular_modified_file(source_file_path, file_sections, file_diff, pr_url, github_client, ai_client, repo_config, max_sections):
+    """Process a regular markdown file that has been modified"""
+    try:
+        print(f"   📝 Processing as regular modified file: {source_file_path}")
+
+        # Extract the actual sections from the file_sections structure
+        # file_sections contains: {'sections': {...}, 'original_hierarchy': {...}, 'current_hierarchy': {...}}
+        if isinstance(file_sections, dict) and 'sections' in file_sections:
+            actual_sections = file_sections['sections']
+        else:
+            # Fallback: assume file_sections is already the sections dict
+            actual_sections = file_sections
+
+        print(f"   📊 Extracted sections: {len(actual_sections)} sections")
+
+        # CRITICAL: Load the source-diff-dict.json and perform matching
+        import json
+        import os
+        from section_matcher import match_source_diff_to_target
+        from pr_analyzer import get_target_hierarchy_and_content
+
+        # Load source-diff-dict.json with file prefix
+        temp_dir = ensure_temp_output_dir()
+        file_prefix = source_file_path.replace('/', '-').replace('.md', '')
+        source_diff_dict_file = os.path.join(temp_dir, f"{file_prefix}-source-diff-dict.json")
+        if os.path.exists(source_diff_dict_file):
+            with open(source_diff_dict_file, 'r', encoding='utf-8') as f:
+                source_diff_dict = json.load(f)
+            print(f"   📂 Loaded source diff dict with {len(source_diff_dict)} sections from {source_diff_dict_file}")
+
+            # Check source token limit before proceeding with processing
+            print(f"   🔍 Checking source token limit...")
+            within_limit, total_tokens, token_limit = check_source_token_limit(source_diff_dict_file)
+            if not within_limit:
+                print(f"   🚫 Skipping file processing: source content exceeds token limit")
+                print(f"      📊 Total tokens: {total_tokens:,} > Limit: {token_limit:,}")
+                print(f"      ⏭️  File {source_file_path} will not be processed")
+                return False
+
+        else:
+            print(f"   ❌ {source_diff_dict_file} not found")
+            return False
+
+        # Get target file hierarchy and content
+        target_repo = repo_config['target_repo']
+        target_hierarchy, target_lines = get_target_hierarchy_and_content(source_file_path, github_client, target_repo)
+
+        if not target_hierarchy or not target_lines:
+            print(f"   ❌ Could not get target file content for {source_file_path}")
+            return False
+
+        print(f"   📖 Target file: {len(target_hierarchy)} sections, {len(target_lines)} lines")
+
+        # Perform source diff to target matching
+        print(f"   🔗 Matching source diff to target...")
+        enhanced_sections = match_source_diff_to_target(
+            source_diff_dict, 
+            target_hierarchy, 
+            target_lines, 
+            ai_client, 
+            repo_config, 
+            max_sections,
+            AI_MAX_TOKENS
+        )
+
+        if not enhanced_sections:
+            print(f"   ❌ No sections matched")
+            return False
+
+        print(f"   ✅ Matched {len(enhanced_sections)} sections")
+
+        # Save the match result for reference
+        match_file = os.path.join(temp_dir, f"{source_file_path.replace('/', '-').replace('.md', '')}-match_source_diff_to_target.json")
+        with open(match_file, 'w', encoding='utf-8') as f:
+            json.dump(enhanced_sections, f, ensure_ascii=False, indent=2)
+        print(f"   💾 Saved match result to: {match_file}")
+
+        # Step 2: Get AI translation for the matched sections
+        print(f"   🤖 Getting AI translation for matched sections...")
+
+        # Create file data structure with enhanced matching info
+        # Wrap enhanced_sections in the expected format for process_single_file
+        file_data = {
+            source_file_path: {
+                'type': 'enhanced_sections',
+                'sections': enhanced_sections
+            }
+        }
+
+        # Call the existing process_modified_sections function to get AI translation
+        results = process_modified_sections(file_data, file_diff, pr_url, github_client, ai_client, repo_config, max_sections)
+
+        # Step 3: Update match_source_diff_to_target.json with AI results
+        if results and len(results) > 0:
+            file_path, success, ai_updated_sections = results[0]  # Get first result
+            if success and isinstance(ai_updated_sections, dict):
+                print(f"   📝 Step 3: Updating {match_file} with AI results...")
+
+                # Load current match_source_diff_to_target.json
+                with open(match_file, 'r', encoding='utf-8') as f:
+                    match_data = json.load(f)
+
+                # Add target_new_content field to each section based on AI results
+                updated_count = 0
+                for key, section_data in match_data.items():
+                    operation = section_data.get('source_operation', '')
+
+                    if operation == 'deleted':
+                        # For deleted sections, set target_new_content to null
+                        section_data['target_new_content'] = None
+                    elif key in ai_updated_sections:
+                        # For modified/added sections with AI translation
+                        section_data['target_new_content'] = ai_updated_sections[key]
+                        updated_count += 1
+                    else:
+                        # For sections not translated, keep original content
+                        section_data['target_new_content'] = section_data.get('target_content', '')
+
+                # Save updated match_source_diff_to_target.json
+                with open(match_file, 'w', encoding='utf-8') as f:
+                    json.dump(match_data, f, ensure_ascii=False, indent=2)
+
+                print(f"   ✅ Updated {updated_count} sections with AI translations in {match_file}")
+
+                # Step 4: Apply updates to target document using update_target_document_from_match_data
+                print(f"   📝 Step 4: Applying updates to target document...")
+                from file_updater import update_target_document_from_match_data
+
+                success = update_target_document_from_match_data(match_file, repo_config['target_local_path'], source_file_path)
+                if success:
+                    print(f"   🎉 Target document successfully updated!")
+                    return True
+                else:
+                    print(f"   ❌ Failed to update target document")
+                    return False
+
+            else:
+                print(f"   ⚠️  AI translation failed or returned invalid results")
+                return False
+        else:
+            print(f"   ⚠️  No results from process_modified_sections")
+            return False
+
+    except Exception as e:
+        print(f"   ❌ Error processing regular modified file {source_file_path}: {e}")
+        return False


The function process_regular_modified_file is very large and contains complex logic for processing a single file. This function's logic is tightly coupled with file updating and matching, and it would be better placed in the file_updater.py or section_matcher.py module. Moving this out of main_workflow.py would make the main workflow clearer and improve the overall modularity of the package.

gemini-code-assist · 2025-09-25T10:01:11Z

scripts/translate_doc_pr/pr_analyzer.py

+def analyze_source_changes(pr_url, github_client, special_files=None, ignore_files=None, repo_configs=None, max_non_system_sections=120, pr_diff=None):
+    """Analyze source language changes and categorize them as added, modified, or deleted"""
+    # Import modules needed in this function
+    import os
+    import json
+    from toc_processor import process_toc_operations
+
+    owner, repo, pr_number = parse_pr_url(pr_url)
+    repository = github_client.get_repo(f"{owner}/{repo}")
+    pr = repository.get_pull(pr_number)
+
+    # Get repository configuration for target repo info
+    repo_config = get_repo_config(pr_url, repo_configs)
+
+    print(f"📋 Processing PR #{pr_number}: {pr.title}")
+
+    # Get markdown files
+    files = pr.get_files()
+    markdown_files = [f for f in files if f.filename.endswith('.md')]
+
+    print(f"📄 Found {len(markdown_files)} markdown files")
+
+    # Return dictionaries for different operation types
+    added_sections = {}      # New sections that were added
+    modified_sections = {}   # Existing sections that were modified  
+    deleted_sections = {}    # Sections that were deleted
+    added_files = {}         # Completely new files that were added
+    deleted_files = []       # Completely deleted files
+    ignored_files = []       # Files that were ignored
+    toc_files = {}           # Special TOC files requiring special processing
+
+    for file in markdown_files:
+        print(f"\n🔍 Analyzing {file.filename}")
+
+        # Check if this file should be ignored
+        if file.filename in ignore_files:
+            print(f"   ⏭️  Skipping ignored file: {file.filename}")
+            ignored_files.append(file.filename)
+            continue
+
+        # Check if this is a completely new file or deleted file
+        if file.status == 'added':
+            print(f"   ➕ Detected new file: {file.filename}")
+            try:
+                file_content = repository.get_contents(file.filename, ref=pr.head.sha).decoded_content.decode('utf-8')
+                added_files[file.filename] = file_content
+                print(f"   ✅ Added complete file for translation")
+                continue
+            except Exception as e:
+                print(f"   ❌ Error getting new file content: {e}")
+                continue
+
+        elif file.status == 'removed':
+            print(f"   🗑️  Detected deleted file: {file.filename}")
+            deleted_files.append(file.filename)
+            print(f"   ✅ Marked file for deletion")
+            continue
+
+        # For modified files, check if it's a special file like TOC.md
+        try:
+            file_content = repository.get_contents(file.filename, ref=pr.head.sha).decoded_content.decode('utf-8')
+        except Exception as e:
+            print(f"   ❌ Error getting content: {e}")
+            continue
+
+        # Check if this is a TOC.md file requiring special processing
+        if os.path.basename(file.filename) in special_files:
+            print(f"   📋 Detected special file: {file.filename}")
+
+            # Get target file content for comparison
+            try:
+                target_repository = github_client.get_repo(repo_config['target_repo'])
+                target_file_content = target_repository.get_contents(file.filename, ref="master").decoded_content.decode('utf-8')
+                target_lines = target_file_content.split('\n')
+            except Exception as e:
+                print(f"   ⚠️  Could not get target file content: {e}")
+                continue
+
+            # Analyze diff operations for TOC.md
+            operations = analyze_diff_operations(file)
+            source_lines = file_content.split('\n')
+
+            # Process with special TOC logic
+            toc_results = process_toc_operations(file.filename, operations, source_lines, target_lines, "")  # Local path will be determined later
+
+            # Store TOC operations for later processing
+            if any([toc_results['added'], toc_results['modified'], toc_results['deleted']]):
+                # Combine all operations for processing
+                all_toc_operations = []
+                all_toc_operations.extend(toc_results['added'])
+                all_toc_operations.extend(toc_results['modified']) 
+                all_toc_operations.extend(toc_results['deleted'])
+
+                # Add to special TOC processing queue (separate from regular sections)
+                toc_files[file.filename] = {
+                    'type': 'toc',
+                    'operations': all_toc_operations
+                }
+
+                print(f"   📋 TOC operations queued for processing:")
+                if toc_results['added']:
+                    print(f"      ➕ Added: {len(toc_results['added'])} entries")
+                if toc_results['modified']:
+                    print(f"      ✏️  Modified: {len(toc_results['modified'])} entries") 
+                if toc_results['deleted']:
+                    print(f"      ❌ Deleted: {len(toc_results['deleted'])} entries")
+            else:
+                print(f"   ℹ️  No TOC operations found")
+
+            continue  # Skip regular processing for TOC files
+
+        # Analyze diff operations
+        operations = analyze_diff_operations(file)
+        print(f"   📝 Diff analysis: {len(operations['added_lines'])} added, {len(operations['modified_lines'])} modified, {len(operations['deleted_lines'])} deleted lines")
+
+        lines = file_content.split('\n')
+        all_headers = {}
+
+        # Track code block state
+        in_code_block = False
+        code_block_delimiter = None
+
+        # First pass: collect all headers (excluding those in code blocks)
+        for line_num, line in enumerate(lines, 1):
+            original_line = line
+            line = line.strip()
+
+            # Check for code block delimiters
+            if line.startswith('```') or line.startswith('~~~'):
+                if not in_code_block:
+                    # Entering a code block
+                    in_code_block = True
+                    code_block_delimiter = line[:3]
+                    continue
+                elif line.startswith(code_block_delimiter):
+                    # Exiting a code block
+                    in_code_block = False
+                    code_block_delimiter = None
+                    continue
+
+            # Skip processing if we're inside a code block
+            if in_code_block:
+                continue
+
+            # Process headers only if not in code block
+            if line.startswith('#'):
+                match = re.match(r'^(#{1,10})\s+(.+)', line)
+                if match:
+                    level = len(match.group(1))
+                    title = match.group(2).strip()
+                    all_headers[line_num] = {
+                        'level': level,
+                        'title': title,
+                        'line': line
+                    }
+
+        # Build complete hierarchy from HEAD (after changes)
+        all_hierarchy_dict = build_hierarchy_dict(file_content)
+
+        # For deletion detection, we also need the base file hierarchy
+        try:
+            base_file_content = repository.get_contents(file.filename, ref=f"{repository.default_branch}").decoded_content.decode('utf-8')
+            base_hierarchy_dict = build_hierarchy_dict(base_file_content)
+        except Exception as e:
+            print(f"   ⚠️  Could not get base file content: {e}")
+            base_hierarchy_dict = all_hierarchy_dict
+            base_file_content = file_content  # Fallback to current content
+
+        # Find sections by operation type with corrected logic
+        sections_by_type = find_sections_by_operation_type(lines, operations, all_headers, base_hierarchy_dict)
+
+        # Prioritize modified headers over added ones (fix for header changes like --host -> --hosts)
+        modified_header_lines = set()
+        for modified_line in operations['modified_lines']:
+            if modified_line['is_header']:
+                modified_header_lines.add(modified_line['line_number'])
+
+        # Remove modified header lines from added set
+        sections_by_type['added'] = sections_by_type['added'] - modified_header_lines
+
+        # Enhanced logic: check for actual content changes within sections
+        # This helps detect changes in section content (not just headers)
+        print(f"   🔍 Enhanced detection: checking for actual section content changes...")
+
+        # Get only lines that have actual content changes (exclude headers)
+        real_content_changes = set()
+
+        # Added lines (new content, excluding headers)
+        for added_line in operations['added_lines']:
+            if not added_line['is_header']:
+                real_content_changes.add(added_line['line_number'])
+
+        # Deleted lines (removed content, excluding headers)
+        for deleted_line in operations['deleted_lines']:
+            if not deleted_line['is_header']:
+                real_content_changes.add(deleted_line['line_number'])
+
+        # Modified lines (changed content, excluding headers)
+        for modified_line in operations['modified_lines']:
+            if not modified_line['is_header']:
+                real_content_changes.add(modified_line['line_number'])
+
+        print(f"   📝 Real content changes (non-header): {sorted(real_content_changes)}")
+
+        # Find sections that contain actual content changes
+        content_affected_sections = set()
+        for changed_line in real_content_changes:
+            # Find which section this changed line belongs to
+            containing_section = None
+            for line_num in sorted(all_headers.keys()):
+                if line_num <= changed_line:
+                    containing_section = line_num
+                else:
+                    break
+
+            if containing_section and containing_section not in sections_by_type['added']:
+                # Additional check: make sure this is not just a line number shift
+                # Only add if the change is within reasonable distance from the section header
+                # AND if the changed line is not part of a completely deleted section header
+                is_deleted_header = False
+                for deleted_line in operations['deleted_lines']:
+                    if (deleted_line['is_header'] and 
+                        abs(changed_line - deleted_line['line_number']) <= 2):
+                        is_deleted_header = True
+                        print(f"   ⚠️  Skipping change at line {changed_line} (deleted header near line {deleted_line['line_number']})")
+                        break
+
+                # More precise filtering: check if this change is actually meaningful
+                # Skip changes that are part of deleted content or line shifts due to deletions
+                should_include = True
+
+                # Skip exact deleted headers
+                for deleted_line in operations['deleted_lines']:
+                    if (deleted_line['is_header'] and 
+                        changed_line == deleted_line['line_number']):
+                        should_include = False
+                        print(f"   ⚠️  Skipping change at line {changed_line} (exact deleted header)")
+                        break
+
+                # Skip changes that are very close to deleted content AND far from their containing section
+                # This helps filter out line shift artifacts while keeping real content changes
+                if should_include:
+                    for deleted_line in operations['deleted_lines']:
+                        # Only skip if both conditions are met:
+                        # 1. Very close to deleted content (within 5 lines)
+                        # 2. The change is far from its containing section (likely a shift artifact)
+                        distance_to_deletion = abs(changed_line - deleted_line['line_number'])
+                        distance_to_section = changed_line - containing_section
+
+                        if (distance_to_deletion <= 5 and distance_to_section > 100):
+                            should_include = False
+                            print(f"   ⚠️  Skipping change at line {changed_line} (likely line shift: {distance_to_deletion} lines from deletion, {distance_to_section} from section)")
+                            break
+
+                if should_include and changed_line - containing_section <= 30:
+                    content_affected_sections.add(containing_section)
+                    print(f"   📝 Content change at line {changed_line} affects section at line {containing_section}")
+                elif should_include:
+                    print(f"   ⚠️  Skipping distant change at line {changed_line} from section {containing_section}")
+
+        # Add content-modified sections to the modified set, but exclude sections that are already marked as added or deleted
+        for line_num in content_affected_sections:
+            if (line_num not in sections_by_type['modified'] and 
+                line_num not in sections_by_type['added'] and
+                line_num not in sections_by_type['deleted']):  # ✅ Critical fix: exclude deleted sections      
+                sections_by_type['modified'].add(line_num)
+                print(f"   📝 Added content-modified section at line {line_num}")
+            elif line_num in sections_by_type['deleted']:
+                print(f"   🚫 Skipping content-modified section at line {line_num}: already marked as deleted")
+
+        # Prepare sections data for source_diff_dict
+        file_modified = {}
+        file_added = {}
+        file_deleted = {}
+
+        # Build modified sections
+        for line_num in sections_by_type['modified']:
+            if line_num in all_hierarchy_dict:
+                file_modified[str(line_num)] = all_hierarchy_dict[line_num]
+
+        # Build added sections  
+        for line_num in sections_by_type['added']:
+            if line_num in all_hierarchy_dict:
+                file_added[str(line_num)] = all_hierarchy_dict[line_num]
+
+        # Build deleted sections
+        for line_num in sections_by_type['deleted']:
+            if line_num in base_hierarchy_dict:
+                file_deleted[str(line_num)] = base_hierarchy_dict[line_num]
+
+        # Check for frontmatter changes (content before first top-level header)
+        print(f"   🔍 Checking for frontmatter changes...")
+        frontmatter_changed = False
+
+        # Check if any changes occur before the first top-level header
+        first_header_line = None
+        for line_num in sorted(all_headers.keys()):
+            header_info = all_headers[line_num]
+            if header_info['level'] == 1:  # First top-level header
+                first_header_line = line_num
+                break
+
+        print(f"   📊 First header line: {first_header_line}")
+        print(f"   📊 Real content changes: {sorted(real_content_changes)}")
+
+        if first_header_line:
+            # Check if any real content changes are before the first header
+            for line_num in real_content_changes:
+                #print(f"   🔍 Checking line {line_num} vs first header {first_header_line}")
+                if line_num < first_header_line:
+                    frontmatter_changed = True
+                    print(f"   📄 Frontmatter change detected: line {line_num} < {first_header_line}")
+                    break
+
+        print(f"   📊 Frontmatter changed: {frontmatter_changed}")
+
+        if frontmatter_changed:
+            print(f"   📄 Frontmatter changes detected (before line {first_header_line})")
+            # Add frontmatter as a special section with line number 0
+            file_modified["0"] = "frontmatter"
+            print(f"   ✅ Added frontmatter section to modified sections")
+
+        # Build source diff dictionary
+        source_diff_dict = build_source_diff_dict(
+            file_modified, file_added, file_deleted, 
+            all_hierarchy_dict, base_hierarchy_dict, 
+            operations, file_content, base_file_content
+        )
+
+        # Breakpoint: Output source_diff_dict to file for review with file prefix
+
+        # Ensure temp_output directory exists
+        script_dir = os.path.dirname(os.path.abspath(__file__))
+        temp_dir = os.path.join(script_dir, "temp_output")
+        os.makedirs(temp_dir, exist_ok=True)
+
+        file_prefix = file.filename.replace('/', '-').replace('.md', '')
+        output_file = os.path.join(temp_dir, f"{file_prefix}-source-diff-dict.json")
+        with open(output_file, 'w', encoding='utf-8') as f:
+            json.dump(source_diff_dict, f, ensure_ascii=False, indent=2)
+
+        print(f"   💾 Saved source diff dictionary to: {output_file}")
+        print(f"   📊 Source diff dictionary contains {len(source_diff_dict)} sections:")
+        for key, diff_info in source_diff_dict.items():
+            print(f"      {diff_info['operation']}: {key} -> original_hierarchy: {diff_info['original_hierarchy']}")
+
+        # source-diff-dict.json generation is complete, continue to next step in main.py
+
+        # For modified headers, we need to build a mapping using original titles for matching
+        original_hierarchy_dict = all_hierarchy_dict.copy()
+
+        # Update hierarchy dict to use original content for modified headers when needed for matching
+        for line_num in sections_by_type['modified']:
+            if line_num in all_headers:
+                header_info = all_headers[line_num]
+                # Check if this header was modified and has original content
+                for op in operations['modified_lines']:
+                    if (op['is_header'] and 
+                        op['line_number'] == line_num and 
+                        'original_content' in op):
+                        # Create hierarchy path using original content for matching
+                        original_line = op['original_content'].strip()
+                        if original_line.startswith('#'):
+                            # Build original hierarchy for matching
+                            original_hierarchy = build_hierarchy_for_modified_section(
+                                file_content, line_num, original_line, all_hierarchy_dict)
+                            if original_hierarchy:
+                                original_hierarchy_dict[line_num] = original_hierarchy
+                        break
+
+        # Process added sections
+        if sections_by_type['added']:
+            file_added = {}
+            # Find insertion points using the simplified logic: 
+            # Record the previous section hierarchy for each added section
+            insertion_points = find_previous_section_for_added(sections_by_type['added'], all_hierarchy_dict)
+
+            # Get actual content for added sections
+            for line_num in sections_by_type['added']:
+                if line_num in all_hierarchy_dict:
+                    file_added[str(line_num)] = all_hierarchy_dict[line_num]
+
+            # Get source sections content (actual content, not just hierarchy)
+            if file_added:
+                source_sections_content = get_source_sections_content(pr_url, file.filename, file_added, github_client)
+                file_added = source_sections_content  # Replace hierarchy with actual content
+
+            if file_added:
+                added_sections[file.filename] = {
+                    'sections': file_added,
+                    'insertion_points': insertion_points
+                }
+                print(f"   ➕ Found {len(file_added)} added sections with {len(insertion_points)} insertion points")
+
+        # Process modified sections
+        if sections_by_type['modified']:
+            file_modified = {}
+            for line_num in sections_by_type['modified']:
+                if line_num in original_hierarchy_dict:
+                    file_modified[str(line_num)] = original_hierarchy_dict[line_num]
+
+            if file_modified:
+                modified_sections[file.filename] = {
+                    'sections': file_modified,
+                    'original_hierarchy': original_hierarchy_dict,
+                    'current_hierarchy': all_hierarchy_dict
+                }
+                print(f"   ✏️  Found {len(file_modified)} modified sections")
+
+        # Process deleted sections  
+        if sections_by_type['deleted']:
+            file_deleted = {}
+            for line_num in sections_by_type['deleted']:
+                # Use base hierarchy to get the deleted section info
+                if line_num in base_hierarchy_dict:
+                    file_deleted[str(line_num)] = base_hierarchy_dict[line_num]
+
+            if file_deleted:
+                deleted_sections[file.filename] = file_deleted
+                print(f"   ❌ Found {len(file_deleted)} deleted sections")
+
+        # Enhanced logic: also check content-level changes using legacy detection
+        # This helps detect changes in section content (not just headers)
+        print(f"   🔍 Enhanced detection: checking content-level changes...")
+        changed_lines = get_changed_line_ranges(file)
+        affected_sections = find_affected_sections(lines, changed_lines, all_headers)
+
+        legacy_modified = {}
+        for line_num in affected_sections:
+            if line_num in all_hierarchy_dict:
+                section_hierarchy = all_hierarchy_dict[line_num]
+                # Only add if not already detected by operation-type analysis
+                already_detected = False
+                if file.filename in modified_sections:
+                    for existing_line, existing_hierarchy in modified_sections[file.filename].get('sections', {}).items():
+                        if existing_hierarchy == section_hierarchy:
+                            already_detected = True
+                            break
+
+                if not already_detected:
+                    legacy_modified[str(line_num)] = section_hierarchy
+
+        if legacy_modified:
+            print(f"   ✅ Enhanced detection found {len(legacy_modified)} additional content-modified sections")
+            # Merge with existing modified sections
+            if file.filename in modified_sections:
+                # Merge the sections
+                existing_sections = modified_sections[file.filename].get('sections', {})
+                existing_sections.update(legacy_modified)
+                modified_sections[file.filename]['sections'] = existing_sections
+            else:
+                # Create new entry
+                modified_sections[file.filename] = {
+                    'sections': legacy_modified,
+                    'original_hierarchy': all_hierarchy_dict,
+                    'current_hierarchy': all_hierarchy_dict
+                }
+
+    print(f"\n📊 Summary:")
+    #print(f"   ✏️  Modified files: {} files") 
+    print(f"   📄 Added files: {len(added_files)} files")
+    print(f"   🗑️  Deleted files: {len(deleted_files)} files")
+    print(f"   📋 TOC files: {len(toc_files)} files")
+    if ignored_files:
+        print(f"   ⏭️  Ignored files: {len(ignored_files)} files")
+        for ignored_file in ignored_files:
+            print(f"      - {ignored_file}")
+
+    return added_sections, modified_sections, deleted_sections, added_files, deleted_files, toc_files


The analyze_source_changes function is very long and complex, spanning hundreds of lines. It handles file analysis, diff parsing, hierarchy building, and categorizing changes. To improve readability and maintainability, this function should be refactored into several smaller, more specialized functions. Each smaller function should have a single, well-defined responsibility.

gemini-code-assist · 2025-09-25T10:01:12Z

scripts/translate_doc_pr/section_matcher.py

+def match_source_diff_to_target(source_diff_dict, target_hierarchy, target_lines, ai_client, repo_config, max_non_system_sections=120, max_tokens=20000):
+    """
+    Match source_diff_dict original_hierarchy to target file sections
+    Uses direct matching for system variables/config and AI matching for others
+
+    Returns:
+        dict: Matched sections with enhanced information including:
+            - target_line: Line number in target file
+            - target_hierarchy: Target section hierarchy 
+            - insertion_type: For added sections only
+            - source_original_hierarchy: Original hierarchy from source
+            - source_operation: Operation type (modified/added/deleted)
+            - source_old_content: Old content from source diff
+            - source_new_content: New content from source diff
+    """
+    thread_safe_print(f"🔗 Starting source diff to target matching...")
+
+    # Extract hierarchies from source diff dict
+    source_hierarchies = extract_hierarchies_from_diff_dict(source_diff_dict)
+
+    if not source_hierarchies:
+        thread_safe_print(f"⚠️  No hierarchies to match")
+        return {}
+
+    # Process sections in original order to maintain consistency
+    # Initialize final matching results with ordered dict to preserve order
+    from collections import OrderedDict
+    all_matched_sections = OrderedDict()
+
+    # Categorize sections for processing strategy but maintain order
+    direct_match_sections = OrderedDict()
+    ai_match_sections = OrderedDict()
+    added_sections = OrderedDict()
+    bottom_sections = OrderedDict()  # New category for bottom sections
+
+    for key, hierarchy in source_hierarchies.items():
+        # Check if this is a bottom section (no matching needed)
+        if hierarchy.startswith('bottom-'):
+            bottom_sections[key] = hierarchy
+        # Check if this is an added section
+        elif key.startswith('added_'):
+            added_sections[key] = hierarchy
+        else:
+            # Extract the leaf title from hierarchy for checking
+            leaf_title = hierarchy.split(' > ')[-1] if ' > ' in hierarchy else hierarchy
+
+            # Check if this is suitable for direct matching
+            if (hierarchy == "frontmatter" or 
+                leaf_title.startswith('# ') or  # Top-level titles
+                is_system_variable_or_config(leaf_title)):  # System variables/config
+                direct_match_sections[key] = hierarchy
+            else:
+                ai_match_sections[key] = hierarchy
+
+    thread_safe_print(f"📊 Section categorization:")
+    thread_safe_print(f"   🎯 Direct matching: {len(direct_match_sections)} sections")
+    thread_safe_print(f"   🤖 AI matching: {len(ai_match_sections)} sections")
+    thread_safe_print(f"   ➕ Added sections: {len(added_sections)} sections")
+    thread_safe_print(f"   🔚 Bottom sections: {len(bottom_sections)} sections (no matching needed)")
+
+    # Process each section in original order
+    thread_safe_print(f"\n🔄 Processing sections in original order...")
+
+    for key, hierarchy in source_hierarchies.items():
+        thread_safe_print(f"   🔍 Processing {key}: {hierarchy}")
+
+        # Determine processing strategy based on section type and content
+        if hierarchy.startswith('bottom-'):
+            # Bottom section - no matching needed, append to end
+            thread_safe_print(f"      🔚 Bottom section - append to end of document")
+            result = {
+                "target_line": "-1",  # Special marker for bottom sections
+                "target_hierarchy": hierarchy  # Keep the bottom-xxx marker
+            }
+        elif key.startswith('added_'):
+            # Added section - find insertion point
+            thread_safe_print(f"      ➕ Added section - finding insertion point")
+            result = process_added_section(key, hierarchy, target_hierarchy, target_lines, ai_client, repo_config, max_non_system_sections, max_tokens)
+        else:
+            # Modified or deleted section - find matching section
+            operation = source_diff_dict[key].get('operation', 'unknown')
+            thread_safe_print(f"      {operation.capitalize()} section - finding target match")
+            result = process_modified_or_deleted_section(key, hierarchy, target_hierarchy, target_lines, ai_client, repo_config, max_non_system_sections, max_tokens)
+
+        if result:
+            # Add source language information from source_diff_dict
+            source_info = source_diff_dict.get(key, {})
+
+            # Extract target content from target_lines
+            target_line = result.get('target_line', 'unknown')
+            target_content = ""
+            if target_line != 'unknown' and target_line != '0':
+                try:
+                    target_line_num = int(target_line)
+                    # For ALL operations, only extract direct content (no sub-sections)
+                    # This avoids duplication when both parent and child sections have operations
+                    target_content = extract_section_direct_content(target_line_num, target_lines)
+                except (ValueError, IndexError):
+                    target_content = ""
+            elif target_line == '0':
+                # For frontmatter, extract content from beginning to first header
+                target_content = extract_frontmatter_content(target_lines)
+
+            enhanced_result = {
+                **result,  # Include existing target matching info
+                'target_content': target_content,  # Add target section content
+                'source_original_hierarchy': source_info.get('original_hierarchy', ''),
+                'source_operation': source_info.get('operation', ''),
+                'source_old_content': source_info.get('old_content', ''),
+                'source_new_content': source_info.get('new_content', '')
+            }
+            all_matched_sections[key] = enhanced_result
+            thread_safe_print(f"      ✅ {key}: -> line {target_line}")
+        else:
+            thread_safe_print(f"      ❌ {key}: matching failed")
+
+    thread_safe_print(f"\n📊 Final matching results: {len(all_matched_sections)} total matches")
+    return all_matched_sections


The match_source_diff_to_target function is very large and complex. It orchestrates direct matching, AI matching, and processes different types of sections. This makes it difficult to understand and maintain. Please consider breaking this function down into smaller, more focused helper functions to improve its structure and readability.

gemini-code-assist · 2025-09-25T10:01:12Z

scripts/translate_doc_pr/file_adder.py

+def thread_safe_print(*args, **kwargs):
+    with print_lock:
+        print(*args, **kwargs)


The thread_safe_print function is duplicated across multiple files in this package (including file_deleter.py, file_updater.py, main_workflow.py, etc.). To improve maintainability and adhere to the DRY (Don't Repeat Yourself) principle, this function should be defined once in a shared utility module and imported where needed.

gemini-code-assist · 2025-09-25T10:01:12Z

scripts/translate_doc_pr/file_adder.py

+def create_section_batches(file_content, max_lines_per_batch=200):
+    """Create batches of file content for translation, respecting section boundaries"""
+    lines = file_content.split('\n')
+
+    # Find all section headers
+    section_starts = []
+    for i, line in enumerate(lines):
+        line = line.strip()
+        if line.startswith('#'):
+            match = re.match(r'^(#{1,10})\s+(.+)', line)
+            if match:
+                section_starts.append(i + 1)  # 1-based line numbers
+
+    # If no sections found, just batch by line count
+    if not section_starts:
+        batches = []
+        for i in range(0, len(lines), max_lines_per_batch):
+            batch_lines = lines[i:i + max_lines_per_batch]
+            batches.append('\n'.join(batch_lines))
+        return batches
+
+    # Create batches respecting section boundaries
+    batches = []
+    current_batch_start = 0
+
+    for i, section_start in enumerate(section_starts):
+        section_start_idx = section_start - 1  # Convert to 0-based
+
+        # Check if adding this section would exceed the line limit
+        if (section_start_idx - current_batch_start) > max_lines_per_batch:
+            # Close current batch at the previous section boundary
+            if current_batch_start < section_start_idx:
+                batch_lines = lines[current_batch_start:section_start_idx]
+                batches.append('\n'.join(batch_lines))
+                current_batch_start = section_start_idx
+
+        # If this is the last section, or the next section would create a batch too large
+        if i == len(section_starts) - 1:
+            # Add remaining content as final batch
+            batch_lines = lines[current_batch_start:]
+            batches.append('\n'.join(batch_lines))
+        else:
+            next_section_start = section_starts[i + 1] - 1  # 0-based
+            if (next_section_start - current_batch_start) > max_lines_per_batch:
+                # Close current batch at current section boundary
+                batch_lines = lines[current_batch_start:section_start_idx]
+                if batch_lines:  # Only add non-empty batches
+                    batches.append('\n'.join(batch_lines))
+                current_batch_start = section_start_idx
+
+    # Clean up any empty batches
+    batches = [batch for batch in batches if batch.strip()]
+
+    return batches


The logic in create_section_batches is quite complex and hard to follow, especially with the lookahead checks for the next section. This complexity increases the risk of bugs and makes the function difficult to maintain.

Consider refactoring this to a simpler, more straightforward algorithm. A simpler approach could be:

Initialize an empty current_batch and a list of batches.

Iterate through the sections (or lines if no sections).

If adding the current section/line to current_batch does not exceed max_lines_per_batch, add it.

Otherwise, add current_batch to batches, and start a new current_batch with the current section/line.

After the loop, add the final current_batch if it's not empty.

gemini-code-assist · 2025-09-25T10:01:12Z

scripts/translate_doc_pr/file_adder.py

+    try:
+        from main import print_token_estimation
+        print_token_estimation(prompt, "File addition translation")
+    except ImportError:
+        # Fallback if import fails - use tiktoken
+        try:
+            import tiktoken
+            enc = tiktoken.get_encoding("cl100k_base")
+            tokens = enc.encode(prompt)
+            actual_tokens = len(tokens)
+            char_count = len(prompt)
+            print(f"   💰 File addition translation")
+            print(f"      📝 Input: {char_count:,} characters")
+            print(f"      🔢 Actual tokens: {actual_tokens:,} (using tiktoken cl100k_base)")
+        except Exception:
+            # Final fallback to character approximation
+            estimated_tokens = len(prompt) // 4
+            char_count = len(prompt)
+            print(f"   💰 File addition translation")
+            print(f"      📝 Input: {char_count:,} characters")
+            print(f"      🔢 Estimated tokens: ~{estimated_tokens:,} (fallback: 4 chars/token approximation)")


This block for token estimation is duplicated in several modules (file_updater.py, section_matcher.py, toc_processor.py). This duplicated code should be refactored into a single utility function in a shared module to avoid redundancy and make future changes easier.

gemini-code-assist · 2025-09-25T10:01:12Z

scripts/translate_doc_pr/main_workflow.py

+def get_pr_diff(pr_url, github_client):
+    """Get the diff content from a GitHub PR (from auto-sync-pr-changes.py)"""
+    try:
+        from pr_analyzer import parse_pr_url
+        owner, repo, pr_number = parse_pr_url(pr_url)
+        repository = github_client.get_repo(f"{owner}/{repo}")
+        pr = repository.get_pull(pr_number)
+
+        # Get files and their patches
+        files = pr.get_files()
+        diff_content = []
+
+        for file in files:
+            if file.filename.endswith('.md') and file.patch:
+                diff_content.append(f"File: {file.filename}")
+                diff_content.append(file.patch)
+                diff_content.append("-" * 80)
+
+        return "\n".join(diff_content)
+
+    except Exception as e:
+        thread_safe_print(f"   ❌ Error getting PR diff: {e}")
+        return None


The get_pr_diff function is duplicated in pr_analyzer.py. To adhere to the DRY (Don't Repeat Yourself) principle, this function should be defined in only one place (for example, in pr_analyzer.py) and imported here.

gemini-code-assist · 2025-09-25T10:01:12Z

scripts/translate_doc_pr/section_matcher.py

+def clean_title_for_matching(title):
+    """Clean title for matching by removing markdown formatting and span elements"""
+    if not title:
+        return ""
+
+    # Remove span elements like <span class="version-mark">New in v5.0</span>
+    title = re.sub(r'<span[^>]*>.*?</span>', '', title)
+
+    # Remove markdown header prefix (# ## ### etc.)
+    title = re.sub(r'^#{1,6}\s*', '', title.strip())
+
+    # Remove backticks
+    title = title.replace('`', '')
+
+    # Strip whitespace
+    title = title.strip()
+
+    return title


The clean_title_for_matching function is duplicated from pr_analyzer.py. This and other shared utility functions should be centralized in a single utility module to avoid code duplication and ensure consistency.

coderabbitai

Actionable comments posted: 19

🧹 Nitpick comments (30)

.github/workflows/sync-docs-cn-to-en.yml (1)

46-70: Harden PR URL parsing.

cut -f7 is brittle (fails with trailing paths like /files). Use regex to extract owner/repo/PR number.

Apply:

       - name: Extract PR information
         id: extract_info
         run: |
           # Extract source repo info
           SOURCE_URL="${{ github.event.inputs.source_pr_url }}"
-          SOURCE_OWNER=$(echo $SOURCE_URL | cut -d'/' -f4)
-          SOURCE_REPO=$(echo $SOURCE_URL | cut -d'/' -f5)
-          SOURCE_PR=$(echo $SOURCE_URL | cut -d'/' -f7)
+          SOURCE_OWNER=$(echo "$SOURCE_URL" | sed -E 's#https?://github\.com/([^/]+)/([^/]+)/pull/([0-9]+).*#\1#')
+          SOURCE_REPO=$(echo "$SOURCE_URL" | sed -E 's#https?://github\.com/([^/]+)/([^/]+)/pull/([0-9]+).*#\2#')
+          SOURCE_PR=$(echo "$SOURCE_URL" | sed -E 's#https?://github\.com/[^/]+/[^/]+/pull/([0-9]+).*#\1#')
           
           # Extract target repo info
           TARGET_URL="${{ github.event.inputs.target_pr_url }}"
-          TARGET_OWNER=$(echo $TARGET_URL | cut -d'/' -f4)
-          TARGET_REPO=$(echo $TARGET_URL | cut -d'/' -f5)
-          TARGET_PR=$(echo $TARGET_URL | cut -d'/' -f7)
+          TARGET_OWNER=$(echo "$TARGET_URL" | sed -E 's#https?://github\.com/([^/]+)/([^/]+)/pull/([0-9]+).*#\1#')
+          TARGET_REPO=$(echo "$TARGET_URL" | sed -E 's#https?://github\.com/([^/]+)/([^/]+)/pull/([0-9]+).*#\2#')
+          TARGET_PR=$(echo "$TARGET_URL" | sed -E 's#https?://github\.com/[^/]+/[^/]+/pull/([0-9]+).*#\1#')

scripts/translate_doc_pr/file_adder.py (5)

8-12: Remove unused imports.

json, Github, OpenAI are unused here.

-import json
 import threading
-from github import Github
-from openai import OpenAI

20-74: Fix batching to respect section boundaries and max_lines reliably.

Current logic can drop content and create empty batches. Pack section ranges then chunk >max batches.

-def create_section_batches(file_content, max_lines_per_batch=200):
-    """Create batches of file content for translation, respecting section boundaries"""
-    lines = file_content.split('\n')
-    
-    # Find all section headers
-    section_starts = []
-    for i, line in enumerate(lines):
-        line = line.strip()
-        if line.startswith('#'):
-            match = re.match(r'^(#{1,10})\s+(.+)', line)
-            if match:
-                section_starts.append(i + 1)  # 1-based line numbers
-    
-    # If no sections found, just batch by line count
-    if not section_starts:
-        batches = []
-        for i in range(0, len(lines), max_lines_per_batch):
-            batch_lines = lines[i:i + max_lines_per_batch]
-            batches.append('\n'.join(batch_lines))
-        return batches
-    
-    # Create batches respecting section boundaries
-    batches = []
-    current_batch_start = 0
-    
-    for i, section_start in enumerate(section_starts):
-        section_start_idx = section_start - 1  # Convert to 0-based
-        
-        # Check if adding this section would exceed the line limit
-        if (section_start_idx - current_batch_start) > max_lines_per_batch:
-            # Close current batch at the previous section boundary
-            if current_batch_start < section_start_idx:
-                batch_lines = lines[current_batch_start:section_start_idx]
-                batches.append('\n'.join(batch_lines))
-                current_batch_start = section_start_idx
-        
-        # If this is the last section, or the next section would create a batch too large
-        if i == len(section_starts) - 1:
-            # Add remaining content as final batch
-            batch_lines = lines[current_batch_start:]
-            batches.append('\n'.join(batch_lines))
-        else:
-            next_section_start = section_starts[i + 1] - 1  # 0-based
-            if (next_section_start - current_batch_start) > max_lines_per_batch:
-                # Close current batch at current section boundary
-                batch_lines = lines[current_batch_start:section_start_idx]
-                if batch_lines:  # Only add non-empty batches
-                    batches.append('\n'.join(batch_lines))
-                current_batch_start = section_start_idx
-    
-    # Clean up any empty batches
-    batches = [batch for batch in batches if batch.strip()]
-    
-    return batches
+def create_section_batches(file_content, max_lines_per_batch=200):
+    """Batch content by Markdown sections, then enforce max_lines_per_batch."""
+    lines = file_content.split('\n')
+    header_idx = [i for i, line in enumerate(lines) if re.match(r'^\#{1,10}\s+\S', line.strip())]
+
+    # No headers: simple chunking
+    if not header_idx:
+        return [
+            '\n'.join(chunk)
+            for i in range(0, len(lines), max_lines_per_batch)
+            if (chunk := lines[i:i + max_lines_per_batch])
+        ]
+
+    # Build section slices
+    sections = []
+    for i, start in enumerate(header_idx):
+        end = header_idx[i + 1] if i + 1 < len(header_idx) else len(lines)
+        sections.append(lines[start:end])
+
+    # Pack sections into batches
+    batches, cur, cur_len = [], [], 0
+    for sec in sections:
+        sec_len = len(sec)
+        if cur and cur_len + sec_len > max_lines_per_batch:
+            batches.append('\n'.join(cur))
+            cur, cur_len = [], 0
+        cur.extend(sec)
+        cur_len += sec_len
+    if cur:
+        batches.append('\n'.join(cur))
+
+    # Split any oversized batch (single very large section)
+    final_batches = []
+    for batch in batches:
+        blines = batch.split('\n')
+        if len(blines) <= max_lines_per_batch:
+            if batch.strip():
+                final_batches.append(batch)
+        else:
+            for i in range(0, len(blines), max_lines_per_batch):
+                chunk = '\n'.join(blines[i:i + max_lines_per_batch]).strip()
+                if chunk:
+                    final_batches.append(chunk)
+    return final_batches

112-121: Drop f-prefix where no interpolation occurs.

Cleans up lints (F541) and keeps output unchanged.

-            print(f"   💰 File addition translation")
-            print(f"      📝 Input: {char_count:,} characters")
-            print(f"      🔢 Actual tokens: {actual_tokens:,} (using tiktoken cl100k_base)")
+            print("   💰 File addition translation")
+            print(f"      📝 Input: {char_count:,} characters")
+            print(f"      🔢 Actual tokens: {actual_tokens:,} (using tiktoken cl100k_base)")
@@
-            print(f"   💰 File addition translation")
-            print(f"      📝 Input: {char_count:,} characters")
-            print(f"      🔢 Estimated tokens: ~{estimated_tokens:,} (fallback: 4 chars/token approximation)")
+            print("   💰 File addition translation")
+            print(f"      📝 Input: {char_count:,} characters")
+            print(f"      🔢 Estimated tokens: ~{estimated_tokens:,} (fallback: 4 chars/token approximation)")
@@
-    thread_safe_print(f"\n✅ Completed processing all new files")
+    thread_safe_print("\n✅ Completed processing all new files")

Also applies to: 128-129, 193-193

135-140: Rename unused params to signal intent.

pr_url and github_client are unused.

-def process_added_files(added_files, pr_url, github_client, ai_client, repo_config):
+def process_added_files(added_files, _pr_url, _github_client, ai_client, repo_config):

131-134: Avoid blind excepts; narrow or log with type.

Catching bare Exception hides actionable errors.

Consider catching specific exceptions (e.g., httpx.HTTPError from AI client; OSError for file writes), or at least include exception type:

except Exception as e:
    thread_safe_print(f"   ❌ Batch translation failed: {type(e).__name__}: {e}")

Also applies to: 190-191

scripts/translate_doc_pr/file_deleter.py (4)

8-8: Remove unused import.

Github is not used.

-from github import Github

17-22: Rename unused param to underscore.

github_client isn’t used.

-def process_deleted_files(deleted_files, github_client, repo_config):
+def process_deleted_files(deleted_files, _github_client, repo_config):

38-39: Avoid blind except; log type.

Narrow to OSError or include type name in message.

-            except Exception as e:
-                thread_safe_print(f"   ❌ Error deleting file {target_file_path}: {e}")
+            except OSError as e:
+                thread_safe_print(f"   ❌ Error deleting file {target_file_path}: {type(e).__name__}: {e}")

43-44: Drop f-prefix without interpolation.

-    thread_safe_print(f"\n✅ Completed processing deleted files")
+    thread_safe_print("\n✅ Completed processing deleted files")

scripts/translate_doc_pr/main_workflow.py (6)

612-615: Remove duplicate re-imports.

process_toc_files and process_modified_sections are already imported at top.
-    # Import necessary functions
-    from file_updater import process_modified_sections, update_target_document_from_match_data
-    from toc_processor import process_toc_files
+    # Functions already imported at module top; no re-import needed
528-533: Drop redundant local import; parse_pr_url already imported.
-def get_workflow_repo_config(pr_url, repo_configs):
+def get_workflow_repo_config(pr_url, repo_configs):
     """Get repository configuration for workflow environment"""
-    from pr_analyzer import parse_pr_url
-    
     owner, repo, pr_number = parse_pr_url(pr_url)
115-122: Unify logging and remove f-prefix without placeholders.

Use thread_safe_print consistently and avoid f-strings without interpolation.
-            print(f"🧹 Cleaned existing temp_output directory")
+            thread_safe_print("🧹 Cleaned existing temp_output directory")
@@
-            print(f"🧹 Removed existing temp_output file")
+            thread_safe_print("🧹 Removed existing temp_output file")
@@
-    print(f"📁 Created temp_output directory: {temp_dir}")
+    thread_safe_print(f"📁 Created temp_output directory: {temp_dir}")
423-431: Prefer local target file content over GitHub API master ref.

Using remote master can mismatch the checked-out PR head branch and degrade matching accuracy. Read from repo_config['target_local_path'] instead, falling back to API if missing.

I can provide a patch if you confirm local path layout mirrors target repo paths.

251-274: Duplicate get_pr_diff logic; reuse pr_analyzer.get_pr_diff.

Avoid maintaining two implementations.

Option:

Remove local get_pr_diff.

Import and call pr_analyzer.get_pr_diff at the top.

Minimal change at call site:
-    pr_diff = get_pr_diff(SOURCE_PR_URL, github_client)
+    from pr_analyzer import get_pr_diff as analyzer_get_pr_diff
+    pr_diff = analyzer_get_pr_diff(SOURCE_PR_URL, github_client)
Also applies to: 592-596

56-57: Message text: avoid long strings in exceptions.

Minor: prefer concise exception messages or custom exception classes.

Also applies to: 159-160

scripts/translate_doc_pr/toc_processor.py (3)

10-12: Remove unused imports.

Github and OpenAI are not used.

-from github import Github
-from openai import OpenAI

241-241: Drop f-prefix where no interpolation occurs.

Cleans up F541 lints.

-        thread_safe_print(f"   ⏭️  No TOC lines need translation")
+        thread_safe_print("   ⏭️  No TOC lines need translation")
@@
-            print(f"   💰 TOC translation")
+            print("   💰 TOC translation")
@@
-            print(f"   💰 TOC translation")
+            print("   💰 TOC translation")
@@
-        thread_safe_print(f"   📝 AI translation response received")
+        thread_safe_print("   📝 AI translation response received")
@@
-    thread_safe_print(f"   ✅ All TOC files processed")
+    thread_safe_print("   ✅ All TOC files processed")

Also applies to: 286-286, 293-295, 303-304, 434-434

90-91: Rename unused params to underscore; avoid blind except.

file_path/target_local_path/pr_url/github_client are unused.
Catch specific exceptions or include type name in logs.

-def process_toc_operations(file_path, operations, source_lines, target_lines, target_local_path):
+def process_toc_operations(_file_path, operations, source_lines, target_lines, _target_local_path):
@@
-def process_toc_files(toc_files, pr_url, github_client, ai_client, repo_config):
+def process_toc_files(toc_files, _pr_url, _github_client, ai_client, repo_config):

And adjust excepts:

except Exception as e:
    thread_safe_print(f"   ❌ AI translation failed: {type(e).__name__}: {e}")

Also applies to: 421-423, 332-334

scripts/translate_doc_pr/pr_analyzer.py (8)

1-1: Remove non-executable shebang or mark file executable.

The shebang is present but the file isn’t executable in source control. Prefer removing it to avoid EXE001 lint, or ensure the file mode is +x.

Apply this diff to remove the shebang:

-#!/usr/bin/env python3

22-26: Harden PR URL parsing.

Splitting on "/" is brittle (trailing slashes, query params). Use a regex or urllib to reliably extract owner/repo/number.

Apply this diff:

 def parse_pr_url(pr_url):
-    """Parse PR URL to get repo info"""
-    parts = pr_url.split('/')
-    return parts[-4], parts[-3], int(parts[-1])  # owner, repo, pr_number
+    """Parse PR URL to get owner, repo, and PR number robustly"""
+    import re
+    m = re.search(r'https?://[^/]+/([^/]+)/([^/]+)/pull/(\d+)', pr_url.rstrip('/'))
+    if not m:
+        raise ValueError(f"Invalid PR URL: {pr_url}")
+    owner, repo, pr_number = m.group(1), m.group(2), int(m.group(3))
+    return owner, repo, pr_number

41-63: Narrow exception handling and use thread-safe logging.

Catching Exception swallows actionable errors; prefer GithubException and use thread_safe_print for consistency.

Apply this diff:

-from github import Github
+from github import Github
+from github.GithubException import GithubException
@@
 def get_pr_diff(pr_url, github_client):
@@
-    except Exception as e:
-        print(f"   ❌ Error getting PR diff: {e}")
+    except GithubException as e:
+        thread_safe_print(f"   ❌ GitHub API error getting PR diff: {e}")
+        return None
+    except Exception as e:
+        thread_safe_print(f"   ❌ Unexpected error getting PR diff: {e}")
         return None

228-289: Remove unused locals to reduce noise (Ruff F841).

original_line and title are never used.

Apply this diff:

-        original_line = line
         line = line.strip()
@@
-            if match:
-                level = len(match.group(1))
-                title = match.group(2).strip()
+            if match:
+                level = len(match.group(1))

742-744: Remove unused cross-import to avoid circular dependency risks.

This import isn’t used and introduces an unnecessary coupling.

Apply this diff:

-    from section_matcher import clean_title_for_matching

1311-1319: Gate local artifact writes behind an env flag and use a temp directory.

Always writing to the repo path is noisy and brittle in CI. Make it opt-in and write to a temp dir by default.

Apply this diff:

-        script_dir = os.path.dirname(os.path.abspath(__file__))
-        temp_dir = os.path.join(script_dir, "temp_output")
-        os.makedirs(temp_dir, exist_ok=True)
-        
-        file_prefix = file.filename.replace('/', '-').replace('.md', '')
-        output_file = os.path.join(temp_dir, f"{file_prefix}-source-diff-dict.json")
-        with open(output_file, 'w', encoding='utf-8') as f:
-            json.dump(source_diff_dict, f, ensure_ascii=False, indent=2)
-        
-        print(f"   💾 Saved source diff dictionary to: {output_file}")
+        if os.getenv("WRITE_SOURCE_DIFF", "0") == "1":
+            out_dir = os.getenv("SOURCE_DIFF_OUTDIR", os.path.join(tempfile.gettempdir(), "translate_doc_pr"))
+            os.makedirs(out_dir, exist_ok=True)
+            file_prefix = file.filename.replace('/', '-').replace('.md', '')
+            output_file = os.path.join(out_dir, f"{file_prefix}-source-diff-dict.json")
+            with open(output_file, 'w', encoding='utf-8') as f:
+                json.dump(source_diff_dict, f, ensure_ascii=False, indent=2)
+            thread_safe_print(f"   💾 Saved source diff dictionary to: {output_file}")

Note: add import tempfile at the top.

982-984: Remove redundant re-imports.

os and json are already imported at module scope.

Apply this diff:

-    import os
-    import json

671-690: Deduplicate helpers across modules.

clean_title_for_matching is duplicated in section_matcher.py. Extract into a shared utils module to avoid drift.

scripts/translate_doc_pr/section_matcher.py (3)

6-12: Remove unused imports.

Github, OpenAI, os, json aren’t used here.

Apply this diff:
-import os
 import re
-import json
 import threading
-from github import Github
-from openai import OpenAI
430-456: Deduplicate build_hierarchy_path with pr_analyzer.

Both modules implement identical logic. Extract into a shared utility to avoid divergence.

16-19: Unify logging with pr_analyzer.

Duplicate thread_safe_print exists across modules. Consider centralizing in a utils module.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7dbc37b and 88d9864.

📒 Files selected for processing (10)

.github/workflows/sync-docs-cn-to-en.yml (1 hunks)
scripts/translate_doc_pr/__init__.py (1 hunks)
scripts/translate_doc_pr/file_adder.py (1 hunks)
scripts/translate_doc_pr/file_deleter.py (1 hunks)
scripts/translate_doc_pr/file_updater.py (1 hunks)
scripts/translate_doc_pr/main_workflow.py (1 hunks)
scripts/translate_doc_pr/pr_analyzer.py (1 hunks)
scripts/translate_doc_pr/requirements.txt (1 hunks)
scripts/translate_doc_pr/section_matcher.py (1 hunks)
scripts/translate_doc_pr/toc_processor.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (7)

scripts/translate_doc_pr/__init__.py (1)

scripts/translate_doc_pr/main_workflow.py (1)

main (544-688)

scripts/translate_doc_pr/file_deleter.py (6)

scripts/translate_doc_pr/file_adder.py (1)

thread_safe_print (16-18)

scripts/translate_doc_pr/file_updater.py (1)

thread_safe_print (17-19)

scripts/translate_doc_pr/main_workflow.py (1)

thread_safe_print (94-96)

scripts/translate_doc_pr/pr_analyzer.py (1)

thread_safe_print (16-19)

scripts/translate_doc_pr/section_matcher.py (1)

thread_safe_print (16-18)

scripts/translate_doc_pr/toc_processor.py (1)

thread_safe_print (16-18)

scripts/translate_doc_pr/file_adder.py (1)

scripts/translate_doc_pr/main_workflow.py (4)

thread_safe_print (94-96)

main (544-688)

print_token_estimation (137-144)

chat_completion (165-198)

scripts/translate_doc_pr/section_matcher.py (2)

scripts/translate_doc_pr/main_workflow.py (4)

thread_safe_print (94-96)

main (544-688)

print_token_estimation (137-144)

chat_completion (165-198)

scripts/translate_doc_pr/pr_analyzer.py (5)

thread_safe_print (16-19)

clean_title_for_matching (672-689)

build_hierarchy_path (291-316)

extract_section_direct_content (454-479)

extract_frontmatter_content (481-494)

scripts/translate_doc_pr/toc_processor.py (1)

scripts/translate_doc_pr/main_workflow.py (4)

thread_safe_print (94-96)

main (544-688)

print_token_estimation (137-144)

chat_completion (165-198)

scripts/translate_doc_pr/main_workflow.py (6)

scripts/translate_doc_pr/pr_analyzer.py (6)

analyze_source_changes (979-1447)

get_repo_config (27-39)

get_target_hierarchy_and_content (603-616)

parse_pr_url (22-25)

thread_safe_print (16-19)

get_pr_diff (41-62)

scripts/translate_doc_pr/file_adder.py (2)

process_added_files (135-193)

thread_safe_print (16-18)

scripts/translate_doc_pr/file_deleter.py (2)

process_deleted_files (17-43)

thread_safe_print (13-15)

scripts/translate_doc_pr/file_updater.py (3)

process_modified_sections (574-605)

thread_safe_print (17-19)

update_target_document_from_match_data (1196-1357)

scripts/translate_doc_pr/toc_processor.py (2)

process_toc_files (421-434)

thread_safe_print (16-18)

scripts/translate_doc_pr/section_matcher.py (2)

match_source_diff_to_target (572-689)

thread_safe_print (16-18)

scripts/translate_doc_pr/file_updater.py (4)

scripts/translate_doc_pr/main_workflow.py (4)

thread_safe_print (94-96)

main (544-688)

print_token_estimation (137-144)

chat_completion (165-198)

scripts/translate_doc_pr/pr_analyzer.py (6)

thread_safe_print (16-19)

find_section_boundaries (361-400)

get_target_hierarchy_and_content (603-616)

clean_title_for_matching (672-689)

build_hierarchy_dict (228-289)

extract_affected_sections (497-514)

scripts/translate_doc_pr/section_matcher.py (9)

thread_safe_print (16-18)

find_direct_matches_for_special_files (138-298)

filter_non_system_sections (300-316)

get_corresponding_sections (318-379)

is_system_variable_or_config (39-80)

clean_title_for_matching (20-37)

parse_ai_response (381-397)

find_matching_line_numbers (399-428)

map_insertion_points_to_target (457-551)

scripts/translate_doc_pr/toc_processor.py (2)

thread_safe_print (16-18)

process_toc_file (332-419)

🪛 Ruff (0.13.1)

scripts/translate_doc_pr/__init__.py

1-1: Shebang is present but file is not executable

(EXE001)

scripts/translate_doc_pr/file_deleter.py

17-17: Unused function argument: github_client

(ARG001)

38-38: Do not catch blind exception: Exception

(BLE001)

43-43: f-string without any placeholders

Remove extraneous f prefix

(F541)

scripts/translate_doc_pr/file_adder.py

112-112: f-string without any placeholders

Remove extraneous f prefix

(F541)

115-115: Do not catch blind exception: Exception

(BLE001)

119-119: f-string without any placeholders

Remove extraneous f prefix

(F541)

128-128: f-string without any placeholders

Remove extraneous f prefix

(F541)

129-129: Consider moving this statement to an else block

(TRY300)

131-131: Do not catch blind exception: Exception

(BLE001)

135-135: Unused function argument: pr_url

(ARG001)

135-135: Unused function argument: github_client

(ARG001)

190-190: Do not catch blind exception: Exception

(BLE001)

193-193: f-string without any placeholders

Remove extraneous f prefix

(F541)

scripts/translate_doc_pr/section_matcher.py

88-88: f-string without any placeholders

Remove extraneous f prefix

(F541)

122-122: f-string without any placeholders

Remove extraneous f prefix

(F541)

138-138: Unused function argument: target_hierarchy

(ARG001)

169-169: f-string without any placeholders

Remove extraneous f prefix

(F541)

235-235: f-string without any placeholders

Remove extraneous f prefix

(F541)

242-242: f-string without any placeholders

Remove extraneous f prefix

(F541)

294-294: String contains ambiguous ℹ (INFORMATION SOURCE). Did you mean i (LATIN SMALL LETTER I)?

(RUF001)

326-332: Possible SQL injection vector through string-based query construction

(S608)

335-335: f-string without any placeholders

Remove extraneous f prefix

(F541)

337-337: f-string without any placeholders

Remove extraneous f prefix

(F541)

354-354: Do not catch blind exception: Exception

(BLE001)

371-371: f-string without any placeholders

Remove extraneous f prefix

(F541)

372-372: f-string without any placeholders

Remove extraneous f prefix

(F541)

374-374: f-string without any placeholders

Remove extraneous f prefix

(F541)

376-376: Consider moving this statement to an else block

(TRY300)

377-377: Do not catch blind exception: Exception

(BLE001)

430-430: Unused function argument: lines

(ARG001)

457-457: Unused function argument: file_path

(ARG001)

457-457: Unused function argument: pr_url

(ARG001)

457-457: Unused function argument: github_client

(ARG001)

480-480: Unpacked variable failed_matches is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

480-480: Unpacked variable skipped_sections is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

486-486: Prefer next(iter(matched_dict.keys())) over single element slice

Replace with next(iter(matched_dict.keys()))

(RUF015)

533-533: Prefer next(iter(ai_matched.keys())) over single element slice

Replace with next(iter(ai_matched.keys()))

(RUF015)

587-587: f-string without any placeholders

Remove extraneous f prefix

(F541)

593-593: f-string without any placeholders

Remove extraneous f prefix

(F541)

626-626: f-string without any placeholders

Remove extraneous f prefix

(F541)

629-629: String contains ambiguous ➕ (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)

633-633: f-string without any placeholders

Remove extraneous f prefix

(F541)

641-641: f-string without any placeholders

Remove extraneous f prefix

(F541)

648-648: f-string without any placeholders

Remove extraneous f prefix

(F541)

648-648: String contains ambiguous ➕ (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)

711-711: Prefer next(iter(matched_dict.keys())) over single element slice

Replace with next(iter(matched_dict.keys()))

(RUF015)

721-721: Unpacked variable failed_matches is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

721-721: Unpacked variable skipped_sections is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

725-725: Prefer next(iter(matched_dict.keys())) over single element slice

Replace with next(iter(matched_dict.keys()))

(RUF015)

726-726: Prefer next(iter(matched_dict.values())) over single element slice

Replace with next(iter(matched_dict.values()))

(RUF015)

762-762: Prefer next(iter(ai_matched.keys())) over single element slice

Replace with next(iter(ai_matched.keys()))

(RUF015)

763-763: Prefer next(iter(ai_matched.values())) over single element slice

Replace with next(iter(ai_matched.values()))

(RUF015)

811-811: Prefer next(iter(matched_dict.keys())) over single element slice

Replace with next(iter(matched_dict.keys()))

(RUF015)

821-821: Unpacked variable failed_matches is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

821-821: Unpacked variable skipped_sections is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

825-825: Prefer next(iter(matched_dict.keys())) over single element slice

Replace with next(iter(matched_dict.keys()))

(RUF015)

826-826: Prefer next(iter(matched_dict.values())) over single element slice

Replace with next(iter(matched_dict.values()))

(RUF015)

854-854: Prefer next(iter(ai_matched.keys())) over single element slice

Replace with next(iter(ai_matched.keys()))

(RUF015)

855-855: Prefer next(iter(ai_matched.values())) over single element slice

Replace with next(iter(ai_matched.values()))

(RUF015)

917-917: Local variable current_level is assigned to but never used

Remove assignment to unused variable current_level

(F841)

939-939: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

scripts/translate_doc_pr/toc_processor.py

90-90: Unused function argument: file_path

(ARG001)

90-90: Unused function argument: target_local_path

(ARG001)

92-92: f-string without any placeholders

Remove extraneous f prefix

(F541)

126-126: String contains ambiguous ➕ (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)

241-241: f-string without any placeholders

Remove extraneous f prefix

(F541)

286-286: f-string without any placeholders

Remove extraneous f prefix

(F541)

289-289: Do not catch blind exception: Exception

(BLE001)

293-293: f-string without any placeholders

Remove extraneous f prefix

(F541)

303-303: f-string without any placeholders

Remove extraneous f prefix

(F541)

328-328: Do not catch blind exception: Exception

(BLE001)

332-332: Unused function argument: pr_url

(ARG001)

332-332: Unused function argument: github_client

(ARG001)

387-387: String contains ambiguous ➕ (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)

400-400: String contains ambiguous ➕ (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)

403-403: String contains ambiguous ➕ (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)

418-418: Do not catch blind exception: Exception

(BLE001)

434-434: f-string without any placeholders

Remove extraneous f prefix

(F541)

scripts/translate_doc_pr/main_workflow.py

56-56: Avoid specifying long messages outside the exception class

(TRY003)

115-115: f-string without any placeholders

Remove extraneous f prefix

(F541)

119-119: f-string without any placeholders

Remove extraneous f prefix

(F541)

132-132: Do not catch blind exception: Exception

(BLE001)

157-157: Avoid specifying long messages outside the exception class

(TRY003)

159-159: Avoid specifying long messages outside the exception class

(TRY003)

163-163: Avoid specifying long messages outside the exception class

(TRY003)

179-179: f-string without any placeholders

Remove extraneous f prefix

(F541)

188-188: f-string without any placeholders

Remove extraneous f prefix

(F541)

191-191: f-string without any placeholders

Remove extraneous f prefix

(F541)

195-195: Use explicit conversion flag

Replace with conversion flag

(RUF010)

197-197: f-string without any placeholders

Remove extraneous f prefix

(F541)

198-198: Use raise without specifying exception name

Remove exception name

(TRY201)

221-221: Loop control variable key not used within loop body

Rename unused key to _key

(B007)

235-235: f-string without any placeholders

Remove extraneous f prefix

(F541)

247-247: Do not catch blind exception: Exception

(BLE001)

271-271: Do not catch blind exception: Exception

(BLE001)

275-275: Unused function argument: target_sections

(ARG001)

312-312: Loop control variable key not used within loop body

Rename unused key to _key

(B007)

319-319: f-string without any placeholders

Remove extraneous f prefix

(F541)

370-370: Unused function argument: file_sections

(ARG001)

411-411: f-string without any placeholders

Remove extraneous f prefix

(F541)

414-414: f-string without any placeholders

Remove extraneous f prefix

(F541)

434-434: f-string without any placeholders

Remove extraneous f prefix

(F541)

446-446: f-string without any placeholders

Remove extraneous f prefix

(F541)

458-458: f-string without any placeholders

Remove extraneous f prefix

(F541)

474-474: Unpacked variable file_path is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

505-505: f-string without any placeholders

Remove extraneous f prefix

(F541)

510-510: f-string without any placeholders

Remove extraneous f prefix

(F541)

513-513: f-string without any placeholders

Remove extraneous f prefix

(F541)

517-517: f-string without any placeholders

Remove extraneous f prefix

(F541)

520-520: f-string without any placeholders

Remove extraneous f prefix

(F541)

523-523: Do not catch blind exception: Exception

(BLE001)

530-530: Redefinition of unused parse_pr_url from line 21

Remove definition: parse_pr_url

(F811)

536-536: Avoid specifying long messages outside the exception class

(TRY003)

556-556: f-string without any placeholders

Remove extraneous f prefix

(F541)

584-584: Do not catch blind exception: Exception

(BLE001)

591-591: f-string without any placeholders

Remove extraneous f prefix

(F541)

599-599: f-string without any placeholders

Remove extraneous f prefix

(F541)

600-600: Unpacked variable added_sections is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

600-600: Unpacked variable deleted_sections is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

610-610: f-string without any placeholders

Remove extraneous f prefix

(F541)

614-614: Redefinition of unused process_toc_files from line 25

Remove definition: process_toc_files

(F811)

620-620: f-string without any placeholders

Remove extraneous f prefix

(F541)

626-626: f-string without any placeholders

Remove extraneous f prefix

(F541)

632-632: f-string without any placeholders

Remove extraneous f prefix

(F541)

658-658: f-string without any placeholders

Remove extraneous f prefix

(F541)

683-683: f-string without any placeholders

Remove extraneous f prefix

(F541)

688-688: f-string without any placeholders

Remove extraneous f prefix

(F541)

scripts/translate_doc_pr/file_updater.py

51-51: Local variable diff_content is assigned to but never used

Remove assignment to unused variable diff_content

(F841)

102-102: f-string without any placeholders

Remove extraneous f prefix

(F541)

105-105: f-string without any placeholders

Remove extraneous f prefix

(F541)

109-109: f-string without any placeholders

Remove extraneous f prefix

(F541)

125-125: Do not catch blind exception: Exception

(BLE001)

138-138: f-string without any placeholders

Remove extraneous f prefix

(F541)

150-150: Consider moving this statement to an else block

(TRY300)

152-152: Do not catch blind exception: Exception

(BLE001)

164-164: f-string without any placeholders

Remove extraneous f prefix

(F541)

173-173: f-string without any placeholders

Remove extraneous f prefix

(F541)

176-176: f-string without any placeholders

Remove extraneous f prefix

(F541)

180-180: f-string without any placeholders

Remove extraneous f prefix

(F541)

200-200: Consider moving this statement to an else block

(TRY300)

202-202: f-string without any placeholders

Remove extraneous f prefix

(F541)

230-230: Consider moving this statement to an else block

(TRY300)

231-231: Do not use bare except

(E722)

231-232: try-except-continue detected, consider logging the exception

(S112)

235-235: Use raise without specifying exception name

Remove exception name

(TRY201)

237-237: f-string without any placeholders

Remove extraneous f prefix

(F541)

257-257: Do not catch blind exception: Exception

(BLE001)

417-417: Consider moving this statement to an else block

(TRY300)

419-419: Do not catch blind exception: Exception

(BLE001)

430-430: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

468-468: f-string without any placeholders

Remove extraneous f prefix

(F541)

568-568: Consider moving this statement to an else block

(TRY300)

570-570: Do not catch blind exception: Exception

(BLE001)

601-601: Do not catch blind exception: Exception

(BLE001)

633-633: Do not catch blind exception: Exception

(BLE001)

639-639: Unused function argument: pr_url

(ARG001)

684-684: Unpacked variable skipped_sections is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

688-688: Loop control variable hierarchy_string not used within loop body

Rename unused hierarchy_string to _hierarchy_string

(B007)

726-726: Loop control variable source_line not used within loop body

Rename unused source_line to _source_line

(B007)

768-768: Local variable target_hierarchy is assigned to but never used

Remove assignment to unused variable target_hierarchy

(F841)

808-808: Consider moving this statement to an else block

(TRY300)

810-810: Do not catch blind exception: Exception

(BLE001)

1049-1049: Do not catch blind exception: Exception

(BLE001)

1056-1056: String contains ambiguous ➕ (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)

1059-1059: String contains ambiguous ➕ (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)

1066-1066: String contains ambiguous ➕ (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)

1118-1118: Unused function argument: operation_type

(ARG001)

1183-1183: Do not catch blind exception: Exception

(BLE001)

1190-1190: f-string without any placeholders

Remove extraneous f prefix

(F541)

1219-1219: f-string without any placeholders

Remove extraneous f prefix

(F541)

1270-1270: Unpacked variable section_data is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1270-1270: Unpacked variable line_num is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1287-1287: f-string without any placeholders

Remove extraneous f prefix

(F541)

1296-1296: f-string without any placeholders

Remove extraneous f prefix

(F541)

1389-1389: f-string without any placeholders

Remove extraneous f prefix

(F541)

1414-1414: f-string without any placeholders

Remove extraneous f prefix

(F541)

1418-1418: f-string without any placeholders

Remove extraneous f prefix

(F541)

1423-1423: Use explicit conversion flag

Replace with conversion flag

(RUF010)

1429-1429: f-string without any placeholders

Remove extraneous f prefix

(F541)

1461-1461: f-string without any placeholders

Remove extraneous f prefix

(F541)

1471-1471: f-string without any placeholders

Remove extraneous f prefix

(F541)

1483-1483: f-string without any placeholders

Remove extraneous f prefix

(F541)

1502-1502: f-string without any placeholders

Remove extraneous f prefix

(F541)

1522-1522: f-string without any placeholders

Remove extraneous f prefix

(F541)

1553-1553: f-string without any placeholders

Remove extraneous f prefix

(F541)

1585-1585: f-string without any placeholders

Remove extraneous f prefix

(F541)

1622-1622: f-string without any placeholders

Remove extraneous f prefix

(F541)

1688-1688: f-string without any placeholders

Remove extraneous f prefix

(F541)

scripts/translate_doc_pr/pr_analyzer.py

1-1: Shebang is present but file is not executable

(EXE001)

33-33: Avoid specifying long messages outside the exception class

(TRY003)

60-60: Do not catch blind exception: Exception

(BLE001)

240-240: Local variable original_line is assigned to but never used

Remove assignment to unused variable original_line

(F841)

265-265: Local variable title is assigned to but never used

Remove assignment to unused variable title

(F841)

291-291: Unused function argument: lines

(ARG001)

328-328: Local variable original_title is assigned to but never used

Remove assignment to unused variable original_title

(F841)

341-341: Local variable title is assigned to but never used

Remove assignment to unused variable title

(F841)

368-368: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

402-402: Unused function argument: hierarchy_dict

(ARG001)

487-487: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

526-526: Unused function argument: lines

(ARG001)

547-547: Unused function argument: lines

(ARG001)

613-613: Consider moving this statement to an else block

(TRY300)

614-614: Do not catch blind exception: Exception

(BLE001)

644-644: Consider moving this statement to an else block

(TRY300)

645-645: Do not catch blind exception: Exception

(BLE001)

665-665: Consider moving this statement to an else block

(TRY300)

667-667: Do not catch blind exception: Exception

(BLE001)

746-746: Unused function argument: hierarchy_dict

(ARG001)

754-754: Unused function argument: base_hierarchy_dict

(ARG001)

922-922: Loop control variable base_line_str not used within loop body

Rename unused base_line_str to _base_line_str

(B007)

944-944: f-string without any placeholders

Remove extraneous f prefix

(F541)

979-979: Unused function argument: max_non_system_sections

(ARG001)

979-979: Unused function argument: pr_diff

(ARG001)

982-982: Redefinition of unused os from line 8

Remove definition: os

(F811)

983-983: Redefinition of unused json from line 7

Remove definition: json

(F811)

1021-1021: String contains ambiguous ➕ (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)

1025-1025: f-string without any placeholders

Remove extraneous f prefix

(F541)

1027-1027: Do not catch blind exception: Exception

(BLE001)

1034-1034: f-string without any placeholders

Remove extraneous f prefix

(F541)

1040-1040: Do not catch blind exception: Exception

(BLE001)

1053-1053: Do not catch blind exception: Exception

(BLE001)

1078-1078: f-string without any placeholders

Remove extraneous f prefix

(F541)

1080-1080: String contains ambiguous ➕ (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)

1086-1086: f-string without any placeholders

Remove extraneous f prefix

(F541)

1086-1086: String contains ambiguous ℹ (INFORMATION SOURCE). Did you mean i (LATIN SMALL LETTER I)?

(RUF001)

1142-1142: Do not catch blind exception: Exception

(BLE001)

1161-1161: f-string without any placeholders

Remove extraneous f prefix

(F541)

1202-1202: Local variable is_deleted_header is assigned to but never used

Remove assignment to unused variable is_deleted_header

(F841)

1270-1270: f-string without any placeholders

Remove extraneous f prefix

(F541)

1299-1299: f-string without any placeholders

Remove extraneous f prefix

(F541)

1371-1371: String contains ambiguous ➕ (HEAVY PLUS SIGN). Did you mean + (PLUS SIGN)?

(RUF001)

1402-1402: f-string without any placeholders

Remove extraneous f prefix

(F541)

1413-1413: Loop control variable existing_line not used within loop body

Rename unused existing_line to _existing_line

(B007)

1437-1437: f-string without any placeholders

Remove extraneous f prefix

(F541)

🪛 actionlint (1.7.7)

.github/workflows/sync-docs-cn-to-en.yml

37-37: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: tidb-check

🔇 Additional comments (1)

scripts/translate_doc_pr/main_workflow.py (1)

39-40: No changes needed: "gemini-2.0-flash" is a valid Google GenAI model ID.

coderabbitai · 2025-09-25T10:14:31Z

.github/workflows/sync-docs-cn-to-en.yml

+on:
+  workflow_dispatch:
+    inputs:
+      source_pr_url:
+        description: 'Source PR URL (Chinese docs repository)'
+        required: true
+        type: string
+        default: ''
+      target_pr_url:
+        description: 'Target PR URL (English docs repository)'
+        required: true
+        type: string
+        default: ''
+      ai_provider:
+        description: 'AI Provider to use for translation'
+        required: false
+        type: choice
+        options:
+          - deepseek
+          - gemini
+        default: 'gemini'
+


🛠️ Refactor suggestion | 🟠 Major

Declare least-privilege permissions for GITHUB_TOKEN.

Explicitly set permissions to enable contents and PR writes; current default may be read-only.

Apply:

on: workflow_dispatch: inputs: source_pr_url: description: 'Source PR URL (Chinese docs repository)' required: true type: string default: '' @@ default: 'gemini' +permissions: + contents: write + pull-requests: write

🤖 Prompt for AI Agents

.github/workflows/sync-docs-cn-to-en.yml around lines 3 to 24: the workflow lacks an explicit permissions block for GITHUB_TOKEN, which can default to read-only; add a top-level permissions section granting least-privilege write access needed for the job (e.g., contents: write and pull-requests: write) so the workflow can update repository contents and create/update PRs, ensuring no broader permissions are granted.

coderabbitai · 2025-09-25T10:14:31Z

.github/workflows/sync-docs-cn-to-en.yml

+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.9'


⚠️ Potential issue | 🟠 Major

Update setup-python to v5 (actionlint failure).

Use actions/setup-python@v5 to avoid runner errors.

Apply:

- - name: Set up Python - uses: actions/setup-python@v4 + - name: Set up Python + uses: actions/setup-python@v5 with: python-version: '3.9'

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

uses: actions/setup-python@v4

with:

python-version: '3.9'

- name: Set up Python

uses: actions/setup-python@v5

with:

python-version: '3.9'

🧰 Tools

🪛 actionlint (1.7.7)

37-37: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🤖 Prompt for AI Agents

.github/workflows/sync-docs-cn-to-en.yml around lines 37 to 39: the workflow pins actions/setup-python@v4 which causes actionlint/runner errors; update the workflow to use actions/setup-python@v5 by changing the action reference to actions/setup-python@v5 and ensure the existing with: python-version setting remains unchanged (no other logic changes needed).

coderabbitai · 2025-09-25T10:14:31Z

.github/workflows/sync-docs-cn-to-en.yml

+      - name: Get target PR branch info
+        id: target_branch
+        run: |
+          # Get target PR branch name
+          TARGET_BRANCH=$(curl -s \
+            -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
+            -H "Accept: application/vnd.github.v3+json" \
+            "https://api.github.com/repos/${{ steps.extract_info.outputs.target_owner }}/${{ steps.extract_info.outputs.target_repo }}/pulls/${{ steps.extract_info.outputs.target_pr }}" \
+            | jq -r '.head.ref')
+
+          echo "target_branch=${TARGET_BRANCH}" >> $GITHUB_OUTPUT
+          echo "Target branch: ${TARGET_BRANCH}"
+


⚠️ Potential issue | 🔴 Critical

Cross-repo PRs and forks: clone/push the PR head repository using a PAT; current flow will fail.

Issues:

You only fetch .head.ref but clone base repo; this breaks for forked PRs (branch won't exist in base).

GITHUB_TOKEN cannot push to another repo; use a PAT (or GitHub App) with contents:write on the head repo.

Commenting on the target PR also needs a token authorized on that repo.

Fix by retrieving head repo full name and cloning/pushing to it with a PAT, and by using the same PAT for API calls.

Suggested patch:

- name: Get target PR branch info id: target_branch run: | # Get target PR branch name - TARGET_BRANCH=$(curl -s \ + PR_JSON=$(curl -s \ -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \ - -H "Accept: application/vnd.github.v3+json" \ + -H "Accept: application/vnd.github+json" \ "https://api.github.com/repos/${{ steps.extract_info.outputs.target_owner }}/${{ steps.extract_info.outputs.target_repo }}/pulls/${{ steps.extract_info.outputs.target_pr }}" \ - | jq -r '.head.ref') + ) + TARGET_BRANCH=$(echo "$PR_JSON" | jq -r '.head.ref') + TARGET_HEAD_REPO=$(echo "$PR_JSON" | jq -r '.head.repo.full_name') + echo "target_branch=${TARGET_BRANCH}" >> $GITHUB_OUTPUT + echo "target_head_repo=${TARGET_HEAD_REPO}" >> $GITHUB_OUTPUT - echo "target_branch=${TARGET_BRANCH}" >> $GITHUB_OUTPUT - echo "Target branch: ${TARGET_BRANCH}" + echo "Target branch: ${TARGET_BRANCH}" + echo "Head repo: ${TARGET_HEAD_REPO}" - name: Clone target repository run: | - # Clone target repository with the PR branch - git clone https://x-access-token:${{ secrets.GITHUB_TOKEN }}@github.com/${{ steps.extract_info.outputs.target_owner }}/${{ steps.extract_info.outputs.target_repo }}.git target_repo + # Clone PR head repository (supports forks) with a PAT + git clone https://x-access-token:${{ secrets.SYNC_REPO_TOKEN }}@github.com/${{ steps.target_branch.outputs.target_head_repo }}.git target_repo cd target_repo - git checkout ${{ steps.target_branch.outputs.target_branch }} + git fetch origin ${{ steps.target_branch.outputs.target_branch }} + git checkout -B ${{ steps.target_branch.outputs.target_branch }} origin/${{ steps.target_branch.outputs.target_branch }} git config user.name "github-actions[bot]" git config user.email "github-actions[bot]@users.noreply.github.com" - name: Commit and push changes run: | cd target_repo git add . if git diff --staged --quiet; then echo "No changes to commit" else git commit -m "Auto-sync: Update English docs from Chinese PR ${{ github.event.inputs.source_pr_url }} Synced from: ${{ github.event.inputs.source_pr_url }} Target PR: ${{ github.event.inputs.target_pr_url }} AI Provider: ${{ github.event.inputs.ai_provider }} Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>" - git push origin ${{ steps.target_branch.outputs.target_branch }} + git push https://x-access-token:${{ secrets.SYNC_REPO_TOKEN }}@github.com/${{ steps.target_branch.outputs.target_head_repo }}.git ${{ steps.target_branch.outputs.target_branch }} echo "Changes pushed to target PR branch: ${{ steps.target_branch.outputs.target_branch }}" fi - name: Add comment to target PR run: | # Add a comment to the target PR about the sync - curl -X POST \ - -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \ + curl -X POST \ + -H "Authorization: token ${{ secrets.SYNC_REPO_TOKEN }}" \ -H "Accept: application/vnd.github.v3+json" \ "https://api.github.com/repos/${{ steps.extract_info.outputs.target_owner }}/${{ steps.extract_info.outputs.target_repo }}/issues/${{ steps.extract_info.outputs.target_pr }}/comments" \ -d "{ \"body\": \"🤖 **Auto-sync completed**\\n\\n📥 **Source PR**: ${{ github.event.inputs.source_pr_url }}\\n🎯 **Target PR**: ${{ github.event.inputs.target_pr_url }}\\n✅ English documentation has been updated based on Chinese documentation changes.\\n\\n_This comment was generated automatically by the sync workflow._\" }"

Also add a repo secret SYNC_REPO_TOKEN (PAT) with repo:write access on the head repo (or use a GitHub App).

Also applies to: 84-93, 125-134

🤖 Prompt for AI Agents

.github/workflows/sync-docs-cn-to-en.yml lines 71-83 (and similarly update 84-93, 125-134): the workflow currently reads only .head.ref and clones/pushes the base repo using GITHUB_TOKEN which fails for forked PRs and prevents pushing to the head repo; change the flow to fetch the head repo full_name (owner/repo) and head.ref from the PR API, switch cloning/push operations to use that head repo full_name, and replace GITHUB_TOKEN with a PAT stored in a repo secret (e.g. SYNC_REPO_TOKEN with repo:write/contents scope) for git clone/push and for any API calls (comments or other PR updates) so the token has permission on the head repo; ensure the workflow uses the head repo URL when adding remotes and pushing branches and update the other referenced blocks (84-93, 125-134) to use the same head-repo + PAT approach.

coderabbitai · 2025-09-25T10:14:31Z

scripts/translate_doc_pr/__init__.py

+from main import main
+
+# Make main function available at package level
+__all__ = ["main"]


⚠️ Potential issue | 🟠 Major

Fix package export import path

import scripts.translate_doc_pr currently explodes with ModuleNotFoundError: No module named 'main' because the package init tries to do an absolute from main import main. The callable lives in main_workflow.py, so the package entry point is unusable until we point the import at the correct module.

-from main import main +from .main_workflow import main

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from main import main

# Make main function available at package level

__all__ = ["main"]

from .main_workflow import main

# Make main function available at package level

__all__ = ["main"]

🤖 Prompt for AI Agents

In scripts/translate_doc_pr/__init__.py around lines 19 to 22, the package init does an absolute import from a non-existent module ("main"), causing ModuleNotFoundError; replace the import with a relative import that points to the actual callable (main_workflow.py) — i.e., import the main function using a package-relative import and keep __all__ as ["main"] so importing scripts.translate_doc_pr exposes the correct entrypoint.

coderabbitai · 2025-09-25T10:14:31Z

scripts/translate_doc_pr/file_adder.py

+    # Add token estimation
+    try:
+        from main import print_token_estimation
+        print_token_estimation(prompt, "File addition translation")
+    except ImportError:
+        # Fallback if import fails - use tiktoken


⚠️ Potential issue | 🟠 Major

Wrong module import: print_token_estimation is in main_workflow.py, not main.

This always ImportError-falls back, hiding token stats.

Apply:

- try: - from main import print_token_estimation + try: + from main_workflow import print_token_estimation print_token_estimation(prompt, "File addition translation")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Add token estimation

try:

from main import print_token_estimation

print_token_estimation(prompt, "File addition translation")

except ImportError:

# Fallback if import fails - use tiktoken

# Add token estimation

try:

from main_workflow import print_token_estimation

print_token_estimation(prompt, "File addition translation")

except ImportError:

# Fallback if import fails - use tiktoken

🤖 Prompt for AI Agents

In scripts/translate_doc_pr/file_adder.py around lines 100 to 105, the code incorrectly imports print_token_estimation from module main (which causes an ImportError and always triggers the fallback); update the import to load print_token_estimation from main_workflow (from main_workflow import print_token_estimation) and keep the existing try/except fallback behavior so token estimation is used when available and falls back to tiktoken only if the import or call fails.

coderabbitai · 2025-09-25T10:14:32Z

scripts/translate_doc_pr/section_matcher.py

+    try:
+        from main import print_token_estimation
+        print_token_estimation(prompt, f"Section mapping ({source_language} → {target_language})")
+    except ImportError:
+        # Fallback if import fails - use tiktoken
+        try:
+            import tiktoken


⚠️ Potential issue | 🟠 Major

Fix import of token estimation helper (wrong module name).

main_workflow.py defines print_token_estimation. Importing from main will always fail and suppress the feature.

Apply this diff:

- try: - from main import print_token_estimation + try: + # Try local module name first + from main_workflow import print_token_estimation print_token_estimation(prompt, f"Section mapping ({source_language} → {target_language})")

Optional: add a second fallback try for package-style import if you plan to package this module.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

from main import print_token_estimation

print_token_estimation(prompt, f"Section mapping ({source_language} → {target_language})")

except ImportError:

# Fallback if import fails - use tiktoken

try:

import tiktoken

try:

# Try local module name first

from main_workflow import print_token_estimation

print_token_estimation(prompt, f"Section mapping ({source_language} → {target_language})")

except ImportError:

# Fallback if import fails - use tiktoken

try:

import tiktoken

🤖 Prompt for AI Agents

In scripts/translate_doc_pr/section_matcher.py around lines 340-346, the code tries to import print_token_estimation from the wrong module name ("main"), which always triggers the ImportError and disables the helper; change the import to "from main_workflow import print_token_estimation" and call it as before; additionally add a second fallback import attempt for package-style imports (e.g., "from .main_workflow import print_token_estimation") before falling back to the tiktoken-based implementation so the helper works both when run standalone and when packaged.

coderabbitai · 2025-09-25T10:14:33Z

scripts/translate_doc_pr/section_matcher.py

+                    'insertion_after_line': insertion_after_line,
+                    'target_hierarchy': target_hierarchy.get(str(target_line_num), ''),
+                    'insertion_type': point_info['insertion_type'],
+                    'new_sections': point_info['new_sections']
+                }


⚠️ Potential issue | 🟠 Major

Key type mismatch when reading target_hierarchy.

target_hierarchy keys are ints; using str(target_line_num) returns empty string.

Apply this diff:

- 'target_hierarchy': target_hierarchy.get(str(target_line_num), ''), + 'target_hierarchy': target_hierarchy.get(target_line_num, ''),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

'insertion_after_line': insertion_after_line,

'target_hierarchy': target_hierarchy.get(str(target_line_num), ''),

'insertion_type': point_info['insertion_type'],

'new_sections': point_info['new_sections']

}

'insertion_after_line': insertion_after_line,

'target_hierarchy': target_hierarchy.get(target_line_num, ''),

'insertion_type': point_info['insertion_type'],

'new_sections': point_info['new_sections']

}

🤖 Prompt for AI Agents

In scripts/translate_doc_pr/section_matcher.py around lines 493 to 497, the code uses str(target_line_num) to read from target_hierarchy whose keys are integers, causing lookups to return an empty string; change the lookup to use the integer key (i.e., target_hierarchy.get(target_line_num, '')) so the correct value is returned (and keep the existing fallback ''), ensuring types match between the key and the lookup.

coderabbitai · 2025-09-25T10:14:33Z

scripts/translate_doc_pr/section_matcher.py

+    """Extract target section content from target_lines (includes sub-sections)"""
+    if target_line_num >= len(target_lines):
+        return ""
+
+    start_line = target_line_num - 1  # Convert to 0-based index
+
+    # Find the end of the section by looking for the next header
+    current_line = target_lines[start_line].strip()
+    if not current_line.startswith('#'):
+        return current_line


⚠️ Potential issue | 🟠 Major

Off-by-one: allow extracting content at last line.

Guard should be 1-based range check.

Apply this diff:

- if target_line_num >= len(target_lines): + if target_line_num < 1 or target_line_num > len(target_lines): return ""

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"""Extract target section content from target_lines (includes sub-sections)"""

if target_line_num >= len(target_lines):

return ""

start_line = target_line_num - 1 # Convert to 0-based index

# Find the end of the section by looking for the next header

current_line = target_lines[start_line].strip()

if not current_line.startswith('#'):

return current_line

"""Extract target section content from target_lines (includes sub-sections)"""

if target_line_num < 1 or target_line_num > len(target_lines):

return ""

start_line = target_line_num - 1 # Convert to 0-based index

# Find the end of the section by looking for the next header

current_line = target_lines[start_line].strip()

if not current_line.startswith('#'):

return current_line

🤖 Prompt for AI Agents

In scripts/translate_doc_pr/section_matcher.py around lines 866 to 875, the current guard uses a 0-based check (if target_line_num >= len(target_lines)) which prevents extracting the last line when target_line_num is 1-based; change the guard to a 1-based range check (e.g., if target_line_num < 1 or target_line_num > len(target_lines): return "") so valid 1-based indices including the last line are allowed, then keep start_line = target_line_num - 1 and the subsequent logic unchanged.

coderabbitai · 2025-09-25T10:14:33Z

scripts/translate_doc_pr/section_matcher.py

+def extract_section_direct_content(target_line_num, target_lines):
+    """Extract ONLY the direct content of a section (excluding sub-sections)"""
+    if target_line_num >= len(target_lines):
+        return ""
+
+    start_line = target_line_num - 1  # Convert to 0-based index
+
+    # Find the end of the section by looking for the next header
+    current_line = target_lines[start_line].strip()


⚠️ Potential issue | 🟠 Major

Same off-by-one in direct-content extractor.

Apply this diff:

- if target_line_num >= len(target_lines): + if target_line_num < 1 or target_line_num > len(target_lines): return ""

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def extract_section_direct_content(target_line_num, target_lines):

"""Extract ONLY the direct content of a section (excluding sub-sections)"""

if target_line_num >= len(target_lines):

return ""

start_line = target_line_num - 1 # Convert to 0-based index

# Find the end of the section by looking for the next header

current_line = target_lines[start_line].strip()

def extract_section_direct_content(target_line_num, target_lines):

"""Extract ONLY the direct content of a section (excluding sub-sections)"""

if target_line_num < 1 or target_line_num > len(target_lines):

return ""

start_line = target_line_num - 1 # Convert to 0-based index

# Find the end of the section by looking for the next header

current_line = target_lines[start_line].strip()

🤖 Prompt for AI Agents

In scripts/translate_doc_pr/section_matcher.py around lines 905 to 913, the direct-content extractor has the same off-by-one bug: start_line is set to target_line_num - 1 but the rest of the logic treats target_line_num as already zero-based. Fix it by making start_line = target_line_num (remove the -1) and adjust the initial bounds check if needed (use target_line_num < 0 or target_line_num >= len(target_lines) for validity); leave the subsequent current_line = target_lines[start_line].strip() and section end detection logic unchanged.

coderabbitai · 2025-09-25T10:14:33Z

scripts/translate_doc_pr/toc_processor.py

+    # Add token estimation
+    try:
+        from main import print_token_estimation
+        print_token_estimation(prompt, "TOC translation")
+    except ImportError:


⚠️ Potential issue | 🟠 Major

Wrong module import: print_token_estimation is in main_workflow.py.

- try: - from main import print_token_estimation + try: + from main_workflow import print_token_estimation print_token_estimation(prompt, "TOC translation")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Add token estimation

try:

from main import print_token_estimation

print_token_estimation(prompt, "TOC translation")

except ImportError:

# Add token estimation

try:

from main_workflow import print_token_estimation

print_token_estimation(prompt, "TOC translation")

except ImportError:

🤖 Prompt for AI Agents

In scripts/translate_doc_pr/toc_processor.py around lines 274 to 278, the code attempts to import print_token_estimation from main but that function lives in main_workflow.py; change the import to "from main_workflow import print_token_estimation" (keeping the existing try/except ImportError block) so the correct module is referenced and token estimation is called successfully.

Frank945946 and others added 4 commits September 24, 2025 07:03

Update system-variables tidb_ddl_reorg_batch_size and tidb_ddl_reorg_… (

f901aa5

pingcap#21819)

ticdc: fix the name of the changefeed config file (pingcap#21831)

d5d6656

tikv: clarify the description of config 'in-memory-engine.capacity' (p…

6518b74

…ingcap#21828)

add files

2aadfeb

qiancai added 2 commits September 25, 2025 17:58

Discard changes to system-variables.md

4fce98f

Discard changes to ticdc/ticdc-faq.md

e68d1c2

Discard changes to tikv-configuration-file.md

88d9864

gemini-code-assist bot reviewed Sep 25, 2025

View reviewed changes

qiancai merged commit ab90fca into master Sep 25, 2025
6 of 7 checks passed

coderabbitai bot reviewed Sep 25, 2025

View reviewed changes

Update system-variables #43

Update system-variables #43

Uh oh!

Conversation

qiancai commented Sep 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

First-time contributors' checklist

What is changed, added or deleted? (Required)

Which TiDB version(s) do your changes apply to? (Required)

What is the related PR or file link(s)?

Do your changes match any of the following descriptions?

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot commented Sep 25, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

qiancai commented Sep 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 25, 2025 •

edited

Loading