Skip to content

Add Modal orchestrator with step operator and orchestrator flavors #3733

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 43 commits into
base: develop
Choose a base branch
from

Conversation

htahir1
Copy link
Contributor

@htahir1 htahir1 commented Jun 12, 2025

Describe changes

Add Modal Orchestrator Integration

This PR adds a new Modal orchestrator to the ZenML integrations, enabling users
to run complete ML pipelines on Modal's serverless cloud infrastructure with
optimized performance and cost efficiency.

What does this PR do?

  • Adds Modal Orchestrator: New orchestrator flavor that executes entire ZenML
    pipelines on Modal's cloud platform
  • Optimized Execution Modes: Two execution modes for different use cases:
    • pipeline (default): Runs entire pipeline in single Modal function for
      maximum speed
    • per_step: Runs each step separately for granular control and debugging
  • Persistent Apps: Implements warm container reuse with 2-hour refresh cycles
    for faster execution
  • Resource Configuration: Full support for GPU, CPU, memory settings with
    intelligent defaults
  • Authentication: Modal token support with fallback to default Modal auth

Implementation Details

File Structure

src/zenml/integrations/modal/
├── orchestrators/ # New directory
│ ├── init.py
│ └── modal_orchestrator.py # Main orchestrator implementation
├── flavors/
│ ├── modal_orchestrator_flavor.py # New flavor definition
│ └── init.py # Updated exports
└── init.py # Updated with orchestrator
registration

Key Features

  • Pipeline-First Design: Uses PipelineEntrypointConfiguration for optimal
    performance
  • Smart Resource Management: Automatic fallbacks and intelligent defaults (32
    CPU, 64GB RAM)
  • Cost Optimization: Persistent apps with warm containers to minimize cold
    starts
  • Stack Validation: Ensures remote artifact store and container registry
    compatibility
  • Comprehensive Error Handling: Detailed logging and error reporting

Usage Example

  from zenml import pipeline, step
  from zenml.integrations.modal.flavors import ModalOrchestratorSettings

  # Configure Modal orchestrator
  @pipeline(
      settings={
          "orchestrator": ModalOrchestratorSettings(
              execution_mode="pipeline",  # Run entire pipeline in one function
              cpu_count=16,
              memory_mb=32768,
              gpu="A100",
              region="us-east-1"
          )
      }
  )
  def my_pipeline():
      # Your pipeline steps here
      pass

Why this approach?

  1. Performance: Running entire pipelines in single containers eliminates
    inter-step overhead
  2. Cost Efficiency: Fewer container spawns = lower costs on Modal's platform
  3. Simplicity: Clean API with just two execution modes for distinct use cases
  4. ZenML Native: Leverages ZenML's PipelineEntrypointConfiguration for optimal
    integration

Breaking Changes

None - this is a new integration that doesn't affect existing functionality.

Dependencies

  • Adds modal>=0.64.49,<1 requirement to Modal integration
  • No new dependencies for core ZenML

Note: This orchestrator follows the same patterns as other ZenML orchestrators
(GCP Vertex, Kubernetes) and integrates seamlessly with the existing ZenML
stack architecture.

Note: I also updated the step operator logic to unify it

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • IMPORTANT: I made sure that my changes are reflected properly in the following resources:
    • ZenML Docs
    • Dashboard: Needs to be communicated to the frontend team.
    • Templates: Might need adjustments (that are not reflected in the template tests) in case of non-breaking changes and deprecations.
    • Projects: Depending on the version dependencies, different projects might get affected.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

@htahir1 htahir1 requested review from strickvl and safoinme June 12, 2025 09:36
Copy link
Contributor

coderabbitai bot commented Jun 12, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added internal To filter out internal PRs and issues enhancement New feature or request labels Jun 12, 2025
- Adds new Modal orchestrator flavor for serverless pipeline execution
- Implements optimized execution modes: pipeline (default) and per_step
- Supports GPU/CPU resource configuration with intelligent defaults
- Features persistent apps with warm containers for fast execution
- Includes comprehensive documentation and examples
- Simplifies execution model by removing redundant single_function mode

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Copy link
Contributor

github-actions bot commented Jun 12, 2025

Documentation Link Check Results

Absolute links check passed
Relative links check passed
Last checked: 2025-06-24 20:28:46 UTC

Copy link
Contributor

✅ Branch tenant has been deployed! Access it at: https://staging.cloud.zenml.io/workspaces/feature-modal-orchestrator/projects

Comment on lines 580 to 585
elif (
log_age < 300
): # Only show logs from last 5 minutes
# This log is recent enough to likely be ours
logger.info(f"{log_msg}")
# Else: skip old logs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit seems maybe like it could use more attention. Also are you sure that we wouldn't have time zone mismatches here etc?

else:
# Fallback to first step's resource settings if no pipeline-level resources
if deployment.step_configurations:
first_step = list(deployment.step_configurations.values())[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the first step is unrepresentative, though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check / just make this really explicit in the docs that this is how we assume this

Comment on lines 69 to 74
# Register the orchestrator with explicit credentials
zenml orchestrator register <ORCHESTRATOR_NAME> \
--flavor=modal \
--token=<MODAL_TOKEN> \
--workspace=<MODAL_WORKSPACE> \
--synchronous=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should use --token-id and --token-secret separately as per the code?

Comment on lines 304 to 306
### Authentication with different environments

For production deployments, you can specify different Modal environments:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe could have a little info box in this section (or maybe even above, linking down here) to say that you might want to have two different stacks, each associated with a different modal environment, one for prod and the other for development etc etc.

log_stream_active.set()
start_time = time.time()

def stream_logs() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function in function smells a bit wrong and also wondering if we should instead use their Python SDK to stream the logs? https://github.com/modal-labs/modal-client/blob/4177d0b994ac69e01ada7d7a96655c9dcaae570e/modal/cli/utils.py#L24

Possibly something for down the line, though the func-in-func seems off.

Comment on lines 62 to 74
if TYPE_CHECKING:
from zenml.integrations.modal.flavors.modal_orchestrator_flavor import (
ModalOrchestratorConfig,
ModalOrchestratorSettings,
)

from zenml.integrations.modal.flavors.modal_orchestrator_flavor import (
ModalExecutionMode,
)

if TYPE_CHECKING:
from zenml.models import PipelineDeploymentResponse, PipelineRunResponse
from zenml.models.v2.core.pipeline_deployment import PipelineDeploymentBase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine the 'if TYPE_CHECKING' parts?


The Modal orchestrator supports two execution modes:

1. **`pipeline` (default)**: Runs the entire pipeline in a single Modal function for maximum speed and cost efficiency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand why this pipeline option is max speed. Isn't it running everything sequentially in the same container? Wouldn't running things in parallel in separate Modal function calls run faster?

Comment on lines 7 to 10
Using the ZenML `modal` integration, you can orchestrate and scale your ML pipelines on [Modal's](https://modal.com/) serverless cloud platform with minimal setup and maximum efficiency.

The Modal orchestrator is designed for speed and cost-effectiveness, running entire pipelines in single serverless functions to minimize cold starts and optimize resource utilization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some representative screenshot of the Modal UI in here to make the docs a bit friendlier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its fine without

htahir1 and others added 5 commits June 13, 2025 14:35
- Extract nested log streaming function into ModalLogStreamer class for better code organization
- Remove unreliable timezone-based log filtering that could miss logs due to clock skew
- Implement smarter resource fallback: use highest requirements across all steps instead of potentially unrepresentative first step
- Add logging for resource selection decisions to improve debugging
- Fix function-in-function code smell identified in PR review

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Combine duplicate TYPE_CHECKING blocks into single import section
- Improve import organization and reduce redundancy
- Maintain all existing functionality while improving code structure

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Import MODAL_ORCHESTRATOR_FLAVOR constant from central location to avoid duplication
- Update requirements to modal>=1 after testing compatibility with both orchestrator and step operator
- Remove unnecessary utils import that was only for mypy discovery
- Maintain consistent import patterns across Modal integration files

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Based on PR review feedback:

- Fix token authentication examples to use --token-id and --token-secret
- Add "When NOT to use it" section with clear tradeoffs and alternatives
- Add info boxes for environment separation best practices and cost implications
- Document Modal vs Step Operator differences with usage recommendations
- Add GPU base image requirements and CUDA compatibility warnings
- Clarify execution modes: "pipeline" mode reduces overhead vs enables parallelism
- Document resource fallback behavior and warming window defaults
- Add container warming cost implications with specific guidance
- Remove tracking pixel per review request
- Improve overall documentation clarity and completeness

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@htahir1 htahir1 requested a review from strickvl June 13, 2025 14:25
Copy link
Contributor

@strickvl strickvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, assuming tests pass.

token_id: Optional[str] = SecretField(default=None)
token_secret: Optional[str] = SecretField(default=None)
workspace: Optional[str] = None
environment: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already exists on superclass

Returns:
True since the orchestrator waits for completion.
"""
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. All other orchestrators have the synchronous attribute in the config and return that value here I think?

token_id: Optional[str] = SecretField(default=None)
token_secret: Optional[str] = SecretField(default=None)
workspace: Optional[str] = None
environment: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defined on super class

try:
import modal
except ImportError:
modal = None # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? And then it crashes with a worse error below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

logger.info(
f"Running step '{step_name}' remotely (pipeline run: {pipeline_run_id})"
)
sys.stdout.flush()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Exception: If step execution fails.
"""
# Get pipeline run ID for debugging/logging
pipeline_run_id = os.environ.get("ZENML_PIPELINE_RUN_ID", "unknown")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this, and who sets it?

)

# Create the configuration and run the step
config = StepEntrypointConfiguration(arguments=args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is not meant to be used in that way, no idea if this will lead to issues

"""
# Use the standard containerized orchestrator build logic
# This ensures ZenML builds the image with all pipeline code
return super().get_docker_builds(deployment)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

Comment on lines 303 to 306
if modal is None:
raise RuntimeError(
"Required dependencies not installed. Please install with: pip install modal"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if modal is None:
raise RuntimeError(
"Required dependencies not installed. Please install with: pip install modal"
)
if modal is None:
raise ImportError(
"Modal is not installed. Please install with: pip install 'modal>=1.0'"
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is intentional


# Pass pipeline run ID for proper isolation (following other orchestrators' pattern)
if placeholder_run:
environment["ZENML_PIPELINE_RUN_ID"] = str(placeholder_run.id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schustmi can i do this? Does this work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request internal To filter out internal PRs and issues staging-workspace
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants