Skip to content

Conversation

@Jmevorach
Copy link
Contributor

@Jmevorach Jmevorach commented Nov 26, 2025

Refactors the OpenEMR 7.0.5 Docker image for significantly improved startup performance and reduced memory footprint. This is a work-in-progress that requires community testing to verify all container functionality before merging.

Performance Improvements

Benchmarking against the current Docker Hub image (openemr/openemr:7.0.5) shows:

Metric Optimized Original Improvement
Startup Time 15.0s 73.1s 4.9x faster
Memory (Avg) 92.8 MB 304.2 MB 69% reduction
Memory (Peak) 117.1 MB 326.6 MB 64% reduction
Throughput 114.9 req/s 117.2 req/s Equivalent

Technical Changes

Build-Time Permission Optimization (Primary)

  • Moved file permission operations (chmod/chown) from runtime to build time
  • The original container scanned and modified ~15,000 files on every startup (40-60s)
  • Now permissions are baked into the image; only setup-modified files need runtime adjustment

Startup Script Modernization

  • Rewrote openemr.sh from POSIX sh to bash with set -euo pipefail
  • Added comprehensive inline documentation and clear section organization
  • Improved database wait logic with exponential backoff
  • Simplified permission handling for runtime efficiency

Dockerfile Documentation

  • Added extensive comments explaining each build stage
  • Documented package purposes, permission schemes, and build targets

Testing Needed

This PR needs help verifying that all OpenEMR container functions work correctly:

  • Fresh installation (auto-configure)
  • Manual setup mode
  • SSL/TLS certificate configuration
  • Redis session handling
  • Kubernetes mode (admin/worker roles)
  • Swarm mode coordination
  • Upgrade path from previous versions
  • XDebug configuration
  • Let's Encrypt integration
  • Document upload/storage
  • Multi-site configurations

How to Test

Build and test the optimized image:
cd docker/openemr/7.0.5

docker build -t openemr:7.0.5-optimized .
docker compose up -d

Or use the benchmarking utility to compare performance: cd utilities/container_benchmarking
./benchmark.sh

Related Files

  • docker/openemr/7.0.5/* - Optimized container files
  • utilities/container_benchmarking/ - Benchmarking suite for performance validation

Summary of the Container Benchmarking Utility

The Container Benchmarking Utility (utilities/container_benchmarking/) is a valuable development tool for the OpenEMR project:

What It Does

  • Compares two container images side-by-side (local build vs. Docker Hub reference)
  • Measures key performance metrics:
    • Startup time: How long until the container is healthy
    • Throughput: Requests per second under load (Apache Bench)
    • Resource usage: CPU and memory during operation

Why It's Useful for OpenEMR Container Development

  1. Quantifies optimizations: Instead of guessing, you get hard numbers (e.g., "4.9x faster startup")
  2. Prevents regressions: Compare any changes against the baseline Docker Hub image
  3. Quick feedback loop: Full benchmark suite runs in 3-5 minutes
  4. Reproducible results: Documented methodology with timestamp-organized output
  5. Analysis tools: Includes summary.sh, compare_results.sh, and CSV export for deeper analysis

Recommended Use Cases

  • Before PRs: Verify performance impact of container changes
  • CI/CD integration: Automated regression detection in GitHub Actions
  • Version comparisons: Test across OpenEMR versions (7.0.3, 7.0.4, 7.0.5, etc.)
  • Optimization validation: Prove that changes actually improve performance

The utility democratizes performance testing—any contributor can run benchmarks locally and share quantified results, making it easier to evaluate and merge container improvements.

Refactor the OpenEMR 7.0.5 Docker image for significantly improved startup performance
and reduced memory footprint. This is a work-in-progress that requires community testing
to verify all container functionality before merging.

## Performance Improvements

Benchmarking against the current Docker Hub image (`openemr/openemr:7.0.5`) shows:

| Metric          | Optimized    | Original    | Improvement      |
|-----------------|--------------|-------------|------------------|
| Startup Time    | 15.0s        | 73.1s       | **4.9x faster**  |
| Memory (Avg)    | 92.8 MB      | 304.2 MB    | **69% reduction**|
| Memory (Peak)   | 117.1 MB     | 326.6 MB    | **64% reduction**|
| Throughput      | 114.9 req/s  | 117.2 req/s | Equivalent       |

## Technical Changes

### Build-Time Permission Optimization (Primary)
- Moved file permission operations (`chmod`/`chown`) from runtime to build time
- The original container scanned and modified ~15,000 files on every startup (40-60s)
- Now permissions are baked into the image; only setup-modified files need runtime adjustment

### Startup Script Modernization
- Rewrote `openemr.sh` from POSIX sh to bash with `set -euo pipefail`
- Added comprehensive inline documentation and clear section organization
- Improved database wait logic with exponential backoff
- Simplified permission handling for runtime efficiency

### Dockerfile Documentation
- Added extensive comments explaining each build stage
- Documented package purposes, permission schemes, and build targets

## Testing Needed

This PR needs help verifying that all OpenEMR container functions work correctly:
- [ ] Fresh installation (auto-configure)
- [ ] Manual setup mode
- [ ] SSL/TLS certificate configuration
- [ ] Redis session handling
- [ ] Kubernetes mode (admin/worker roles)
- [ ] Swarm mode coordination
- [ ] Upgrade path from previous versions
- [ ] XDebug configuration
- [ ] Let's Encrypt integration
- [ ] Document upload/storage
- [ ] Multi-site configurations

## How to Test

Build and test the optimized image:
cd docker/openemr/7.0.5
docker build -t openemr:7.0.5-optimized .
docker compose up -dOr use the benchmarking utility to compare performance:
cd utilities/container_benchmarking
./benchmark.sh## Related Files

- `docker/openemr/7.0.5/*` - Optimized container files
- `utilities/container_benchmarking/` - Benchmarking suite for performance validation

---

## Summary of the Container Benchmarking Utility

The **Container Benchmarking Utility** (`utilities/container_benchmarking/`) is a valuable development tool for the OpenEMR project:

### What It Does
- **Compares two container images** side-by-side (local build vs. Docker Hub reference)
- **Measures key performance metrics**:
  - **Startup time**: How long until the container is healthy
  - **Throughput**: Requests per second under load (Apache Bench)
  - **Resource usage**: CPU and memory during operation

### Why It's Useful for OpenEMR Container Development
1. **Quantifies optimizations**: Instead of guessing, you get hard numbers (e.g., "4.9x faster startup")
2. **Prevents regressions**: Compare any changes against the baseline Docker Hub image
3. **Quick feedback loop**: Full benchmark suite runs in 3-5 minutes
4. **Reproducible results**: Documented methodology with timestamp-organized output
5. **Analysis tools**: Includes `summary.sh`, `compare_results.sh`, and CSV export for deeper analysis

### Recommended Use Cases
- **Before PRs**: Verify performance impact of container changes
- **CI/CD integration**: Automated regression detection in GitHub Actions
- **Version comparisons**: Test across OpenEMR versions (7.0.3, 7.0.4, 7.0.5, etc.)
- **Optimization validation**: Prove that changes actually improve performance

The utility democratizes performance testing—any contributor can run benchmarks locally and share quantified results, making it easier to evaluate and merge container improvements.
Fixes shellcheck issues in compare_results.sh
Copy link
Contributor

@kojiromike kojiromike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I'm looking forward to getting more benchmarking data out of the gate.
I hope you don't mind my reviewing it early despite the WIP marker. I just don't know when I'll get another chance.

Mainly my comments are around taking better advantage of bash. If we're not going to try to be POSIX compliant then we should take full advantage of bash features, particularly integer arithmetic and shopts like nullglob, globstar and maybe extglob.

The other stuff is about taking better advantage of current docker buildtime features like caching.

And maybe it's not a conversation for today, but maybe we don't need to have bash, python and php on the image. Some day, we should consider rewriting these tools in a single language. We can take advantage of Symfony CLI and other libs already included in openemr to avoid adding more overhead or dependency management.

RUN apk add --no-cache build-base \
# Clone OpenEMR repository (shallow clone to reduce image size)
&& git clone https://github.com/openemr/openemr.git --depth 1 \
&& rm -rf openemr/.git \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're cloning master and nuking .git, we could fetch the archive from GitHub and unpack it instead.

# Install PHP dependencies (production mode, no dev dependencies)
&& composer install --no-dev \
# Install Node.js dependencies for frontend build
&& npm install --unsafe-perm \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

npm install/build could be done in a previous stage (docker multistage) and only the assets copied over.

&& cd ../ \
# Clean up vendor and asset files using Phing build tool
&& composer global require phing/phing \
&& /root/.composer/vendor/bin/phing vendor-clean \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing composer install in a stage and copying only the needed bits could also reduce the need for manual cleaning applications.

&& composer dump-autoload --optimize --apcu \
# Clear Composer and npm caches to reduce image size
&& composer clearcache \
&& npm cache clear --force \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should take advantage of docker's support for build-time caching instead of disabling or nuking them at build time.

OpenCoreEMR can share how we're doing this for our own builds some time next week. Cc: @msummers42

# Install kcov dependencies
# Install kcov build dependencies
# kcov is a code coverage tool that requires compilation from source
RUN apk add --no-cache bash \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have bash above, now.

local files=()
while IFS= read -r file; do
files+=("${file}")
done < <(find "${RESULTS_DIR}" -name "benchmark_*.txt" -type f 2>/dev/null | sort || true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using globstar/nullglob here instead of find.


for file in "${recent_files[@]}"; do
local timestamp
timestamp=$(basename "${file}" | sed 's/benchmark_//' | sed 's/\.txt$//')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using parameter expansion here instead of basename/sed/sed

RESULTS_DIR="${RESULTS_DIR:-./results}"

# Colors for output
GREEN='\033[0;32m'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using tput in all these cases.

NC='\033[0m' # No Color

log_info() {
echo -e "${BLUE}${NC} $*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using printf instead of echo -e

extract_metric() {
local file="${1}"
local metric="${2}"
grep "^${metric}=" "${file}" | cut -d'=' -f2 | sed 's/s$//' | sed 's/ms$//' || echo ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be replaced with a single awk command, see above.

Copy link
Contributor

@kojiromike kojiromike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I'm looking forward to getting more benchmarking data out of the gate.
I hope you don't mind my reviewing it early despite the WIP marker. I just don't know when I'll get another chance.

Mainly my comments are around taking better advantage of bash. If we're not going to try to be POSIX compliant then we should take full advantage of bash features, particularly integer arithmetic and shopts like nullglob, globstar and maybe extglob.

The other stuff is about taking better advantage of current docker buildtime features like caching.

And maybe it's not a conversation for today, but maybe we don't need to have bash, python and php on the image. Some day, we should consider rewriting these tools in a single language. We can take advantage of Symfony CLI and other libs already included in openemr to avoid adding more overhead or dependency management.

@Jmevorach
Copy link
Contributor Author

Thank you for taking the time to go through all this @kojiromike ; really appreciate your time. I'll be busy for a little bit but I should have time in ~2 weeks to start going through all the comments you left and making edits. If anyone wants to step in before then and start making edits themselves please feel free; I welcome all the help I can get on this PR.

The time from container start to when it passes healthchecks is something really important to me because the shorter that time is the better our autoscaling will be when running multiple horizontal replicas. If it takes 5 minutes to spin up a new node it doesn't mean you can't autoscale but it means that you won't be able to autoscale as well as if it took 1 minute. There's some things you can do with Karpenter and other autoscalers to autoscale performantly even when startup time is relatively long (things like having pre-spun up hosts ready to serve traffic waiting rather than spinning up hosts from scratch every time there's an autoscaling event) but reducing the "time from start to healthy" will improve horizontal autoscaling no matter where OpenEMR is run.

While reducing the size of the container image is good in general (especially because it helps reduces some of the hardware specs you'd need to run OpenEMR using docker and the time it takes to pull the container); the rise of container caching technologies (specifically for Kubernetes and basically anywhere else you'd host this in a large public cloud provider) means that this is something I'm planning to prioritize less than reducing the amount of operations the container has to perform when it starts because I think this is going to impact startup time way more than container image size.

I feel the same way about reducing the build times of the containers. Reducing build times is never a bad thing (especially because it allows developers to test more easily) but I think that prioritizing performing the least amount of operations and performing those operations as quickly as possible on container start (especially if those operations are the same each time and can be accomplished at Docker build time or by making configurations to repositories where build materials are pulled from) is going to give us the greatest "value for effort" for the time being.

I think the fastest way to startup would basically be to ...

  1. Elect a leader.
  2. Have the leader launch multiple parallel processes (as many as we can manage without sacrificing reliability) to complete setup and collect the PIDs of each so that when they complete the leader can make sure they all did so successfully.
  3. As much as possible while waiting for the leader to finish have follower containers also perform their own setup tasks (and parallelize them to the greatest extent possible).

What I'd really like to do is introduce automated CI/CD for container builds that verifies that all of this functionality works as its supposed to:

  • Manual setup mode
  • SSL/TLS certificate configuration
  • Redis session handling
  • Kubernetes mode (admin/worker roles)
  • Swarm mode coordination
  • Upgrade path from previous versions
  • XDebug configuration
  • Let's Encrypt integration
  • Multi-site configurations

I've managed to make a "high availability" docker compose setup (put high availability in quotes because even if you have two containers running, running both of them on the same host is not really high availability) that launches two containers behind a load balancer in docker-compose which can be used to test that Swarm node configuration works so I'm making some progress in this regard but, again, I'll more than happily take all the help I can on pushing this forward.

The benchmarking is a good first step because it allows us to quickly verify the performance impacts of a change to one of the containers but as long as we have to more or less manually verify that the functionality above works as expected there's only going to be so fast we'll be able to iterate. By the way thanks for all your work in introducing all the unit tests you've introduced so far. I'm a big fan of what's going on in the "tests" folder in the "openemr/openemr" repo and everything going on in the ".github" folder in this repo.

If we have CI/CD that verifies all the functionality the container should have and uses the benchmarking utilities then the process to find optimizations like this becomes much easier. Anyone would be able to fork the repository, make a change, and then know in minutes whether or not that change improved the container and maintained the functionality we need to properly run OpenEMR.

@kojiromike
Copy link
Contributor

The time from container start to when it passes healthchecks is something really important…

Agreed

I think the fastest way to startup would basically be to ...

1. Elect a leader.
2. Have the leader launch multiple parallel processes (as many as we can manage without sacrificing reliability) to complete setup and collect the PIDs of each so that when they complete the leader can make sure they all did so successfully.
3. As much as possible while waiting for the leader to finish have follower containers also perform their own setup tasks (and parallelize them to the greatest extent possible).

Let me see if I understand. Is openemr.sh slow when the application is already set up and ready to go, and we're just scaling it horizontally by adding another node in basically OPERATOR mode? Or is the problem that openemr.sh is slow starting up when it has a bunch of setup tasks to do? Or both?

@Jmevorach
Copy link
Contributor Author

@kojiromike

Both.

Most of my contributions to the OpenEMR containers come from a perspective similar to an operator of a cluster (be it Kubernetes or otherwise in the case of ECS) who runs OpenEMR on behalf of a larger organization. Sort of like if you were a DevOps engineer hired by a hospital and tasked with operating a cluster running OpenEMR. The difference is rather than me being concerned about a specific installation I'm basically trying to make machines that print out highly reliable setups for other operators. Most of the time I try to do that by acting like I'm an operator myself and stress testing these setups I'm printing out to see what I need to improve with the machine that makes them.

When we first started making OpenEMR on ECS we noticed that the deployment was failing because the healthchecks were timing out. At first we thought this was because the containers had failed some setup task and so they were just sitting idle until they failed healthchecks and were terminated but then we realized that this was because the containers were just taking a really long time to start up.

When we run on ECS we use Fargate and run on a collection of Graviton 2 and Graviton 3 processors in whatever AWS region you decide to use. When we run on EKS we tend to use (although auto-mode defines node-pools with multiple types of instances) Graviton 3 or 4. I ended up deciding on Graviton because it was 1/ cheaper than other comparable instance types 2/ There's been a bunch of optimization work done for running PHP on Graviton and I found it generally performed better than comparable x86 based instances. Also AWS has a lot of ARM chips so from the perspective of autoscaling I think the chance that one of our clusters is going to ask an AWS data center for another ARM vCPU and for that data center to say "sorry we're all out" is pretty low.

This allows us to operate OpenEMR at an incredibly large scale and to do so relatively affordably. The initial load testing results we got from using just the ECS version alone (which is arguably less performant overall than the EKS setup) show us comfortably serving the equivalent of ~20K virtual users or ~4000 req/sec with the equivalent of $0.079/hour worth of Fargate compute that never went beyond 19% avg. CPU utilization or 31% avg. RAM utilization. I truly believe that both the ECS or EKS setups as they exist today could have specific configurations set that would allow them to meet the demands of even national or global scale deployments and serve many many concurrent users.

I'd say it takes about ~9-12 minutes for a leader pod to complete configuration on ECS and pass healthchecks and about ~6-8 minutes for new follower pods to spin up. There's some variation here because it's luck of the draw whether or not you end up running on Graviton 2 vs 3 and you can notice the difference in testing.

I'd say it takes about ~7-9 minutes for a leader pod to complete configuration on EKS and pass healthchecks and about ~4-6 minutes for a follower pod to spin up. There's some variation here because it's luck of the draw as to what specific processor you end up running on but in general I think the processors we end up running on in our EKS architecture are a little bit more performant than those we get allocated on average when we run on ECS.

The ECS version is cheaper and easier to manage than the EKS version while the EKS version offers integration paths to a lot more tools and a lot more functionality around really granular observability out of the box so I think it makes sense to offer both. In general I'd advise using the ECS based one for smaller to mid-size installations and the EKS based one for mid-size to large installations but you can definitely use either the ECS or EKS one at any scale.

While it's improved over time as we've figured out ways to ...

  1. Run OpenEMR better on EKS and ECS

  2. Improve the startup and overall performance of the docker containers (i.e. introducing Opcache, etc. etc.)

we've still had to set pretty generous healthchecks to get OpenEMR to run in a container orchestrator on AWS. I imagine that as other architectures are created in the future to run OpenEMR on container orchestrators elsewhere we'll probably run into similar issues.

It's super normal for applications to take a few minutes to start either as replicas or, especially, when they boot for the first time. However, applications that do take a while to boot and are going to be scheduled on Kubernetes or a similar orchestrator generally try to, as soon as is reasonably possible, start responding to healthchecks on some endpoint that says "this application might not be ready to serve traffic but it is operating normally". An example of this would be an application that starts almost immediately responding to healtchecks at /health but if you go to the website while it's setting up you'll see some sort of landing page that says "Application is Starting" or something to that effect. In our case right now though openemr.sh starts Apache last after all the other setup has completed.

If we can get the containers to reliably respond to healthchecks in <2 minutes when they boot from the perspective of a user or an operator we'd be able to map our autoscaling capacity very closely to the amount of compute we'd need at any one time to serve traffic when running in a cloud and basically all of the nodes would be healthy at any one time because the longest an unhealthy node would be allowed to serve traffic would be <2 minutes. This would make the system cheaper and more reliable overall.

You can get around the long start up times when running today by just setting the autoscaling values lower than you'd probably want to. So instead of setting the average CPU utilization to schedule a scaling event from 70-90% and worry about potentially running up against more traffic than you can serve as you try to spin up new instances what you can instead do is set to be at 40-60% or something similar so that you're more conservative in allocating the amount of compute you have ready to serve traffic at any one time.

Aside from requiring you to spend more money you can get away with using this method to reliably serve even large volumes of OpenEMR traffic when running in a public cloud.

What's perhaps more concerning than cost is the case in which let's say a node fails and we may not have many nodes up at one time because we can serve a lot of traffic with just a few nodes. Because we've had to configure the healthchecks to be really lenient to accommodate for a ~5-7 minute window where a container is running but not responding to healthchecks this can hypothetically lead to a scenarios where a user can experience degraded functionality over a period of a few minutes and not really know why.

Let's imagine a hypothetical situation where let's say 1/4 hosts is doing nothing but serving 5xx responses to users. Neither of our setups need to use sticky sessions and most of the time we're just evenly distributing traffic to our replicas behind a load balancer. This means about 1/4 requests sent by a user don't work so from their perspective they're clicking around in OpenEMR but 1/4 clicks don't work correctly in the UI and they're sort of left scratching their head as to what's going on and it may take a few minutes for that to resolve. It's not the biggest deal in the world but I think if we were to even lower the "time to healthy" by a few minutes it would create a scenario in which basically all of the nodes minus a minute or two here and there were healthy at all times.

The faster a container starts responding to healthchecks the faster we can set up the orchestrator to remove unhealthy nodes and replace them with healthy ones which greatly improves the reliability of the system overall. We'll also save money on autoscaling by allowing us to set higher CPU and RAM thresholds for triggering scale out events.

In most orchestrators (including ECS and EKS) you can set a startup grace period in which you say something to the effect of "This container might take 5 minutes to boot initially but after that first 5 minute grace period do more aggressive healthchecks" which is something we do; so it definitely doesn't take us ~9-11 minutes to identify and remove unhealthy nodes but it's still pretty high relative to most applications you'd run in a cluster so it's something I'm thinking about still.

This PR aims to address three things I've noticed when running OpenEMR as a horizontal replica:

  1. The time to configure the MySQL database the first time can take a while.

  2. The time taken by each container on performing file operations like chmod/chown on startup can take a while and unless you're doing development work I don't think it's necessary to keep these file operations in the production container that's basically just meant to reliably serve traffic.

  3. When starting OpenEMR in a swarm if the leader failed for whatever reason all of the remaining containers would sit there idle waiting for the leader (which no longer exists because it crashed and got swapped out) to create a "docker-completed" file and eventually the whole deployment would fail. You'd end up with three or four containers doing nothing but printing out "Waiting for docker leader to finish configuration" over and over again to logs. This PR's new "heartbeat" system makes it so that a set of containers can recover from this by a) detecting that the leader has failed b) electing a new leader to pick up where configuration left off in the event the former leader crashed

I think we're actually pretty close to the point where any more startup optimizations would be nice to have but not as critical as they are today. That's all while acknowledging that the current setups can perform pretty well today and host OpenEMR with autoscaling for even incredibly large deployments.

What I'd like to introduce in a subsequent PR (or I'd be happy to support anyone who'd like to lead this) would be to modify the containers so that they basically start passing some basic healthcheck as quickly as possible. We can still layer more aggressive and thorough sets of healthchecks on top of that (and even give them grace periods if need be) but having an indication as early as possible that the container is progressing normally even if it's not finished with everything it needs to do would be super helpful.

I think this PR plus maybe one or two more is all we'd need to do to ensure that startup times were fast enough in the production container (Flex containers wouldn't be in scope for this effort) to pass healthchecks quickly and then everything else beyond that would be more of a "nice to have" rather than something that would substantially impact the overall functioning of the system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants