Skip to content

Add Bitaxe cooling control#33

Open
jayrmotta wants to merge 1 commit into256foundation:mainfrom
jayrmotta:feature/fan-controller
Open

Add Bitaxe cooling control#33
jayrmotta wants to merge 1 commit into256foundation:mainfrom
jayrmotta:feature/fan-controller

Conversation

@jayrmotta
Copy link

@jayrmotta jayrmotta commented Mar 3, 2026

Summary

Currently the fan controller available on Mujina sets the fan speed to 100% as the system initializes, it sits at that value through the entire firmware lifecycle. Issue #9 describes this behavior and points to esp-miner's implementation of a PID controller to control the temperature.

Changes

  • Add fan control for Bitaxe boards
  • Introduce temperature filtering to smooth sensor readings
  • Integrate cooling control into existing Bitaxe board lifecycle

Goal

Improve thermal management and protect Bitaxe hardware under sustained load by actively controlling the fan based on filtered temperature readings.

Disclaimer

I'm currently unable to stabilize the Bitaxe with the fan controller alone, in a previous experiment I was also controlling the ASIC frequency which did help but for simplicity (under @rkuester's guidance) we decided to first introduce the fan control and then maybe later introduce frequency control.

I got a second Bitaxe thinking it would potentially not heat as much as the first one but I'm struggling to flash and run bitaxe-raw on it.

Testing

  • Verified that the temperature filter accepts realistic readings into a bounded sliding window while rejecting values that fall outside the reasonable operating range;
  • Checked that sudden, unrealistic jumps in temperature are treated as noise and discarded, without preventing subsequent, more plausible readings from being used;
  • Confirmed that the control loop combines proportional and integral terms correctly, accumulating error over time while keeping that accumulated value within sensible bounds to avoid runaway behavior;
  • Ensured the controller can be reset between runs so that past error does not leak into new control cycles;
  • Tested that no fan speed changes are requested until a valid temperature reading is actually available;
  • Verified that, once a valid temperature is present, the controller periodically computes and publishes fan speed updates based on the difference between the current temperature and the configured target.

Fixes: #9

Copy link

@johnnyasantoss johnnyasantoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested it throughly and tweaked a lot the parameters before realizing that the problem wasn't that the PI controller didn't have the correct parameters, but the output was being added to something else and then clamped.

This made the PI "fight" back the base 25% fan speed.

With the changes I proposed here I'm Running it for hours now under 65c. When it gets really hot (by tweaking core freq + voltage), the fan actually goes to 100%.

I tested this on a BitAxe Gamma (601) but had to lower the frequency steps and lower the voltage (it would get too hot with the stock settings). The settings I'm using freq 412.5Mhz (74 ramp steps) and 1.01v.

This may be unrelated but I can run much higher frequencies on ESP-Miner with the same hardware.

}

impl PiController {
fn new(kp: f32, ki: f32, integral_min: f32, integral_max: f32) -> Self {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a log like debug!(kp, ki, integral, "Initializing PI Controller");

Should this allow to init integral at any arbitrary value? In case of restarts the chip is already hot and going back to zero to wait the ramp up can be problematic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I chose to have the fan by default at 100%

If passing initial integral was possible what would keep memory of the latest known value? Continuously write that to some persistent memory like filesystem?

Comment on lines +39 to +57
if !(-20.0..=100.0).contains(&temp) {
return None;
}

if !self.window.is_empty() {
let avg = self.window.iter().sum::<f32>() / self.window.len() as f32;
let deviation = (temp - avg).abs();
if deviation > self.max_deviation_c {
return None;
}
}

if self.window.len() == self.window_size as usize {
self.window.pop_front();
}

self.window.push_back(temp);

Some(temp)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was testing this PR this morning with @jayrmotta and it seems that the temp filter is too aggressive. On restarts it can reject all temperatures because the chip is hot from the start.

I would make this more lenient or just remove it. The integral would fluctuate a bit but then go back to equilibrium. Maybe just sanity check for values outside of what the hardware can REALLY read?

Copy link
Author

@jayrmotta jayrmotta Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my case I often get ~5 wrong readings in a row before I start getting valid ones, so this filter has been really useful. The only thing it does is protect against unrealistic values or sudden changes that crosses what I called a noise threshold.

I think the noise threshold can be tweaked so it's not too sensitive as we did in the call, then that in conjunction with filtering values like 127.875 or 121 we've been seeing makes for a reliable source of temperature readings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at the esp-miner source you can see we do this too. The issue is that the temp readings are invalid until the ASIC core is fully powered.

@rkuester
Copy link
Collaborator

This may be unrelated but I can run much higher frequencies on ESP-Miner with the same hardware.

I'd be quite interested in getting to the bottom of any differences like this between Mujina and esp-miner. Do you mean esp-miner runs a lot cooler than Mujina for the same frequency? If so, have you compared the chip voltage in both cases?

(FWIW, I'm working up a larger review of this PR.)

@rkuester
Copy link
Collaborator

Thanks for working on this, Jayr! It sounds like you guys are getting the PI controller math and the temperature filter dialed in. I have some architectural feedback that I think will make this simpler and easier to build on.

The core job here is: read temperature, compute fan speed, set fan speed. That's a sequential operation, but right now the PR spreads it across four async tasks—fan worker dispatch loop, temperature reader, PI controller, and fan speed listener—connected by five channels. That's a lot of moving parts. Let me walk through what I'd suggest instead.

Remove fan worker actor

The FanWorkerCommand enum, oneshot response channels, dispatch loop, and fan_worker_request helper add up to ~150 lines that serialize access to the EMC2101. But look at how the TPS546 voltage regulator is handled a few fields away in the same struct:

regulator: Option<Arc<Mutex<Tps546<BitaxeRawI2c>>>>,

The regulator is shared between the stats monitor and the board lifecycle with a simple Arc<Mutex<>>.

But actually, we can do even better. Instead of sharing the EMC2101 behind a mutex, let's give it a single owner.

One owner for the EMC2101

The stats monitor needs temperature, fan setpoint, and RPM. The fan controller needs temperature. Both need the EMC2101. Rather than sharing it, let one task own it exclusively and publish a cached snapshot for everyone else:

/// Cached readings, published via watch channel.
struct ThermalReadings {
    temperature: Option<f32>,
    fan_percent: Option<u8>,
    fan_rpm: Option<u32>,
}

That single task does everything in one loop:

loop {
    // Read hardware (only task that touches the EMC2101)
    let temp = emc2101.get_external_temperature().await.ok();
    let rpm = emc2101.get_rpm().await.ok();
    let pct = emc2101.get_fan_speed().await.ok().map(u8::from);

    // Run control math, set speed
    if let Some(temp) = temp {
        if let Some(speed) = controller.update(temp, dt) {
            emc2101.set_fan_speed(speed).await;
        }
    }

    // Publish snapshot for stats monitor
    snapshot_tx.send(ThermalReadings { temp, pct, rpm });

    sleep(interval).await;
}

The stats monitor holds a watch::Receiver<ThermalReadings> and calls .borrow() whenever it needs the latest readings—it can call it as often as it likes, it just gets a read-only snapshot of whatever the thermal task last published. No I2C, no contention.

This loop lives in a single spawned task—similar to how spawn_stats_monitor works today—replacing the current four. Shutdown is straightforward: cancel the one task, and set the fan to a safe speed.

This also leaves room for a future API override of the fan speed. The thermal task is the single writer to the EMC2101, so an API "hold fan at X%" command could just tell that task to pause the PI controller and hold a fixed speed until the API releases the constraint.

The fan controller should be a plain object

The PI controller and temperature filter are pure math—they don't need to be async, and they don't need channels. A synchronous update method makes them composable by any caller:

pub struct FanController {
    filter: TemperatureFilter,
    pi: PiController,
    target_temp: f32,
}

impl FanController {
    /// Feed a raw sensor reading; get back a fan speed if the
    /// reading passes the noise filter.
    /// `dt` is the time since the last call. The integral term
    /// needs reasonably stable intervals, but that's a given
    /// since the caller runs on a `time::interval`.
    pub fn update(
        &mut self,
        reading: f32,
        dt: Duration,
    ) -> Option<Percent> {
        let temp = self.filter.consider(reading)?;
        let error = temp - self.target_temp;
        let output = self.pi.update(error, dt);
        let speed = (MIN + output).clamp(MIN, MAX);
        Some(Percent::new_clamped(speed as u8))
    }
}

The caller provides a temperature, gets back a fan speed. That's the whole interface. The board decides how and when to read the sensor and how to apply the result.

The PI gains are tuned assuming a roughly stable tick interval, so dt matters for tuning; but since the caller runs on a fixed timer, it's stable in practice. The target temperature and PI gains (Kp, Ki, integral bounds) should be constructor parameters rather than module-level constants, so different boards can tune for their own thermal characteristics. Something like:

pub struct FanControllerConfig {
    pub target_temp: f32,
    pub kp: f32,
    pub ki: f32,
    pub integral_bounds: (f32, f32),
}

impl FanController {
    pub fn new(config: FanControllerConfig) -> Self { ... }
}

Where things should live

For now, keeping the fan controller in board/bitaxe/ is good. Once it's a synchronous object with no bitaxe-specific dependencies, it'll be easy to lift into a shared module later if another board needs something similar.

The temperature filter is an implementation detail of the controller—I'd fold it into fan_controller.rs rather than giving it a separate module.

Summary

  1. Drop the fan worker actor. One task owns the EMC2101, reads it periodically, runs the fan controller, and publishes a ThermalReadings via watch channel.
  2. Make FanController synchronous. A plain update(temp, dt) -> Option<Percent> method, composing the filter and PI math internally. No channels, no async.
  3. Refactor the stats monitor read to the snapshot. Just .borrow() on the watch receiver. No I2C access needed.

This should cut the non-test code size significantly and be easier to follow and extend. The PI tuning, filter, logic, and test coverage you've built are good—it's really just the plumbing around them that I'd like to see simplified.

I may be oversimplifying in places; if you encounter details I'm glossing over, let's talk through it. Reaching too eagerly for tasks and channels is a common anti-pattern in async Rust. The right idea is usually to start with synchronous code and only introduce concurrency when you have a concrete reason.

@rkuester rkuester marked this pull request as draft March 13, 2026 00:51
@rkuester rkuester marked this pull request as draft March 13, 2026 00:51
@rkuester
Copy link
Collaborator

Because this is still a work in progress, I've marked it as a draft PR.

@jayrmotta jayrmotta force-pushed the feature/fan-controller branch from 51457c0 to c721d54 Compare March 13, 2026 02:38
@johnnyasantoss
Copy link

I'd be quite interested in getting to the bottom of any differences like this between Mujina and esp-miner.

@rkuester continuing it here: #50

@jayrmotta jayrmotta force-pushed the feature/fan-controller branch from c721d54 to d17da6b Compare March 16, 2026 19:47
@jayrmotta jayrmotta marked this pull request as ready for review March 16, 2026 19:52
@jayrmotta
Copy link
Author

Indeed that's a lot simpler @rkuester, please take a look when you got the time

@Nickamoto
Copy link
Contributor

Tested the revised architecture on a Bitaxe Gamma 601 (BM1370) running bitaxe-raw over USB — combined with PR #39 locally for macOS IOKit discovery. ~30 minute run: discovery → ASIC init → pool connect → shares submitted → thermal control active throughout. Fan settled at 47% (~4,600 RPM) with ASIC stable at 70°C. The single-task EMC2101 owner + watch channel approach works well in practice.

A few things I noticed that weren't covered in the earlier discussion:

Panic if EMC2101 init fails

init_fan_controller handles a failed fan.init() with a warning and Ok(()), leaving self.fan as None. Then spawn_thermal_task calls .expect("Fan controller must be initialized before spawning thermal monitor") on it. If the hardware isn't there, that's a panic rather than a graceful degradation. Might be worth either propagating the error or skipping the thermal task spawn when self.fan is None, depending on whether the fan is considered required.

Dangling doc reference on FanPIController

Line 104: "See module-level documentation for rationale." The module doc just describes what the controller does — the rationale for omitting the derivative term (noise amplification from thermal diodes) isn't actually written down anywhere. It's a good reason and worth capturing; right now the reference just points to nothing.

No unit tests for FanPIController

TemperatureFilter has solid coverage. If the PI update logic, windup clamping, and output clamping ever get touched during a future tuning pass, having a few fixed-input tests there would help catch regressions.

Add fan control and temperature filtering for Bitaxe boards to improve
thermal management and protect hardware under high load.

Co-authored-by: Johnny Santos <johnnyadsantos@gmail.com>
@jayrmotta jayrmotta force-pushed the feature/fan-controller branch from d17da6b to 918dbcc Compare March 18, 2026 18:56
@jayrmotta
Copy link
Author

jayrmotta commented Mar 18, 2026

Hey @Nickamoto, I appreciate the review, tests, and feedback 💪

I re-added the FanPiController tests, I had removed them recently because of the Test Behaviors, Not Implementation Details TEST.behavior coding guideline. I couldn't come up with a way to test its logic without tying the test to the implementation since it's a math heavy construct. Do you think it makes sense to keep it?

I also added an early return to the thermal task, so if the fan fails to initialize the firmware won't panic. But on the other hand it might also not set the fan to 100% speed, which could potentially risk the hw? I've observed that bitaxe-raw is intermittent with booting with the fan on/off.

Thanks for the shout with the docs as well, it slipped my mind, I think it's okay now.

@skot skot changed the title Add BitAxe cooling control Add Bitaxe cooling control Mar 18, 2026
@Nickamoto
Copy link
Contributor

Hey @Nickamoto, I appreciate the review, tests, and feedback 💪

I re-added the FanPiController tests, I had removed them recently because of the Test Behaviors, Not Implementation Details TEST.behavior coding guideline. I couldn't come up with a way to test its logic without tying the test to the implementation since it's a math heavy construct. Do you think it makes sense to keep it?

I also added an early return to the thermal task, so if the fan fails to initialize the firmware won't panic. But on the other hand it might also not set the fan to 100% speed, which could potentially risk the hw? I've observed that bitaxe-raw is intermittent with booting with the fan on/off.

Thanks for the shout with the docs as well, it slipped my mind, I think it's okay now.

On the test question, I think the guideline is on your side for keeping them. TEST.behavior is aimed at tests that break when you refactor internals while preserving the contract, like asserting a hardcoded specificity score. A PI controller is a pure function: given an error and a timestep, you get a control output. Testing that relationship is testing the behavioral contract, not the implementation. If you swap out the math for a different approach someday and the output changes, the test should catch that. I'd keep them.

On the fan init failure, the early return fixes the panic. This is good, but I'm not sure Ok(()) is the right outcome when thermal protection can't start. The miner continues hashing with no fan control, and the fan stays at whatever state the hardware defaulted to rather than a known safe speed. For a Bitaxe at 70°C that might be fine, but as this code path gets reused on higher-power boards it becomes riskier. Worth considering whether a failed fan init should be a hard error that stops the board from starting rather than a silent degradation. The intermittent boot behavior on bitaxe-raw is worth noting, but I wouldn't let hardware quirks drive the error handling strategy for all boards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(bitaxe): implement closed-loop fan control based on temperature

5 participants