Conversation
3db71c8 to
5dd129b
Compare
johnnyasantoss
left a comment
There was a problem hiding this comment.
Tested it throughly and tweaked a lot the parameters before realizing that the problem wasn't that the PI controller didn't have the correct parameters, but the output was being added to something else and then clamped.
This made the PI "fight" back the base 25% fan speed.
With the changes I proposed here I'm Running it for hours now under 65c. When it gets really hot (by tweaking core freq + voltage), the fan actually goes to 100%.
I tested this on a BitAxe Gamma (601) but had to lower the frequency steps and lower the voltage (it would get too hot with the stock settings). The settings I'm using freq 412.5Mhz (74 ramp steps) and 1.01v.
This may be unrelated but I can run much higher frequencies on ESP-Miner with the same hardware.
| } | ||
|
|
||
| impl PiController { | ||
| fn new(kp: f32, ki: f32, integral_min: f32, integral_max: f32) -> Self { |
There was a problem hiding this comment.
maybe add a log like debug!(kp, ki, integral, "Initializing PI Controller");
Should this allow to init integral at any arbitrary value? In case of restarts the chip is already hot and going back to zero to wait the ramp up can be problematic.
There was a problem hiding this comment.
That's why I chose to have the fan by default at 100%
If passing initial integral was possible what would keep memory of the latest known value? Continuously write that to some persistent memory like filesystem?
| if !(-20.0..=100.0).contains(&temp) { | ||
| return None; | ||
| } | ||
|
|
||
| if !self.window.is_empty() { | ||
| let avg = self.window.iter().sum::<f32>() / self.window.len() as f32; | ||
| let deviation = (temp - avg).abs(); | ||
| if deviation > self.max_deviation_c { | ||
| return None; | ||
| } | ||
| } | ||
|
|
||
| if self.window.len() == self.window_size as usize { | ||
| self.window.pop_front(); | ||
| } | ||
|
|
||
| self.window.push_back(temp); | ||
|
|
||
| Some(temp) |
There was a problem hiding this comment.
I was testing this PR this morning with @jayrmotta and it seems that the temp filter is too aggressive. On restarts it can reject all temperatures because the chip is hot from the start.
I would make this more lenient or just remove it. The integral would fluctuate a bit but then go back to equilibrium. Maybe just sanity check for values outside of what the hardware can REALLY read?
There was a problem hiding this comment.
In my case I often get ~5 wrong readings in a row before I start getting valid ones, so this filter has been really useful. The only thing it does is protect against unrealistic values or sudden changes that crosses what I called a noise threshold.
I think the noise threshold can be tweaked so it's not too sensitive as we did in the call, then that in conjunction with filtering values like 127.875 or 121 we've been seeing makes for a reliable source of temperature readings.
There was a problem hiding this comment.
if you look at the esp-miner source you can see we do this too. The issue is that the temp readings are invalid until the ASIC core is fully powered.
5dd129b to
51457c0
Compare
I'd be quite interested in getting to the bottom of any differences like this between Mujina and esp-miner. Do you mean esp-miner runs a lot cooler than Mujina for the same frequency? If so, have you compared the chip voltage in both cases? (FWIW, I'm working up a larger review of this PR.) |
|
Thanks for working on this, Jayr! It sounds like you guys are getting the PI controller math and the temperature filter dialed in. I have some architectural feedback that I think will make this simpler and easier to build on. The core job here is: read temperature, compute fan speed, set fan speed. That's a sequential operation, but right now the PR spreads it across four async tasks—fan worker dispatch loop, temperature reader, PI controller, and fan speed listener—connected by five channels. That's a lot of moving parts. Let me walk through what I'd suggest instead. Remove fan worker actorThe regulator: Option<Arc<Mutex<Tps546<BitaxeRawI2c>>>>,The regulator is shared between the stats monitor and the board lifecycle with a simple But actually, we can do even better. Instead of sharing the EMC2101 behind a mutex, let's give it a single owner. One owner for the EMC2101The stats monitor needs temperature, fan setpoint, and RPM. The fan controller needs temperature. Both need the EMC2101. Rather than sharing it, let one task own it exclusively and publish a cached snapshot for everyone else: /// Cached readings, published via watch channel.
struct ThermalReadings {
temperature: Option<f32>,
fan_percent: Option<u8>,
fan_rpm: Option<u32>,
}That single task does everything in one loop: loop {
// Read hardware (only task that touches the EMC2101)
let temp = emc2101.get_external_temperature().await.ok();
let rpm = emc2101.get_rpm().await.ok();
let pct = emc2101.get_fan_speed().await.ok().map(u8::from);
// Run control math, set speed
if let Some(temp) = temp {
if let Some(speed) = controller.update(temp, dt) {
emc2101.set_fan_speed(speed).await;
}
}
// Publish snapshot for stats monitor
snapshot_tx.send(ThermalReadings { temp, pct, rpm });
sleep(interval).await;
}The stats monitor holds a This loop lives in a single spawned task—similar to how This also leaves room for a future API override of the fan speed. The thermal task is the single writer to the EMC2101, so an API "hold fan at X%" command could just tell that task to pause the PI controller and hold a fixed speed until the API releases the constraint. The fan controller should be a plain objectThe PI controller and temperature filter are pure math—they don't need to be async, and they don't need channels. A synchronous pub struct FanController {
filter: TemperatureFilter,
pi: PiController,
target_temp: f32,
}
impl FanController {
/// Feed a raw sensor reading; get back a fan speed if the
/// reading passes the noise filter.
/// `dt` is the time since the last call. The integral term
/// needs reasonably stable intervals, but that's a given
/// since the caller runs on a `time::interval`.
pub fn update(
&mut self,
reading: f32,
dt: Duration,
) -> Option<Percent> {
let temp = self.filter.consider(reading)?;
let error = temp - self.target_temp;
let output = self.pi.update(error, dt);
let speed = (MIN + output).clamp(MIN, MAX);
Some(Percent::new_clamped(speed as u8))
}
}The caller provides a temperature, gets back a fan speed. That's the whole interface. The board decides how and when to read the sensor and how to apply the result. The PI gains are tuned assuming a roughly stable tick interval, so pub struct FanControllerConfig {
pub target_temp: f32,
pub kp: f32,
pub ki: f32,
pub integral_bounds: (f32, f32),
}
impl FanController {
pub fn new(config: FanControllerConfig) -> Self { ... }
}Where things should liveFor now, keeping the fan controller in The temperature filter is an implementation detail of the controller—I'd fold it into Summary
This should cut the non-test code size significantly and be easier to follow and extend. The PI tuning, filter, logic, and test coverage you've built are good—it's really just the plumbing around them that I'd like to see simplified. I may be oversimplifying in places; if you encounter details I'm glossing over, let's talk through it. Reaching too eagerly for tasks and channels is a common anti-pattern in async Rust. The right idea is usually to start with synchronous code and only introduce concurrency when you have a concrete reason. |
|
Because this is still a work in progress, I've marked it as a draft PR. |
51457c0 to
c721d54
Compare
c721d54 to
d17da6b
Compare
|
Indeed that's a lot simpler @rkuester, please take a look when you got the time |
|
Tested the revised architecture on a Bitaxe Gamma 601 (BM1370) running bitaxe-raw over USB — combined with PR #39 locally for macOS IOKit discovery. ~30 minute run: discovery → ASIC init → pool connect → shares submitted → thermal control active throughout. Fan settled at 47% (~4,600 RPM) with ASIC stable at 70°C. The single-task EMC2101 owner + watch channel approach works well in practice. A few things I noticed that weren't covered in the earlier discussion: Panic if EMC2101 init fails init_fan_controller handles a failed fan.init() with a warning and Ok(()), leaving self.fan as None. Then spawn_thermal_task calls .expect("Fan controller must be initialized before spawning thermal monitor") on it. If the hardware isn't there, that's a panic rather than a graceful degradation. Might be worth either propagating the error or skipping the thermal task spawn when self.fan is None, depending on whether the fan is considered required. Dangling doc reference on FanPIController Line 104: "See module-level documentation for rationale." The module doc just describes what the controller does — the rationale for omitting the derivative term (noise amplification from thermal diodes) isn't actually written down anywhere. It's a good reason and worth capturing; right now the reference just points to nothing. No unit tests for FanPIController TemperatureFilter has solid coverage. If the PI update logic, windup clamping, and output clamping ever get touched during a future tuning pass, having a few fixed-input tests there would help catch regressions. |
Add fan control and temperature filtering for Bitaxe boards to improve thermal management and protect hardware under high load. Co-authored-by: Johnny Santos <johnnyadsantos@gmail.com>
d17da6b to
918dbcc
Compare
|
Hey @Nickamoto, I appreciate the review, tests, and feedback 💪 I re-added the I also added an early return to the thermal task, so if the fan fails to initialize the firmware won't panic. But on the other hand it might also not set the fan to 100% speed, which could potentially risk the hw? I've observed that bitaxe-raw is intermittent with booting with the fan on/off. Thanks for the shout with the docs as well, it slipped my mind, I think it's okay now. |
On the test question, I think the guideline is on your side for keeping them. TEST.behavior is aimed at tests that break when you refactor internals while preserving the contract, like asserting a hardcoded specificity score. A PI controller is a pure function: given an error and a timestep, you get a control output. Testing that relationship is testing the behavioral contract, not the implementation. If you swap out the math for a different approach someday and the output changes, the test should catch that. I'd keep them. On the fan init failure, the early return fixes the panic. This is good, but I'm not sure Ok(()) is the right outcome when thermal protection can't start. The miner continues hashing with no fan control, and the fan stays at whatever state the hardware defaulted to rather than a known safe speed. For a Bitaxe at 70°C that might be fine, but as this code path gets reused on higher-power boards it becomes riskier. Worth considering whether a failed fan init should be a hard error that stops the board from starting rather than a silent degradation. The intermittent boot behavior on bitaxe-raw is worth noting, but I wouldn't let hardware quirks drive the error handling strategy for all boards. |
Summary
Currently the fan controller available on Mujina sets the fan speed to 100% as the system initializes, it sits at that value through the entire firmware lifecycle. Issue #9 describes this behavior and points to esp-miner's implementation of a PID controller to control the temperature.
Changes
Goal
Improve thermal management and protect Bitaxe hardware under sustained load by actively controlling the fan based on filtered temperature readings.
Disclaimer
I'm currently unable to stabilize the Bitaxe with the fan controller alone, in a previous experiment I was also controlling the ASIC frequency which did help but for simplicity (under @rkuester's guidance) we decided to first introduce the fan control and then maybe later introduce frequency control.
I got a second Bitaxe thinking it would potentially not heat as much as the first one but I'm struggling to flash and run bitaxe-raw on it.
Testing
Fixes: #9