Enterprise Servers are loud and hot, and while a couple of 13th generation Dell servers kept a room nice and toasty in the cold of winter, it a double whammy to your electric bill in the Spring/Summer.
The noise can be problematic as well, especially since I no longer have a basement, where I relied on my servers to circulate and dehumidify the stuffy air. Now, I actually have to sit in a room with the rack. Which is why I moved it to the bedroom. Strange.
So why the bedroom, well the white noise (when under control) helps me sleep and I can turn down the heat in the Winter and not freeze at night. And in the day, when I power everything up, I am in my home office (or my second home office – the dinning room), which is already loud enough with my workstations and old Cisco switches. That tiny home office, that is technically a small bedroom for your least favorite child, can get downright uncomfortably warm.
Why So Much Noise
As anyone who has worked in a datacenter can tell you. Servers are loud. Let compare fans in an Dell R630 to an R730.
| Server | Form Factor | Effective fan size | Fan Count | Notes |
| Dell R630 | 1U | ~40 mm blower | 7 | Very high RPM, loud, high static pressure |
| Dell R730 | 2U | ~60 mm blower | 6 | Still loud, but more efficient airflow |
This boils down to the 1U server must spin its fans faster that then 2U to move the same amount of air through the server chassis, measured in CFMs (Cubic Feet per Minute).
Additionally, you install a PCIe card in a server that the server itself does not recognize (like Telsa T4 GPUs) , and you may find your fans spinning at full bore, as the server would rather take flight than overheat.
Enter the Dell Fan Susher
Luckily, you can use ipmitools to override a Dell server’s fan speed, and quiet things down a great deal. However, you need to keep your eye on temperatures. For this I created Dell-Server-Fan-Shusher.
Its a python script that monitors system temps (via sensors), and NVIDIA GPU temps (via nvidia-smi – when present), and sets fan speeds accordingly. It currently has seven threshold levels for temp and fan speeds, all of which can easily be modified.
Installation is one command and its scheduled either via cron or systemd.
sudo ./install.sh
The deployed fan_control.py reads system temps in the following order.
- Sysfs (/sys/class/hwmon)
- sensors
- ipmitool sdr list (last fallback)
It also gets GPU temps via nvidia-smi and system temps via IPMI (e.g. Inlet, Exhaust, CPU packages, etc.).
Its Getting Hot in Here
The susher has been working fine for months, but as I mentioned its no longer Winter and the ambient air in my “server room” has been rising. See output from my Netbotz 450 below. Starting to get hotter than a grandparent’s Florida condo.
So why I am posting all this? So this morning, when I attempted to log into my main hypervisor to start on my 3rd installment of my HPC GPU Cluster journey, I found the host unresponsive and throwing the errors below on the console. Time to reboot.

Overheating NIC?
So here is what happened, the ixgbe driver detected that the adapter overheated and disabled it to protect the hardware. When that happens, all ports on that card stop working. In the console, kernel logs are reporting a thermal shutdown of the Intel 10GbE network adapter. Specifically the Intel X540-AT2, which is a dual port 10Gbe copper (RJ45) adapter – with a large heatsink (some variants of this card have a fan.
lspci -s 0000:82:00.082:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
A few minutes after a reboot the error returns, and the NIC goes offline again and is accessible only via the IDRAC console. At this point I cannot keep the system online long enough to do any meaningful troubleshooting. So we need to shush the susher. So from the console we disable it and then reboot again.
# systemctl disable --now dell-r730-fan-control.timerRemoved '/etc/systemd/system/timers.target.wants/dell-r730-fan-control.timer'.
From our workstation we connect to the servers IDRAC and set manual fan control and set fan speeds to 100%. We do this while the target system is in the process or rebooting.
ipmitool -I lanplus -H 10.1.10.20 -U root -P calvin raw 0x30 0x30 0x01 0x00ipmitool -I lanplus -H 10.1.10.20 -U root -P calvin raw 0x30 0x30 0x02 0xff 100
Hopefully now we can keep the system online long enough to see what is really going on, as the server itself is not reporting any temperature issues on the idrac.
Troubleshooting
So first lets check the system logs and see how many times that this has occurred.
grep -r "over heat\|overheat\|stopped because" /var/log/ 2>/dev/null | tail -50
We see two events today, and a 3rd event a few days ago (probably patient zero).
| Date | Time | Event |
|---|---|---|
| Mar 22, 2026 | 12:04:56 | Both Intel 10GbE NICs (enp130s0f0, enp130s0f1) stopped due to overheat |
| Mar 22, 2026 | 13:17:24 | Same overheat again shortly before reboot at 13:24:59 |
| Mar 15, 2026 | 11:39:31 | Same overheat on both NICs |
Digging into logs a bit further, we see the following at around 12:04
Fan speed: 15% (~3800 RPM)
- GPU temp: 35°C
- System temps: max 47°C (sensors 24–25 and 39 at 41–47°C)
- Action: Fan control had just set fans to 15% for the “LOW” threshold.
Intel X540-AT2 thermal specs
A quick internet search yields the following. This NIC needs to get pretty hot to experience a thermal shutdown
| Specification | Value | Notes |
|---|---|---|
| Operating temp (ambient) | 0°C to 55°C | Marketing spec for 200 LFM airflow |
| Extended ambient | 0°C to 70°C | With 300 LFM airflow and adequate heatsink |
| Tcase max | 107°C | Max case temp at heat spreader (die-level limit) |
NIC temperature visibility
So the Intel ixgbe driver does not expose temperature to Linux:
- No hwmon temperature sensor under
/sys/... ethtooldoes not show NIC temperature
So there is no NIC temperature in system logs. The “over heated” message comes from the NIC’s internal thermal protection. The driver only reports the event; the actual NIC temperature is not logged. At this point we do no have a method to pull the temp from the NIC.
Additional, there are no log messages like thermal warnings, throttling, or high-temperature notices before the shutdown. So there is really nothing that we can check for in the logs and use as a trigger for our Dell Fan Controller (Shusher).
| Event | Last ixgbe message before overheat | Overheat |
|---|---|---|
| Mar 22, 12:04:56 | 11:58:02 – NIC Link is Up 10 Gbps | 12:04:56 – overheat (about 7 minutes later) |
| Mar 22, 13:17:24 | 13:11:28 – NIC Link is Up 10 Gbps (after reboot) | 13:17:24 – overheat (about 6 minutes later) |
| Mar 15, 11:39:31 | No ixgbe messages in preserved logs before this time | 11:39:31 – overheat |
Additionally, none of our reported temps from lm-sensors show any temperature issues, and we can see that the fan-susher had recently adjusted fans to 15% due to somewhat amenable temps.
LM-Sensors
Currently sensors detects the following temperatures on this R730.
| Chip | Sensor | Current | Limits |
|---|---|---|---|
| coretemp-isa-0000 (CPU Package 0) | Package id 0 | 28°C | high 83°C, crit 93°C |
| Core 0–28 (16 cores) | 22–25°C | high 83°C, crit 93°C | |
| coretemp-isa-0001 (CPU Package 1) | Package id 1 | 29°C | high 83°C, crit 93°C |
| Core 0–28 (16 cores) | 22–25°C | high 83°C, crit 93°C |
Lets run sensors-detect and see if we can add any additional sensors that might help us get a better picture of temps across the system
# sensors-detect
We answer “Yes” at each prompt. Any new modules/sensors are added to /etc/sysconfig/lm_sensors
Now we reload all the sensors
. /etc/sysconfig/lm_sensors; for m in $HWMON_MODULES $BUS_MODULES; do [ -n "$m" ] && sudo modprobe -r $m 2>/dev/null; done; for m in $BUS_MODULES $HWMON_MODULES; do [ -n "$m" ] && sudo modprobe $m; done
However, no new modules/temps are output by the sensors command, so no additional system temps to feed to the shusher.
So What Now?
At this point we have a few options, besides just cranking down the AC.
We know that our overheat events occurred when system temps were ~47°C with fans at 15%. So, lets start with an overhaul of the susher and increase fan speeds by 10% for each of the 7 temp thresholds as configured in our .env file, and redeploy.
Updated Fan speed levels (7 tiers)
| Level | Temp threshold | Fan speed | Trigger |
|---|---|---|---|
| Very-Low | < 20°C | 20% | Below LOW |
| Low | ≥ 20°C | 25% | GPU LOW / System LOW |
| Medium-Low | ≥ 35°C | 35% | MED_LOW |
| Medium | ≥ 50°C | 45% | MED |
| Medium-High | ≥ 60°C | 60% | MED_HIGH |
| High | ≥ 60°C (system) / 70°C (GPU) | 75% | HIGH |
| Very-High | ≥ 75°C | 90% | Auto mode → iDRAC |
Our minimum fan speed is now 20% instead of 10%, which should improve airflow over the NICs and reduce overheat risk.
Inlet/Exhaust Differential Logging
For the NIC overheat events (Mar 22 12:04, 13:17 and Mar 15 11:39), inlet/exhaust were not logged, so let’s start logging them on each scheduled run of fan_control.py , so if/when something goes wrong, we can see how hot the air was going in and how much it warmed up inside the chassis.
Safety Floor
Additionally lets setup login in fan_control.py if inlet or exhaust gets too high:
- Inlet ≥ 40°C → minimum fan speed set to 35%
- Exhaust ≥ 50°C → minimum fan speed set to 35%
Even if GPU/CPU temps look fine, we still ramp fans to protect things like the NIC when chassis air is hot.
Next Steps
For now we are going to let things ride. We have adjusted our fan speeds up, and we setup our safety floor.
If we continue to see issues, we may need to take additional action. This r730 is running a number of Virtual machines which I use as “lab infrastructure”, so I want it running 24/7. And while I could just keep jacking up fan speeds, I would rather take a more proactive approach.
- Remove heatsink and apply fresh thermal paste. This card is long in the tooth. Could be dried up.
- Move to an earlier revision of the Intel X540-AT2 which came with active fans. (Cheap)
- Move to a card who’s driver can expose temperatures, like the Broadcom NetXtreme-E (e.g. BCM57416 – not as cheap)
In theory, I like the 3rd option, as it would be nice to be able to pull temperatures from the the NIC, and allow fan_control.py to actually adjust fan speeds intelligently. In practice, however, replacing the NIC is probably overkill unless I run into a dual port actively cooled Intel with both long and short brackets (I like to keep my options open slot-wise). Although I do like getting packages in the mail.
