The Homelab Dilemma: Living With Enterprise Servers

Enterprise Servers are loud and hot, and while a couple of 13th generation Dell servers kept a room nice and toasty in the cold of winter, it a double whammy to your electric bill in the Spring/Summer.

The noise can be problematic as well, especially since I no longer have a basement, where I relied on my servers to circulate and dehumidify the stuffy air. Now, I actually have to sit in a room with the rack. Which is why I moved it to the bedroom. Strange.

So why the bedroom, well the white noise (when under control) helps me sleep and I can turn down the heat in the Winter and not freeze at night. And in the day, when I power everything up, I am in my home office (or my second home office – the dinning room), which is already loud enough with my workstations and old Cisco switches. That tiny home office, that is technically a small bedroom for your least favorite child, can get downright uncomfortably warm.


Why So Much Noise

As anyone who has worked in a datacenter can tell you. Servers are loud. Let compare fans in an Dell R630 to an R730.

ServerForm FactorEffective fan sizeFan CountNotes
Dell R6301U~40 mm blower7 Very high RPM, loud, high static pressure
Dell R7302U~60 mm blower6Still loud, but more efficient airflow

This boils down to the 1U server must spin its fans faster that then 2U to move the same amount of air through the server chassis, measured in CFMs (Cubic Feet per Minute).

Additionally, you install a PCIe card in a server that the server itself does not recognize (like Telsa T4 GPUs) , and you may find your fans spinning at full bore, as the server would rather take flight than overheat.


Enter the Dell Fan Susher

Luckily, you can use ipmitools to override a Dell server’s fan speed, and quiet things down a great deal. However, you need to keep your eye on temperatures. For this I created Dell-Server-Fan-Shusher.

Its a python script that monitors system temps (via sensors), and NVIDIA GPU temps (via nvidia-smi – when present), and sets fan speeds accordingly. It currently has seven threshold levels for temp and fan speeds, all of which can easily be modified.

Installation is one command and its scheduled either via cron or systemd.

sudo ./install.sh

The deployed fan_control.py reads system temps in the following order.

  • Sysfs (/sys/class/hwmon)
  • sensors
  • ipmitool sdr list (last fallback)

It also gets GPU temps via nvidia-smi and system temps via IPMI (e.g. Inlet, Exhaust, CPU packages, etc.).


Its Getting Hot in Here

The susher has been working fine for months, but as I mentioned its no longer Winter and the ambient air in my “server room” has been rising. See output from my Netbotz 450 below. Starting to get hotter than a grandparent’s Florida condo.

Line graph showing temperature readings over time from a sensor pod. The y-axis represents temperature in Fahrenheit, ranging from 71 to 77 degrees. The x-axis shows time intervals. The maximum temperature recorded is 75.4 degrees, and the minimum is 72.3 degrees.

So why I am posting all this? So this morning, when I attempted to log into my main hypervisor to start on my 3rd installment of my HPC GPU Cluster journey, I found the host unresponsive and throwing the errors below on the console. Time to reboot.

Terminal screen showing error messages for a network adapter in Red Hat Enterprise Linux, indicating it has stopped due to overheating and suggesting to restart the computer or replace the adapter.

Overheating NIC?

So here is what happened, the ixgbe driver detected that the adapter overheated and disabled it to protect the hardware. When that happens, all ports on that card stop working. In the console, kernel logs are reporting a thermal shutdown of the Intel 10GbE network adapter. Specifically the Intel X540-AT2, which is a dual port 10Gbe copper (RJ45) adapter – with a large heatsink (some variants of this card have a fan.

lspci -s 0000:82:00.0
82:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)

A few minutes after a reboot the error returns, and the NIC goes offline again and is accessible only via the IDRAC console. At this point I cannot keep the system online long enough to do any meaningful troubleshooting. So we need to shush the susher. So from the console we disable it and then reboot again.

# systemctl disable --now dell-r730-fan-control.timer
Removed '/etc/systemd/system/timers.target.wants/dell-r730-fan-control.timer'.


From our workstation we connect to the servers IDRAC and set manual fan control and set fan speeds to 100%. We do this while the target system is in the process or rebooting.

ipmitool -I lanplus -H 10.1.10.20 -U root -P calvin raw 0x30 0x30 0x01 0x00
ipmitool -I lanplus -H 10.1.10.20 -U root -P calvin raw 0x30 0x30 0x02 0xff 100

Hopefully now we can keep the system online long enough to see what is really going on, as the server itself is not reporting any temperature issues on the idrac.


Troubleshooting

So first lets check the system logs and see how many times that this has occurred.

grep -r "over heat\|overheat\|stopped because" /var/log/ 2>/dev/null | tail -50

We see two events today, and a 3rd event a few days ago (probably patient zero).

DateTimeEvent
Mar 22, 202612:04:56Both Intel 10GbE NICs (enp130s0f0, enp130s0f1) stopped due to overheat
Mar 22, 202613:17:24Same overheat again shortly before reboot at 13:24:59
Mar 15, 202611:39:31Same overheat on both NICs

Digging into logs a bit further, we see the following at around 12:04

Fan speed: 15% (~3800 RPM)

  • GPU temp: 35°C
  • System temps: max 47°C (sensors 24–25 and 39 at 41–47°C)
  • Action: Fan control had just set fans to 15% for the “LOW” threshold.

Intel X540-AT2 thermal specs

A quick internet search yields the following. This NIC needs to get pretty hot to experience a thermal shutdown

SpecificationValueNotes
Operating temp (ambient)0°C to 55°CMarketing spec for 200 LFM airflow
Extended ambient0°C to 70°CWith 300 LFM airflow and adequate heatsink
Tcase max107°CMax case temp at heat spreader (die-level limit)

NIC temperature visibility

So the Intel ixgbe driver does not expose temperature to Linux:

  • No hwmon temperature sensor under /sys/...
  • ethtool does not show NIC temperature

So there is no NIC temperature in system logs. The “over heated” message comes from the NIC’s internal thermal protection. The driver only reports the event; the actual NIC temperature is not logged. At this point we do no have a method to pull the temp from the NIC.

Additional, there are no log messages like thermal warnings, throttling, or high-temperature notices before the shutdown. So there is really nothing that we can check for in the logs and use as a trigger for our Dell Fan Controller (Shusher).

EventLast ixgbe message before overheatOverheat
Mar 22, 12:04:5611:58:02 – NIC Link is Up 10 Gbps12:04:56 – overheat (about 7 minutes later)
Mar 22, 13:17:2413:11:28 – NIC Link is Up 10 Gbps (after reboot)13:17:24 – overheat (about 6 minutes later)
Mar 15, 11:39:31No ixgbe messages in preserved logs before this time11:39:31 – overheat

Additionally, none of our reported temps from lm-sensors show any temperature issues, and we can see that the fan-susher had recently adjusted fans to 15% due to somewhat amenable temps.


LM-Sensors

Currently sensors detects the following temperatures on this R730.

ChipSensorCurrentLimits
coretemp-isa-0000 (CPU Package 0)Package id 028°Chigh 83°C, crit 93°C
Core 0–28 (16 cores)22–25°Chigh 83°C, crit 93°C
coretemp-isa-0001 (CPU Package 1)Package id 129°Chigh 83°C, crit 93°C
Core 0–28 (16 cores)22–25°Chigh 83°C, crit 93°C

Lets run sensors-detect and see if we can add any additional sensors that might help us get a better picture of temps across the system

# sensors-detect

We answer “Yes” at each prompt. Any new modules/sensors are added to /etc/sysconfig/lm_sensors

Now we reload all the sensors

 . /etc/sysconfig/lm_sensors; for m in $HWMON_MODULES $BUS_MODULES; do [ -n "$m" ] && sudo modprobe -r $m 2>/dev/null; done; for m in $BUS_MODULES $HWMON_MODULES; do [ -n "$m" ] && sudo modprobe $m; done

However, no new modules/temps are output by the sensors command, so no additional system temps to feed to the shusher.


So What Now?

At this point we have a few options, besides just cranking down the AC.

We know that our overheat events occurred when system temps were ~47°C with fans at 15%. So, lets start with an overhaul of the susher and increase fan speeds by 10% for each of the 7 temp thresholds as configured in our .env file, and redeploy.

Updated Fan speed levels (7 tiers)

LevelTemp thresholdFan speedTrigger
Very-Low< 20°C20%Below LOW
Low≥ 20°C25%GPU LOW / System LOW
Medium-Low≥ 35°C35%MED_LOW
Medium≥ 50°C45%MED
Medium-High≥ 60°C60%MED_HIGH
High≥ 60°C (system) / 70°C (GPU)75%HIGH
Very-High≥ 75°C90%Auto mode → iDRAC

Our minimum fan speed is now 20% instead of 10%, which should improve airflow over the NICs and reduce overheat risk.

Inlet/Exhaust Differential Logging

For the NIC overheat events (Mar 22 12:04, 13:17 and Mar 15 11:39), inlet/exhaust were not logged, so let’s start logging them on each scheduled run of fan_control.py , so if/when something goes wrong, we can see how hot the air was going in and how much it warmed up inside the chassis.

Safety Floor

Additionally lets setup login in fan_control.py if inlet or exhaust gets too high:

  • Inlet ≥ 40°C → minimum fan speed set to 35%
  • Exhaust ≥ 50°C → minimum fan speed set to 35%

Even if GPU/CPU temps look fine, we still ramp fans to protect things like the NIC when chassis air is hot.


Next Steps

For now we are going to let things ride. We have adjusted our fan speeds up, and we setup our safety floor.

If we continue to see issues, we may need to take additional action. This r730 is running a number of Virtual machines which I use as “lab infrastructure”, so I want it running 24/7. And while I could just keep jacking up fan speeds, I would rather take a more proactive approach.

  1. Remove heatsink and apply fresh thermal paste. This card is long in the tooth. Could be dried up.
  2. Move to an earlier revision of the Intel X540-AT2 which came with active fans. (Cheap)
  3. Move to a card who’s driver can expose temperatures, like the Broadcom NetXtreme-E (e.g. BCM57416 – not as cheap)

In theory, I like the 3rd option, as it would be nice to be able to pull temperatures from the the NIC, and allow fan_control.py to actually adjust fan speeds intelligently. In practice, however, replacing the NIC is probably overkill unless I run into a dual port actively cooled Intel with both long and short brackets (I like to keep my options open slot-wise). Although I do like getting packages in the mail.

Leave a Reply