ECC memory errors

Yes, they do happen and you may not even know about it. ECC memory will automatically correct 1-bit errors, but that alone should cause an alarm, e.g. via SNMP. Depending on your hardware and your setup, it may or may not. Sometimes a server will start misbehaving without having given any indications of a memory issue. In a recent case, an HP Proliant server had had an episode with random restarts about a year back, but then apparently recovered and subsequently showed no symptoms. Until last week, when it became really unstable. It would be running fine for a few hours, then lapse into a series of restarts. Only a cold-start would get it running again. We were preparing to replace it, when one of my engineers noticed it would also fail during the POST memory check. Not consistently, just every now and then. We had no window for running a full memory test, so we just yanked the complete set and replaced it – problem fixed.

Lesson learned: ECC memory errors are rare, but can cause havoc without any clear indications. Keep it in mind.