Homelab Chronicles 06 – “Hey Google…” “I’m Sorry, Something Went Wrong”

I woke up early today, on a Saturday, to my alarm clock(s) going off. I was planning to go to a St. Patrick’s Day Parade and post-parade party with a friend. After turning off my phone alarm(s), I told my Google Nest Mini to stop the alarm that was blaring.

Unfortunately, it informed me that something went wrong. Though it did turn off. Usually when my Google Nest Mini has issues, it’s because WiFi messed up. So I stumbled out of bed, still half-asleep, to the guest bedroom, where the network “rack”—a small metal bookshelf—and the Unifi AP was at. My main 24-port switch had lights blinking. I looked up at the AP high up on the wall and saw the steady ring of blue light, indicating everything was working. OK, so not a WiFi problem, nor a network problem. Probably.

In the hallway, I passed by my Ecobee thermostat to turn the heat up a little and then noticed a “?” mark on the button for local weather. Ah, so I didn’t have Internet. Back in my room and I picked up my phone: 5G, instead of WiFi. On my computer, the Formula 1 livestream of the Bahrain track test, which I fell asleep to, had stopped. And reloading the page simply displayed a “No connection” error. I opened a command prompt and ran ipconfig /all and ping 8.8.8.8. The ping didn’t go anywhere, but I still had a proper internal IP in the subnet. Interesting. Guess the DHCP lease was still good.

Only one last place to check: the living room where the Google Fiber Jack and my Unifi Secure Gateway router were. Maybe there was a Fiber outage. Or maybe my cat had accidentally knocked the AC adapter off messing around in places he shouldn’t. Sunlight was streaming in from the balcony sliding door, making it hard to see the LED on the Jack. I covered the LED on the Fiber Jack with my hands as best as I could: it was blue. Which meant this wasn’t an outage. Uh oh. Only one other thing it could be.

Next to the Fiber Jack, surrounding my TV, I have some shelving with knickknacks and little bits of artwork. Hidden behind one art piece is my USG and an 8-port switch. I removed the art to see the devices. The switch was blinking normally. But on the USG, the console light was blinking with periodicity, while the WAN and LAN lights were out. Oh no, please don’t tell me the “magic smoke” escaped from the USG.

On closer inspection, it looked like the USG was trying to boot up repeatedly. It was even making a weird sound like a little yelp in time with the console LED going on and off. So I traced the power cable to the power strip and unplugged it, waited 15 seconds, and plugged it in again. Same thing happened. I really didn’t want to have to buy a new USG; they’re not terribly expensive, but they’re not inexpensive, either.

I tried plugging it into a different outlet on the power strip, but it kept quickly boot-looping. I then brought it to a different room and plugged it into a power outlet; no change. Great.

But then I noticed that there was a little green LED on the power brick. And it was flashing at the same frequency as the USG’s console light when plugged in. Hmm, maybe the power adapter went bad. I could deal with that, provided I had a spare lying around.

The Unifi power brick said “12V, 1 amp” for the output. So I started looking around. On my rack, I had an external HDD that was cold. I looked at its AC adapter and saw “12V, 2 amps.” That was promising, but could I use a 2 amp power supply on a device that only wants 1 amp? I looked online, via my phone, and the Internet said, “Yes.” Perfect.

I swapped the AC adapter on the USG. The little barrel connector that goes into the USG seemed to fit, if not just a smidge loose. Then I plugged it back into the wall.

It turned on and stayed on! Ha!

I brought it back to the shelf and reconnected everything. It took about 5 minutes for it to fully boot up. Afterwards, I went back to my computer and waited for an Internet connection to come back, and it did.

All in all, it was a 15-20 minute troubleshooting adventure. Not what I preferred to do straight out of bed on a Saturday morning, but it got fixed. I already ordered a new AC adapter from Amazon that should arrive in a few days.

Afterwards, I got ready and went to the parade. A bit nippy at about 25°F (about -3°C), but at least it was bright and sunny with barely any wind. I went to the party and had a couple beers. It definitely made up for the morning IT sesh.

Homelab Chronicles 05: Dead Disk Drive

Sometime around the new year, one of the drives in my server appeared to have died. I had some issues with it in the past, but usually unseating and reseating it seemed to fix whatever problems it was presenting. But not this time.

My server has 7 500GB HDDs, set-up in a RAID 5 configuration. It gives me about 2.7TB of storage space. These are just consumer-level WD Blue 7200rpm drives. It’s a Homelab that’s mainly experimental; I’m not into spending big money on it. Not yet, anyway.

While I’ve since heard that RAID 5 isn’t great, I’m OK with this since this is just a Homelab. Anyway, in RAID5, one drive can die and the array will still function. Which is exactly what happened here.

However, I began tempting fate by not immediately swapping the failed drive. I didn’t have any spares at home, but more importantly, I was being cheap. So I let it run in a degraded state for a month or two months. This was very dangerous as I don’t backup the VMs or ESXi. I only backup my main Windows Server instance via Windows Server backup to an external HDD. Even then, I’ve committed the common cardinal sin of backups: I’ve yet to test a single WS backup. So using something like Veeam is probably worth looking into for backing up full VMs. And of course testing my Windows Server full bare-metal backups.

Luckily, fates were on my side and no other drive failures were reported. I finally got around to replacing the drive about month ago. Got my hands on a similar WD Blue 500GB drive; a used one at that. It was pretty straightforward. I swapped the drives, went into the RAID configuration in the system BIOS, designated the drive as part of the array, and then had it rebuild. I think it took at least 10hrs.

While it was rebuilding, everything else was down. All VMs were down, because ESXi was down. Thought it best to rebuild while nothing else was happening. Who knows how long it would’ve taken otherwise and if I’d run into other issues. I wonder how this is done on real-life production servers.

Afterwards, the RAID controller reported that everything was in tip-top shape.

But of course, I wanted more. More storage, that is. I ended up getting two of the 500GB WD Blue HDDs: one for the replacement and the other as an additional disk to the array.

Unfortunately, Dell does not make it easy to add additional drives to an existing array. I couldn’t do it directly on the RAID controller (pressing Ctrl+R during boot), nor in Dell’s GUI-based BIOS or Lifecycle Controller. IDRAC didn’t allow it either.

Looking around online, it seemed that the only way to do it would be via something called OpenManage, some kind of remote system controller from Dell. But I couldn’t get it to work no matter what I did. The instructions on what I needed to install, how to install it once I figured out what to install, nor how to actually use it once I determined how to install it, were poor. Thanks, Dell.

In the end, after spending at least a few hours researching and experimenting, it didn’t seem worth it for 500GB more of storage space. I did add the 8th drive in, but as a hot spare. I may even take it out and use it as cold spare.

But yeah, I can now say that I’ve dealt with a failed drive in a RAID configuration. Hopefully it never goes further than that.