Page 1 of 1

Server Degradation - [solved]

Posted: Fri Dec 11, 2020 3:09 am
by Red Squirrel
Everything is good and we are now back to full disk redundancy

----
Just a heads up I have 2 drives that have failed in one of the raid arrays on the NAS and it just so happens both the shard's VM and the database server are on that array.

I have 4 drives on order from 2 retailers to maximize chance of getting it sooner, and to have 2 spares after, since a couple of the other drives are throwing errors too. The drives in this array have around 60 thousand hours on them so I think it's just a matter of time until they all need to be swapped out.

In order to minimize load on this degraded array and reduce chance of another failure, I have decided to turn off the database server. The shard will continue to run, but if anything happens such as a crash, it will result in a revert to around Dec 11 2:45am ET. (no longer the case)

You can continue to play as normal, and if all goes well there will be no revert.

I'm on night shifts right now so I don't want to do anything too drastic at this point as I can't dedicate my focus 100% to it, but once I'm off again, I want to look at migrating the database server to another array. In theory I should be able to do that while the shard continues to run, and when I bring it back up, it will start to save again.

The shard itself does not really produce much disk IO so that will remain.

For your viewing pleasure, this is a shot of the carnage:

Image

If I'm understanding this right, any of the B drives can fail and we will be safe, but if any of the A ones fail, then the entire array is lost. I do have backups but hope I don't need to use them as it's still a pain to rebuild everything.

Archived topic from AOV, old topic ID:6847, old post ID:39583

Server Degradation - [solved]

Posted: Sat Dec 12, 2020 2:51 am
by Red Squirrel
It's quiet tonight at work and I got everything done that I needed to.

Currently migrating database VM to another LUN. This is kinda critical as that very act is putting lot of strain on the array, but mostly read and not write, so should be fine...

Once it's on the new LUN I will fire the VM back up and turn the server back on and start to sync the shard back up.

Archived topic from AOV, old topic ID:6847, old post ID:39584

Server Degradation - [solved]

Posted: Sat Dec 12, 2020 3:25 am
by Red Squirrel
Database server now on new array and running. Server is now synced with database and there is no longer a risk of revert.

However the shard VM itself remains on the degraded array so there is still a risk of downtime, but no data loss.

I still have no ETR for arrival of new hard drives and with the weekend they won't really move until monday but according to the tracking number from one retailer the drives are in Richmond Hill which is here in Ontario so once it does ship it should only be a few days.

Archived topic from AOV, old topic ID:6847, old post ID:39585

Server Degradation - [solved]

Posted: Sat Dec 12, 2020 4:00 pm
by ggkthx
This is quite the adventure. :o:

Archived topic from AOV, old topic ID:6847, old post ID:39586

Server Degradation - [solved]

Posted: Sat Dec 12, 2020 7:23 pm
by Red Squirrel
Lol yeah quite the adventure. I can't wait for those drives to come in... it's one of my higher performance raid arrays so have a lot on there.

I'm actually due for an overall upgrade to increase capacity since most of my arrays are running low on space and are on fairly old drives, but costs of living keep going up so don't really have money to buy server stuff anymore these days.

Archived topic from AOV, old topic ID:6847, old post ID:39588

Server Degradation - [solved]

Posted: Fri Dec 18, 2020 8:49 pm
by Red Squirrel
So the two replacement drives came in. I pulled out the 2 dead drives and put the replacements in. Running some tests on them to make sure they're good, then will insert them into the array and let it rebuild.

At this point the shard's data is NOT at risk as per my last post about migrating it to another LUN, but the possibility of downtime is still a risk should the array get more drive failures.

I am not too worried though and I think everything will go smooth. This should be over within 1-2 days.

Archived topic from AOV, old topic ID:6847, old post ID:39644

Server Degradation - [solved]

Posted: Fri Dec 18, 2020 11:51 pm
by ggkthx
Cool cool cool. Hope all goes smoothly!

Archived topic from AOV, old topic ID:6847, old post ID:39645

Server Degradation - [solved]

Posted: Sat Dec 19, 2020 12:21 am
by Red Squirrel
First round of testing (long SMART test) completed without error on both drives.

Doing full write test now then will do full read back test. This makes sure there's no bad sectors.

It's so odd looking at the stats and seeing a drive with only several power on hours compared to like 60 thousand lol. The drives did pretty good time.

Archived topic from AOV, old topic ID:6847, old post ID:39646

Server Degradation - [solved]

Posted: Sat Dec 19, 2020 3:53 am
by Red Squirrel
All tests were good. Rebuild in progress!

Image

Archived topic from AOV, old topic ID:6847, old post ID:39647

Server Degradation - [solved]

Posted: Sat Dec 19, 2020 7:59 pm
by Red Squirrel
Everything good now. Raid array is nominal.

I have 2 other drives on the way which I'll keep as spares as I do have more drives showing errors.

Archived topic from AOV, old topic ID:6847, old post ID:39650