Server Degradation - [solved]

Stay up to date with shard happenings
Locked
User avatar
Red Squirrel
Posts: 29209
Joined: Wed Dec 18, 2002 12:14 am
Location: Northern Ontario
Contact:

Server Degradation - [solved]

Post by Red Squirrel »

Everything is good and we are now back to full disk redundancy

----
Just a heads up I have 2 drives that have failed in one of the raid arrays on the NAS and it just so happens both the shard's VM and the database server are on that array.

I have 4 drives on order from 2 retailers to maximize chance of getting it sooner, and to have 2 spares after, since a couple of the other drives are throwing errors too. The drives in this array have around 60 thousand hours on them so I think it's just a matter of time until they all need to be swapped out.

In order to minimize load on this degraded array and reduce chance of another failure, I have decided to turn off the database server. The shard will continue to run, but if anything happens such as a crash, it will result in a revert to around Dec 11 2:45am ET. (no longer the case)

You can continue to play as normal, and if all goes well there will be no revert.

I'm on night shifts right now so I don't want to do anything too drastic at this point as I can't dedicate my focus 100% to it, but once I'm off again, I want to look at migrating the database server to another array. In theory I should be able to do that while the shard continues to run, and when I bring it back up, it will start to save again.

The shard itself does not really produce much disk IO so that will remain.

For your viewing pleasure, this is a shot of the carnage:

Image

If I'm understanding this right, any of the B drives can fail and we will be safe, but if any of the A ones fail, then the entire array is lost. I do have backups but hope I don't need to use them as it's still a pain to rebuild everything.

Archived topic from AOV, old topic ID:6847, old post ID:39583
Honk if you love Jesus, text if you want to meet Him!
User avatar
Red Squirrel
Posts: 29209
Joined: Wed Dec 18, 2002 12:14 am
Location: Northern Ontario
Contact:

Server Degradation - [solved]

Post by Red Squirrel »

It's quiet tonight at work and I got everything done that I needed to.

Currently migrating database VM to another LUN. This is kinda critical as that very act is putting lot of strain on the array, but mostly read and not write, so should be fine...

Once it's on the new LUN I will fire the VM back up and turn the server back on and start to sync the shard back up.

Archived topic from AOV, old topic ID:6847, old post ID:39584
Honk if you love Jesus, text if you want to meet Him!
User avatar
Red Squirrel
Posts: 29209
Joined: Wed Dec 18, 2002 12:14 am
Location: Northern Ontario
Contact:

Server Degradation - [solved]

Post by Red Squirrel »

Database server now on new array and running. Server is now synced with database and there is no longer a risk of revert.

However the shard VM itself remains on the degraded array so there is still a risk of downtime, but no data loss.

I still have no ETR for arrival of new hard drives and with the weekend they won't really move until monday but according to the tracking number from one retailer the drives are in Richmond Hill which is here in Ontario so once it does ship it should only be a few days.

Archived topic from AOV, old topic ID:6847, old post ID:39585
Honk if you love Jesus, text if you want to meet Him!
User avatar
ggkthx
Posts: 943
Joined: Mon Jan 12, 2009 7:55 pm

Server Degradation - [solved]

Post by ggkthx »

This is quite the adventure. :o:

Archived topic from AOV, old topic ID:6847, old post ID:39586
Image
I didn't choose the Fel life, the Fel life chose me.
User avatar
Red Squirrel
Posts: 29209
Joined: Wed Dec 18, 2002 12:14 am
Location: Northern Ontario
Contact:

Server Degradation - [solved]

Post by Red Squirrel »

Lol yeah quite the adventure. I can't wait for those drives to come in... it's one of my higher performance raid arrays so have a lot on there.

I'm actually due for an overall upgrade to increase capacity since most of my arrays are running low on space and are on fairly old drives, but costs of living keep going up so don't really have money to buy server stuff anymore these days.

Archived topic from AOV, old topic ID:6847, old post ID:39588
Honk if you love Jesus, text if you want to meet Him!
User avatar
Red Squirrel
Posts: 29209
Joined: Wed Dec 18, 2002 12:14 am
Location: Northern Ontario
Contact:

Server Degradation - [solved]

Post by Red Squirrel »

So the two replacement drives came in. I pulled out the 2 dead drives and put the replacements in. Running some tests on them to make sure they're good, then will insert them into the array and let it rebuild.

At this point the shard's data is NOT at risk as per my last post about migrating it to another LUN, but the possibility of downtime is still a risk should the array get more drive failures.

I am not too worried though and I think everything will go smooth. This should be over within 1-2 days.

Archived topic from AOV, old topic ID:6847, old post ID:39644
Honk if you love Jesus, text if you want to meet Him!
User avatar
ggkthx
Posts: 943
Joined: Mon Jan 12, 2009 7:55 pm

Server Degradation - [solved]

Post by ggkthx »

Cool cool cool. Hope all goes smoothly!

Archived topic from AOV, old topic ID:6847, old post ID:39645
Image
I didn't choose the Fel life, the Fel life chose me.
User avatar
Red Squirrel
Posts: 29209
Joined: Wed Dec 18, 2002 12:14 am
Location: Northern Ontario
Contact:

Server Degradation - [solved]

Post by Red Squirrel »

First round of testing (long SMART test) completed without error on both drives.

Doing full write test now then will do full read back test. This makes sure there's no bad sectors.

It's so odd looking at the stats and seeing a drive with only several power on hours compared to like 60 thousand lol. The drives did pretty good time.

Archived topic from AOV, old topic ID:6847, old post ID:39646
Honk if you love Jesus, text if you want to meet Him!
User avatar
Red Squirrel
Posts: 29209
Joined: Wed Dec 18, 2002 12:14 am
Location: Northern Ontario
Contact:

Server Degradation - [solved]

Post by Red Squirrel »

All tests were good. Rebuild in progress!

Image

Archived topic from AOV, old topic ID:6847, old post ID:39647
Honk if you love Jesus, text if you want to meet Him!
User avatar
Red Squirrel
Posts: 29209
Joined: Wed Dec 18, 2002 12:14 am
Location: Northern Ontario
Contact:

Server Degradation - [solved]

Post by Red Squirrel »

Everything good now. Raid array is nominal.

I have 2 other drives on the way which I'll keep as spares as I do have more drives showing errors.

Archived topic from AOV, old topic ID:6847, old post ID:39650
Honk if you love Jesus, text if you want to meet Him!
Locked