Page 1 of 1

Ongoing Database Issue [solved]

Posted: Sat Feb 17, 2018 4:29 pm
by Red Squirrel
Looks like the issue we've been facing for past few days may be related to a failing hard drive in one of the raid arrays. I've ordered a new drive. That particular raid array is not actually the one that the server VM is on but I think when it gets stuck it just causes the whole file server to block and then affects other stuff too. Though this is just a guess as the issue actually looks more like a network issue, so it's kinda weird. But there is a drive failing, so may as well tackle that and see if it helps. Has to be done anyway.

As a side note I do have a code issue with the SQL system as it should not be corrupting the way it's doing even with DB write issues. The system was designed in a way for situations like this not to actually cause database corruption and for the pending data to simply wait until the DB is available again. So I will have to look at that.

I am reluctant to increase the frequency of the backups for now as it will simply put more strain on the server, so I will just keep the backups at he same rate while I wait for the replacement drive to arrive. Shard DB backups run several times per day already.

If by chance you play and do something such as get an artifact and are worried about the crash happening, just shoot me a PM and I can run a backup manually.

Archived topic from AOV, old topic ID:6744, old post ID:39218

Ongoing Database Issue [solved]

Posted: Fri Feb 23, 2018 11:20 pm
by Red Squirrel
The new drive came in, so I will be replacing it in next few days.

Archived topic from AOV, old topic ID:6744, old post ID:39228

Ongoing Database Issue [solved]

Posted: Tue Feb 27, 2018 12:31 am
by Red Squirrel
Drive replaced and raid rebuilt.

Will give it a few days to see whether or not that fixes it. I have a feeling it's not the cause, but we'll see. Had to be done anyway.

Archived topic from AOV, old topic ID:6744, old post ID:39229

Ongoing Database Issue [solved]

Posted: Sun Mar 04, 2018 9:24 pm
by Red Squirrel
This may possibly be solved, but not too sure yet. It's related to an overall issue with my storage system where if there is too much load, things start to crash. I never was able to figure out that issue and it kind of went away on it's own but then it resurfaced and now it's hitting the DB server instead of other VMs. I figured maybe the failing drive was not helping though.

Will continue to monitor.

It's still safe to play, it's just that the worse thing that can happen is losing a couple hours of progress if it does happen. I do need to look at redesigning part of the DB system, as it should not corrupt like this regardless of if the DB becomes unavailable in the middle of saving (which is what seems to happen during high loads) so I will look at fixing that at some point.

Archived topic from AOV, old topic ID:6744, old post ID:39233

Ongoing Database Issue [solved]

Posted: Sat Mar 31, 2018 12:18 pm
by Red Squirrel
Unfortunately the drive I replaced is not the cause of this. I really don't know what it is at this point. This issue just started randomly with no explanation.

Basically it looks like the network randomly drops for no reason and then it causes the shard to crash hard instead of just trying again later to write whatever it is it's trying to write. The crash logs don't give line numbers because I think it's the core that's crashing and not the main part, so this makes it extremely hard to troubleshoot.

Given the shard is pretty much dead I'm just going to keep restoring backups for the time being every time this happens. If you log in and everything is missing assume that the issue happened and that it will get restored. I do get alerts on my phone when it crashes so chances are I already know about it when it happens, I might just be at work or sleeping or something.

Archived topic from AOV, old topic ID:6744, old post ID:39251

Ongoing Database Issue [solved]

Posted: Sun Jul 29, 2018 3:59 pm
by Red Squirrel
Happened again.

Restored backup from Fri Jul 27 01:00:27 EDT 2018.

I have summer projects I've been working on, but I do seriously want to get back into coding a bit for the shard at some point, mostly back end fixes though, and this is one of them.

Archived topic from AOV, old topic ID:6744, old post ID:39294

Ongoing Database Issue [solved]

Posted: Sun Sep 02, 2018 12:30 pm
by Red Squirrel
Had another crash.

Restored DB from: Sun Sep 2 06:00:08 EDT 2018

I still have to figure out why these crashes keep happening, but I also have an idea to redesign the DB system to be more efficient, so I might just do that and it might by chance fix the crash issue too. Not that the shard is all that active now days to start putting this kind of work into it, but the whole idea is I want it to be set and forget... and right now it's not.

As always let me know if you see any major issues but everything should be normal as of the backup date.

Archived topic from AOV, old topic ID:6744, old post ID:39302

Ongoing Database Issue [solved]

Posted: Wed Sep 05, 2018 4:55 am
by Red Squirrel
I did some changes to the file system/program files of the shard and restructured a few things. Long story short I made it so the shard's data files (executables) are local to the VM, instead of on a SMB share. One hunch I have is that when disk IO on my network grinds to a halt during backup jobs (I still don't know why it does that) it would actually cause some SMB faults, which in turn would crash the whole server.

The database issue is not fixed, but if I can at least fix the crashes then the database issue will stop happening.

So I will leave it at that for the time being and hopefully the crashes stop. Either way, my next step is to redesign the snapshot portion of the database system. I actually implimented it quite poorly and it could be done better so I'll want to do that. It will also generate less strain on the sql server so that will be a bonus.

Shard should be running as normal now in it's new environment setup, let me know if there's any weird issues.

Archived topic from AOV, old topic ID:6744, old post ID:39303