Home raid failure
- Red Squirrel
- Posts: 29209
- Joined: Wed Dec 18, 2002 12:14 am
- Location: Northern Ontario
- Contact:
Home raid failure
I just lost my entire home raid array which means I lost my entire home infrastructure. This will greatly delay the release date as I will need to buy all new drives as I don't know which ones failed as they are all showing up as ok now but are not trustable. I will then have to restore backups and reconfigure everything. The dev and test environments as well as all the code for the shard are on this raid array... well were. Fortunately, I have multiple backups, it will just take a while before I can get everything together.
I will post as I progress. I'm still trying to wrap my mind around this disaster and trying to figure out the best approach to take, but it does look like it's lost given 2 drives dropped at the same time. Next time I am going with a raid 6 or 61.
Also never buy Hitachi drives. They are the biggest pieces of shit ever. I actually have 5 of them arriving from an RMA, and I will be RMAing 2 more. They are pure suck. They are cheap for a reason.
Archived topic from AOV, old topic ID:5921, old post ID:35821
I will post as I progress. I'm still trying to wrap my mind around this disaster and trying to figure out the best approach to take, but it does look like it's lost given 2 drives dropped at the same time. Next time I am going with a raid 6 or 61.
Also never buy Hitachi drives. They are the biggest pieces of shit ever. I actually have 5 of them arriving from an RMA, and I will be RMAing 2 more. They are pure suck. They are cheap for a reason.
Archived topic from AOV, old topic ID:5921, old post ID:35821
Honk if you love Jesus, text if you want to meet Him!
- Red Squirrel
- Posts: 29209
- Joined: Wed Dec 18, 2002 12:14 am
- Location: Northern Ontario
- Contact:
Home raid failure
UPDATE
Good news, I managed to recover the array, only thing I do not trust it, my goal is to get a full snapshot backup of it and then I will wait till the new drives arrive. Still debating on what drives to get.
At the end of the day, the good news of all is I do have backups and the last job ran August 16th. I will run another right now while the data is available.
So nothing has been lost, this is just going to cause a delay in the release date.
And this is only affecting development and testing, not the live shard.
Archived topic from AOV, old topic ID:5921, old post ID:35822
Good news, I managed to recover the array, only thing I do not trust it, my goal is to get a full snapshot backup of it and then I will wait till the new drives arrive. Still debating on what drives to get.
At the end of the day, the good news of all is I do have backups and the last job ran August 16th. I will run another right now while the data is available.
So nothing has been lost, this is just going to cause a delay in the release date.
And this is only affecting development and testing, not the live shard.
Archived topic from AOV, old topic ID:5921, old post ID:35822
Honk if you love Jesus, text if you want to meet Him!
- Red Squirrel
- Posts: 29209
- Joined: Wed Dec 18, 2002 12:14 am
- Location: Northern Ontario
- Contact:
Home raid failure
One drive dropped out, but the backup is still running. The array is degraded right now but as long as another drive does not drop, I'm ok. Backing up to a separate media that had an older backup, that way the one that is 2 days old is still safe in case this one fails and corrupts or something.
I will try to do a rebuild overnight to see what happens then I'll do a fsck of the array.
The new drives have been ordered. I also ordered two new backplanes in case it turns out it's that, and not the drives.
Archived topic from AOV, old topic ID:5921, old post ID:35824
I will try to do a rebuild overnight to see what happens then I'll do a fsck of the array.
The new drives have been ordered. I also ordered two new backplanes in case it turns out it's that, and not the drives.
Archived topic from AOV, old topic ID:5921, old post ID:35824
Honk if you love Jesus, text if you want to meet Him!
- Red Squirrel
- Posts: 29209
- Joined: Wed Dec 18, 2002 12:14 am
- Location: Northern Ontario
- Contact:
Home raid failure
The backup was successful. I tried to readd the drive after and it crapped out with tons of I/O errors. I moved the drive to another bay and it seems fine, I'll know when I get back home.
Suspecting the backplane. Going to wait till the new parts arrive and go from there.
Archived topic from AOV, old topic ID:5921, old post ID:35829
Suspecting the backplane. Going to wait till the new parts arrive and go from there.
Archived topic from AOV, old topic ID:5921, old post ID:35829
Honk if you love Jesus, text if you want to meet Him!
- Red Squirrel
- Posts: 29209
- Joined: Wed Dec 18, 2002 12:14 am
- Location: Northern Ontario
- Contact:
Home raid failure
I did a fsck on the raid volume as there was lot of errors. Corrected all FS errors. In the crash I did lose all my VMs (including DEV1 and TC1), which I am currently restoring from a backup. I may have lost some torrents that were active (big deal) but other than that things seem to be good so far now that I moved that drive.
I'm hoping the replacement drives and backplanes will arrive soon as I still want to change that out completely before I put any kind of stress on the system.
Archived topic from AOV, old topic ID:5921, old post ID:35834
I'm hoping the replacement drives and backplanes will arrive soon as I still want to change that out completely before I put any kind of stress on the system.
Archived topic from AOV, old topic ID:5921, old post ID:35834
Honk if you love Jesus, text if you want to meet Him!
- Red Squirrel
- Posts: 29209
- Joined: Wed Dec 18, 2002 12:14 am
- Location: Northern Ontario
- Contact:
Home raid failure
I am restoring VMs from backup and so far things are looking good. No errors yet or anything. I will monitor things for a while but it's looking good. Hopefully the new backplanes and drives solve the issue completely.
TC1 is back up as of now, assuming nothing crashes again.
Archived topic from AOV, old topic ID:5921, old post ID:35839
TC1 is back up as of now, assuming nothing crashes again.
Archived topic from AOV, old topic ID:5921, old post ID:35839
Honk if you love Jesus, text if you want to meet Him!
- Red Squirrel
- Posts: 29209
- Joined: Wed Dec 18, 2002 12:14 am
- Location: Northern Ontario
- Contact:
Home raid failure
So far so good, no signs of errors. Leaving it alone until the new hardware arrives. I'm still going to replace the back planes and all the drives and cables, just to be on the safe side.
Archived topic from AOV, old topic ID:5921, old post ID:35889
Archived topic from AOV, old topic ID:5921, old post ID:35889
Honk if you love Jesus, text if you want to meet Him!
- Red Squirrel
- Posts: 29209
- Joined: Wed Dec 18, 2002 12:14 am
- Location: Northern Ontario
- Contact:
Home raid failure
The parts just came in today, that was surprisingly fast.
I will be bringing the environment offline shortly to begin the physical rebuild of the raid array. Will most likely be down for a few hours.
Archived topic from AOV, old topic ID:5921, old post ID:35904
I will be bringing the environment offline shortly to begin the physical rebuild of the raid array. Will most likely be down for a few hours.
Archived topic from AOV, old topic ID:5921, old post ID:35904
Honk if you love Jesus, text if you want to meet Him!
Home raid failure
Whenever I am having a hard time falling asleep I come to the forums and check threads like this one in hopes of new updates to read. x1000 better than counting sheep.
Archived topic from AOV, old topic ID:5921, old post ID:35911
Archived topic from AOV, old topic ID:5921, old post ID:35911
- Red Squirrel
- Posts: 29209
- Joined: Wed Dec 18, 2002 12:14 am
- Location: Northern Ontario
- Contact:
Home raid failure
Haha, it's always been my goal to keep everyone up to date, even with minor details. I find lot of shards and other services fail to do this. Even web hosts for example, when there are issues they try to hide everything, instead of being straight to the point about what's happening, the progress etc.
That said, when I went to open the box for the backplane I realized one of em got backordered. Though I did install one of em, and decided to slightly mod the case as I don't like how on one side the bays are vertical and the other side they are horizontal. Maybe I just have OCD, but I like things to be symmetrical, at least to a a certain point.
Right now the server is booted up with one backplane, these ones hold 5 drives instead of 4, so it ended up nicely. Only have one drive dangling inside off a cable. When the other backplane arrives I'll bring it down again to set it up properly.
I am hoping heat wont be an issue, but I still have some tweaking to do on the air flow. It will involve caulk and be very dirty, but it will get the job done.
Services are coming back up now, monitoring to ensure I don't get any IO errors. I plugged one of the bays into the controller that was connected to the other bay that was failing. I want to rule out the controller.
Archived topic from AOV, old topic ID:5921, old post ID:35913
That said, when I went to open the box for the backplane I realized one of em got backordered. Though I did install one of em, and decided to slightly mod the case as I don't like how on one side the bays are vertical and the other side they are horizontal. Maybe I just have OCD, but I like things to be symmetrical, at least to a a certain point.
Right now the server is booted up with one backplane, these ones hold 5 drives instead of 4, so it ended up nicely. Only have one drive dangling inside off a cable. When the other backplane arrives I'll bring it down again to set it up properly.
I am hoping heat wont be an issue, but I still have some tweaking to do on the air flow. It will involve caulk and be very dirty, but it will get the job done.
Services are coming back up now, monitoring to ensure I don't get any IO errors. I plugged one of the bays into the controller that was connected to the other bay that was failing. I want to rule out the controller.
Archived topic from AOV, old topic ID:5921, old post ID:35913
Honk if you love Jesus, text if you want to meet Him!
Home raid failure
In all seriousness it's nice to get the updates simply because it shows you're working on the shard. I'm just giving you a hard time because most of the time I have no idea what you're talking about technology-wise.Red Squirrel wrote:Haha, it's always been my goal to keep everyone up to date, even with minor details. I find lot of shards and other services fail to do this. Even web hosts for example, when there are issues they try to hide everything, instead of being straight to the point about what's happening, the progress etc.
That said, when I went to open the box for the backplane I realized one of em got backordered. Though I did install one of em, and decided to slightly mod the case as I don't like how on one side the bays are vertical and the other side they are horizontal. Maybe I just have OCD, but I like things to be symmetrical, at least to a a certain point.
Right now the server is booted up with one backplane, these ones hold 5 drives instead of 4, so it ended up nicely. Only have one drive dangling inside off a cable. When the other backplane arrives I'll bring it down again to set it up properly.
I am hoping heat wont be an issue, but I still have some tweaking to do on the air flow. It will involve caulk and be very dirty, but it will get the job done.
Services are coming back up now, monitoring to ensure I don't get any IO errors. I plugged one of the bays into the controller that was connected to the other bay that was failing. I want to rule out the controller.
Archived topic from AOV, old topic ID:5921, old post ID:35914
- Red Squirrel
- Posts: 29209
- Joined: Wed Dec 18, 2002 12:14 am
- Location: Northern Ontario
- Contact:
Home raid failure
ok so it crapped out again. So guess it was not the backplanes. It could still be the drives, but I'm starting to suspect the controller. That means more money to spend. Ugh. I'm already over 1k in debt, this is really not looking good. I just want this stupid thing to work.
Archived topic from AOV, old topic ID:5921, old post ID:35919
Archived topic from AOV, old topic ID:5921, old post ID:35919
Honk if you love Jesus, text if you want to meet Him!
- Red Squirrel
- Posts: 29209
- Joined: Wed Dec 18, 2002 12:14 am
- Location: Northern Ontario
- Contact:
Home raid failure
Been continuing to troubleshoot this issue.
all brand new drives, one new backplane, other backordered still, but most of the drives are in the new one, the other is just sitting inside the case directly attached.
Still random crashes/errors. Doing a ram test overnight, maybe it's the memory that's bad.
Next step is a new SATA controller.... this is getting expensive.
If the sata controller does not do it, then I'm at a loss. I'll have to try reinstalling Linux, maybe something got corrupted but that's a good week+ of solid work, reconfiguring everything.
I'll let this memtest go overnight and see what happens. Memtest86 would not even start at all, it just locked up, using a different utility. I'm kinda hoping it's the ram as that is an easy and cheap fix.
Archived topic from AOV, old topic ID:5921, old post ID:35972
all brand new drives, one new backplane, other backordered still, but most of the drives are in the new one, the other is just sitting inside the case directly attached.
Still random crashes/errors. Doing a ram test overnight, maybe it's the memory that's bad.
Next step is a new SATA controller.... this is getting expensive.
If the sata controller does not do it, then I'm at a loss. I'll have to try reinstalling Linux, maybe something got corrupted but that's a good week+ of solid work, reconfiguring everything.
I'll let this memtest go overnight and see what happens. Memtest86 would not even start at all, it just locked up, using a different utility. I'm kinda hoping it's the ram as that is an easy and cheap fix.
Archived topic from AOV, old topic ID:5921, old post ID:35972
Honk if you love Jesus, text if you want to meet Him!
- Red Squirrel
- Posts: 29209
- Joined: Wed Dec 18, 2002 12:14 am
- Location: Northern Ontario
- Contact:
Home raid failure
Ram tested ok. Ordered a new SATA controller to give that a try...
Stuff is back up right now, just not sure for how long. It usually crashes overnight, so we'll see. I'm really hoping the new controller will get the job done, I'm sick of spending so much money on this thing, and waiting for orders to come in suck. The 2nd backplane is STILL backordered. Guessing it's coming from China. Could be another few weeks.
Archived topic from AOV, old topic ID:5921, old post ID:36003
Stuff is back up right now, just not sure for how long. It usually crashes overnight, so we'll see. I'm really hoping the new controller will get the job done, I'm sick of spending so much money on this thing, and waiting for orders to come in suck. The 2nd backplane is STILL backordered. Guessing it's coming from China. Could be another few weeks.
Archived topic from AOV, old topic ID:5921, old post ID:36003
Honk if you love Jesus, text if you want to meet Him!
- Red Squirrel
- Posts: 29209
- Joined: Wed Dec 18, 2002 12:14 am
- Location: Northern Ontario
- Contact:
Home raid failure
So far she's holding up....
The second backplane and the controller have arrived today, so I will install those tonight. That should hopefully be the end of these outages and the server should be stable again.
Archived topic from AOV, old topic ID:5921, old post ID:36042
The second backplane and the controller have arrived today, so I will install those tonight. That should hopefully be the end of these outages and the server should be stable again.
Archived topic from AOV, old topic ID:5921, old post ID:36042
Honk if you love Jesus, text if you want to meet Him!
- Red Squirrel
- Posts: 29209
- Joined: Wed Dec 18, 2002 12:14 am
- Location: Northern Ontario
- Contact:
Home raid failure
The backplane has been installed. Unfortunatly the SATA card is not compatible... it does not have a "no raid" function. I just assumed all raid cards had that function... apparently not. (server uses Linux raid for easier compatibility between hardware)
That said, with the existing controllers (motherboard + two small pcie cards) has been holding up for over 3 days. I think replacing the Hitachi drives may have possibly done the trick. Going to be returning the controller and calling this a done deal if it continues to hold up.
At this moment the backplane is installed and my whole environment is fully operational including TC1. Further development will continue on. This issue has been one of the delays in the artifact scaling which will follow the release.
I wont call this a done deal yet, but putting it to sleep for now, as I think all is good even without a new controller. Time will tell.
Archived topic from AOV, old topic ID:5921, old post ID:36054
That said, with the existing controllers (motherboard + two small pcie cards) has been holding up for over 3 days. I think replacing the Hitachi drives may have possibly done the trick. Going to be returning the controller and calling this a done deal if it continues to hold up.
At this moment the backplane is installed and my whole environment is fully operational including TC1. Further development will continue on. This issue has been one of the delays in the artifact scaling which will follow the release.
I wont call this a done deal yet, but putting it to sleep for now, as I think all is good even without a new controller. Time will tell.
Archived topic from AOV, old topic ID:5921, old post ID:36054
Honk if you love Jesus, text if you want to meet Him!