Distributed Proofreaders Outage Update: Monday, March 3, 2014

March 3, 2014

Unfortunately we have not been able to bring the server (prod6) back online remotely. After evaluation, we have decided to prepare a new server and migrate DP to that machine earlier than originally planned.

As noted previously, this is one of two identical machines graciously donated by Tom Kowal.

Prod6 has (literally) served DP well since 2006 and was slated for replacement in summer 2014 but, given the current situation, making that transition now should minimize the total disruption of service.

The new machine (prod7) has been in preparation since early this morning, already has the OS upgrade on it, and is being prepared/configured with the necessary software. In addition to being more recent hardware, prod7 also gives us two additional drive bays, a battery-backup unit on the drive controller, twice the RAM, and twice as many cpu cores as we currently have with prod6. Those factors combined with a move to the ext4 filesystem should make the server a bit more responsive during heavy load.

Brief stats: 2x Xeon @2.00GHz, 4 cores each. 8G RAM. 8x 300G drives (2 as RAID mirror for the OS, 4 as RAID-5 for the data, and 2 hot spares).

Once the build-out is complete, the new hardware will be transported on-site and we will begin transferring the databases, project files, etc. over. This may take an extended time, as we do not yet know exactly what conditions we will be working under.

Transport and on-site work is tentatively scheduled to take place tomorrow (Tuesday, March 4).

Once again, thanks for your patience. We’re working hard to bring you an improved experience.

DavidĀ (donovan)

Distributed Proofreaders – Extended Outage

March 1, 2014

Yes, Distributed Proofreaders is experiencing an extended downtime.

The March 1, 2014, OS upgrade took place as scheduled at 10am and went smoothly, completing in just under one hour. However, upon reboot we encountered a boot configuration issue which is preventing the server from getting past the bootloader and back online. Since then, we have tried several approaches to correct this problem remotely (and have come tantalizingly close) but have not yet succeeded. I’ll skip the technical details, but note that it has been incredibly frustrating to be able to see the drives on the controller and the filesystem contents, and yet be unable to fully boot into it.

We do have console access to the machine and have verified that all data is present.

We will continue to work at least part of tonight and tomorrow (Sunday March 2) to resolve the issue remotely, but we are close to exhausting those options.

If we cannot resolve the issue remotely, then I will have to travel to the server, or the server will have to be shipped to me. In either case, we are probably looking at being down for approximately a week.

It’s worth noting that the shipping option does open up the possibility for us to migrate to one of the two servers which were donated to DP by Tom Kowal. They are underutilized in their current role providing an additional layer of off-site backups.

Please check back here for further updates on our progress. I apologize for the inconvenience and thank everyone for their patience during this outage. Please wish us luck!

Thanks to Casey and Brian for their assistance today, and special thanks to the staff at Interserver for their hands-on help.

David (donovan)

%d bloggers like this: