frankd

Silent Write Errors On Firewire (1394a) Drives

Recommended Posts

Has anyone out there intensively tested a firewire-based drive to see if they are getting silent write errors?

We have found that while WinXP is not reporting any errors, we are getting a substantial number of bad bits written to disk, both hard drive (WD1200JB) and DVD-RAM (Panasonic SW9571). Our external enclosure is Oxford OXFW911-based, and we have brought the firmware up to the current rev (27 Jan 2003).

Ordinary file usage of things like jpegs, mpegs, mp3's, and even zip files doesn't demonstrate any problems, but the bit errors are there as demonstrated by checksum. (We are using HashCalc to compute checksums.) The simple test is to copy a large file (we've been using 4gig) from the system drives to a firewire drive and then compare checksums -- they should (MUST) match, but don't!

If anyone else out there has an Oxford-911 based enclosure on firewire, I'd appreciate hearing if your enclosure is error-free..... (I have a really hard time believing that this problem is somehow related to this specific system; we're running the correct drivers, have eliminated all other devices, and are using quality cables throughout, even from the bride board to the drive.)

Thanks!

-frank

Share this post


Link to post
Share on other sites

Very interesting.

I have not manually checked checksums, but I do use DVDs for Disk Image backups.

Restoring from these DVDs have never caused an error. I know Drive Image uses checksum verification to an extent...but not sure if it tolerates errors (I would assume it does).

I'll have to try looking myself. :)

DogEared

Share this post


Link to post
Share on other sites

I don't know if this relates, but my older BX system made the same kind of corruption to the hard drive once I added some memory. I was overclocking, and the computer was stable. When I copied large files (I had few large .zips ranging from 50 MB to 1600 MB), they all got corrupted randomly. No error messages, just when I tried to unzíp them I got a CRC error. Same thing with .exe files (trying to install AV program worked fine when installing from the source drive, but no go from the copied version). Smaller files worked fine.

Solution (for me): Drop the OC. If You are not overclocking, I'd suggest You to run MemTest86 to check the memory.

Cheers,

Jan

Share this post


Link to post
Share on other sites

About 2-3 yrs back when I was testing, and breaking in my two HDD enclosures I had two reoccurring data corrupting issues. Testing was done with RAR volumes (IIRC 10MiB) and multi GiB video (~4-10GiB). I used WinRAR, and hashes to check. I forget whether I used HashCalc, but I ran CRC32, and md5's. I also used some kind of transfer rate limiting program to test explorer copies.

One: Buggy chipsets (KT133/A) -> reqd patches and playing w/PCI slots of firewire cards. Hopefully the modern chipsets aren't as bad.

Two: Bad cables or connectors in enclosure/bridge or drive sensitivity -> reqd dropping bus speed down.

I have pushed ~2.5TiB through both since I installed them. They are cheap little 5.25in, oxsemi 911, random USB1.1 ASIC, 3x1394a connector enclosures from compgeeks. IBM 60GiB 60GXP drives are bracket mounted, and powered on ~24x7 through a non-AVR UPS until this past October.

Share this post


Link to post
Share on other sites

Is this truly silent? Do you see any errors in the system logs?

Overall, this does not surprise me, as I have been seeing quite a lot of "non-silent" write errors with firewire drives. There seem to be a number of places where this is documented - not on manufacturer sites - and the general assumption seems to place the known issues in the general category of "if the transfer is too fast, the system will probably not be able to properly handle it."

There is a site that discusses this a bit and has a utility to check for one of the known firewire issues:

http://www.bustrace.com/products/delayedwrite.htm

I recently sent the following email to the folks at Granite (who use the Oxfords), though am still awaiting a reply:

We have 3 of your external firewire enclosures with removable bays (a total of 7 drives, including the extra trays we've ordered). These are not of the S.M.A.R.T. variety. We also have two external fixed enclosures. All are of the standard 1394-"A" variant. All connected via your FireVue cables, either directly to controllers or via one of our three FireVue 6 port hubs.

Drives used range from One 40G Fujistu 5400RPM, One Maxtor about 80GB 5400RPM, Two 80GB IBM 7200RPM 2MB Cache Drives, One WD 120GB 2GB Cache Drive, and the rest WD 120GB 8MB Cache ("Special Edition Drives). The drives are about half and half FAT32 and NTFS.

Computers used: Two IBM M Pro Intellistation Workstations (Dual 1.8GHz Xeon) with Adaptec 4300 Cards - using either TI or standard drivers, One IBM Single CPU Intellistation with Creative Labs Audigy 1394, One HP/Compaq xw8000 Dual Xeon 3GHz with built-in 1394, One Toshiba Laptop with built-in 1394, One Dell Laptop with Adaptec PCMCIA 1394, Three Tyan Motherboard P4 2.8GHz with both Creative labs Audigy 1394 and Adaptec 4300 Cards - using either TI or standard drivers.

The HP and the Toshiba are XP pro Systems, rest are W2K Pro. All are 100% up to date with OS, drivers, BIOS, and using your utilities, the Granite devices are also flashed to the latest (by the way, I was surprised that even the devices we recently ordered from you were not up to date in the firmware). Etc.

All of the above is mentioned because what I am about to describe happens with 100% repeatability under any and all possible combinations of the hardware above.

What happens? Well, if we use the drives in a "standard" firewire method - e.g., connected not to a computer, but for example to a video device such as any of the Videonics FireStore products, they record and playback perfectly (the FAT32 devices anyway, Videonics does not support NTFS, which is the only reason they are not all NTFS). If we use the drives under more or less "low stress" conditions connected to computers, they also are fine.

If we use the drives under more "high stress" conditions we get the following (example):

"Windows - Delayed Write Failed" Windows was unable to save all of the data for the file \Device\Hardisk\Volume2\$Mft. The data has been list. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.

There is no drive activity at the time of the error - the transfer has aborted. The system at that point is more or less useless - very few actions work, it is nearly impossible even to reboot, even with several HOURS of trying to shut down, close many many instances of these dialogs etc. it always requires a hard boot. To merely click on the fie in question locks the system even after the boot - to remove it you must either delete the directory where it is found or do an "invert select" and hit the delete key.

The high-stress condition that always create the error seems to be BOTH of the following: (1) dealing with larger files (2+GB, often 10-60GB) and (2) the source is fast. For example, trying to copy a 40GB video file off the HP with the source being a Rourke Dual Channel SCSI RAID rated for 4 uncompressed HD video streams. It need not be that extreme - just copying 4GB files off of one of the Tyan system's internal 7200 RPM "Special Edition" IDE drives to the firewire device will also do it. Working at "DV" speeds or less is always fine and error-free, no matter how large the file (e.g., rendering a 40GB file overnight is fine).

The system log records a huge number of "Controller Errors" during these failures.

I have tried, but is seems that it is not possible to disable the write cache on these devices. I un-check the box and click "OK" but when I go back it is still checked.

This is not anti-virus related, as the IBM systems are not-networked dedicated video rendering systems that do not have AV software due to the fact that AV software screws up the video renders.

Any suggestions? Not being able to transfer files to these devices makes them rather limited in their utility, and generally tends to make me view 1394 drives as no-where near ready for prime time (though truly I even tend to think that of IDE).

Share this post


Link to post
Share on other sites

Have you heard back from Granite yet regarding your problem?

After having done some extensive troubleshooting of my own Oxford-911 based firewire enclosures(with different manufacturers), along with some help from the bustrace folks, I've managed to get my enclosures to be working error-free. And they have been for the past month or so, even under heavy load.

At first I tried lowering the DMA mode on the enclosures from UDMA 5 to something lower. At one point I thought I got to a reliable alternative when I had it set to UDMA2(not good for performance at all) and I didn't get the Delayed Write errors when I did my test. After a few days under heavy load, the Delayed Write errors reappeared so scrap that for a solution. I didn't want something that wasn't reliable AND had bad performance.

Then I tried the Max Block size setting. On both my enclosures, they were both set to 1024. the max they could be set to for the Oxford 911 was 2048. So experimenting with different values, I found that if I had the Max Block size set to 2048, the delayed write errors disappeared, even under VERY HEAVY load. I could then set the UDMA back to UDMA 5, and I got the performance and reliablity that I had expected when I had first bought the enclosures.

Note: To play around with the settings on the enclosure, I used the Firmware Uploader tool from Oxford Semiconductors. You can download from many places but the place I downloaded from was this(http://www.newmotiontech.com/new/download/index.htm). It's the one labelled Oxford 922 update, but it can be used for Oxford 911 chips as well. There is also a user guide for using the uploader at the Oxford website(http://www.oxsemi.co.uk/cgi-bin/general/home.cgi).

Important: If you don't have to, don't actually upload any new or old firmware to the enclosure. If the current firmware on the enclosure was specially written or has special functions, you will lose them. At worst, your enclosure may be rendered inoperable until you reflash with the original(which is easier said than done). I managed to resolve the Delayed Write errors without updating any firmware, just by changing the settings.

Let me know if this has helped anyone with the Delayed Write errors.

Note that Delayed Write errors occur for a multitude of reasons, some of which are:

1) incompatibilities between firewire chipset on computer and enclosure

2) virtual memory settings

3) actual bad or corrupted hard disk

4) bad power connection on the enclosure

5) other misc reasons

It's basically a message that says that something bad happened while accessing the hd, but nothing specific

Share this post


Link to post
Share on other sites

Thank you, I think I will give this a try. I may wait until the end of the week until I see if Granite comes up with anything, though, as they have indicated that if I use the Oxford software that will negate their custom firmware.

I do appreciate your follow up. Given how widely the Oxfords are used - ADS uses them now, and WD came back with "Oxfod 911" as their answer to my question to them about their bridges - I am surprised that this is not a more widely known issue. It must be that the usage patterns are just not that intense for most folks and their 1394 drives.

Share this post


Link to post
Share on other sites

Well, the Oxford utilities will not configure 911s with the Granite firmware - they don't even recognize the devices. I could treat them as "blank" and use the utility to build up from scratch, but at the moment that seems a wee bit less severe than waiting for Granite to respond.

Thanks for the info, though - it may help others.

Share this post


Link to post
Share on other sites

Well,

How do you get the checksum ?

You have to check what the box is getting and compare it to what you wanted to send.

If the data is corrupted before getting to the external drive, it will return the corrupted data without any error !

You will need a bus analyzer IEEE1394...

Good luck !

MEJV

Share this post


Link to post
Share on other sites

This is the reply from Granite. I objected, based on the notion that if 9 computers, four devices and about 3 dozen cables all exhibited the exact same failure, then it is tough to blame the cable:

Yes, the Granite products set the max block size to 2048. This is true both for the OXFW911-based products as well as the OXUF922-based products. Granite has been using the 2048 block size setting since the original pilot production in early 2001.

Note: Setting the block size to a smaller value reduces the maximum throughput which can be achieved.

Regarding the "delayed write failed" errors, my guess, and this is just a guess, is that there is a data integrity problem on the cable. This could be the result of a poor connection or a poor cabling setup. By changing the max block size to 2048, you halve the number of packets which need to be sent in order to move the same amount of data. It's possible that this statically reduces the number of failed transfers. Keep in mind that there are automatic retries for corrupted packets in FireWire. The automatic retries may be masking the presence of a high packet failure rate, so failed packets might be happening more frequently than one realizes. Only those transfers which have more failures than the maximum number of retries will appear as "delayed write failures". So, the fewer the packets you send (ie., larger packet sizes), or any reduction in the chance of a given packet being corrupted, may result in a system that appears to be working better.

Again, just a guess. But, I would look at cabling and the quality of connections.

Share this post


Link to post
Share on other sites

The following is evidently the final word from Granite on the subject. Back to SCSI....

First, let me say you have done a good job of investigating the Microsoft problem. Cabling is certainly not the only issue... just possibly one and if you are using our boxes and our cables you will find that we produce the best there is. Other things that I have seen that might improve problems is the use of a hub. We have also had some added luck with the new FW800 1394B Host Adapters. Microsoft has still not issued a 1394B driver but most of the B hosts still work. This new TI chip has a bigger buffer and seems to eliminate some problems.

As you mentioned there are a variety of items that can cause this error and until we get some new input from Microsoft there will continue to be some mystery associated with it.

Best regards, frank

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now