On Mar 03, 2004 10:20 PM, "McAllister, Andrew" <McAllisterA@(protected)> wrote:
> > -- --Original Message-- -- > > From: Michael Hasenstein [mailto:mha@(protected)] > snip > > Miquel Colom wrote: > snip > > > 1-Hangs reported due to FC cards. SOLVED with firmware upgrade. > We don't have FC cards and our Perc4/DCs are at 3.28/1.05 firmware etc. > > > > 2-Hangs suspected to be due to using framebuffer. DISABLING fmb not > > > tested. > We never had a problem when using the console. > > > > 3-Hangs due to asynch io. SOLVED by applying a RHEL 3.0 (on > > SLES8, is that > > > correct)?. > > > > Not RH specific, it is an Oracle bug with the stub libraries. > > BUG 3016968 - ASYNCIO FUNCTIONALITY IS NOT WORKING > We haven't applied this patch but are researching now. > > > > 4-Hangs due to bad reiserfs filesystem. SOLVED with reiserfs fsck. > > > Correct? > Someone reported this. We'll be fscking next reboot. > > > > 5-Finally, Andrew is experiencing hangs due to high load. > > There is here no > > > FC card, but there is async io and reiserfs. Also there is > > a note that > > > taking out the broadcom cards contributes to a better > > uptime. This can be > > > a driver problem or an IO-APIC issue. Hangs not > > reproducible on test > > > system, only in production (sigh). > We can't use ext2 as our database files are all 2gig and some of our > nightly data loads come in as files > 2 gig (like 6 gig). > > > > > > > Do I miss something? > > > > I'd be interested to see this with ext2, I'd like to know if > > reiserfs is > > involved, directly or indirectly (triggering a bug somewhere else) > > doesn't matter. > Snip > > Our open TAR number with oracle was sent to Michael off-list. > > Our production system is in its last stages before meltdown as of right > now. We're limping along until after business hours. One of the two > listeners is dying every couple hours (this is also a symptom that we > thought was gone with the pro100 card install). And now a user is > reporting corrupt database block errors. We'll run the fsck on next > reboot. > > Any other ideas as to what to look for? If this thing hangs between 8-5 > US/Central we'll have to bring it back up immediately, if after 5 we may > have an hour or so to tweak or check settings. I'd be happy to run any > non-destructive test or check of settings while the machine is on its > last legs. Obviously if it is a true hang, we'll have to power cycle. > > Another interesting development... > We set up a test 2650 with SLES 8 SP3 and 2.4.21-190-smp. We've also > built a stress test database and set of data files with load scripts > that will continuously load data and check for errors. > > This test database on our standby 6650 will produce "ORA-12599 (See ORA-12599.ora-code.com): > TNS:cryptographic checksum mismatch" and "ORA-03113 (See ORA-03113.ora-code.com) end-of-file on > communications channel" after about 15 minutes. These Oracle errors are > the first sign of impending doom. We normally get them after 24 hours of > running on our production box. On the production side, eventually this > ORA-... error rate goes up and the box will exhibit other symptoms like > hung pipes, then file systems, then complete hangs. > > On the test 2650 we went through 2700 loads last night without any > problem. The test 2650 has ASYNC IO turned OFF! > > Differences between the 6650 and 2650: number and speed of CPU's, > chipset?, 6650's have PERC4/DC with megaraid2 drivers, 2650 have the > internal Adaptec raid (aacraid driver). Async IO was off on the 2650. > Both use reiserfs and LVM. > > We've just disabled ansyc IO on the standby 6650 and are restarting that > stress test. Will report results back. > > We just turned async on on the 2650 and relinked all and are starting > the stress test over again to see if we get errors. If so, we know it > is async io. If not, then something in the kernel or megaraid drivers? > > Tom reported problems with max open files: > oracle@(protected):/proc/sys/fs> cat file-nr > 5676 1495 131072 > Should be no problem for us here. > > Thus I think we are narrowing down the problems to megaraid, > chipset/hardware, or something in the kernel that is aggrivated by our > hardware. > > Andy
-- To unsubscribe, email: suse-oracle-unsubscribe@(protected) For additional commands, email: suse-oracle-help@(protected) Please see http://www.suse.com/oracle/ before posting