| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
|
| Sorry for the multi-post, I ended up deciding I should have cross-posted. Folks, I have an Oracle EBS system that has hung a couple of days in a row. At first I saw references in /var/log/messages that were new and followed them to power management so I planned on working in the BIOS the next time it happened. The problem is now it's starting to hang again and the only entries near all 3 hangs are about pam_unix. So I started digging on new sessions authenticating. Sure enough - New cron jobs stop happening severa hours before anyone notices a problem. Existing oracle sessions keep running SQL but new ones hang at the creation point. New ssh sessions hang just after asking for my password but existing ones work - I have a few root shells running on the console so I can do any sort of debugging I want. The load average is consistant with a guess of the number of hung sessions. Load around 12-15 but CPU utilitization no more than 10%. Neither netstat nor lsof suggest there are hung network sessions. The very puzzling part is /etc/nsswitch.conf has passwd: files as its only option. No NIS, LDAP or any other sort of network authentication. No sign of disk problems this hang or the previous though there was a 30 second burp on the FCAL line to the array an hour before the first hang was noticed. The only processes that used network authentication were smd and nmbd so I did chkconfig smdb off and service smbd stop to remove that possibility. At this point I figure I'll need to reboot the box by the end of the day and I really want a plan of action by then to keep it from having the same problem again tomorrow. Has anyone seen pam cause a hang with passwd: files? Thanks in advance! |
|
#2
|
| Doug Freyburger wrote: > Sorry for the multi-post, I ended up deciding I should have > cross-posted. > > Folks, > > I have an Oracle EBS system that has hung a couple of days > in a row. At first I saw references in /var/log/messages that > were new and followed them to power management so I > planned on working in the BIOS the next time it happened. > > The problem is now it's starting to hang again and the only > entries near all 3 hangs are about pam_unix. So I started > digging on new sessions authenticating. Sure enough - > New cron jobs stop happening severa hours before anyone > notices a problem. Existing oracle sessions keep running > SQL but new ones hang at the creation point. New ssh > sessions hang just after asking for my password but > existing ones work - I have a few root shells running on the > console so I can do any sort of debugging I want. > > The load average is consistant with a guess of the number > of hung sessions. Load around 12-15 but CPU utilitization > no more than 10%. Neither netstat nor lsof suggest there > are hung network sessions. > > The very puzzling part is /etc/nsswitch.conf has > > passwd: files > > as its only option. No NIS, LDAP or any other sort of > network authentication. No sign of disk problems this > hang or the previous though there was a 30 second burp > on the FCAL line to the array an hour before the first hang > was noticed. > > The only processes that used network authentication > were smd and nmbd so I did chkconfig smdb off and > service smbd stop to remove that possibility. > > At this point I figure I'll need to reboot the box by the end > of the day and I really want a plan of action by then to > keep it from having the same problem again tomorrow. > > Has anyone seen pam cause a hang with passwd: files? > > Thanks in advance! It's time for installing some OS-patches. A work-around is maybe to disable nscd/pwgrd (caching), or disable some line in /etc/pam.conf -- echo imhcea\.lophc.tcs.hmo | sed 's2\(....\)\(.\{5\}\)2\2\122;s1\(.\)\(.\)1\2\11g;1 s;\.;::;2' |
|
#3
|
| On Aug 1, 7:04*pm, Michael Tosch wrote: > Doug Freyburger wrote: > > Sorry for the multi-post, I ended up deciding I should have > > cross-posted. > > > Folks, > > > I have an Oracle EBS system that has hung a couple of days > > in a row. *At first I saw references in /var/log/messages that > > were new and followed them to power management so I > > planned on working in the BIOS the next time it happened. > > > The problem is now it's starting to hang again and the only > > entries near all 3 hangs are about pam_unix. *So I started > > digging on new sessions authenticating. *Sure enough - > > New cron jobs stop happening severa hours before anyone > > notices a problem. *Existing oracle sessions keep running > > SQL but new ones hang at the creation point. *New ssh > > sessions hang just after asking for my password but > > existing ones work - I have a few root shells running on the > > console so I can do any sort of debugging I want. > > > The load average is consistant with a guess of the number > > of hung sessions. *Load around 12-15 but CPU utilitization > > no more than 10%. *Neither netstat nor lsof suggest there > > are hung network sessions. > > > The very puzzling part is /etc/nsswitch.conf has > > > passwd: files > > > as its only option. *No NIS, LDAP or any other sort of > > network authentication. *No sign of disk problems this > > hang or the previous though there was a 30 second burp > > on the FCAL line to the array an hour before the first hang > > was noticed. > > > The only processes that used network authentication > > were smd and nmbd so I did chkconfig smdb off and > > service smbd stop to remove that possibility. > > > At this point I figure I'll need to reboot the box by the end > > of the day and I really want a plan of action by then to > > keep it from having the same problem again tomorrow. > > > Has anyone seen pam cause a hang with passwd: files? > > > Thanks in advance! > > It's time for installing some OS-patches. > A work-around is maybe to disable nscd/pwgrd (caching), > or disable some line in /etc/pam.conf > > -- > echo imhcea\.lophc.tcs.hmo | > sed 's2\(....\)\(.\{5\}\)2\2\122;s1\(.\)\(.\)1\2\11g;1 s;\.;::;2' Review the limits in number of process(threads) or max number of open files with: ulimit -a Maybe there is a problem with the resources, i.e. client sessions to Oracle. Try to count the # of process with: ps -ef|wc -l And also review with top if there are high consumption of processor of Waits i/o calls. |
|
#4
|
| Michael Tosch > Doug Freyburger wrote: > > > Has anyone seen pam cause a hang with passwd: files? > > It's time for installing some OS-patches. > A work-around is maybe to disable nscd/pwgrd (caching), > or disable some line in /etc/pam.conf Thank you everyone that sent suggestions on group or by e-mail! Here's the bug that apparently caused the problem - Standard behavior of "audit" is to stop logging when its logging directory hits 80% full. On Red Hat Enterprise Linus 3 there's a bug that it stops logins rather than logging. Turn off the "audit" subsystem, stop tickling the bug. On this particular system /var isn't separated from /. It's one of my first health check issues on any production system tp separate /var and /tmp to isolate them from disk full problems causing hang but I'd only been taking care of this particular client for negative one week when this happened. They rushed to authorize the hours starting a week early. So now I need to - 1) Confirm that's the problem by waiting a week without seeing the problem come back. 2) Trim logs to get / well under 80%. Install canned log trimming scripts that use Red Hat utilities as well as find. 3) Get a new LUN on the SAN ready to go as /mnt and migrate to it as /var at the next maintenance reboot in a couple of weeks. 4) Use up2date to get all kernel and non-kernel modules up to date at the next maintenance reboot in a couple of weeks. 5) Then and only then turn audit back on. Note to self to read any white papers on the UNIX auditting systems. I started in engineering support and much of my production experience has been on small drive systems or high load systems that needed to turn audit off so I do not know it well enough. 6) Do what I consider good health check audits on the hosts at this client and start working on recommandations. |
![]() |
| Thread Tools | |
| Display Modes | |