|
Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email hr@neohapsis.com |
From: AIX Service Mail Server (aixserv
austin.ibm.com)Date: Tue Jan 23 2001 - 02:15:01 CST
APAR: IY12053 COMPID: 5765D5100 REL: 311
ABSTRACT: NODECOND TIMEOUT WAITING FOR OK PROMPT
PROBLEM DESCRIPTION:
When netbooting a node the nodecond_chrp script can fail with
a "timeout waiting for ok prompt" message in the nodecond log
in /var/adm/SPlogs/spmon/nc directory. It fails to enter the
"8" in the right time after seeing the line
memory keyboard network scsi
This has been seen so far on some of the winterhawk2 nodes.
LOCAL FIX:
Do a manual nodecondition to netboot the node until the
nodecond_chrp script has been changed to address this timing
problem.
PROBLEM SUMMARY:
Network boot may sometimes fail on a Winterhawk-2 node with
the message, 'timeout waiting for "Welcome to AIX"'.
PROBLEM CONCLUSION:
The node conditioning script has been changed to require
less interaction with the service processor menu. Instead
of waiting a fixed number of seconds before sending an "8"
prompt, it will select a firmware option that allows it to
wait for the "ok" prompt upon booting.
------
APAR: IY12819 COMPID: 5765D5101 REL: 111
ABSTRACT: HAGSD MEMORY LEAK - ASSERT SUBROUTINE FAILED IN
PROBLEM DESCRIPTION:
Customer's hags daemon goes down intermittently, with the
following error in the hags log:
The assert subroutine failed: message !=0, file ../../../../../s
rc/rsct/pgs/pgsd/pbs/PBContainer.C, line 63.
Also, the core dump that is created is well over a 100M. These
2 things together suggest hags is encountering a memory leak
on assert. There were 2 defects in 3.2 64941 and 66277, which
should fix this problem, and will be retrofitted backwards to
this release.
PROBLEM SUMMARY:
Memory leaks in hags caused it to core dump.
PROBLEM CONCLUSION:
Several memory leaks have been plugged.
------
APAR: IY12873 COMPID: 5765D5100 REL: 311
ABSTRACT: VSDNOISE_DEBUG DEFAULT SETTING IMPACTS VSD PERFORMANCE
PROBLEM DESCRIPTION:
Currently when vsd is running on a system the vsdnoise_debug
default setting is set to trace level VSDTFLOW which is
impacting vsd performance during read/writes. The default trace
level needs to be reduced so that it does not significantly
impact vsd performance.
PROBLEM SUMMARY:
The default values for vsdnoise_debug in PSSP 3.1.1 include
VSDTFLOW. That causes an entry to be made in lookvsd_debug
whenever the vsdd driver changes its flow of control. This
can impact VSD performance.
PROBLEM CONCLUSION:
The vsdd driver has been changed so that VSDTFLOW is no
longer set by default. Instead, a new option, VSDTSHIP, has
been added to cause a trace entry to be made only when a
major operation is undertaken by vsdd (for example, sending
or receiving a packet.) This replaces VSDTFLOW as a tracing
default.
------
APAR: IY13317 COMPID: 5765D6100 REL: 210
ABSTRACT: LLSTATUS INFO MISSING FROM API DATA
PROBLEM DESCRIPTION:
LLSTATUS INFO MISSING FROM API DATA
PROBLEM SUMMARY:
The LoadLeveler API does not have information for
the Drain and Draining classes.
PROBLEM CONCLUSION:
The LoadLeveler API ll_get_data would now have two new
specification called LL_MachineDrainingClassList and
LL_MachineDrainClassList to get the draining and drain
class list on the machine.
------
APAR: IY13614 COMPID: 5765D5100 REL: 311
ABSTRACT: PMANDEF FAILS WTH HACMP WHEN INITIAL_HOSTNAME NOT IN SDR ADAPTER
PROBLEM DESCRIPTION:
The Failure occurs when the initial_hostname in the SDR for a
node resolves to an ip address that is not listed for one of the
adaters in the SDR Adapter class. This can be seen around line
1204 in function get_node_list():
push(
node_num_list, $adapter_list{&host_to_ipaddr($node)});
The host_to_ipaddr($node) gives back the IP address of the
initial hostname of the node. But there is no
entry for it in the adapter_list
so it gets left out. This causes what amounts to a syntax error
node range which causes the failure.
The initial_hostname gets in an associative array (node_list) of
node numbers to hostname via build_node_list.
This array is used down deep
in the target processing path which ultimately turns
the target list into a list
of hostnames from the SDR initial_hostname field.
PROBLEM SUMMARY:
If the initial_hostname of a target node of a Problem
Management subscription does not correspond to an
Adapter in the SDR, pmandef will fail with an Invalid
node range message.
PROBLEM CONCLUSION:
Processing in pmandef that was obsolete, resulted in
an error when the initial_hostname of a target node of a
Problem Management subscription did not correspond to an
Adapter in the SDR. This processing was removed.
------
APAR: IY13791 COMPID: 5765D5101 REL: 111
ABSTRACT: TOPSVCS REC'S 2523-055 NODE DUPLICATION ERROR BECAUSE 2 OFFSETS
PROBLEM DESCRIPTION:
Customer has 5 node hacmp cluster with ttys across all 5. Each
node is connected to 2 neighbors. When the machines.lst is
generated, 2 offsets are being created instead of 3..this
results in topsvcs not starting and failing with 2523-055 node
duplication error.
This is a duplicate of defect 64717 in r120. This fix needs to
be retrofitted back to r111.
PROBLEM SUMMARY:
In rare situations, the ordering of non-IP networks might
cause a conflict in Topology Services configuration file,
which in turn will cause Topology Services daemon to exit.
PROBLEM CONCLUSION:
A new algorithm has been developed for ordering non-IP
networks in Topology Services configuration file . This
algorithm avoids conflict by assigning networks to the
first offset that fits.
------
APAR: IY13923 COMPID: 5765D5100 REL: 311
ABSTRACT: UPDATEVSDTAB SCRIPT DOES NOT TAKE INTO ACCOUNT MIRRORING WHEN
PROBLEM DESCRIPTION:
The /usr/lpp/csd/bin/updatevsdtab script updates the size of the
vsd incorrectly in the SDR if there is mirroring.
PROBLEM SUMMARY:
The problem arose when the customer needed to update
size of the vsd (size_in_MB) in the VSD_Table class
because the logical volume had been extended.
There is a script called updatevsdtab in /usr/lpp/csd/bin
which will, using sysctl, go out to the node and determine
(via "lslv <logical-volume>") information about the
logical volume and will update the "size_in_MB" object
in the VSD_Table class. The "work" is done in the
/usr/lpp/csd/sysctl/updatevsdtab2.perl script.
When the customer did this the command worked, however,
the size of the vsd was updated to be double the actual
size. Upon investigation it was found that the logical
volume was mirrored and the updatevsdtab2.perl script
was not checking for how many copies there were and
taking that into account when calculating the total size
of the vsd.
PROBLEM CONCLUSION:
The solution to this problem is to extract from the
LVS array one additional bit of information (COPIES:)
and modify the calculation of the current size of the
vsd ($curr_size) to always take into account the number
of copies ($curr_size = ($pps / $copies) * $ppsize, where
$pps=number of partitions, $copies is the number of copies
that are active, and $ppsize, the size of each partition).
------
APAR: IY14043 COMPID: 5765D6100 REL: 210
ABSTRACT: REQUESTING TOO MANY US WINDOWS CRASHES THE NEGOTIATOR
PROBLEM DESCRIPTION:
The customer uses an external scheduler and the negotiator
began crashing on a regular basis (about once per week). The
problem was traced back to several jobs that were requesting 8
and 16 tasks per node in US, but this machine only has 4 US
windows per node.
LOCAL FIX:
Maui provides a filtering mechanism which can be programmed to
prevent it from trying to start jobs requesting too many
windows. I do not know if any other external schedulers have a
similar feature.
PROBLEM SUMMARY:
The negotiator crashes when an external scheduler passes
in a job that uses more US windows than are currently
available on a given node.
PROBLEM CONCLUSION:
The negotiator can not depend on the scheduler to verify
that sufficient US windows are available to run a job.
Therefore, code will be added to reject a job that tries
to load the max + 1st US window. The job will be left
in Idle state and llq -s already knows why the job should
not have been started.
TEMPORARY FIX:
Some external schedulers have filters or other programable
interfaces that can be used to prevent bad jobs from being
started
------
APAR: IY14100 COMPID: 5765D5100 REL: 311
ABSTRACT: SDRCREATEFILE SHOWS 0025-062 "SDR FILENAME NOT FOUND" USING
PROBLEM DESCRIPTION:
if a SDR file comprised the character string "/../" you gets
the error message:
SDRCreateFile: 0025-062 SDR filename not found.
using e.g. SDRCreateFile <filename> <sdr class> command.
probably the problem is in:
sdrd_file.c
if (strstr(buf, "/../") = NULL) return(62); /* .. not allowed*/
PROBLEM SUMMARY:
SDRCreateFile and SDRCreateSystemFile are failing with
the message "0025-062 SDR filename not found.", if the
file they try to create in the SDR contains the string
/../as part of its contents.
PROBLEM CONCLUSION:
Modified the SDR commands to allow files containing the
string /../ within the file, to be added to the SDR.
------
APAR: IY14157 COMPID: 5765D6100 REL: 210
ABSTRACT: HARD WALLCLOCK LIMIT NOT ENFORCED WHEN SOFT LIMIT IS TRAPPED FOR
PROBLEM DESCRIPTION:
When a loadleveler job traps a soft limit signal the job should
continue until it hits the hard wallclock limit or completes.
The problem is that the hard wallclock limit is not being
enforced once the soft limit is trapped.
PROBLEM SUMMARY:
When a LoadLeveler job traps a soft limit signal the job
should continue until it hits the hard wallclock limit or
completes. The problem is that the hard wallclock limit is
not being enforced once the soft limit is trapped. When
the soft limit is reached the job is marked removed. Then
when the hard limit is reached it checks to see if the job
is marked removed, if so it does not send the kill signal.
PROBLEM CONCLUSION:
LoadL_starter has been changed so that the job does not get
marked removed when the soft limit is reached. Therefore
the job will get the kill signal when the hard limit is
reached after the soft limit was trapped.
------
APAR: IY14205 COMPID: 5765D5100 REL: 311
ABSTRACT: SYSCTLD HANGS AFTER LARGE BATCH JOBS
PROBLEM DESCRIPTION:
Sysctld hangs after large batch jobs. Appearantly, the reason is
a bug in svc_reaper function which gets in an infinite loop with
only a few ways to break out of it. Withing the svc_reaper
function, we are using == when we should be using just a
single =. This prevents a variable from getting changed which
can leave us in an infinite for loop.
After changing this, no more problems were expirienced.
PROBLEM SUMMARY:
sysctld hangs after large batch jobs. The processing that
attempts to clean up resources of defunct child processes
ends up in an infinite loop, which hangs the daemon.
PROBLEM CONCLUSION:
Modified the section of code in the sysctl daemon that
cleans up resources of defunct child processes, to no
longer end up in an infinite loop when the child
processes are not cleaned up properly.
------
APAR: IY14272 COMPID: 5765D5100 REL: 311
ABSTRACT: LLSTATUS INFO MISSING FROM API DATA
PROBLEM DESCRIPTION:
LLSTATUS INFO MISSING FROM API DATA
PROBLEM SUMMARY:
The LoadLeveler API does not have information for
the Drain and Draining classes.
PROBLEM CONCLUSION:
The LoadLeveler API ll_get_data would now have two new
specification called LL_MachineDrainingClassList and
LL_MachineDrainClassList to get the draining and drain
class list on the machine.
------
APAR: IY14339 COMPID: 5765D5101 REL: 111
ABSTRACT: HAGS NEEDS TO HAVE MORE DESCRIPTIVE ERRORS WHEN IT EXITS.
PROBLEM DESCRIPTION:
Hags need to be more descriptive when it exits.
1- Guard possible coredump if currDirectory is NULL.
2- Write 'program name' in the place of sockFd
3- Try to Change the format of the log output from multiple line
to a single line.
PROBLEM SUMMARY:
Group Services is currently writing a log
message with the internal token number
whenever a client process dies (or stops).
Unfortunately, there is no easy to know
what provider(or process) dies by just
reading the number.
Therefore, adding the program name in the
log message should help the problem.
PROBLEM CONCLUSION:
With this fix, it should be easier to identify
who is the failing processes(or providers).
------
APAR: IY14385 COMPID: 5765D5101 REL: 111
ABSTRACT: GSAPI CLIENT GENS CORE AT HA_GS_DISPATCH
PROBLEM DESCRIPTION:
gsapi client gens core at ha_gs_dispatch
PROBLEM SUMMARY:
GSAPI ha_gs_dispatch() causes a core dump because
of invalid access to the uninitialized internal
memory which was allocated by GSAPI, especially
related to ha_gs_change_attribute() function call.
Although this sympton may not be always shown
externally, it may possibly misbehave memory
memory handling.
PROBLEM CONCLUSION:
After the fix of memory initialization problem,
the GSAPI should not core dumped.
------
APAR: IY14403 COMPID: 5765D5100 REL: 311
ABSTRACT: CHGCSS DOES NOT ACCEPT MULTIPLE ATTRIBUTES FROM CHDEV
PROBLEM DESCRIPTION:
chdev -l css0 -a rpoolsize=xxxxxxx -a spoolsize=xxxxxx passes
-l css0 -a rpoolsize=xxxxxxx spoolsize=xxxxxx to chgcss
But chgcss is unable to see this as multiple parameters, when
all other change methods do.
PROBLEM SUMMARY:
The chgcss command does not handle changing both rpoolsize
and spoolsize, at the same time, when chdev is the command
being used to specify the changes.
PROBLEM CONCLUSION:
The chgcss command was changed to check for multiple
attributes in a single, quoted, -a argument, which is the
way that chdev passes change requests to the various chgxxx
methods.
------
APAR: IY14431 COMPID: 5765D5101 REL: 111
ABSTRACT: HAEMD DIES (CALLS ABORT) IF NIS+ IS USED
PROBLEM DESCRIPTION:
Refer to defect 47729.
haemd dies (calls abort() to create a core)
if NIS+ is used.
errpt -a shows:
LABEL: HA002_ER
IDENTIFIER: 12081DC6
Resource Name: haemd
Detail Data
DETECTING MODULE
LPP=PSSP,Fn=emd_rvo.c,SID=1.24,L#=749,
DIAGNOSTIC EXPLANATION
haemd(ach03): 2521-006 System call "shmat" failed with
error 22 - A system call received a parameter that is
not valid..
LABEL: CORE_DUMP
SIGNAL NUMBER
6
PROGRAM NAME
haemd
dbx stacktrace:
raise.raise(??) at 0xd017ad28
abort() at 0xd0174450
emd_exit(??) at 0x10001138
obsv_vars(??, ??) at 0x10012304
rvo_immediate() at 0x10012748
ctrl_loop() at 0x10000814
main(??, ??) at 0x10000ffc
PROBLEM SUMMARY:
On a NIS system, once pman was started, haemd would
abort due to a shared memory segment problem.
PROBLEM CONCLUSION:
The haemd daemon has been assigned a higher memory
segment which will enable the daemon to stay up
once pman is started on a NIS system.
------
APAR: IY14440 COMPID: 5765D5101 REL: 111
ABSTRACT: ORACLE EXPECTS A NETWORK RESPONSE FROM ALL NODES IN A CLUSTER
PROBLEM DESCRIPTION:
Oracle calls ha_em_receive_response() which returns information
only for a local node. A network response is expected from all
the nodes in the cluster
PROBLEM SUMMARY:
After one or more nodes in a HACMP/ES go down, a query
request entered through the Event Management API
may take up to two minutes to complete, if the
request is directed to one or more of the nodes that
are down.
PROBLEM CONCLUSION:
After this fix is applied, query commands that target
other nodes should complete with no appreciable delay.
------
APAR: IY14470 COMPID: 5765D5100 REL: 311
ABSTRACT: VSD HANGS DUE TO BAD REQUEST COUNT
PROBLEM DESCRIPTION:
vsd hangs due to bad request count
PROBLEM SUMMARY:
VSD has the potential to hang during "suspendvsd" if the
internal counter of requests targeted to a specific vsd
never decrements to zero. This code area has been
problematic in the past.
PROBLEM CONCLUSION:
The methodology to maintain the vsd request counter was
improved in VSD 3.2. This code is being backfitted to VSD 3.1.1.
------
APAR: IY14520 COMPID: 5765B9501 REL: 310
ABSTRACT: MMRPLDISK FAILS WHEN DISK DESCRIPTOR IS NOT FOR A VSD DISK.
PROBLEM DESCRIPTION:
Command fails when the operand disk descriptor contains a disk
that is not a vsd disk. mmrpldis scripts sets parameter vsdsiz
only when vsd exist. The line number in the mmrlpdisk is 469.
Customer issued the command and it failed with error:
"fslow+ :0403-053 Expression is not complete;
more tokens expected".
LOCAL FIX:
This line needs to be added.. after line 469
vsdsiz =$(cat $mkvsdrtn | awk -F, '{print $2}')
PROBLEM SUMMARY:
Incorrect error given on mmrpldisk
PROBLEM CONCLUSION:
Correct handling of disk size for mmrpldisk
------
APAR: IY14743 COMPID: 5765D5100 REL: 311
ABSTRACT: NODECOND.CHRP FAILING
PROBLEM DESCRIPTION:
nodecond.chrp failing
PROBLEM SUMMARY:
The code in APAR IY12053 to address timing problems in the
conditioning of model 9076-270 nodes must be adjusted to
accommodate the 9076-260 as well. Otherwise, the nodecond
program will go into an infinite loop.
PROBLEM CONCLUSION:
The nodecond_chrp script has been changed to:
1. allow for nodes which do not have the "Boot to Open
Firmware" option in their Pre-installation Menu. In these
cases, the timings that are used to maintain the
installation dialog will be estimated.
2. wait 90 seconds before timing out on failure to start the
firmware setup menus. It has been two minutes.
3. correctly handle the Language Selector Menu. It had
previously caused the process to loop to a timeout.
------
APAR: IY14804 COMPID: 5765D5100 REL: 311
ABSTRACT: SSP.BASIC 2.4.0.18 FAILS TO APPLY WITH SYSCK: 3001-038
PROBLEM DESCRIPTION:
When applying ssp.basic 2.4.0.18 on an AIX 433 system, the
following errors may be seen:
sysck: 3001-038 The name imnadm is not a known group for
entry /usr/lpp/ssp/bin/cshutdown.
sysck: 3001-003 A value must be specified for group for
entry /usr/lpp/ssp/bin/cshutdown.
sysck: 3001-038 The name imnadm is not a known group for
entry /usr/lpp/ssp/bin/cstartup.
sysck: 3001-003 A value must be specified for group for
entry /usr/lpp/ssp/bin/cstartup.
sysck: 3001-017 Errors were detected validating the files
for package ssp.basic.
LOCAL FIX:
Create the group imnadm as group 200 and then ssp.basic will
install. Then, do a chgrp shutdown /usr/lpp/ssp/bin/
cshutdown and startrup.
PROBLEM SUMMARY:
Installation of ssp.basic on an AIX 4.3.3 system will
fail, if imnadm is not defined as a group in /etc/group.
The install will fail with message 3001-038 from sysck that
The name imnadm is not a known group.
PROBLEM CONCLUSION:
Corrected the packaging of ssp.basic, so that there is
no dependency on the imnadm group existing in /etc/group.
------
APAR: IY14846 COMPID: 5765D5100 REL: 311
ABSTRACT: DOC: SA22-7351-01 NEEDS TO ADD THE SUPPLEMENTARY RESTRICTION OF
PROBLEM DESCRIPTION:
There is insufficient explanation for psyslclr command in
following electronic library of books.
Title: Command and Technical Reference, Volume 1
Document Number: SA22-7351-01and SA22-7351-02
The psyslclr still stops and starts syslogd during trimming.
Found the "Note:" in the PSSP Commands Vol 1 for PSSP3.1.0
( SA22-7351-00 ) under the psyslclr command near the very
bottom of the command description:
Note: The syslogd daemon does not log the year in records time-
stamps. The comparisons for start and end times are done on a
per record basis and could cause unexpected results if the log
file is allowed to span more than one year. The syslogd daemon
is stopped during this process so trimming activity should be
planned accordingly. It is then restarted using the default or
alternate syslog configuration file.
PROBLEM SUMMARY:
PSSP for AIX Command and Technical Reference, Volume 1
Chapter 1 - psyslclr command
Information was missing for the psyslclr command, indicating
that syslogd is stopped and restarted during the log
trimming process.
PROBLEM CONCLUSION:
PSSP for AIX Command and Technical Reference, Volume 1
Chapter 1 - psyslclr command
At the end of the Description section for psyslclr, the
following note will be added:
Note: The syslogd daemon is stopped during this process so
trimming activity should be planned accordingly. It
is then restarted using the default or alternate
syslog configuration file.
------
APAR: IY14924 COMPID: 5765D5100 REL: 311
ABSTRACT: NODECOND TIMEOUT WAITING FOR DIAG CONSOLE
PROBLEM DESCRIPTION:
nodecond timeout waiting for diag console
PROBLEM SUMMARY:
When attempting to network boot a node in "diag" mode, the
node conditioning program can abort with a timeout
condition. It is waiting for the a console message "please
define the system console."
The problem only occurs on node types that support the chrp
interface, and have a great many I/O adapters and disk
units.
PROBLEM CONCLUSION:
The timeout value in the nodecond_chrp process had been
hard-coded at 360 seconds for any node. This has been
changed to reflect the actual node type.
There is, already defined, an SDR object, named NC_timeout
which varies from node to node. The wait for the Diagnostics
Menu to appear will be a function of NC_timeout.
------
APAR: IY15017 COMPID: 5765C3403 REL: 430
ABSTRACT: LINUX: LIBICE NOT BEHAVING CORRECTLY
PROBLEM DESCRIPTION:
The libICE.a library may not behave as expected.
PROBLEM CONCLUSION:
Changed to enable the BSD44SOCKETS compatibility flag when
building libICE.
------
APAR: IY15850 COMPID: 5765D5100 REL: 311
ABSTRACT: GSAPI CLIENT GENS CORE AT HA_GS_DISPATCH
PROBLEM DESCRIPTION:
gsapi client gens core at ha_gs_dispatch
PROBLEM SUMMARY:
GSAPI ha_gs_dispatch() causes a core dump because
of invalid access to the uninitialized internal
memory which was allocated by GSAPI, especially
related to ha_gs_change_attribute() function call.
Although this sympton may not be always shown
externally, it may possibly misbehave memory
memory handling.
PROBLEM CONCLUSION:
After the fix of memory initialization problem,
the GSAPI should not core dumped.
------
APAR: IY15876 COMPID: 5765D5100 REL: 311
ABSTRACT: LATEST PSSP 3.1.1 FIXES AS OF JANUARY 2001
PROBLEM DESCRIPTION:
This is the lastest PSSP ptf as of January 2001.
Order this apar to get all of the ptfs as of January 2001.
PROBLEM SUMMARY:
This is a packaging apar for PSSP 3.1.1 fixes
as of January 2001.
PROBLEM CONCLUSION:
This is a packaging apar for PSSP 3.1.1
fixes as of January 2001.
------
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]