|
Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email hr@neohapsis.com |
From: AIX Service Mail Server (aixserv_at_austin.ibm.com)
Date: Wed Sep 04 2002 - 02:44:21 CDT
has requested a copy or has subscribed to the document named "New_AIXV4_Fixes".
If you would like to be removed from this mailing list, send e-mail to
aixserv
austin.ibm.com with a subject of "unsubscribe New_AIXV4_Fixes", or
send a note to owner-aixserv
austin.ibm.com with your request.
APAR: IY21028 COMPID: 5765D9300 REL: 310
ABSTRACT: POE NEEDS KERBEROS, THOUGH PSSP3.2 IS INSTALLED TO HAVE BASIC
PROBLEM DESCRIPTION:
POE on an SP needs at least KERBEROS running, although
with PSP 3.2 the SP could be installed without KERBEROS or
DCE. The dependency on Kerberos should be removed from POE.
PROBLEM SUMMARY:
In ppe.poe 3.1, POE requires the "compatibility" security
method defined by the PSSP SP Security services in order to
run parallel jobs. This implies that the SP security
services
must be installed and configured with the "compatibility"
method. As a result, in order to configure the SP security
services, at least Kerberos Version 4 is required.
If SP security services are not installed and configured,
this implies a case where no security methods are defined
(the value returned by lsauthts is blank).
POE currently treats the case of no methods as a closed
system and it will not allow parallel jobs to run.
PROBLEM CONCLUSION:
POE 3.1 will be changed to treat "no methods" the same
as the case where the "compatibility" method was defined,
which will allow parallel jobs to run with AIX .rhosts
based user authentication. In this case, POE will not
depend on the installation and configuration of PSSP SP
security services, and in turn will not require
Kerberos.
------
APAR: IY30030 COMPID: 5765D5100 REL: 320
ABSTRACT: SOME NODES WILL HAVE WORM DIED AFTER REBOOT AND HENCE DO NOT
PROBLEM DESCRIPTION:
Error description: After rebooting all nodes, some of them will
not have Worm process up (there will be
/var/adm/SPlogs/css/rc.switch.log.extra
file which complains about more than one rc.switch
processes...).
rc.switch failure is related to an extremely small timing
hole that has to do with how ksh implements pipes.
LOCAL FIX:
Manually start Worm by rc.switch ; after a while (about 2 min),
the switch respond for those nodes will come up.
PROBLEM SUMMARY:
The rc.switch script can fail to start the false service
daemon because it may falsely detect that a daemon is
already running.
PROBLEM CONCLUSION:
The rc.switch script has been changed to prevent it
from falsely detecting a running fault service
daemon.
------
APAR: IY31661 COMPID: 5765D5100 REL: 320
ABSTRACT: S70D DAEMON GETS ERRORS "CANNOT COMMUNICATE WITH REMOTE NODE" IF
PROBLEM DESCRIPTION:
At SP systems using a 128-port RAN for S7A tty connection foll
owing entries occur in system errlog of cws: "cannot communicate
with remote node". The problem is definately associated with a
timing issue between the s70d daemon and the H/W. With using
an 8-port adapter the connection is direct between the 8-port
box and the CWS. With a 16-port box there is only a direct
connection between the initial box and the CWS (all other boxes
making up the 128-way adapter do not have direct connection
which is where the problem lies).
PROBLEM SUMMARY:
SP_attached S70s and S7As connected using more than
When an S70 or S7A is connected using more than an
8-port RAN, SPMON_EMSG101_ER entries may be made in the
errpt. This indicates a communication problem, which
is not the case. The s70d daemon needs to be more
tolerant of non-responses from the SAMI.
PROBLEM CONCLUSION:
The s70d daemon has been modified to be more tolerant of
non-responses of any kind from SAMI. The allowable number
of non-responses has been increased to prevent
SPMON_EMSG101_ER entries being made in the errpt indicating
the Supervisor is not responding.
------
APAR: IY31853 COMPID: 5765D5100 REL: 320
ABSTRACT: EXCESSIVE MPI/MPCI REXMIT MESSAGE IN ERRORLOG
PROBLEM DESCRIPTION:
MPCI_REXMIT_STALL and MPCI_REXMIT_RECOVER informational messages
are filling up the errpt making it impossible to see if any real
errors are occuring on the system that need administrative
attention. All programs seem to be running within expectaions.
PROBLEM SUMMARY:
Excessive REXMIT STALL/RECOVER errors in the errpt.
PROBLEM CONCLUSION:
There are two items things that will be looked at, at this
time. The first is extending a previous performance fix
that was only done for US to help IP performance as well.
The second is to institute a message relief timer so that
each process can only log one MPCI_REXMIT_STALL error
message per MP_TIMEOUT period.
------
APAR: IY32038 COMPID: 5765D5100 REL: 320
ABSTRACT: 0509-036 AND 0509-130 IN PMANRMD.LOG FILE WHEN LIBDCE.A IS ON
PROBLEM DESCRIPTION:
The pmanrmd.log file shows a repeatable pattern of the
following entries.
0509-036 Cannot load program spsec_ldmod because of the
following errors
0509-130 Symbol resolution failed for
/usr/lpp/ssp/bin/spsec_ldmod because:
0509-136 Symbol GSS_MECH_MIT_KRB5 is not exported from dependent
module /usr/lib/libdce.a(shr.o).
/usr/lpp/ssp/bin/SDRGetObjects: 0025-004 Item specified for
query, insertion or deletion was not found.
The problem is triggered by the pman daemon logic finding that
the libdce.a file is on this system before checking to see if
DCE authentication is in use. DCE is not in use on this system
and the file remains for other reasons.
Other apars with similar symptoms IY17070, IY21195, IY23021 and
IY22203 either have the fix on or do not apply. There does not
seem to be any impact to the system other than the error entries
in the log.
LOCAL FIX:
The only impact is the messages and they can be ignored. If the
libdce.a file is removed the messages stop.
PROBLEM SUMMARY:
dsrvtgt was calling spsec_start before it was determining if
dce authentication was being used. If there is an older
/usr/lib/libdce.a you will get the load errors seen in the
/var/adm/SPlogs/pman/pmanrmd.log.
The -m in the SDRGetObjects call has been change to -q.
PROBLEM CONCLUSION:
dsrvtgt has been modified to determine if dce authentication
is being used before calling spsec_start. If it determines
dce authentication is not being used it just exits without
calling spsec_start.
The SDRGetObjects option list has been fixed.
------
APAR: IY32351 COMPID: 5765D6100 REL: 220
ABSTRACT: TIMING-HOLE IN LOADL PROCESS TRACKING EXTENSION MAY LEAD TO
PROBLEM DESCRIPTION:
there is a timing-hole in the Loadl_pte kernel extension,
that might lead to a node crash.
during llctl stop the extension gets unloaded, and
eventually is being accessed after that during the stop
process..
PROBLEM SUMMARY:
The LoadLeveler process tracking tracking kernel
extension can cause a node crash if process tracking
is enabled. Node crashes have been seen when
reconfiguring or restarting LoadLeveler.
PROBLEM CONCLUSION:
The LoadLeveler process tracking tracking kernel
extension has been changed to prevent a node crash
from occurring when restarting/reconfiguring
LoadLeveler with process tracking enabled.
------
APAR: IY32428 COMPID: 5765D5100 REL: 320
ABSTRACT: DOUBLE FREE OF SERVICE PACKET STORAGE IN HAL_RECV_HNDLR()
PROBLEM DESCRIPTION:
In hal_recv_hndlr(), there are cases where the storage for a
service packet is freed after the packet has been placed on a
port'svirtual receive FIFO. The port thread will also free the
storage after reading the service packet from the FIFO. The
free in hal_recv_hndlr() is erroneous. The results are
indeterminate, because it depends on if/when the doubly-freed
storage is reused. One possible result is a fault-service daemon
core dump.
LOCAL FIX:
Restart the fault-service daemon after a core dump with
/usr/lpp/ssp/css/rc.switch.
PROBLEM SUMMARY:
Under some conditions, the hal_recv_hndlr() function will
free the storage used for service packets; the port thread
may later free this same storage. The results are
indeterminate; there may be data corruption or the fault
service daemon may core dump.
PROBLEM CONCLUSION:
Once a service packet is placed on the port's virtual
receive FIFO queue, the hal_recv_hndlr() function
will no longer try to free it.
------
APAR: IY32694 COMPID: 5765B8100 REL: 220
ABSTRACT: NUMERIC OPS OUTPUT WRONG RESULT IN 3270 IF RESULT > 2147483647
PROBLEM DESCRIPTION:
Directalk 3270 script wont handle values over 2 billion.
Bad result when adding or subtracting numbers over 2 billion.
LOCAL FIX:
Multiply any of the input parameters by 1.0.
Output will then be correct, but will also have extra
decimal places, which may require modifications to the
3270 server.
PROBLEM SUMMARY:
Directalk 3270 script wont handle values over 2
billion.
Bad result when adding or subtracting numbers over 2 billion.
PROBLEM CONCLUSION:
Changed sprintf output from signed int to
float with no decimal places. Also added a test to abort
script and generate error 24507 and abort the script if the
"1e15 > result > -1e15" are exceeded.
Note: Exceeding these limits will cause loss of precision in
calculations and hence are trapped.
------
APAR: IY32751 COMPID: 5765D5100 REL: 320
ABSTRACT: SRVSUPPWD NEEDS UNIQUE TMP PW FILENAME
PROBLEM DESCRIPTION:
srvsuppwd needs unique tmp pw filename
PROBLEM SUMMARY:
The srvsuppwd process creates a temporary file that is
not unique to the process. Since there may be multiple
srvsuppwd processes running at the same time, this could
result in an updsuppwd process on a node having to try
to obtain the supman password file muptiple times.
PROBLEM CONCLUSION:
srvsuppwd has been modified to create a temporary file that
is unique to the process that creates it.
------
APAR: IY32760 COMPID: 5765D5100 REL: 320
ABSTRACT: CHANGE DEFAULT PERMS ON TAR FILE
PROBLEM DESCRIPTION:
change default perms on tar file
PROBLEM SUMMARY:
css.snap tar file was world readable.
PROBLEM CONCLUSION:
css.snap tar file is now readable only
by root.
------
APAR: IY32788 COMPID: 5765D5100 REL: 320
ABSTRACT: S70D DAEMON DIES UNEXPECTED. HARDMON MUST BE STOPPED AND RESTART
PROBLEM DESCRIPTION:
SP attached server S80/S85. s70d dies unexpectly. following msgs
in /var/adm/SPlogs/spmon/s70/s70d.3.log.xxx :
s70d 3 : 0026-500I s70d daemon started on device"/dev/tty7" (Fra
me 3) at Sat May 18 09:59:29 2002
s70d 3 : 0026-507I Entered main processing loop
SAMI Firmware Level (mm/dd/yy): 8/31/99
s70d 3 : 0026-522 ioctl() was unsuccessful: Resource temporarily
unavailable (11)
s70d 3 : 0026-502I s70d daemon ended (2) on device "/dev/tty7"
PROBLEM SUMMARY:
An ioctl failure is causing the s70d to terminate. In the
log file /var/adm/SPlogs/spmon/s70/s70d.x.log.yyy will be
the messages:
0026-522 ioctl() was unsuccessful:
0026-502I s70d daemon ended (x) on device "/dev/ttyx"
The s70d should be modifed to not terminate if there
an ioctl failure.
PROBLEM CONCLUSION:
The s70d has been modified to not issue message 0026-522
when a call to ioctl is unsuccessful and to not
terminate. The ioctl will either succeed on a subsequent
retry, or will cause another terminating error to occur.
------
APAR: IY32969 COMPID: 5765D5100 REL: 320
ABSTRACT: SP SWITCH2 WORM RUNS SLOW UNDER HEAVY PAGING LOAD
PROBLEM DESCRIPTION:
The current SP Switch2 Worm uses popen() to invoke the sum
command on the current compressed topology file. The result of
the sum command is used to determine if an updated copy of the
topology file needs to be sent to the node. Under heavy paging
load, the time necessary for popen() to do a fork to invoke ksh,
and then ksh to do a fork to invoke sum, can be excessive. If
the Worm does not report back fast enough, when it receives
a NODE_INIT packet, the primary will drop the node off of the
switch.
LOCAL FIX:
None, really. The node is normally still okay. There shouldn't
be any problem bringing it back on the switch, via Eunfence.
But, the damage is already done.
PROBLEM SUMMARY:
Nodes can drop off the SP Switch 2 when they are under
a heavy load (e.g. high levels of paging). The time
taken to call the AIX sum command to calculate the
switch topology file checksum may be too long under high
load conditions, causing the primary node to drop the
slow responding node off the switch.
PROBLEM CONCLUSION:
The fault_service_Worm_RTG_CS code has been changed to
calculate the checksum of the switch topology file
directly instead of calling the AIX sum command.
------
APAR: IY32972 COMPID: 5765D6100 REL: 220
ABSTRACT: LOADL CANNOT REMOVE A RP JOB
PROBLEM DESCRIPTION:
One machine in the LL pool had a crash, which left the two jobs,
which had been running on the machine in the LL queue.
LL on the machine was back after reboot, but llstatus showed
that resources are in use - no new jobs would start. A llcancel
put the jobs in RP, and the resources on the machine were not
freed. One job had been issued from this machine, this job
disappeared from the system after deleting the job_queue files
in spool/ and recycle LL on this machine. second job, issued
from another machine persists in queue as RP. resources blocked.
PROBLEM SUMMARY:
When LoadLeveler came back up after a crash,
the job previously in suspended state is gone
but llq still have it shown as running.
Doing a llcancel could only set the job
state to RP without truly removing it.
PROBLEM CONCLUSION:
When LoadLeveler came back up after a crash,
the job previously in suspended state is now
able to run. And llcancel will be able to
kill the job.
------
APAR: IY32973 COMPID: 5765D5100 REL: 320
ABSTRACT: REGATTA_H:SP ATTACH, UCFGCOR WILL UNCONFIGURE A TBCPCI ADAPTER
PROBLEM DESCRIPTION:
regatta_h: spattach, ucfgcor will unconfigure a tb3pci adapter
PROBLEM SUMMARY:
The unconfig method for certain css0 devices would attempt
and sometime fail to unconfigure any css0 device.
PROBLEM CONCLUSION:
The css0 unconfig method will no longer attempt to
unconfigure invalid device instances.
------
APAR: IY32977 COMPID: 5765D5100 REL: 320
ABSTRACT: SWITCH CLOCK 75MHZ 75.2MHZ PROBLEM
PROBLEM DESCRIPTION:
fsd calculation used to determine switch clock is incorrect
75mhz 75.2Mhz
MPI_WTIME MPI_CLOCK_SOURCE
PROBLEM SUMMARY:
It's possible for the switch fault service
daemon to incorrectly detect the clock frequency
of the switch, which can result in getting invalid
results when making a call to read the switch clock
from a program.
PROBLEM CONCLUSION:
The switch fault service daemon has been changed
to accurately determine the switch clock frequency.
------
APAR: IY33006 COMPID: 5765D5100 REL: 320
ABSTRACT: EXCESSIVE US MPCI REXMIT MESSAGES IN ERRORLOG
PROBLEM DESCRIPTION:
excessive us mpci rexmit messages in errorlog
PROBLEM SUMMARY:
Customers are seeing an excessive amount of
MPCI_REXMIT_STALL errors in the error report when using
PSSP 3.2
PROBLEM CONCLUSION:
Two future release defects improved the processing of
packets which will reduce the number of these messages.
A fix test by one customer showed a good decrease in
the messages with these changes.
------
APAR: IY33582 COMPID: 5765D5100 REL: 320
ABSTRACT: REGATTA_H:SPATTACH: AFTER CORSAIR HOT PLUG, SDR INCORRECT
PROBLEM DESCRIPTION:
regatta_H:SPattach:after corsair hot plug, sdr incorrect
PROBLEM SUMMARY:
After adapter configuration on the SP
Switch2, the SDR adapter_config_status attribute was updated
in the SDR switch_responds class but the SP Switch2 support
code expects to find this attirbute in the SDR Adapter class.
PROBLEM CONCLUSION:
Modify sdr_acs_update to change adapter
status in SDR Adapter class vs. switch_responds, only for the
SP Switch2.
------
APAR: IY33935 COMPID: 5724C3505 REL: 310
ABSTRACT: ALLOW JAVA APPS TO BE CALLED FROM STATE TABLE
PROBLEM DESCRIPTION:
Allow Java apps to be called from state table
PROBLEM SUMMARY:
Allow Java apps to be called from state table
PROBLEM CONCLUSION:
Control of incoming calls is passed to DTBE
from action InvokeStateTable rather than at the end of the
internal Action Ringing. This means state table Incoming_Call
is now executed first
------
APAR: IY34104 COMPID: 5765D5101 REL: 121
ABSTRACT: HAGSGLSM HANGS AFTER GLOBAL REBOOT
PROBLEM DESCRIPTION:
hagsglsm hangs after global reboot
PROBLEM SUMMARY:
When HAGSGLSM fails to connect the Group Services (typically
when Group Services is not ready to serve the clients),
first it cleans up the connection to make sure of the
disconnection, and then it retries to connect Group Services
again.
However, due to problem in the cleanup routine, HAGSGLSM may
go into the indefinite wait (deadlock). The only workaround
would be restart the hagsglsm
daemon (but not hags) until the fix is applied.
PROBLEM CONCLUSION:
By removing the double locking, the deadlock situation will
be solved
------
APAR: IY34144 COMPID: 5765D5100 REL: 320
ABSTRACT: LATEST PSSP 3.2.0 FIXES AS OF AUGUST 2002
PROBLEM DESCRIPTION:
This is the latest PSSP ptf as of August 2002.
Order this apar to get all of the ptfs as of August 2002.
PROBLEM SUMMARY:
This is a packaging apar for PSSP 3.2.0 fixes
as of August 2002
PROBLEM CONCLUSION:
This is a packaging apar for PSSP 3.2.0
fixes as of August 2002
------
APAR: IY34232 COMPID: 5724C3505 REL: 310
ABSTRACT: NUMERIC OPS OUTPUT WRONG RESULT IN 3270 IF RESULT > 2147483647
PROBLEM DESCRIPTION:
Directalk 3270 script wont handle values over 2 billion.
Bad result when adding or subtracting numbers over 2 billion.
LOCAL FIX:
Multiply any of the input parameters by 1.0.
Output will then be correct, but will also have extra
decimal places, which may require modifications to the
3270 server.
PROBLEM SUMMARY:
Directalk 3270 script wont handle values over 2
billion.
Bad result when adding or subtracting numbers over 2 billion.
PROBLEM CONCLUSION:
Changed sprintf output from signed int to
float with no decimal places. Also added a test to abort
script and generate error 24507 and abort the script if the
"1e15 > result > -1e15" are exceeded.
Note: Exceeding these limits will cause loss of precision in
calculations and hence are trapped.
------
APAR: IY34330 COMPID: 5724C3505 REL: 310
ABSTRACT: SUPPORT FOR VIAVOICE CUSTOM SERVER DURING CLEAN UP
PROBLEM DESCRIPTION:
SUPPORT FOR VIAVOICE CUSTOM SERVER DURING CLEAN UP via state
table exit
PROBLEM SUMMARY:
SUPPORT FOR VIAVOICE CUSTOM SERVER DURING CLEAN
UP via state table exit
PROBLEM CONCLUSION:
Support added
------
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]