OSEC

Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email hr@neohapsis.com
 
From: AIX Service Mail Server (aixservaustin.ibm.com)
Date: Wed Jun 26 2002 - 02:39:53 CDT

  • Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

    APAR: IY28660 COMPID: 5765D6100 REL: 220
    ABSTRACT: LLSUMMARY -R THROUGHPUT/MAXQUEUED AND REAL TIME MULTIPLE OF

    PROBLEM DESCRIPTION:
    llsummary -r throughput produces Queue and Real times wrong.
    Value is a multiple of nodes used when parallel.

    PROBLEM SUMMARY:
    The throughput reports, produced by the LoadLeveler
    llsummary command, can produce higher than appropriate
    Queue Time and Real Time numbers for parallel jobs that ran
    on multiple nodes.

    PROBLEM CONCLUSION:
    The llsummary command had been adding the Queue Time and
    Real Time numbers from each node that was used to execute
    a parallel job. That made the resulting numbers too high
    by the number of nodes that was used for the job.
    The command was changed to first determine if the job was a
    serial job or a parallel job, and to do the correct
    calculations, after that.

    ------

    APAR: IY28683 COMPID: 5765D6100 REL: 220
    ABSTRACT: MAXSLOTS AND FREESLOTS IN LLCLASS OUTPUT CALCULATE WRONG WHEN

    PROBLEM DESCRIPTION:
    llclass output calculates maxslots and freeslots
    correctly, as long as maxjobs is not set.
    if it is set, the number of maxjobs limits the slots,
    however, if jobs are running the freeslots is reduced
    by the number of tasks.
    so a single job can occupy all the slots, though
    there are still cpus left over .

    PROBLEM SUMMARY:
    In LoadLeveler 2.2,
    the MAXSLOTS and FREESLOTS values in the llclass output
    are incorrect when the maxjobs value is set.

    PROBLEM CONCLUSION:
    In LoadLeveler 2.2,
    the MAXSLOTS and FREESLOTS values from
    the llclass output are
    now calculated based on the number
    of machines that have tasks running
    as well as the max_starter and maxjobs 22
    0 lues if set.

    ------

    APAR: IY29118 COMPID: 5765D6100 REL: 220
    ABSTRACT: MACHPRIO DOES WRONG CALCULATION OVER TIME

    PROBLEM DESCRIPTION:
    MACHPRIO very often is based on a computation around
    LoadAvg. Now LoadLeveler adjusts the LoadAvg with the
    value of NEGOTIATOR_LAODAVG_INCREMEMENT when a job is
    started on that node.
    unfortunately it can happen that this add-on stays longer,
    and if multiple jobs are started on the node in question
    accumulate to really strange values. (i saw values of upto
    -1240.00).
    if the machine is idle for a certain amount of time,
    the MACHPRIO value recovers on its own ...

    LOCAL FIX:
    Recycling of Negotiator or Startd on the problem node
    recovers immediatly.
    alternatively a sequence of "llctl resume" to that node
    can recover eventually, too

    PROBLEM SUMMARY:
    If Loadavg is used in your calculation for MACHPRIO in the
    LoadL_config file, the MACHPRIO value can sometimes get
    values well out of the range of what it should be. The
    problem can happen if a parallel job starts more than two
    tasks on the same node. Newer hardware, with increased
    numbers of CPUs, are most susceptible to this problem.
    That is based on the assumption that the MAX_STARTERS value
    is set equal to the number of CPUs on the machine.

    PROBLEM CONCLUSION:
    The Negotiator internally adjusts a machine's loadavg when
    it starts a new job on that machine. As part of that
    adjustment, the Negotiator could sometimes keep adjusting
    the adjusted value instead of adjusting the real load value
    that it received from the machine. The code that
    determined which value to adjust was modified to correct
    the problem.

    ------

    APAR: IY29205 COMPID: 5765D9300 REL: 310
    ABSTRACT: FAILURE IN CREATING CORE DIRECTORIES AND FILES USING MP_COREDIR

    PROBLEM DESCRIPTION:
      (a) MP_COREDIR must be set to a directory where the user has
    permission to write files in the parent of the specified
    directory. For example, MP_CORDIR=/tmp will not work because
    the user does not have permission to create files in "/", the
    parent of /tmp.
      (b) If MP_COREDIR=/tmp/abc then core directores called abc.0,
    abc.1, etc, will be created in /tmp and these directories will
    contain the core files (light weight or regular). Note that
    the coredir name is not abc with subdirectories abc.0, etc, as
    expected.
    (c) If MP_CORDIR is unset then the core directories will be
    created in the directory where the user's job is run. This is
    as expected.

    PROBLEM SUMMARY:
    If the user sets MP_COREDIR=/tmp then the core file created
    will be in a file called core and if there are multiple
    tasks then only one core file is created called core.
    The following error message is received by the user:
    ERROR: 0031-144 error creating directory for core files,
    reason: <The file access permissions do not allow the
    specified action.>
    T

    PROBLEM CONCLUSION:
    MP_COREDIR must be set to a directory where the user has
    permission to write files in the parent of the specified
    directory. For example, MP_CORDIR=/tmp will not work
    because the user does not have permission to create files in
    "/", the parent of /tmp.

    ------

    APAR: IY29786 COMPID: 5765D5100 REL: 320
    ABSTRACT: MEMORY EXHAUSTED W/ NON-CONTIGUOUS USER DEFINED DATA TYPES

    PROBLEM DESCRIPTION:
    Memory leak sending MPI non-contiguous user defined types with
    MPI_GATHER. After a few thousand iterations, the job will end
    with ERROR: 0032-171 Communication subsystem error: Memory is
    exhausted. in MPI_Gather, task 0. For 32bit us it runs out at
    1,043,000 cycles, for 32 bit ip it runs out at 38,000 cycles and
    for 64-bit PRPQ over US it runs out at 521,000 cycles.

    PROBLEM SUMMARY:
    Memory leak sending MPI non-contiguous user defined types
    with MPI_GATHER. After a few thousand iterations, the job
    will end with ERROR: 0032-171 Communication subsystem error:
    Memory is exhausted. The fix has cleaned up memory
    allocations that are no longer in used.

    PROBLEM CONCLUSION:
    The fix is effective. Unused memories are being cleaned up.

    ------

    APAR: IY29853 COMPID: 5765D5100 REL: 320
    ABSTRACT: CHANGES FOR AIX51 HEADERS/LIBS

    PROBLEM DESCRIPTION:
    changes for aix51 headers/libs

    PROBLEM SUMMARY:
    Needed to define local storage for errno.

    PROBLEM CONCLUSION:
    Defined local storage for errno.

    ------

    APAR: IY29924 COMPID: 5765D5100 REL: 320
    ABSTRACT: RAS: IMPROVE CSS.SNAP DATA COLLECTION

    PROBLEM DESCRIPTION:
    The css.snap command needs to improve the data collected on some
    'soft' snaps.

    PROBLEM SUMMARY:
    A 'soft' css.snap command needs to collect a switch adapter
    microcode dump under certain conditions.

    PROBLEM CONCLUSION:
    The css.snap script has been changed to improve RAS.
    The changes allow for a soft css.snap to collect a switch
    adapter microcode dump in certain cases.

    ------

    APAR: IY30011 COMPID: 5765D5100 REL: 320
    ABSTRACT: UNINITIALIZED VARIABLE IN COPY MACRO

    PROBLEM DESCRIPTION:
    uninitialized variable in copy macro

    PROBLEM SUMMARY:
    There is a variable used without initialized. This may
    course some unpredicable problem like data corruption.

    PROBLEM CONCLUSION:
    Initialize the variable before using it.

    ------

    APAR: IY30028 COMPID: 5765D5100 REL: 320
    ABSTRACT: WITH RRA ON, THE .KLOGIN ON THE NODE SHOULD ONLY HAVE THE

    PROBLEM DESCRIPTION:
    The root.admin or K4 admin id is always getting added to
    /.klogin. but it should only be added if RRA is not on or if
    RRA is on and it is the cws.
    In /.klogin of the nodes should be no admin entry if RRA is on.

    LOCAL FIX:
    workaround: changing the updauthfiles script at line 880
       from
    if (defined($k4_admin)) { print KLOGIN_F "$k4_admin\n"; }
    to
    if ((defined($k4_admin)) &&
       (($local_node_number == 0) || ( $restrict_root_rcmd ne "true
    " ))){
       print KLOGIN_F "$k4_admin\n";
    }

    PROBLEM SUMMARY:
    ***********************************************************
    * USERS AFFECTED: Users with ssp.basic 3.2.0.15 *
    * or greater, installed on a node with *
    * the restrict_root_rcmd attribute of *
    * the SP_Restricted class set to true, *
    * that are using Kerberos 4. *
    * *
    ***********************************************************
    * PROBLEM DESCRIPTION: *
    * *
    * Running /usr/lpp/ssp/bin/updauthfiles on a node when *
    * the restrict_root_rcmd attribute of the SP_Restricted *
    * class is true and Kerberos 4 is being used as the *
    * authentication method for a partition, an entry is *
    * being made in .klogin for root.admin. *
    * *
    ***********************************************************
    * RECOMMENDATION: *
    * *
    * Install APAR IY30028, currently targeted for *
    * ssp.basic 3.2.0.20 on PTF Set 20, when available. *
    * *
    * Until APAR IY30028 is available, after running *
    * updauthfiles edit .klogin to remove the root.admin *
    * entry if the restrict_root_rcmd attribute of the *
    * SP_Restricted class is true. *
    * *
    ***********************************************************

    ------

    APAR: IY30039 COMPID: 5765D5100 REL: 320
    ABSTRACT: CLEANUP.LOGS.WS FAILS AT PSSP-3.2 PTF18

    PROBLEM DESCRIPTION:
    The last line of cleanup.logs.ws is a tilde (~). This cuases
    error: /root: 0403-006 Execute permission denied.

    LOCAL FIX:
    Remove the line with the tilde.

    PROBLEM SUMMARY:
    An extra line was inadvertantly added to the end of
    cleanup.logs.ws. The extra line contained one character -
    the tilde ( ). This would cause an execution error like:
    /usr/lpp/ssp/bin/cleanup.logs.ws 240 : /:
    0403-006 Execute permission denied.
    The command still works properly.
    The error message can be ignored.

    PROBLEM CONCLUSION:
    The extra line has been removed.

    TEMPORARY FIX:
    Edit /usr/lpp/ssp/bin/cleanup.logs.ws and remove the last
    line.

    ------

    APAR: IY30050 COMPID: 5765D5100 REL: 320
    ABSTRACT: SYSLOGD NOT RESTARTED BY CLEANUP.LOGS.NODES

    PROBLEM DESCRIPTION:
    The syslogm routine in logmgt.cmds, called by cleanup.logs.nodes
    (via psyslclr) stops syslogd after trimming a log. If trimming
    another log results in an error, the routine exits without
    restarting syslogd.

    PROBLEM SUMMARY:
    When psyslclr is invoked to trim multiple logs and it
    successfully trims the first log, but does not have enough
    space to trim subsequent logs, syslogd is stopped
    but not restarted.

    PROBLEM CONCLUSION:
    Modified code in logmgt.cmds so that if syslogd is stopped,
    it is always restarted.

    ------

    APAR: IY30184 COMPID: 5765D5100 REL: 320
    ABSTRACT: IP_RESET(IP_INIT) ERRORS ON SP SWITCH

    PROBLEM DESCRIPTION:
    ip_reset(IP_INIT) errors on sp switch

    PROBLEM SUMMARY:
    When a new set of switch routes must be
    downloaded to a node on the SP switch, it's possible for an
    ip_reset(IP_INIT) error to occur. This error can only be
    addressed by rebooting the affected node.

    PROBLEM CONCLUSION:
    The SP Switch IP driver and microcode have
    been changed to prevent ip_reset (IP_INIT) errors from
    occuring.

    ------

    APAR: IY30196 COMPID: 5765B9500 REL: 140
    ABSTRACT: GPFS MINOR NUMBERS NOT SYNCHRONIZED BETWEEN HACMP NODES CAN

    PROBLEM DESCRIPTION:
    GPFS is not picky about the minor numbers it assigns to its
    filesystem entries in /dev. Basically it just starts at 100 and
    increments until it find a free number.
    The problem that occurrs on hacmp clusters when clients are NFS
    mounting the GPFS filesystems, is that NFS receives a filehandle
    based, in part, on the minor number of the filesystem.
    If different clients are accessing the same filesystem from two
    different gpfs nodes using differing device minor numbers (and
    thus different filehandles), when a failover occurs, the node
    now handling all the clients will not recognize the other node's
    client requests.

    LOCAL FIX:
    Manually synchronize the device minor numbers when the file-
    systems are created, and monitor them periodically in case one
    gets deleted (which will result in gpfs recreating it in the
    original manner).

    PROBLEM SUMMARY:
    Fixed /def minor number needed NFS
    failover of gpfs server nodes.

    PROBLEM CONCLUSION:
    Start assigning permanent minor numbers
    to all new file systems. The minor numbers will be in the
    range 150-maxminornumber (65535 or 255).

    TEMPORARY FIX:
    Manually synchronize the device minor numbers
    when the file-systems are created, and monitor them
    periodically in case one gets deleted (which will result in
    gpfs recreating it in the original manner).

    ------

    APAR: IY30224 COMPID: 5765B9501 REL: 320
    ABSTRACT: MMCRLV FAILS YET RETURNS ERROR CODE OF 0

    PROBLEM DESCRIPTION:
    mmcrlv fails yet returns error code of 0

    PROBLEM SUMMARY:
    mmcrlv and mmcrvsd updated to ensure zero return code on a
    failure related to hdisk already part of a VG.

    PROBLEM CONCLUSION:
    mmcrlv and mmcrvsd: fix bug in which conditional call to
    unlockSDR() clobbered rc

    ------

    APAR: IY30247 COMPID: 5765D5100 REL: 320
    ABSTRACT: KFSERVER_TIMEOUT DEFAULT SHOULD BE 200

    PROBLEM DESCRIPTION:
    kfserver_timeout default should be 200

    PROBLEM SUMMARY:
    The value of the kfserver_timeout attribute in the SP class
    is currently set to 600. There is no longer a reason for
    the value to be this high. It should be lowered to 200.

    PROBLEM CONCLUSION:
    The value of the kfserver_timeout attribute in the SP class
    has been lowered to 200.

    ------

    APAR: IY30383 COMPID: 5765B9501 REL: 330
    ABSTRACT: MEMORY LEAK WHILE USING DMAPI

    PROBLEM DESCRIPTION:
    Memory leak while using DMAPI.

    PROBLEM SUMMARY:
    fixed memory leadk in dmapi

    PROBLEM CONCLUSION:
    sfsdmgetdirattrs: not freeing inode
    buffer.

    ------

    APAR: IY30392 COMPID: 5765D5100 REL: 320
    ABSTRACT: UPDSUPPWD SHOULD ONLY UPDATE THE SUPMAN PASSWORK IF CHANGED

    PROBLEM DESCRIPTION:
    updsuppwd should only update the supman password if changed

    PROBLEM SUMMARY:
    The updsuppwd routine will compare the checksum files of the
    current password and the previous password. If the checksum
    file match then the password will not be transferred over
    the s1term and updated on the node.

    PROBLEM CONCLUSION:
    The updsuppwd routine must be changed to not request the
    password for the supman id over the s1term if the password
    has not changed.

    ------

    APAR: IY30436 COMPID: 5765B9501 REL: 320
    ABSTRACT: CATCH RUNAWAY QUOTA INDOUBT VALUES

    PROBLEM DESCRIPTION:
    catch runaway quota indoubt values

    PROBLEM SUMMARY:
    Debug code added to capture runaway quota condition.

    PROBLEM CONCLUSION:
    Add a trigger that asserts on "run-away" inDoubt values in
    update() routine, so that the stripe group manager gets log
    assert and trace data can be collected.

    ------

    APAR: IY30595 COMPID: 5765B9501 REL: 320
    ABSTRACT: MMDELDISK -C STOPS ON EMEDIA ERROR

    PROBLEM DESCRIPTION:
    mmdeldisk -c stops on emedia error

    PROBLEM SUMMARY:
    Fixed mmdeldisk stopping on EMEDIA error

    PROBLEM CONCLUSION:
    Check for both EIO and EMEDIA errors on reads only on
    copyReplicas when deciding to 'break' disk addresses
    pointing to bad stripes.

    ------

    APAR: IY30596 COMPID: 5765B9501 REL: 330
    ABSTRACT: MMDELDISK -C STOPS ON EMEDIA ERROR

    PROBLEM DESCRIPTION:
    mmdeldisk -c stops on emedia error

    PROBLEM SUMMARY:
    Fixed mmdeldisk stopping on EMEDIA error

    PROBLEM CONCLUSION:
    Check for both EIO and EMEDIA errors on reads only on
    copyReplicas when deciding to 'break' disk addresses
    pointing to bad stripes.

    ------

    APAR: IY30597 COMPID: 5765B9501 REL: 320
    ABSTRACT: MEMORY LEAK WHILE USING DMAPI

    PROBLEM DESCRIPTION:
    Memory leak while using DMAPI.

    PROBLEM SUMMARY:
    fixed memory leadk in dmapi

    PROBLEM CONCLUSION:
    sfsdmgetdirattrs: not freeing inode
    buffer.

    ------

    APAR: IY30600 COMPID: 5765B9501 REL: 320
    ABSTRACT: ASSERT --IBDP1->INDDIRTY && IDP2->INDDIRTY, METADATA.C, LINE 98

    PROBLEM DESCRIPTION:
    assert --ibp1->inddirty && ibdp2->inddirty, metadata.c line 98

    PROBLEM SUMMARY:
    Fixed Assert condition: ibdP1->indDirty && ibdP2->indDirty

    PROBLEM CONCLUSION:
    iIn doubleUpdateDiskAddr, don't assert that the indirect
    blocks are dirty. The update might have already been done by
    another node which failed after logging the indirect block
    changes. Also, don't deallocate the old addresses unless
    they changed since the deallocation might have also happened
    before the node crashed.

    ------

    APAR: IY30603 COMPID: 5765B9501 REL: 320
    ABSTRACT: PANIC--FETCH-VFS-KX.C

    PROBLEM DESCRIPTION:
    panic--fetch-vfs-kx.c

    PROBLEM SUMMARY:
    Fixed panic condition in fetch-vfs-kx.C::bdP->whichBufList
    != w

    PROBLEM CONCLUSION:
    Prefetch list mutex does not need to be dropped across the
    call to cacheObjRele, since the hold count cannot go to
    zero.

    ------

    APAR: IY30604 COMPID: 5765B9501 REL: 330
    ABSTRACT: PANIC--FETCH-VFS-KX.C

    PROBLEM DESCRIPTION:
    panic--fetch-vfs-kx.c

    PROBLEM SUMMARY:
    Fixed panic condition in fetch-vfs-kx.C::bdP->whichBufList
    != w

    PROBLEM CONCLUSION:
    Prefetch list mutex does not need to be dropped across the
    call to cacheObjRele, since the hold count cannot go to
    zero.

    ------

    APAR: IY30610 COMPID: 5765B9501 REL: 320
    ABSTRACT: MMCHECKQUOTA PRODUCES NEGATIVE NUMBERS

    PROBLEM DESCRIPTION:
    mmcheckquota sometimes produces negative numbers when GPFS is
    under heavy load.

    PROBLEM SUMMARY:
    mmquotacheck sometimes produces negative numbers for disk
    usage.

    PROBLEM CONCLUSION:
    Do not update server's shadow entries at ComputeShare and
    Relinquish routines since the quota usage and quota share
    accounting in this case is done through regular quota
    entries.

    ------

    APAR: IY30651 COMPID: 5765D5100 REL: 320
    ABSTRACT: PMAN ARRAY LIMIT MEANS THAT WHEN AN EVENT HAPPENS, A MESSAGE MAY

    PROBLEM DESCRIPTION:
    Pman internal array default of 16 adapters per node may not be
    enough and can overwrite the pman definitions, causing the
    nodes not to see any pman definitions!

    LOCAL FIX:
    pmand uses an internal array to read the SDR Adapter info. into.
    This array is hard coded to 16 members (for each node).
    If you have more than 16, it writes beyond the end of the
    array, stepping on the PMAN_Subscription variable.
    The array size has been set to 32 in a new version of pmand.
    Until this new pmand is used, try reduce the amount of SDR info.

    PROBLEM SUMMARY:
    The code uses the PMAN_subscription variable to remember if
    the SDR file is the new PMAN_Subscription file or the old
    pmandConfig file.
    Because the variable got stepped on when the array
    overflowed, the code was incorrectly looking for a
    pmandConfig file. The result is it does not find
    any events, because it is looking in the wrong
    place for them.

    PROBLEM CONCLUSION:
    Increased the size of the array (number of adapters per
    node) from 16 to 64. This will prevent the array from
    being overrun and the PMAN_Subscription variable from
    getting stepped on.

    ------

    APAR: IY30692 COMPID: 5765D5100 REL: 320
    ABSTRACT: DUPLICATE CALLS TO FREE() CAUSING CORE DUMP IN SDR

    PROBLEM DESCRIPTION:
    free() is being called twice on with the same memory block,
    cuasing sdrd to core dump.

    PROBLEM SUMMARY:
    A duplicate call to free() was causing sdrd to core dump.

    PROBLEM CONCLUSION:
    The duplicate call has been removed.

    ------

    APAR: IY30693 COMPID: 5765D5100 REL: 320
    ABSTRACT: CSS.SNAP.LOG FILE CAN BE OVERWRITTEN

    PROBLEM DESCRIPTION:
    css.snap.log file can be overwritten

    PROBLEM SUMMARY:
    If the contents of the css log directories in
    /var/adm/SPlogs/css occupy more than 30% of /var, the
    css.snap utility will try to free space by deleting old
    css.snap files. If there are no files with names ending
    in "....css.snap.tar.Z", the css.snap.log file will be
    overwritten.

    PROBLEM CONCLUSION:
    The output of the "ls" command to list the css.snap tar
    files is appended to the end of the css.snap.log file.

    ------

    APAR: IY31117 COMPID: 5765B9501 REL: 330
    ABSTRACT: PROBLEMS MOUNTING GPFS FS AFTER DELETING DISKS. DISK DESCRIPTOR

    PROBLEM DESCRIPTION:
    Problems mounting gpfs fs after deleting disks. The error
    6027-711 was received which indicated that the disk or fs
    does not exits. It mentioned the deleted disks. the mmsdrfs2
    file in the SDR and /var/mmfs/gen were updated and did not show
    the disks. The problem is that the disk descriptor areas on
    some vsd's are not updated. By chance, the ones that are not
    updated are the first one gpfs uses in attempting to mount the
    fs causing the failure.

    PROBLEM SUMMARY:
    After the mmdeldisk command, some filesystem would not be
    able to remount due to old replica data.

    PROBLEM CONCLUSION:
    When migrating the stripe group descriptor to a new replica
    set, update the copy of the destriptor on all other disks in
    the stripe group as well. This is necessary to prevent
    future attempts to read from disks in the old replica set in
    case these disks have since been deleted from the stripe
    group.

    ------

    APAR: IY31130 COMPID: 5765B9501 REL: 320
    ABSTRACT: PROBLEMS MOUNTING GPFS FS AFTER DELETING DISKS. DISK DESCRIPTOR

    PROBLEM DESCRIPTION:
    Problems mounting gpfs fs after deleting disks. The error
    6027-711 was received which indicated that the disk or fs
    does not exits. It mentioned the deleted disks. the mmsdrfs2
    file in the SDR and /var/mmfs/gen were updated and did not show
    the disks. The problem is that the disk descriptor areas on
    some vsd's are not updated. By chance, the ones that are not
    updated are the first one gpfs uses in attempting to mount the
    fs causing the failure.

    PROBLEM SUMMARY:
    After the mmdeldisk command, some filesystem
    would not be able to remount due to old replica data.

    PROBLEM CONCLUSION:
    When migrating the stripe group descriptor
    to a new replia set, update the copy of the descriptor on all
    other disks in the stripe group as well. This is necessary
    to prevent future attempts to read from disks in the old
    replica set in case these disks have since been deleted from
    the stripe group.

    ------

    APAR: IY31150 COMPID: 5765B9501 REL: 330
    ABSTRACT: MMFS: FCNTL LOCK LOOPING ON A NODE

    PROBLEM DESCRIPTION:
    mmfs hanging in fcntl lock on one node while trying to revoke
    from another node that had already relinquished that token, but
    had forgotten to tell the token manager.

    PROBLEM SUMMARY:
    fixed multi-node fcntl token locking condition.

    PROBLEM CONCLUSION:
    always relinquish down to nl in revoke
    handler when byte range tokens are unknown.

    ------

    APAR: IY31172 COMPID: 5765D6100 REL: 220
    ABSTRACT: TIMING EXPOSURE IN LOADL_NEGOTIATOR CAUSES DEADLOCK

    PROBLEM DESCRIPTION:
    Timing exposures, between a job completion and a Negotiator
    Cycle can cause a deadlock condition in the Negotiator.

    PROBLEM SUMMARY:
    A timing exposure in the LoadLeveler Negotiator could make
    it think that there were jobs running, on a machine, when
    they had already finished. That wrong assumption could
    cause the Negotiator to try to get the same lock, for
    write, a second time. The Negotiator would hang, after
    that.

    PROBLEM CONCLUSION:
    The LoadLeveler Negotiator added a second verification
    that there were jobs running, on a machine, before trying
    to use certain data about the jobs on that machine.

    ------

    APAR: IY31173 COMPID: 5765D6100 REL: 220
    ABSTRACT: A CANCELED INTERACTIVE JOB MIGHT CAUSE THE NEGOTIATOR TO QUIT

    PROBLEM DESCRIPTION:
    If an Interactive job is Ctrl-C'd at the same time that the
    Negotiator decides that it cannot schedule the job to run, the
    LoadL_negotiator daemon may fail to handle the job correctly
    and will intentionally terminate itself.

    PROBLEM SUMMARY:
    An interactive poe job can cause the LoadLeveler Negotiator
    to get confused, and decide to terminate itself, if the
    interactive job is terminated at just the right time during
    the negotiation cycle.

    PROBLEM CONCLUSION:
    The LoadLeveler Negotiator was modified to keep better
    track of interactive jobs, during the negotiation cycle.

    ------

    APAR: IY31238 COMPID: 5765D5100 REL: 320
    ABSTRACT: SETUP_SERVER SHOULD IGNORE PPP CONNECTIONS

    PROBLEM DESCRIPTION:
    If pp0 adapter is pressent setup_server fails.
    setup_server : host: 0827-803 Cannot find address 0.0.0.0.
    setup_CWS: 0016-338 Kerberos setup was bypassed for network
    interfaces that could not be resolved
    Setup_server ends with rc = 0. But The node you are installing
    does not receive a kerberos ticket.
    Circumvention this problem by detaching pp0 causes that
    svcagent cannot be activated and running during setup_server
    action.

    LOCAL FIX:
    A good workaround is to add an entry to /etc/hosts like:
    zero 0.0.0.0 # dummy ppp entry to prevent setup_server problems

    PROBLEM SUMMARY:
    When the Point-to-Point Protocol (PPP) is being used on
    a Control Workstation, setup_CWS will terminate processing
    with the messages:
    host: 0827-803 Cannot find address 0.0.0.0.
    setup_CWS: 0016-338 Kerberos setup was bypassed for
               network interfaces that could not be resolved.
    Since the Point-to-Point Protocol is being displayed in
    the netstat -in data, setup_CWS tries to determine the
    IP addresses for these interfaces and fails. The data
    from the Point-to-Point Protocol should be ignored
    by setup_CWS.

    PROBLEM CONCLUSION:
    setup_CWS has been modified to skip lines of data from
    netstat -in which refer to the Point-to-Point Protocol.

    ------

    APAR: IY31239 COMPID: 5765D5100 REL: 320
    ABSTRACT: SP SWITCH 2 WINDOW SUSPEND FAILURE AFTER LOADLEVELER HAS TRIED

    PROBLEM DESCRIPTION:
    On the SP Switch 2, a failure may occur suspending windows if
    a job fails to respond to the suspend request that is issued
    during switch recovery. This problem can happen if a job has
    a SIGKILL pending (having been killed by LoadLeveler) but has
    not yet fully processed the SIGKILL because a thread is in a
    system call with signals blocked. When the windows fail to
    suspend because of a non-responsive job, switch recovery will
    fail on the affected node, and switch responds will be lost on
    the affected switch plane.

    PROBLEM SUMMARY:
    Nodes can drop off the SP Switch 2 when the switch device
    driver fails to suspend jobs that are running. The
    adapter.log will show the following error:
    QUERY SUSPEND WINDOW_COMPLETION ioctl failed

    PROBLEM CONCLUSION:
    The device driver for the SP Switch 2 has been changed
    to allow suspend requests to be properly handled for
    jobs that are starting or stopping.

    ------

    APAR: IY31245 COMPID: 5765D5100 REL: 320
    ABSTRACT: RC.SP SETS THE WRONG BOOTLIST IF TOTAL BOOTDISKS NOT

    PROBLEM DESCRIPTION:
    rc.sp sets the wrong bootlist if total bootdisks
    not equivalent to total install disks

    PROBLEM SUMMARY:
    On the reboot of a node, the bootlist was being reset to
    include all of the physical volumes listed for the selected
    volume group of the node. Even the physical volumes that
    did not contain boot logical volumes were included in
    the bootlist. If there was a high number of physical
    volumes it could cause a subsequent reboot to fail.

    PROBLEM CONCLUSION:
    spboot, which is called by /etc/rc.sp, was modified to only
    set the bootlist to physical volumes that contain boot
    logical volumes.

    ------

    APAR: IY31249 COMPID: 5765B9501 REL: 320
    ABSTRACT: MMFS: FCNTL LOCK LOOPING ON A NODE

    PROBLEM DESCRIPTION:
    mmfs hanging in fcntl lock on one node while trying to revoke
    from another node that had already relinquished that token, but
    had forgotten to tell the token manager.

    PROBLEM SUMMARY:
    fixed multi-node fcntl token locking condition.

    PROBLEM CONCLUSION:
    always relinquish down to nl in revoke
    handler when byte range tokens are unknown.

    ------

    APAR: IY31253 COMPID: 5765B9501 REL: 330
    ABSTRACT: PROBLEMS WITH NOSUID FLAG IN GPFS FILESYSTEMS

    PROBLEM DESCRIPTION:
    problems with nosuid flag in gpfs filesystems

    PROBLEM SUMMARY:
    Security Problem.

    PROBLEM CONCLUSION:
    Security Problem Resolved.

    ------

    APAR: IY31357 COMPID: 5697E3000 REL: 220
    ABSTRACT: WNN6 HUNG-UP BY CTRL + Y

    PROBLEM DESCRIPTION:
    Wnn6 hungs up by Ctrl + Y.

    LOCAL FIX:
    Update xwnmo.

    ------

    APAR: IY31372 COMPID: 5765B9501 REL: 330
    ABSTRACT: FSCK DOES NOT FIX CORRUPTED ALLOC MAP CHAINS

    PROBLEM DESCRIPTION:
    mmfsck does not fix FSSTRUCT errors of type 114 (corrupted
    allocation maps).

    PROBLEM SUMMARY:
    Fixed mmfsck to repair FSSTRUCT errors of type 114
    (corrupted allocation maps)

    PROBLEM CONCLUSION:
    Fix relinkAllChunks which computed an incorrect allocation
    map magic number for a disk. Provide new functionality to
    verify allocation map chunk list head bitmap chain.
    Recognize chunk list head loops and unlinked chunks.

    ------

    APAR: IY31376 COMPID: 5765D5100 REL: 320
    ABSTRACT: DIAGS FAILING INVALIDLY

    PROBLEM DESCRIPTION:
    When cfgmgr runs diags against the SP-Switch2 adapter, the diags
    routing may fail or not complete. This leaves the status of thea
    adapter in the " diag_fail " state. Later rc.switch will fail
    and the adpater will not join the switch.

    PROBLEM SUMMARY:
    There is a potential for adapter reset to be run
    concurrently during diagnostics.

    PROBLEM CONCLUSION:
    Added locking to device driver calls to adapter reset to
    prevent simultaneous resets during diagnostics.

    ------

    APAR: IY31379 COMPID: 5765B9500 REL: 130
    ABSTRACT: GPFS:6027-848 CONFIG MANAGER 35 FAILED UPDATING NEW NODE STATUS

    PROBLEM DESCRIPTION:
    gpfs:6027-848 config manager 35 failed updating new node status

    PROBLEM SUMMARY:
    fixed sysctl locking condition with
    mmconfig.

    PROBLEM CONCLUSION:
    in the sp environment, do not use the
    output of hostname as a lock identifier. If hostname on a
    node is set to be the same as the switch adapter name, locks
    cannot be reclained (sysctl cannot talk to the node).

    ------

    APAR: IY31380 COMPID: 5765B9501 REL: 330
    ABSTRACT: GPFS:6027-848 CONFIG MANAGER 35 FAILED UPDATING NEW NODE STATUS

    PROBLEM DESCRIPTION:
    gpfs:6027-848 config manager 35 failed updating new node status

    PROBLEM SUMMARY:
    Fixed sysctl locking condition with mmconfig.

    PROBLEM CONCLUSION:
    In the sp environment, do not use the output of hostname as
    a lock identifier. If hostname on a node is set to be the
    same as the switch adapter name, locks cannot be reclaimed
    (sysctl cannot talk to the node).

    ------

    APAR: IY31382 COMPID: 5765B9501 REL: 330
    ABSTRACT: FCNTL LOCKS NOT CLEANED UP ON MMFS DEATH

    PROBLEM DESCRIPTION:
    fcntl locksnot cleaned up on mmfs death

    PROBLEM SUMMARY:
    Fixed GPFS recovery condition

    PROBLEM CONCLUSION:
    kxRecLockReset should process all filesystems even if they
    have been marked unmounted previously during shutdown.

    ------

    APAR: IY31386 COMPID: 5765B9501 REL: 320
    ABSTRACT: TSCTL MAXFCNTLRANGESPERFILE DOES NOT CHANGE VALUE

    PROBLEM DESCRIPTION:
    tsctl maxfcntlrangesperfile does not change value

    PROBLEM SUMMARY:
    Fixed tsctl maxFcntlRangesPerFile does not change value

    PROBLEM CONCLUSION:
    Fix incorrect setting of maxMBpS when tsctl
    maxFcntlRangesPerFile specified.

    ------

    APAR: IY31387 COMPID: 5765B9501 REL: 330
    ABSTRACT: TSCTL MAXFCNTLRANGESPERFILE DOES NOT CHANGE VALUE

    PROBLEM DESCRIPTION:
    tsctl maxfcntlrangesperfile does not change value

    PROBLEM SUMMARY:
    Fixed tsctl maxFcntlRangesPerFile does not change value

    PROBLEM CONCLUSION:
    Fix incorrect setting of maxMBpS when tsctl
    maxFcntlRangesPerFile specified.

    ------

    APAR: IY31388 COMPID: 5765B9501 REL: 330
    ABSTRACT: ASSERT !"NEW_DELETE_DEBUG", NEWDEBUG.C, LINE 176

    PROBLEM DESCRIPTION:
    assert !"new_delete_deubg", newdebug.c, line 176

    PROBLEM SUMMARY:
    Fixed failure in mmrestripefs

    PROBLEM CONCLUSION:
    Realloc code needs to verify the configuration is correct
    before updating the disk effort counters.

    ------

    APAR: IY31572 COMPID: 5765B9501 REL: 320
    ABSTRACT: MMREPQUOTA SHOWS NEGATIVE USAGE AFTER MMRESTRIPEFS

    PROBLEM DESCRIPTION:
    mmrepquota shows negative usage after mmrestripefs

    PROBLEM SUMMARY:
    Fixed mmrestripefs causing mmrepquota to show incorrect
    usage.

    PROBLEM CONCLUSION:
    During restripe and defrag, when deallocating unused blocks
    do not decrement quota usage count if these blocks were not
    allocated with allocBlock.

    ------

    APAR: IY31576 COMPID: 5765B9501 REL: 330
    ABSTRACT: MMREPQUOTA SHOWS NEGATIVE USAGE AFTER MMRESTRIPEFS

    PROBLEM DESCRIPTION:
    mmrepquota shows negative usage after mmrestripefs

    PROBLEM SUMMARY:
    Fixed mmrestripefs causing mmrepquota to show incorrect
    usage.

    PROBLEM CONCLUSION:
    During restripe and defrag, when deallocating unused blocks
    do not decrement quota usage count if these blocks were not
    allocated with allocBlock.

    ------

    APAR: IY31579 COMPID: 5765B9501 REL: 330
    ABSTRACT: NODE PANICKED BY RUNNING GPFS_STAT()

    PROBLEM DESCRIPTION:
    node panicked by running gpfs_stat()

    PROBLEM SUMMARY:
    kpathname being traced after it is freed in kernel.

    PROBLEM CONCLUSION:
    Fixed trace path which could cause gpfs_stat() to panic
    node.

    ------

    APAR: IY31637 COMPID: 5724C3505 REL: 310
    ABSTRACT: UNEXPECTED BEHAVIOUR AFTER RETURNING FROM INVOKEAPPLICATION

    PROBLEM DESCRIPTION:
    DTBE Applications behave irratially after a call to
    invokeApplication returns. The cause is that the internal
    representation of the call status may not be accurate. This can
    result in unexpected errors.

    PROBLEM SUMMARY:
    Unexpected behaviour of app after return
    from invoke application call

    PROBLEM CONCLUSION:
    Code modified to handle correctly

    ------

    APAR: IY31768 COMPID: 5765D5100 REL: 320
    ABSTRACT: SDRCHANGEATTRVALUES KFSERVER_TIMEOUT FAILS ON REJECT

    PROBLEM DESCRIPTION:
    sdrchangeattrvalues kfserver_timeout fails on reject

    PROBLEM SUMMARY:
    Enhancements were required to packaging files for ssp.basic
    for the setting of the kfserver_timeout attribute in the
    SP class.

    PROBLEM CONCLUSION:
    Enhancements were made to packaging files for ssp.basic
    for the setting of the kfserver_timeout attribute in the
    SP class.

    ------

    APAR: IY31802 COMPID: 5765B9501 REL: 330
    ABSTRACT: ASSERT AFTER METANODE RELINQUISH

    PROBLEM DESCRIPTION:
    assert after metanode relinquish

    PROBLEM SUMMARY:
    Fixed an Assert after metanode relinquish

    PROBLEM CONCLUSION:
    Test for turning off the newMnode flag was in the wrong
    place

    ------

    APAR: IY31993 COMPID: 5765B8100 REL: 220
    ABSTRACT: 3270 SESSIONS DO NOT ALWAYS RECOVER WHEN HOST GOES DOWN

    PROBLEM DESCRIPTION:
    Sometimes if the Host goes down then the 3270 Sessions do
    not always recover when the host comes back again.
    This is more likely to be seen if some of the sessions are
    on a host which stays up, but other sessions are on a host
    which goes down.

    PROBLEM SUMMARY:
    Sometimes if the Host goes down then the 3270 S
    Sessions do not always recover when the host comes back again.
    This is more likely to be seen if some of the sessions are on a
    host which stays up, but other sessions are on a host which
    goes down.

    PROBLEM CONCLUSION:
    If scripts are running when DT is shutdown t
    then both CTRL3270 and EXEC3270 tried to deactivate sessions
    using E32DACT and E32DACTA. This causes havoc with the TPS
    library can result in some of the sessions being broken in SNA
    when DT is restarted. The fix was to streamline the shutdown so
    that only E32DACT is used and only once.

    ------

    APAR: IY32157 COMPID: 5765D5100 REL: 320
    ABSTRACT: LATEST PSSP 3.2.0 FIXES AS OF JUNE 2002

    PROBLEM DESCRIPTION:
    This is the latest PSSP ptf as of June 2002.
    Order this apar to get all of the ptfs as of June 2002.

    PROBLEM SUMMARY:
    This is a packaging apar for PSSP 3.2.0 fixes
    as of June 2002

    PROBLEM CONCLUSION:
    This is a packaging apar for PSSP 3.2.0
    fixes as of June 2002

    ------