OSEC

Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email hr@neohapsis.com
 
From: AIX Service Mail Server (aixservaustin.ibm.com)
Date: Tue Jan 23 2001 - 02:15:01 CST

  • Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

    APAR: IY12053 COMPID: 5765D5100 REL: 311
    ABSTRACT: NODECOND TIMEOUT WAITING FOR OK PROMPT

    PROBLEM DESCRIPTION:
    When netbooting a node the nodecond_chrp script can fail with
    a "timeout waiting for ok prompt" message in the nodecond log
    in /var/adm/SPlogs/spmon/nc directory. It fails to enter the
    "8" in the right time after seeing the line
      memory keyboard network scsi
    This has been seen so far on some of the winterhawk2 nodes.

    LOCAL FIX:
    Do a manual nodecondition to netboot the node until the
    nodecond_chrp script has been changed to address this timing
    problem.

    PROBLEM SUMMARY:
    Network boot may sometimes fail on a Winterhawk-2 node with
    the message, 'timeout waiting for "Welcome to AIX"'.

    PROBLEM CONCLUSION:
    The node conditioning script has been changed to require
    less interaction with the service processor menu. Instead
    of waiting a fixed number of seconds before sending an "8"
    prompt, it will select a firmware option that allows it to
    wait for the "ok" prompt upon booting.

    ------

    APAR: IY12819 COMPID: 5765D5101 REL: 111
    ABSTRACT: HAGSD MEMORY LEAK - ASSERT SUBROUTINE FAILED IN

    PROBLEM DESCRIPTION:
    Customer's hags daemon goes down intermittently, with the
    following error in the hags log:
    The assert subroutine failed: message !=0, file ../../../../../s
    rc/rsct/pgs/pgsd/pbs/PBContainer.C, line 63.
    Also, the core dump that is created is well over a 100M. These
    2 things together suggest hags is encountering a memory leak
    on assert. There were 2 defects in 3.2 64941 and 66277, which
    should fix this problem, and will be retrofitted backwards to
    this release.

    PROBLEM SUMMARY:
    Memory leaks in hags caused it to core dump.

    PROBLEM CONCLUSION:
    Several memory leaks have been plugged.

    ------

    APAR: IY12873 COMPID: 5765D5100 REL: 311
    ABSTRACT: VSDNOISE_DEBUG DEFAULT SETTING IMPACTS VSD PERFORMANCE

    PROBLEM DESCRIPTION:
    Currently when vsd is running on a system the vsdnoise_debug
    default setting is set to trace level VSDTFLOW which is
    impacting vsd performance during read/writes. The default trace
    level needs to be reduced so that it does not significantly
    impact vsd performance.

    PROBLEM SUMMARY:
    The default values for vsdnoise_debug in PSSP 3.1.1 include
    VSDTFLOW. That causes an entry to be made in lookvsd_debug
    whenever the vsdd driver changes its flow of control. This
    can impact VSD performance.

    PROBLEM CONCLUSION:
    The vsdd driver has been changed so that VSDTFLOW is no
    longer set by default. Instead, a new option, VSDTSHIP, has
    been added to cause a trace entry to be made only when a
    major operation is undertaken by vsdd (for example, sending
    or receiving a packet.) This replaces VSDTFLOW as a tracing
    default.

    ------

    APAR: IY13317 COMPID: 5765D6100 REL: 210
    ABSTRACT: LLSTATUS INFO MISSING FROM API DATA

    PROBLEM DESCRIPTION:
    LLSTATUS INFO MISSING FROM API DATA

    PROBLEM SUMMARY:
    The LoadLeveler API does not have information for
    the Drain and Draining classes.

    PROBLEM CONCLUSION:
    The LoadLeveler API ll_get_data would now have two new
    specification called LL_MachineDrainingClassList and
    LL_MachineDrainClassList to get the draining and drain
    class list on the machine.

    ------

    APAR: IY13614 COMPID: 5765D5100 REL: 311
    ABSTRACT: PMANDEF FAILS WTH HACMP WHEN INITIAL_HOSTNAME NOT IN SDR ADAPTER

    PROBLEM DESCRIPTION:
    The Failure occurs when the initial_hostname in the SDR for a
    node resolves to an ip address that is not listed for one of the
    adaters in the SDR Adapter class. This can be seen around line
    1204 in function get_node_list():
     push(node_num_list, $adapter_list{&host_to_ipaddr($node)});
     The host_to_ipaddr($node) gives back the IP address of the
    initial hostname of the node. But there is no
    entry for it in the adapter_list
    so it gets left out. This causes what amounts to a syntax error
    node range which causes the failure.
    The initial_hostname gets in an associative array (node_list) of
    node numbers to hostname via build_node_list.
    This array is used down deep
    in the target processing path which ultimately turns
    the target list into a list
    of hostnames from the SDR initial_hostname field.

    PROBLEM SUMMARY:
    If the initial_hostname of a target node of a Problem
    Management subscription does not correspond to an
    Adapter in the SDR, pmandef will fail with an Invalid
    node range message.

    PROBLEM CONCLUSION:
    Processing in pmandef that was obsolete, resulted in
    an error when the initial_hostname of a target node of a
    Problem Management subscription did not correspond to an
    Adapter in the SDR. This processing was removed.

    ------

    APAR: IY13791 COMPID: 5765D5101 REL: 111
    ABSTRACT: TOPSVCS REC'S 2523-055 NODE DUPLICATION ERROR BECAUSE 2 OFFSETS

    PROBLEM DESCRIPTION:
    Customer has 5 node hacmp cluster with ttys across all 5. Each
    node is connected to 2 neighbors. When the machines.lst is
    generated, 2 offsets are being created instead of 3..this
    results in topsvcs not starting and failing with 2523-055 node
    duplication error.
    This is a duplicate of defect 64717 in r120. This fix needs to
    be retrofitted back to r111.

    PROBLEM SUMMARY:
    In rare situations, the ordering of non-IP networks might
    cause a conflict in Topology Services configuration file,
    which in turn will cause Topology Services daemon to exit.

    PROBLEM CONCLUSION:
    A new algorithm has been developed for ordering non-IP
    networks in Topology Services configuration file . This
    algorithm avoids conflict by assigning networks to the
    first offset that fits.

    ------

    APAR: IY13923 COMPID: 5765D5100 REL: 311
    ABSTRACT: UPDATEVSDTAB SCRIPT DOES NOT TAKE INTO ACCOUNT MIRRORING WHEN

    PROBLEM DESCRIPTION:
    The /usr/lpp/csd/bin/updatevsdtab script updates the size of the
    vsd incorrectly in the SDR if there is mirroring.

    PROBLEM SUMMARY:
    The problem arose when the customer needed to update
    size of the vsd (size_in_MB) in the VSD_Table class
    because the logical volume had been extended.
    There is a script called updatevsdtab in /usr/lpp/csd/bin
    which will, using sysctl, go out to the node and determine
    (via "lslv <logical-volume>") information about the
    logical volume and will update the "size_in_MB" object
    in the VSD_Table class. The "work" is done in the
    /usr/lpp/csd/sysctl/updatevsdtab2.perl script.
    When the customer did this the command worked, however,
    the size of the vsd was updated to be double the actual
    size. Upon investigation it was found that the logical
    volume was mirrored and the updatevsdtab2.perl script
    was not checking for how many copies there were and
    taking that into account when calculating the total size
    of the vsd.

    PROBLEM CONCLUSION:
    The solution to this problem is to extract from the
    LVS array one additional bit of information (COPIES:)
    and modify the calculation of the current size of the
    vsd ($curr_size) to always take into account the number
    of copies ($curr_size = ($pps / $copies) * $ppsize, where
    $pps=number of partitions, $copies is the number of copies
    that are active, and $ppsize, the size of each partition).

    ------

    APAR: IY14043 COMPID: 5765D6100 REL: 210
    ABSTRACT: REQUESTING TOO MANY US WINDOWS CRASHES THE NEGOTIATOR

    PROBLEM DESCRIPTION:
    The customer uses an external scheduler and the negotiator
    began crashing on a regular basis (about once per week). The
    problem was traced back to several jobs that were requesting 8
    and 16 tasks per node in US, but this machine only has 4 US
    windows per node.

    LOCAL FIX:
    Maui provides a filtering mechanism which can be programmed to
    prevent it from trying to start jobs requesting too many
    windows. I do not know if any other external schedulers have a
    similar feature.

    PROBLEM SUMMARY:
    The negotiator crashes when an external scheduler passes
    in a job that uses more US windows than are currently
    available on a given node.

    PROBLEM CONCLUSION:
    The negotiator can not depend on the scheduler to verify
    that sufficient US windows are available to run a job.
    Therefore, code will be added to reject a job that tries
    to load the max + 1st US window. The job will be left
    in Idle state and llq -s already knows why the job should
    not have been started.

    TEMPORARY FIX:
    Some external schedulers have filters or other programable
    interfaces that can be used to prevent bad jobs from being
    started

    ------

    APAR: IY14100 COMPID: 5765D5100 REL: 311
    ABSTRACT: SDRCREATEFILE SHOWS 0025-062 "SDR FILENAME NOT FOUND" USING

    PROBLEM DESCRIPTION:
    if a SDR file comprised the character string "/../" you gets
    the error message:
     SDRCreateFile: 0025-062 SDR filename not found.
    using e.g. SDRCreateFile <filename> <sdr class> command.
    probably the problem is in:
    sdrd_file.c
    if (strstr(buf, "/../") = NULL) return(62); /* .. not allowed*/

    PROBLEM SUMMARY:
    SDRCreateFile and SDRCreateSystemFile are failing with
    the message "0025-062 SDR filename not found.", if the
    file they try to create in the SDR contains the string
    /../as part of its contents.

    PROBLEM CONCLUSION:
    Modified the SDR commands to allow files containing the
    string /../ within the file, to be added to the SDR.

    ------

    APAR: IY14157 COMPID: 5765D6100 REL: 210
    ABSTRACT: HARD WALLCLOCK LIMIT NOT ENFORCED WHEN SOFT LIMIT IS TRAPPED FOR

    PROBLEM DESCRIPTION:
    When a loadleveler job traps a soft limit signal the job should
    continue until it hits the hard wallclock limit or completes.
    The problem is that the hard wallclock limit is not being
    enforced once the soft limit is trapped.

    PROBLEM SUMMARY:
    When a LoadLeveler job traps a soft limit signal the job
    should continue until it hits the hard wallclock limit or
    completes. The problem is that the hard wallclock limit is
    not being enforced once the soft limit is trapped. When
    the soft limit is reached the job is marked removed. Then
    when the hard limit is reached it checks to see if the job
    is marked removed, if so it does not send the kill signal.

    PROBLEM CONCLUSION:
    LoadL_starter has been changed so that the job does not get
    marked removed when the soft limit is reached. Therefore
    the job will get the kill signal when the hard limit is
    reached after the soft limit was trapped.

    ------

    APAR: IY14205 COMPID: 5765D5100 REL: 311
    ABSTRACT: SYSCTLD HANGS AFTER LARGE BATCH JOBS

    PROBLEM DESCRIPTION:
    Sysctld hangs after large batch jobs. Appearantly, the reason is
    a bug in svc_reaper function which gets in an infinite loop with
    only a few ways to break out of it. Withing the svc_reaper
    function, we are using == when we should be using just a
    single =. This prevents a variable from getting changed which
    can leave us in an infinite for loop.
    After changing this, no more problems were expirienced.

    PROBLEM SUMMARY:
    sysctld hangs after large batch jobs. The processing that
    attempts to clean up resources of defunct child processes
    ends up in an infinite loop, which hangs the daemon.

    PROBLEM CONCLUSION:
    Modified the section of code in the sysctl daemon that
    cleans up resources of defunct child processes, to no
    longer end up in an infinite loop when the child
    processes are not cleaned up properly.

    ------

    APAR: IY14272 COMPID: 5765D5100 REL: 311
    ABSTRACT: LLSTATUS INFO MISSING FROM API DATA

    PROBLEM DESCRIPTION:
    LLSTATUS INFO MISSING FROM API DATA

    PROBLEM SUMMARY:
    The LoadLeveler API does not have information for
    the Drain and Draining classes.

    PROBLEM CONCLUSION:
    The LoadLeveler API ll_get_data would now have two new
    specification called LL_MachineDrainingClassList and
    LL_MachineDrainClassList to get the draining and drain
    class list on the machine.

    ------

    APAR: IY14339 COMPID: 5765D5101 REL: 111
    ABSTRACT: HAGS NEEDS TO HAVE MORE DESCRIPTIVE ERRORS WHEN IT EXITS.

    PROBLEM DESCRIPTION:
    Hags need to be more descriptive when it exits.
    1- Guard possible coredump if currDirectory is NULL.
    2- Write 'program name' in the place of sockFd
    3- Try to Change the format of the log output from multiple line
    to a single line.

    PROBLEM SUMMARY:
    Group Services is currently writing a log
    message with the internal token number
    whenever a client process dies (or stops).
    Unfortunately, there is no easy to know
    what provider(or process) dies by just
    reading the number.
    Therefore, adding the program name in the
    log message should help the problem.

    PROBLEM CONCLUSION:
    With this fix, it should be easier to identify
    who is the failing processes(or providers).

    ------

    APAR: IY14385 COMPID: 5765D5101 REL: 111
    ABSTRACT: GSAPI CLIENT GENS CORE AT HA_GS_DISPATCH

    PROBLEM DESCRIPTION:
    gsapi client gens core at ha_gs_dispatch

    PROBLEM SUMMARY:
    GSAPI ha_gs_dispatch() causes a core dump because
    of invalid access to the uninitialized internal
    memory which was allocated by GSAPI, especially
    related to ha_gs_change_attribute() function call.
    Although this sympton may not be always shown
    externally, it may possibly misbehave memory
    memory handling.

    PROBLEM CONCLUSION:
     After the fix of memory initialization problem,
     the GSAPI should not core dumped.

    ------

    APAR: IY14403 COMPID: 5765D5100 REL: 311
    ABSTRACT: CHGCSS DOES NOT ACCEPT MULTIPLE ATTRIBUTES FROM CHDEV

    PROBLEM DESCRIPTION:
    chdev -l css0 -a rpoolsize=xxxxxxx -a spoolsize=xxxxxx passes
    -l css0 -a rpoolsize=xxxxxxx spoolsize=xxxxxx to chgcss
    But chgcss is unable to see this as multiple parameters, when
    all other change methods do.

    PROBLEM SUMMARY:
    The chgcss command does not handle changing both rpoolsize
    and spoolsize, at the same time, when chdev is the command
    being used to specify the changes.

    PROBLEM CONCLUSION:
    The chgcss command was changed to check for multiple
    attributes in a single, quoted, -a argument, which is the
    way that chdev passes change requests to the various chgxxx
    methods.

    ------

    APAR: IY14431 COMPID: 5765D5101 REL: 111
    ABSTRACT: HAEMD DIES (CALLS ABORT) IF NIS+ IS USED

    PROBLEM DESCRIPTION:
    Refer to defect 47729.
    haemd dies (calls abort() to create a core)
    if NIS+ is used.
    errpt -a shows:
    LABEL: HA002_ER
    IDENTIFIER: 12081DC6
    Resource Name: haemd
    Detail Data
    DETECTING MODULE
    LPP=PSSP,Fn=emd_rvo.c,SID=1.24,L#=749,
    DIAGNOSTIC EXPLANATION
    haemd(ach03): 2521-006 System call "shmat" failed with
    error 22 - A system call received a parameter that is
    not valid..
    LABEL: CORE_DUMP
    SIGNAL NUMBER
               6
    PROGRAM NAME
    haemd
    dbx stacktrace:
    raise.raise(??) at 0xd017ad28
    abort() at 0xd0174450
    emd_exit(??) at 0x10001138
    obsv_vars(??, ??) at 0x10012304
    rvo_immediate() at 0x10012748
    ctrl_loop() at 0x10000814
    main(??, ??) at 0x10000ffc

    PROBLEM SUMMARY:
    On a NIS system, once pman was started, haemd would
    abort due to a shared memory segment problem.

    PROBLEM CONCLUSION:
    The haemd daemon has been assigned a higher memory
    segment which will enable the daemon to stay up
    once pman is started on a NIS system.

    ------

    APAR: IY14440 COMPID: 5765D5101 REL: 111
    ABSTRACT: ORACLE EXPECTS A NETWORK RESPONSE FROM ALL NODES IN A CLUSTER

    PROBLEM DESCRIPTION:
    Oracle calls ha_em_receive_response() which returns information
    only for a local node. A network response is expected from all
    the nodes in the cluster

    PROBLEM SUMMARY:
    After one or more nodes in a HACMP/ES go down, a query
    request entered through the Event Management API
    may take up to two minutes to complete, if the
    request is directed to one or more of the nodes that
    are down.

    PROBLEM CONCLUSION:
    After this fix is applied, query commands that target
    other nodes should complete with no appreciable delay.

    ------

    APAR: IY14470 COMPID: 5765D5100 REL: 311
    ABSTRACT: VSD HANGS DUE TO BAD REQUEST COUNT

    PROBLEM DESCRIPTION:
    vsd hangs due to bad request count

    PROBLEM SUMMARY:
    VSD has the potential to hang during "suspendvsd" if the
    internal counter of requests targeted to a specific vsd
    never decrements to zero. This code area has been
    problematic in the past.

    PROBLEM CONCLUSION:
    The methodology to maintain the vsd request counter was
    improved in VSD 3.2. This code is being backfitted to VSD 3.1.1.

    ------

    APAR: IY14520 COMPID: 5765B9501 REL: 310
    ABSTRACT: MMRPLDISK FAILS WHEN DISK DESCRIPTOR IS NOT FOR A VSD DISK.

    PROBLEM DESCRIPTION:
    Command fails when the operand disk descriptor contains a disk
    that is not a vsd disk. mmrpldis scripts sets parameter vsdsiz
    only when vsd exist. The line number in the mmrlpdisk is 469.
    Customer issued the command and it failed with error:
    "fslow+ :0403-053 Expression is not complete;
    more tokens expected".

    LOCAL FIX:
    This line needs to be added.. after line 469
    vsdsiz =$(cat $mkvsdrtn | awk -F, '{print $2}')

    PROBLEM SUMMARY:
    Incorrect error given on mmrpldisk

    PROBLEM CONCLUSION:
    Correct handling of disk size for mmrpldisk

    ------

    APAR: IY14743 COMPID: 5765D5100 REL: 311
    ABSTRACT: NODECOND.CHRP FAILING

    PROBLEM DESCRIPTION:
    nodecond.chrp failing

    PROBLEM SUMMARY:
    The code in APAR IY12053 to address timing problems in the
    conditioning of model 9076-270 nodes must be adjusted to
    accommodate the 9076-260 as well. Otherwise, the nodecond
    program will go into an infinite loop.

    PROBLEM CONCLUSION:
    The nodecond_chrp script has been changed to:
    1. allow for nodes which do not have the "Boot to Open
     Firmware" option in their Pre-installation Menu. In these
     cases, the timings that are used to maintain the
     installation dialog will be estimated.
    2. wait 90 seconds before timing out on failure to start the
     firmware setup menus. It has been two minutes.
    3. correctly handle the Language Selector Menu. It had
     previously caused the process to loop to a timeout.

    ------

    APAR: IY14804 COMPID: 5765D5100 REL: 311
    ABSTRACT: SSP.BASIC 2.4.0.18 FAILS TO APPLY WITH SYSCK: 3001-038

    PROBLEM DESCRIPTION:
    When applying ssp.basic 2.4.0.18 on an AIX 433 system, the
    following errors may be seen:
    sysck: 3001-038 The name imnadm is not a known group for
     entry /usr/lpp/ssp/bin/cshutdown.
    sysck: 3001-003 A value must be specified for group for
     entry /usr/lpp/ssp/bin/cshutdown.
    sysck: 3001-038 The name imnadm is not a known group for
     entry /usr/lpp/ssp/bin/cstartup.
    sysck: 3001-003 A value must be specified for group for
     entry /usr/lpp/ssp/bin/cstartup.
    sysck: 3001-017 Errors were detected validating the files
            for package ssp.basic.

    LOCAL FIX:
    Create the group imnadm as group 200 and then ssp.basic will
    install. Then, do a chgrp shutdown /usr/lpp/ssp/bin/
    cshutdown and startrup.

    PROBLEM SUMMARY:
    Installation of ssp.basic on an AIX 4.3.3 system will
    fail, if imnadm is not defined as a group in /etc/group.
    The install will fail with message 3001-038 from sysck that
    The name imnadm is not a known group.

    PROBLEM CONCLUSION:
    Corrected the packaging of ssp.basic, so that there is
    no dependency on the imnadm group existing in /etc/group.

    ------

    APAR: IY14846 COMPID: 5765D5100 REL: 311
    ABSTRACT: DOC: SA22-7351-01 NEEDS TO ADD THE SUPPLEMENTARY RESTRICTION OF

    PROBLEM DESCRIPTION:
    There is insufficient explanation for psyslclr command in
    following electronic library of books.
     Title: Command and Technical Reference, Volume 1
     Document Number: SA22-7351-01and SA22-7351-02
    The psyslclr still stops and starts syslogd during trimming.
    Found the "Note:" in the PSSP Commands Vol 1 for PSSP3.1.0
    ( SA22-7351-00 ) under the psyslclr command near the very
    bottom of the command description:
     Note: The syslogd daemon does not log the year in records time-
    stamps. The comparisons for start and end times are done on a
    per record basis and could cause unexpected results if the log
    file is allowed to span more than one year. The syslogd daemon
    is stopped during this process so trimming activity should be
    planned accordingly. It is then restarted using the default or
    alternate syslog configuration file.

    PROBLEM SUMMARY:
    PSSP for AIX Command and Technical Reference, Volume 1
    Chapter 1 - psyslclr command
    Information was missing for the psyslclr command, indicating
    that syslogd is stopped and restarted during the log
    trimming process.

    PROBLEM CONCLUSION:
    PSSP for AIX Command and Technical Reference, Volume 1
    Chapter 1 - psyslclr command
    At the end of the Description section for psyslclr, the
    following note will be added:
    Note: The syslogd daemon is stopped during this process so
          trimming activity should be planned accordingly. It
          is then restarted using the default or alternate
          syslog configuration file.

    ------

    APAR: IY14924 COMPID: 5765D5100 REL: 311
    ABSTRACT: NODECOND TIMEOUT WAITING FOR DIAG CONSOLE

    PROBLEM DESCRIPTION:
    nodecond timeout waiting for diag console

    PROBLEM SUMMARY:
    When attempting to network boot a node in "diag" mode, the
    node conditioning program can abort with a timeout
    condition. It is waiting for the a console message "please
    define the system console."
    The problem only occurs on node types that support the chrp
    interface, and have a great many I/O adapters and disk
    units.

    PROBLEM CONCLUSION:
    The timeout value in the nodecond_chrp process had been
    hard-coded at 360 seconds for any node. This has been
    changed to reflect the actual node type.
    There is, already defined, an SDR object, named NC_timeout
    which varies from node to node. The wait for the Diagnostics
    Menu to appear will be a function of NC_timeout.

    ------

    APAR: IY15017 COMPID: 5765C3403 REL: 430
    ABSTRACT: LINUX: LIBICE NOT BEHAVING CORRECTLY

    PROBLEM DESCRIPTION:
    The libICE.a library may not behave as expected.

    PROBLEM CONCLUSION:
    Changed to enable the BSD44SOCKETS compatibility flag when
    building libICE.

    ------

    APAR: IY15850 COMPID: 5765D5100 REL: 311
    ABSTRACT: GSAPI CLIENT GENS CORE AT HA_GS_DISPATCH

    PROBLEM DESCRIPTION:
    gsapi client gens core at ha_gs_dispatch

    PROBLEM SUMMARY:
    GSAPI ha_gs_dispatch() causes a core dump because
    of invalid access to the uninitialized internal
    memory which was allocated by GSAPI, especially
    related to ha_gs_change_attribute() function call.
    Although this sympton may not be always shown
    externally, it may possibly misbehave memory
    memory handling.

    PROBLEM CONCLUSION:
     After the fix of memory initialization problem,
     the GSAPI should not core dumped.

    ------

    APAR: IY15876 COMPID: 5765D5100 REL: 311
    ABSTRACT: LATEST PSSP 3.1.1 FIXES AS OF JANUARY 2001

    PROBLEM DESCRIPTION:
    This is the lastest PSSP ptf as of January 2001.
    Order this apar to get all of the ptfs as of January 2001.

    PROBLEM SUMMARY:
    This is a packaging apar for PSSP 3.1.1 fixes
    as of January 2001.

    PROBLEM CONCLUSION:
    This is a packaging apar for PSSP 3.1.1
    fixes as of January 2001.

    ------