2002-04-04 Dave Wallace
- Fixed a bug in gmond which caused big-endian architectures to incorrectly store multicast data and therefore misreport on the XML port. Solaris and Alpha/Linux users should be happy to here that news!
- Added a key value check of the XDR multicast data in ./gmond/listen.c
2002-04-04 Matt Massie
- Updated ./lib/gexec_func.c gexec_cluster() function to handle hosts with no domainname correctly as pointed out by David Wallace
- Updated the program commandline options to obey the GNU Coding Standard
- Added a –debug_level parameter to gmond to allow debugging the daemon without recompiling the daemon
- Updated the –trusted_host option to allow multiple instances of the option
- Added a –mcast_ttl gmond option to allow you to modify the Time-To-Live (TTL) of the outgoing multicast messages
2002-04-02 Neil Spring
- Updated the gmond Makefile to cleanup the machine.c link on “make clean”
2002-04-02 Doc Schneider
- Update the way the number of CPUs are collected in order to workaround a bug on AMD-based systems.
2002-03-26 Matt Massie
- Removed the use of streams (via fdopen) on the XML socket and replaced it with write()s on the socket descriptor to work around Linux bug under high-stress conditions
2002-03-25 Matt Massie
- Added thread barriers to gmond initialization to ensure listening threads exist before the threads which multicast are created
2002-03-24 Matt Massie
- Use XML_ParseBuffer() in gexec_cluster() in order to avoid double copying of buffers (XML input from gmond). Faster.
2002-03-22 Matt Massie
- Updated gexec_cluster_free() to ensure no memory leaks even when cluster.num_nodes and cluster.num_dead_nodes equals zero
- Used setvbuf to ensure the gmond and libganglia are using line buffering
- Updated net.h to include netinet/in.h
- Changed sockaddr_in_new function in net.c to plug a potential memory leak
- Added XML_ParserFree() call to gexec_cluster() lib call to plug up memory leak
- Modified ./gmond/server.c to prevent crashes under heavy stress conditions
- Added fclose() calls to gexec_cluster() to plug a memory leak
- Added gexec_cluster_free() call to gstat for good measure
The command-line client (/usr/sbin/ganglia) is a small utility which is best used as an example – guiding the development of other python-based ganglia tools. The command-line client instantiates the ganglia python class (Gang). ‘Gang’ (/usr/lib/python1.5/site-packages/gmon/ganglia.py) is where all the “heavy-lifting” occurs, as it contains the methods to attach to a local gmond server and parse gmond’s XML output. Other methods are included that output specific metrics and display a help message (by dynamically examining the metrics in the XML output!).
- Preston Smith patched a wrong sysctl to get free memory for freebsd machines
- Added mute and deaf mode for gmond
- Created a new ganglia-monitor-core-lib distribution for the libganglia library. Ganglia has the start of a C API now.
- Moved all the documentation to DocBook, added much more information and updated/removed what was there. Output docs to ./docs directory of the distribution in both HTML and PDF form. Also installed Doxygen to document libganglia.
- Add the Ganglia Status Tool (gstat) which allows you to check the status of your cluster from the commandline. Hosts are sorted with least-loaded nodes at the top of the list.
- Removed the need for the POSIX mutex in pre_process_node() allowing for faster processing of incoming multicast data
- Changed gmond to not count itself as a running process when reporting the number of running processes.
- Doc Schneider changed the way ganglia builds RPMs to support non-root builds as suggested by an anonymous SourceForge user
I’ve put all the Ganglia documentation in DocBook format in order to index and organize the ganglia documentation and also create a manual in PDF which can easily be printed. The documentation is incomplete but I’m updating it daily.
- Added a new CLUSTER element to the XML with two attributes: NAME and LOCALTIME. Necessary for monitoring clusters in many different timezones.
- Fixed the getopt_long() call in gmond.c thanks to feedback from Meik Hellmund. The getopt_long() parameters didn’t match the switch() statement breaking the “trusted_host” option.
- Fixed a bug in gmond where connections from untrusted hosts caused segfaults. Error caused by passing datum_free() a NULL pointer in server_thread() of ./gmond/server.c.
- Changed the way transient nameservice errors are handled by pre_process_node() in ./gmond/listen.c. Previously, transient errors were retried but now they are treated as errors (although gmond will continue trying to resolve the host when it gets a new multicast packet from it)
- Updated the ganglia.spec file to merge gmond and gmetric into a single RPM, fixed some small bugs, and updated the RPM information.
- Preston Smith updated the FreeBSD monitoring code to include all metrics which are monitored under Linux except number of running processes, absolute cpu idle time, and shared memory. SMP users may find that freebsd’s cp_time sysctls is not completely accurate under FreeBSD stable meaning CPU%s might be inaccurate. However, it works under FreeBSD-CURRENT.
- Changed the gmetric options to also support long options and updated the help output (from -h, –help) to be much more descriptive
Added the getopt source to the ganglia library for systems (Solaris, FreeBSD) which don’t have it installed by default. Tested on Solaris 8, FreeBSD 4.5 and Alpha/Linux.
- Completely rewrote the underlying hash library because the original hash functions were over-engineered and had a memory bug on certain platforms. New hash functions are superlight and fast. Built test program and profiled/traced all memory functions using mpatrol. No leaks. Special thanks to Mike Howard for letting me test gmond on his cluster which displayed the memory bug. Also thanks to Alan Hagg and Rod Hernandez for patiently answering my questions about the memory bug on their clusters. You help was appreciated!
- Updated code to catch when transient nameservice errors occur and retry. Correctly handle hosts the don’t resolve instead of treating as an error
- Added a patch submitted by Joshua J England for gmond to correctly report the number of CPUs and their speed on alpha architectures
- Added a patch submitted by Eirikur Hallgrimsson and written by Yaroslav Klyukin for gmetric which allows users to chose which network interface gmetric multicasts metric data
- Changed the “safe_host” option to “trusted_host” to make it clearer. Also added the “num_nodes” and “num_custom_metrics” options for more efficient in-memory cluster image creation
- Reduced the number of total threads by one by removing the for(;;)pause() spin and having the main thread do server work
- created the function my_inet_ntop() function in libganglia to deal with the limitations of inet_ntoa in a multi-threaded environment
- changed the self-organzing behavior of gmond to recognize when a transient error occured on a remote gmond process
- added verbose error checking of gethostbyaddr() in listen.c
- Fixed a bug in ganglia-rdd.pl where stale hosts were not being removed from the in-memory hash (to match the XML output). No changes in the underlying databases were necessary only the data that is being put into them.
- Changed the Y-Axis on the hosts in the cluster overview to have the same range (min/max) in order to make better host comparisons. (Thanks to Tim Cera for making the suggestion)
- Changed the list of hosts that are down to a drop-down box. Previously, when a large number of machines went down the top right corner table cell would swell. The host list is also sort in order from the most recent crash to the oldest.
See a demo at http://ganglia.sourceforge.net/demo/!
Download it now, from the ganglia download site
I’ve had requests from users to integrate ganglia into their MPI installations. As a first step I’m releasing a Perl script which creates a dynamic load-balanced MPI machinefile.
Any nodes in your cluster that are down are not included in the list and the number of CPUs for each node is listed as well.
To get this great script, go to the download page