27 March 2012

115. Very Simple Python Queue Manager

I suppose we can call it the VSPQM, which sounds a bit like a Roman initialism, akin to SPQR.

I've spent the past few days trying to get to grips with the Sun Gridengine (SGE) but have given up for now. While it seems capable, it's just overkill for my purposes, especially taking into account the difficulties in simply configuring it. It's a bit similar to my experience with OpenDX, a very capable plotting program, but which I couldn't make work to satisfaction in spite of being one of the lucky few in possession of the "Open DX -- Paths to Visualisation" book.

Long story short -- I wrote a small script in python. It
- reads a file, list, with the name of shell scripts
- the shell scripts, job1.sh..jobn.sh, are executed sequentially - when the execution of one script is finished, the next one is executed
- jobs can be added and removed from list during execution

It's a 'dumb' script -- it does not try to balance jobs across nodes or look for idle cpus/cores. It just executes one job after the other, and mark jobs as done after execution.

To test it:
create a file called list and put the following lines in it:
pi40.sh
pi400.sh
pi2000.sh
The scripts are the following:

pi40.sh
echo "pi to 40 decimals"
echo "scale=40; 4*a(1)" | bc -l -q
echo "done"
pi400.sh
echo "scale=400; 4*a(1)" | bc -l -q
pi200.sh
 echo "scale=2000; 4*a(1)" | bc -l -q
The python code for vspqm.py is below

I've aliased my vspqm (edit ~/.bashrc):
alias vspqm='/home/me/work/vspqm/vspqm.py'
Then sourced ~/.bashrc

Launch in the directory you keep your list file using
me@beryllium:~/work/vspqm/jobs$ vspqm list > log &
[1] 23925
me@beryllium:~/work/vspqm/jobs$ cat log
pi to 40 decimals
3.1415926535897932384626433832795028841968
done
3.141592653589793238462643383279502884197169399375105820974944592307\
[..]
3.141592653589793238462643383279502884197169399375105820974944592307\
81640628620899862803482534211706798214808651328230664709384460955058\
[..]

An nwchem example would be
list:
ac.sh
bn.sh
ac.sh:
cd acetone/
mpirun -n 4 nwchem ac.nw>ac.out
cd ../
bn.sh:
cd benzene/
mpirun -n 4 nwchem bn.nw>bn.out
cd ../


Our python queue manager (which we'll call vspqm.py and chmod +x to make executable) is below. Don't forget to change #!/usr/bin/python2.4 if necessary -- I use 2.4 on ROCKS and 2.7 on Debian testing/wheezy

#!/usr/bin/python2.4
# rudimentary queue manager. Handles a single node,
# submitting a series of jobs in sequence. use python v2.4-2.7
import os
import time
import sys
infile=sys.argv[1]
print "pyqm v 0.0.3"
def launchjob(job):
        i=0
        print "######"
        job=job.rstrip('\n')
     
        i=os.system("sh "+job)
        if i==0:
                print "Job successful"
        else:
                print "Job failed"
        print "######"
        return i
def remake_list(infile):
        qfile=open(infile,"w")
        bakfile=open(infile+".bak",'r')
        for i in bakfile:
                qfile.write(i)
        return 0
def rewind(infile):
        qfile=open(infile,"w")
        bakfile=open(infile+".bak",'r')
        for i in bakfile:
                qfile.write(i[1:])
        return 0
def get_next_job(infile):
        qfile=open(infile,"r")
        bakfile=open(infile+".bak",'w')
        lines=""
        job=""
        for line in qfile:
                if line[0]=="*":
                        print "Marked as done: ",line[1:]
                if line[0]!="*" and job=="":
                        print "Launching: ", line
                        job=line
                        line="*"+line
                lines+=line
        bakfile.write(lines)
        qfile.close
        bakfile.close
        return job
def main(infile):
        jobs=1
        while (jobs==1):
                newjob=get_next_job(infile)
                remake_list(infile)
                if newjob!="":
                        jobs=1
                        echojob=launchjob(newjob)
                else:
                        print "No more jobs found at "+str(time.asctime())    
                        jobs=0
        return 0

if __name__ == "__main__":
        main(infile)
        rewind(infile)

20 March 2012

114. Nwchem 6.0 with openmpi support on debian testing

I still haven't managed to compile a working versin of Nwchem 6.1 on Debian 64 bit regardless of whether I'm using mpich or openmpi. The number of posts relating to compiling nwchem is steadily growing, but I'd rather have post which are almost, but not quite, identical if it makes it's unambiguous for the average user how to build and use nwchem.

Anyway, since I'm using openmpi on my rocks cluster(s), I figure I might as well start using openmpi on debian too. In addition, the only way you can get nwchem 6.0 to work with mpich2 on debian seems to be by using the old v1.2 package which causes problems of its own (see apt-pinning).

Note: See here for information about python support: http://verahill.blogspot.com.au/2012/04/adding-python-support-to-nwchem-under.html

Long story short -- nwchem with openmpi:
mkdir ~/tmp
sudo apt-get install openmpi-bin libopenmpi-dev
wget http://www.nwchem-sw.org/images/Nwchem-6.0.tar.gz
tar -xvf Nwchem-6.0.tar.gz
cd nwchem-6.0/

export LARGE_FILES=TRUE
export TCGRSH=/usr/bin/ssh
export NWCHEM_TOP=/home/me/tmp/nwchem-6.0
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES=all
export USE_MPI=y
export USE_MPIF=y
export MPI_LOC=/usr/lib/openmpi/lib
export MPI_INCLUDE=/usr/lib/openmpi/include
export LIBRARY_PATH=$LIBRARY_PATH:/usr/lib/openmpi/lib
export LIBMPI="-lmpi -lopen-rte -lopen-pal -ldl -lmpi_f77 -lpthread"
cd $NWCHEM_TOP/src
make clean
make nwchem_config
make FC=gfortran

This will take a good 20-30 minutes.


Your binary will be in nwchem-6.0/bin/LINUX64/

Finally, see whether openmpi is already in your LD_LIBRARY_PATH

echo $LD_LIBRARY_PATH
/lib/openmm:/usr/lib/nvidia-cuda-toolkit:/usr/lib/nvidia
If not, edit ~/.bashrc and add
export LD_LIBRARY_PATH=/usr/lib/openmpi/lib:$LD_LIBRARY_PATH
export PATH=$PATH:/home/me/tmp/nwchem-6.0/bin/LINUX64


113. Using ECCE to run nwchem jobs

EDIT: This post is getting messier as I'm hammering things out...but I've gotten everything to work in the end, so please persist.  The workflow described below is not the ideal one, but it'll get you started. I'll link here when I put up a newer, more reasonable tutorial.

EDIT2: I'm really warming to ECCE as I'm learning more about it. I still think it'd be nice if it was open source, and I can't understand why it has to be reliant on csh (which is pretty much broken on ROCKS, and uncomfortable at the best of times), but it's pretty neat once you've got all the details ironed out. Error feedback/report could be better though.

EDIT 3: ECCE is going open source the (northern) summer of 2012! As users we no longer have any excuses to complain.

Here's a quick introduction to getting started with using ECCE as the interface to nwchem, similar to how gaussview can be used to set up gaussian jobs.

This presumes that you've set up ECCE and preferably compiled your own version of nwchem:
http://verahill.blogspot.com.au/2012/03/ecce-on-debian-but-not-on-rockscentos.html
http://verahill.blogspot.com.au/2012/03/nwchem-61-with-openmpi-on-rocks.html
http://verahill.blogspot.com.au/2012/01/debian-testing-64-wheezy-nwhchem.html


##Important##
Once I had figured all of this out I rebuilt nwchem and re-installed ecce in the proper locations. You might want to do the same.

A. If you're going to use several nodes you should put nwchem in the same position in the file system hierarchy on all nodes e.g.
/opt/nwchem/nwchem-6.0/bin/LINUX64/nwchem

Also, make sure you share a folder (see how to use NFS) between the nodes which you can use for run time files e.g. /work

EDIT 4: This (probably) isn't necessary. In fact, using NFS in the wrong way will slow things down.

Set the permissions right (chown your user and set to 777 -- 755 is enough for nfs sharing between debian nodes, but between ROCKS and Debian you seem to need 777), and open your firewall on all ports for communication between the nodes.

B. Make sure that ECCE_HOME has been set in ~/.bashrc e.g.
export ECCE_HOME=/opt/ecce/apps

and in ~/.cshrc
setenv ECCE_HOME=/opt/ecce/apps

C.
edit /opt/ecce/apps/siteconfig/submit.site (location depends on where you install ecce)
Change lines 65+ from
#NWChemCommand {
#  $nwchem $infile > $outfile
#}
to (for multiple nodes)
NWChemCommand {
mpirun -hostfile /work/hosts.list -n $totalprocs --preload-binary /opt/nwchem/nwchem-6.0/bin/LINUX64/nwchem $infile > $outfile
}
to use mpirun for parallel job submissions and assuming you have a hosts file in /work. For running on a single node you can use


NWChemCommand {
mpirun  -n $totalprocs $nwchem  $infile > $outfile
}

user either --preload-binary /opt/nwchem/nwchem-6.0/bin/LINUX64/nwchem or $nwchem -- see what works for you. You probably can't do preload if you're running different linux distros (e.g. debian and centos)

My hosts.list looks like this:

tantalum slots=4 max_slots=4
beryllium slots=4 max_slots=5

Make sure that you don't accidentally put 2 jobs on node 0, then 2 jobs on node 1, then another 2 jobs on node 0, since they won't be consecutively numbered and will crash armci. You can avoid this by setting slots and max_slots to the same number.


D.
You may have to edit /etc/openmpi/openmpi-mca-params.conf if you have several (real or virtual) interfaces and add e.g.


btl=tcp,sm,self
btl_tcp_if_include=eth1,eth2
btl_tcp_if_exclude=eth0,virtbr0


Start ECCE:
First start the server
csh /home/me/tmp/ecce/ecce-v6.2/server/ecce-utils/start_ecce_server
then launch ecce

ecce

This will launch what the ecce people call the 'gateway':
The Gateway

0. Make sure you've got your machine set up
Click on Machine browser
Make sure that you can connect to the node e.g. by clicking on disk usage

Set the application paths. Don't fiddle with nodes -- just change number of processors to the total for all nodes.



1. Draw SiCl4 
Click on the Builder in the Gateway, which gives you the following:
The builder window

Click on More to get the periodic table which gives you access to Si

Select Geometry -- here, Tetrahedral

Si -- with four 'nubs' (yup, that's what the ecce ppl call them)

Time to attach Cl atoms to the nubs. Select Cl and pick Terminal geometry.

Click on a 'nub' to replace it with a Cl

And do it until you've replaced all 'nubs'. Hold down right mouse button to rotate

Click on the broom next to the bond menu on the right to pre-optimize  the structure using MM

And save. You will probably be limited to saving your jobs in folders below the ecce  folder.


2. Set up your job
Click on the Organizer icon in the 'gateway', which takes you here:

Click on the first icon, Editor

Focus on selecting Theory and Run type. Here's we'll do a geometry optimisation.

Click on Details for Theory

Click on Details for Run type

Constraints are optional

In the organizer, click on the third icon to set the basis set. Defined atoms for a particular basis set are indicated by a n orange right lower corner

You can get Details about the basis set

If you don't have a Navy Triangle you can't run. Click on Editor and see what might be wrong.

Ready to run. Click on Launch.
4. Running
I'm still working on enabling more than a single core...
Once you've clicked on launch you'll get

 If you click on viewer you can monitor the job

Optimization in progress
5. Re-launch a job at higher theory
In the Organizer, select your last job and then click on Edit, Duplicate Setup with Last Geometry
You then get a copy to edit

Change the basis set, save, then click on Final Edit

This is the nwchem input file in a vim instance

Add a line to the end, saying task scf freq to calculate the vibrations (there's another job option called geovib which does optim+freq , but here we do it by hand)

Launch

Running...

You can now look at the vibrations

And you can visualise MOs -- here's the HOMO which looks like all isolated p orbitals on the chlorine

You can also calculate 'properties'

These include GIAO shielding

Performance:
Here's phenol (scf/6-31g*) across three gigabit-linked nodes. The dotted line denotes node boundaries.


Here's a number of alkanes (scf/6-31g) on 4 cores on a single node:


19 March 2012

112. Kernel 3.3.x on debian testing

Compiling a kernel on debian is easy. Kernel 3.3 came out today, and here's how to build it for debian testing:
sudo apt-get install kernel-package fakeroot
wget http://www.kernel.org/pub/linux/kernel/v3.0/linux-3.3.tar.bz2
tar -xvf linux-3.3.tar.bz2 
cd linux-3.3/

cat /boot/config-`uname -r`>.config
make oldconfig

EDIT: 3.3.1 is here: http://www.kernel.org/pub/linux/kernel/v3.0/linux-3.3.1.tar.bz2 -- the build instructions are the same
EDIT 2: This has been tried with 3.3.4 as well. All is fine.
EDIT 3: And it's fine with 3.3.5

You'll also be asked about the new inclusions in the kernel. You can pick the default if you don't know - Yes means to compile into the kernel, m mean to provide as a module and no means don't support. Some drivers are better provided as modules -- see e.g. http://justlinux.com/forum/archive/index.php/t-127876.html
"Some things obviously have to be compiled into the kernel - file system support for your / filesys, stuff like that. Most everything else can be modules, if you desire." and "f you hot-swap different usb peripherals then I would recommend compiling those drivers as modules". Same goes for network drivers.

You can also look up the different options here: http://cateee.net/lkddb/web-lkddb/.
Seems like there's a lot of network and touch/screen drivers. And the fabled android drivers are available now:
Make audit loginuid immutable (AUDIT_LOGINUID_IMMUTABLE) [N/y/?] (NEW) Yes
Memory Resource Controller Kernel Memory accounting (EXPERIMENTAL) (CGROUP_MEM_RES_CTLR_KMEM) [N/y/?] (NEW) No
EFI stub support (EFI_STUB) [N/y/?] (NEW) No
UNIX: socket monitoring interface (UNIX_DIAG) [N/m/y/?] (NEW) m
UDP: socket monitoring interface (INET_UDP_DIAG) [N/m/?] (NEW)  
m
Netfilter NFACCT over NFNETLINK interface (NETFILTER_NETLINK_ACCT) [N/m/y/?] (NEW) m
Supply CT list in procfs (OBSOLETE) (NF_CONNTRACK_PROCFS) [Y/n/?] (NEW) Yes
"nfacct" match support (NETFILTER_XT_MATCH_NFACCT) [N/m/?] (NEW) No
IPVS source hashing table size (the Nth power of 2) (IP_VS_SH_TAB_BITS) [8] (NEW)
"rpfilter" reverse path filter match support (IP_NF_MATCH_RPFILTER) [N/m/?] (NEW) No
"rpfilter" reverse path filter match support (IP6_NF_MATCH_RPFILTER) [N/m/?] (NEW) No
Open vSwitch (OPENVSWITCH) [N/m/y/?] (NEW) No
Network priority cgroup (NETPRIO_CGROUP) [N/m/y/?] (NEW) m
ISA Bus based legacy SJA1000 driver (CAN_SJA1000_ISA) [N/m/?] (NEW) No
Bosch CC770 and Intel AN82527 devices (CAN_CC770) [N/m] (NEW) No
NFC LLCP support (EXPERIMENTAL) (NFC_LLCP) [N/y/?] (NEW) 
No
Block Device Driver for Micron PCIe SSDs (BLK_DEV_PCIESSD_MTIP32XX) [N/m/?] (NEW) 
No
NVM Express block device (BLK_DEV_NVME) [N/m/y/?] (NEW) No
Ethernet team driver support (EXPERIMENTAL) (NET_TEAM) [N/m/y/?] (NEW) Yes
Calxeda 1G/10G XGMAC Ethernet driver (NET_CALXEDA_XGMAC) [N/m/y/?] (NEW) No
Micrel KS8995MA 5-ports 10/100 managed Ethernet switch (MICREL_KS8995MA) [N/m/y] (NEW) No
Atheros ath9k bluetooth coexistence support (ATH9K_BTCOEX_SUPPORT) [Y/n/?] (NEW) Yes
Hardware support that overlaps with the brcmsmac driver (B43_BCMA_EXTRA) [Y/n] (NEW) Yes
Broadcom IEEE802.11n PCIe SoftMAC WLAN driver (BRCMSMAC) [N/m/?] (NEW) 
No
iwlwifi experimental P2P support (IWLWIFI_P2P) [N/y/?] (NEW) 
No
Enable full debugging output in iwlegacy (iwl 3945/4965) drivers (IWLEGACY_DEBUG) [N/y/?] (NEW) 
 No
TCA8418 Keypad Support (KEYBOARD_TCA8418) [N/m/?] (NEW) 
No
AUO in-cell touchscreen using Pixcir ICs (TOUCHSCREEN_AUO_PIXCIR) [N/m/?] (NEW) 
No
EETI eGalax multi-touch panel support (TOUCHSCREEN_EGALAX) [N/m/?] (NEW) 
No
PIXCIR I2C touchscreens (TOUCHSCREEN_PIXCIR) [N/m/?] (NEW) 
No
Sharp GP2AP002A00F I2C Proximity/Opto sensor driver (INPUT_GP2A) [N/m/?] (NEW) 
No
Polled GPIO tilt switch (INPUT_GPIO_TILT_POLLED) [N/m/y/?] (NEW) 
No
SBS Compliant gas gauge (BATTERY_SBS) [N/m/?] (NEW) 
No
National Semiconductor LP8727 charger driver (CHARGER_LP8727) [N/m/?] (NEW) 
No
Battery charger manager for multiple chargers (CHARGER_MANAGER) [N/y/?] (NEW)  No?
VIA Watchdog Timer (VIA_WDT) [N/m/y/?] (NEW) 
No
Support STMicroelectronics STMPE (MFD_STMPE) [N/y/?] (NEW) 
No
Support Dialog Semiconductor DA9052/53 PMIC variants with SPI (MFD_DA9052_SPI) [N/y/?] (NEW) 
No
Enable IR raw decoder for the Sanyo protocol (IR_SANYO_DECODER) [M/n/?] (NEW) 
No
JL2005B/C/D USB V4L2 driver (USB_GSPCA_JL2005BCD) [N/m/?] (NEW) 
m
V4L PCI(e) devices (V4L_PCI_DRIVERS) [Y/n/?] (NEW) Yes
V4L ISA and parallel port devices (V4L_ISA_PARPORT_DRIVERS) [N/y/?] (NEW) 
No
V4L platform devices (V4L_PLATFORM_DRIVERS) [N/y/?] (NEW) 
 No
Intel GMA5/600 KMS Framebuffer (DRM_GMA500) [N/m/?] (NEW) 
No
Roccat Isku keyboard support (HID_ROCCAT_ISKU) [N/m/?] (NEW) 
No
Microsoft Hyper-V mouse driver (HID_HYPERV_MOUSE) [N/m/?] (NEW) 
No
EHCI support for Marvell on-chip controller (USB_EHCI_MV) [N/y/?] (NEW) 
No
Inventra Highspeed Dual Role Controller (TI, ADI, ...) (USB_MUSB_HDRC) [N/m/?] (NEW) 
No
Marvell USB2.0 Device Controller (USB_MV_UDC) [N/m/?] (NEW) 
No
LED Support for TCA6507 I2C chip (LEDS_TCA6507) [N/m/?] (NEW) 
No
LED support for the Bachmann OT200 (LEDS_OT200) [N/m/y/?] (NEW) 
No
InfiniBand SCSI RDMA Protocol target support (INFINIBAND_SRPT) [N/m/?] (NEW) 
No
Support for rtllib wireless devices (RTLLIB) [N/m/?] (NEW)
Android Drivers (ANDROID) [N/y/?] (NEW) 
No
Fujitsu Tablet Extras (FUJITSU_TABLET) [N/m/y/?] (NEW) 
No
Fujitsu-Siemens Amilo rfkill support (AMILO_RFKILL) [N/m/?] (NEW) 
No
AMD IOMMU Version 2 driver (EXPERIMENTAL) (AMD_IOMMU_V2) [N/m/y/?] (NEW) 
No
Btrfs with integrity check tool compiled in (DANGEROUS) (BTRFS_FS_CHECK_INTEGRITY) [N/y/?] (NEW) 
No
NFS server manual fault injection (NFSD_FAULT_INJECTION) [N/y/?] (NEW) No
Kernel memory leak detector (DEBUG_KMEMLEAK) [N/y/?] (NEW) 
No
NMI Selftest (DEBUG_NMI_SELFTEST) [N/y/?] (NEW) No
Serpent cipher algorithm (x86_64/SSE2) (CRYPTO_SERPENT_SSE2_X86_64) [N/m/y/?] (NEW) 

No
Continuing,
make-kpkg clean
fakeroot make-kpkg -j7 --initrd --revision=3.3.0 --append-to-version=amd64 kernel_image kernel_headers  

The build takes a LONG time. Once it's done:

mv ../linux*3.3.0*.deb .
sudo dpkg -i *.deb

Done.

Linux tantalum 3.3.0-amd64 #1 SMP Tue Mar 20 06:29:46 EST 2012 x86_64 GNU/Linux



111. Ecce (nwchem) on Debian, and ROCKS/Centos

If you're using nwchem chances are that you've considered using ECCE to parse the output:
http://ecce.emsl.pnl.gov/

First of all you'll need to register at https://eus.emsl.pnl.gov/Portal/ -- and you can only do that if you're faculty. Postdocs and PhD students need not apply. Other than that, it's free, but you'll have to wait a couple of days to get your registration approved.

As much as I like nwchem owing to the clear syntax, I feel less warmly about ecce. Don't get me wrong -- it's pretty. It's just feels archaic and cobbled together. Even worse is that it's not open source and that its workings feel a bit opaque at times. Still, there's no better program for visually parsing nwchem output at this point. Anyway...

--start here --
Debian:
Download the install_ecce.v6.2.rhel5-gcc3.2.3-m32.csh file to ~/tmp/ecce

There's no md5sum supplied but here's what I got:
2ee70cc817dee9f80b11be5eac6e53e5

If you haven't already
sudo apt-get install csh 

OK, moving on...
cd ~/tmp/ecce
chmod +x  install_ecce.v6.2.rhel5-gcc3.2.3-m32.csh
./install_ecce.v6.2.rhel5-gcc3.2.3-m32.csh


Main ECCE installation menu
===========================
0) Help on main menu options
1) Full install
2) Full upgrade
3) Application software install
4) Application software upgrade
5) Server install
6) Server upgrade

Pick 1 if you're installing on your desktop and there's no server that you know of. 

Once the installation is over you get:
***************************************************************
!! You MUST perform the following steps in order to use ECCE !!
-- Unless only the user 'me' will be running ECCE,
   start the ECCE server as 'me' with:
     /home/me/tmp/ecce/ecce-v6.2/server/ecce-utils/start_ecce_server
-- To register machines to run computational codes, please see
   the installation and compute resource registration manuals
   at http://ecce.pnl.gov/using/installguide.shtml
-- To run ECCE each user must source either the runtime_setup
   (csh/tcsh) or runtime_setup.sh (sh/bash/ksh) script in the
   directory /home/me/tmp/ecce/ecce-v6.2/apps/scripts
   from their shell environment setup script.  For example,
   with csh or tcsh, add the following to ~/.cshrc:
     if (-e /home/me/tmp/ecce/ecce-v6.2/apps/scripts/runtime_setup) then
       source /home/me/tmp/ecce/ecce-v6.2/apps/scripts/runtime_setup
     endif
***************************************************************
Which translates to:
1. sh  /home/me/tmp/ecce/ecce-v6.2/server/ecce-utils/start_ecce_server
2. Sourcing that file makes no sense. Instead, add the following to your ~/.bashrc
export ECCE_HOME=/home/me/tmp/ecce/ecce-v6.2/apps
export PATH=${ECCE_HOME}/scripts:${PATH}

Assuming you've source your ~/.bashrc, start ecce by typing
ecce

...which takes an unreasonably long time (ca 1 min) after which you're greeted by
Press Any Key
Type in a password -- any password -- which will be your password from now on.
You're then taken to
Click on Viewer (assuming you've got something to look at)
Pay attention to the fine print
Have a look at the text box in the bottom right corner..and pay attention. In my particular case I have 6 cores and an mpi aware nwchem 6.0 version compiled. I bet that's better than whatever comes bundled with ecce. Also, the

To change you go to the machine browser (see screen shot #2), click on set up remote access and make sure that everything is working by clicking on e.g. processes:

Then click on the Machine menu (top left), select Register Machine while your machine is selected.
You can now change your options.

Running:
So, before using ecce you always need to
sh  /home/me/tmp/ecce/ecce-v6.2/server/ecce-utils/start_ecce_server
first. The server will run until you stop it or reboot.
Next, start ecce
ecce

Integration with nwchem
Most people would probably set up their nwchem jobs by hand, because it's so simple. All you need to do is to include the statement
ecce_print ecce.out
in the beginning, and you'll get an ecce.out file which you can then IMPORT (not open regularly, but import) into ecce.

Click on Viewer, Import Calculation From Output File, select your ecce out and voilá:
ECCE: homo (benzene)
If you're running debian, you're done now.



ROCKS 5.4.3/Centos 5.6:
This isn't a fix as much as a rant. The problem with ROCKS 5.4.3 is that csh is so broken that it's a struggle just to install ecce. I mean, I do show how to get ecce running in the end, but ROCKS feels like an unfinished piece of work compared to a normal debian install.

--Demonstration only -- don't do --
First back up ssh-key.sh and ssh-key.csh in /etc/profile.d

So...you start by
chmod +x install_ecce.v6.2.rhel5-gcc3.2.3-m32.csh
./install_ecce.v6.2.rhel5-gcc3.2.3-m32.csh
...and nothing's happening.

You then try just typing in
csh

/etc/profile.d/ssh-key.sh: line 211: return: can only `return' from a function or sourced script
It appears that you have not set up your ssh key.
This process will make the files:
     /export/home/me/.ssh/id_rsa.pub
     /export/home/me/.ssh/id_rsa
     /export/home/me/.ssh/authorized_keys
Generating public/private rsa key pair.
/export/home/me/.ssh/id_rsa already exists.
Overwrite (y/n)? 

Turns out there's a bug in ROCKS 5.4.3.  You can fix that by:
rpm -Uvh ftp://www.rocksclusters.org/pub/rocks/updates/5.4.3/x86_64/RPMS/rocks-config-server-5.4.3-1.x86_64.rpm

So far so good.
csh
...and nothing. It just exits. Or so you think. But the problem is bigger than that --  try opening a new terminal in e.g. gnome (gnome-terminal or xterm) -- it exits immediately. No error message or anything.

You can get csh to start by moving /etc/csh.cshrc out of the way, but you're still screwed as to opening a new terminal. The only way to get back a working system is to restore ssh-key.sh and ssh-key.csh.

--- Demonstration over ---

--Start here --
 You could also get around all this by running
csh -f
But then you don't have any env. variables loading and it can lead to problems of its own.

Anyway:
csh -f install_ecce.v6.2.rhel5-gcc3.2.3-m32.csh

The install starts. Just follow the instructions.

After installation, start the server:
csh -f ecce-v6.2/server/ecce-utils/start_ecce_server

Hit enter until you get a workable prompt back...
Edit your ~/.bashrc and add

export ECCE_HOME=/home/me/tmp/ecce/ecce-v6.2/apps
export PATH=${ECCE_HOME}/scripts:${PATH}

Don't bother sourcing your ~/.bashrc. It's easier to just open a new terminal.
Type
ecce
and you should be up and running...sort of. Under ROCKS I had problems importing ecce.out files since I had problems actually connecting to the server. Don't know why, but it came down to not being able to open a remote shell on the host.

NOTE:
this worked fine on one box, but not on another one which I was setting up remotely. On that one I had to edit

ecce/apps/siteconfig/Dataservers
and
ecce/apps/siteconfig/jndi.properties 

In particular, I had to change references to eccetera.emsl.pnl.gov.

18 March 2012

110. Compiling, installing Gnuplot 4.6 on Debian

A new version of gnuplot doesn't happen very often, and this one has an interesting added feature in terms of support for using braces in algorithms.
http://www.gnuplot.info/announce.4.6.0

Building gnuplot 4.6 is similar to building 4.4.4 and is pretty straightforward:

sudo apt-get install libgd2-xpm-dev checkinstall

wget http://sourceforge.net/projects/gnuplot/files/latest/download?source=files
mv download\?source\=files gnuplot-4.6.tar.gz
tar -xvf gnuplot-4.6.tar.gz
cd gnuplot-4.6.0/

./configure --with-linux-vga
make
checkinstall -install=no
 sudo rm /usr/local/share/info/dir -rf
sudo dpkg -i gnuplot_4.6.0-1_amd64.deb

You may get an error if trying to install on a system with a home-compiled version of octave (see below).

The problem with handling small numbers is not present in this version (http://verahill.blogspot.com.au/2012/02/debian-testing-wheezy-64-bug-in-debian.html).


Error:
Selecting previously unselected package gnuplot.
(Reading database ... 258722 files and directories currently installed.)
Unpacking gnuplot (from gnuplot_4.6.0-1_amd64.deb) ...
dpkg: error processing gnuplot_4.6.0-1_amd64.deb (--install):
 trying to overwrite '/usr/local/share/info/dir', which is also in package octave 3.6.1-1
dpkg-deb: error: subprocess paste was killed by signal (Broken pipe)
Errors were encountered while processing:
 gnuplot_4.6.0-1_amd64.deb
Solution:

 sudo dpkg --force-overwrite -i gnuplot_4.6.0-1_amd64.deb

17 March 2012

109. Building Thunderbird 11 on Debian testing

The build is fairly straightforward and pretty much identical to building 10.0.2 (earlybird): http://verahill.blogspot.com.au/2012/02/debian-testing-wheezy-64-building.html

As always, uninstall existing versions before installing a new one.

--start here --
First install the dependencies:
sudo apt-get install libdbus-glib-1-dev gir1.2-notify-0.7 libnotify-dev yasm checkinstall libzip-dev zip 


Download the sources  and untar:
mkdir ~/tmp
cd ~/tmp

wget ftp://ftp.mozilla.org/pub/mozilla.org/thunderbird/releases/11.0/source/thunderbird-11.0.source.tar.bz2
tar -xvf thunder-bird-11.0.source.tar.bz2
cd comm-release/

Start the build
./configure --disable-necko-wifi
make -j3

3 is the number of cores +1. If you have a quadcore CPU substitute 3 with 5. The build takes a while so you will probably want to do a parallel build.

Finally, to install
sudo make install



checkinstall is segfaulting for me.

Error:

/home/me/tmp/comm-release/mozilla/js/src/config/nsinstall -R -m 644 ../mozilla-config.h ../../../config/nsStaticComponents.h  ../../../dist/include
make[5]: /home/me/tmp/comm-release/mozilla/js/src/config/nsinstall: Command not found
make[5]: *** [export] Error 127
make[5]: Leaving directory `/home/me/tmp/comm-release/mozilla/js/src/config'
make[4]: *** [export] Error 2
make[4]: *** Waiting for unfinished jobs....
make[4]: Leaving directory `/home/me/tmp/comm-release/mozilla/js/src'
make[3]: *** [export_tier_js] Error 2
make[3]: Leaving directory `/home/me/tmp/comm-release/mozilla'
make[2]: *** [tier_js] Error 2
make[2]: Leaving directory `/home/me/tmp/comm-release/mozilla'
make[1]: *** [default] Error 2
make[1]: Leaving directory `/home/me/tmp/comm-release/mozilla'
Solution:
I got this error because I accidentally untared the new sources into an existing directory with an older version of thunderbird. The solution was to delete the directory and untar the sources again.


15 March 2012

108. Building local version of sinfo without root/sudo on ROCKS/CentOS

Edit 04/04/2012: there were several errors and omissions. These have been fixed now.

Because I don't want to mess up a cluster which is on a different continent I'm trying to use my superuser powers as little as possible.

Here's how to make a local version of sinfo -- you'll still need to make sinfod runs as a service on all the nodes.

There's no reason the instructions here shouldn't work on most linux distros, including Debian.

boost:
cd ~/tmp
wget http://sourceforge.net/projects/boost/files/boost/1.49.0/boost_1_49_0.tar.gz/download
tar -xvf boost_1_49_0.tar.gz
cd boost_1_49_0/
./bootstrap.sh --prefix=/export/home/me/.libboost


Edit tools/build/user-config.jam and add
using mpi ;
The space between mpi and ; is needed.

Start installation:
./b2 install

cd /export/home/me/.libboost/lib
ln -s libboost_signals.so libboost_signals-mt.so
ln -s libboost_serialization.so libboost_serialization-mt.so
ln -s libboost_date_time.so libboost_date_time-mt.so
ln -s libboost_wserialization.so libboost_wserialization-mt.so
ln -s libboost_regex.so libboost_regex-mt.so


asio:
cd ~/tmp
wget "http://downloads.sourceforge.net/project/asio/asio/1.5.3%20%28Development%29/asio-1.5.3.tar.bz2?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fasio%2F&ts=1331441086&use_mirror=aarnet"
tar -xvf asio-1.5.3.tar.bz2
cd asio-1.5.3/

./configure --prefix=/export/home/me/.asio --with-boost=/export/home/me/.libboost/include
make
make install

sinfo/d:
wget http://www.ant.uni-bremen.de/whomes/rinas/sinfo/download/sinfo-0.0.45.tar.gz
tar -xvf sinfo-0.0.45.tar.gz
cd sinfo-0.0.45/

export LIBS=-L/export/home/me/.libboost/lib
export LDFLAGS=$LIBS
export CPPFLAGS="-I/export/home/me/.libboost/include -I/export/home/me/.asio/include/"
./configure --prefix=/export/home/me/.sinfo --disable-IPv6
make

make install 

Getting started:
In order to make something happen at boot you need sudo/root access. However, HPC clusters are rarely rebooted, so even if you launch something as a user it will persist for a long time. If you're lucky the right ports are open -- and they should be open between nodes.

You also need to add this to your ~/.bashrc:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/export/home/me/.libboost/lib

Start sinfod (the daemon) using:
~/.sinfo/sbin/./sinfod --quiet

ps aux |grep sinfod 
will show it it's running

And check that everything is ok using
~/.sinfo/bin/./sinfo



13 March 2012

107. Fun with gnu screen -- setting up a screenrc

This is going to be a short post.

The problem:
If I have three nodes on a cluster, each with htop installed, how do I set up gnu screen so that it automatically signs in to each node and starts a copy of htop on each node, and display the output in a split window? Basically, how to you open screen instances and inject commands in them?

...like so. Guess which computer is running a full gnome-shell environment...


The solution:
create a file called multitop.screenrc:

screen ssh 192.168.1.101
stuff htop\015
title node01
split
focus
screen ssh 192.168.1.102
title node02
stuff htop\015
split
focus
screen ssh 192.168.1.103
title node03
stuff htop\015
stuff string injects string into the active window. \015 basically injects <enter>.

start screen using
screen -c multitop.screenrc 
and as usual,don't exit using exit or q, but use C+a - d to leave it running.


To make things a bit prettier and to enable users to reconnect to a running session, edit /etc/screenrc:

multiuser on
acladd me
defscrollback 5432
termcapinfo xterm|xterms|xs|rxvt ti@:te@
caption     always        "%{+b rk}%H%{gk} |%c %{yk}%d.%m.%Y | %72=Load: %l %{wk}"
hardstatus alwayslastline "%?%{yk}%-Lw%?%{wb}%n*%f %t%?(%u)%?%?%{yk}%+Lw%"
and
sudo chmod +s /usr/bin/screen
sudo chmod 755 /var/run/screen


If you start a session using e.g.
screen -S mytest
as user me

then other users can connect using
screen -x me/mytest

106. htop 1.0.1 and sinfo-0.0.45 on rock 5.4.3/centos 5.6

There are a number of performance monitor tools in the debian repos. ROCKS 5.4.3/Centos doesn't seem quite as well-equipped.

First out, htop:

htop:
wget http://downloads.sourceforge.net/project/htop/htop/1.0.1/htop-1.0.1.tar.gz
tar -xvf htop-1.0.1.tar.gz
cd htop-1.0.1/
./configure --prefix=/home/me/.htop
make
make install

It's as simple as that.
Add e.g.
alias htop='/home/me/.htop/bin/htop'
to your ~/.bashrc
Note: this works on Scientific Linux (boron) 5.4 as well.

sinfo:
Update 13/03/2012:
Sinfo <0.0.44 has IPv6 enabled by default.
On sinfo >=0.0.45 you can disable IPv6 using ./configure --disable-IPv6

Sinfo is probably the snazziest cluster monitoring tool that I know of. Sure, ganglia etc. are nice too, but they run as web service. Sinfo is a 'simple' curses program, but building it on CentOS was a bit of a challenge.

Be aware that sinfo versions prior to 0.045 expect ipv6 to work -- by default ROCKS disables IPv6, so use sinfo 0.0.45 and above.





First boost:
(yum install boost-devel didn't do anything for me)
cd ~/tmp
wget http://sourceforge.net/projects/boost/files/boost/1.49.0/boost_1_49_0.tar.gz/download
tar -xvf boost_1_49_0.tar.gz
cd boost_1_49_0/
./bootstrap.sh --prefix=/usr

Edit Jamroot and add
using mpi ;
The space between mpi and ; is needed.

Symlink to your mpic++, e.g. if your mpic++ is in /opt/openmpi:
sudo ln -s /opt/openmpi/bin/mpic++ /usr/bin/mpic++

The following step takes a long time:
sudo ./b2 -a install --layout=versioned --build-type=complete

These days all the libboost libs are multithread aware (or so I hear), and in debian it turns out that the -mt.so libs are just symbolic links to the 'regular' libs.
sudo ln -s /usr/lib/libboost_signals.so /usr/lib/libboost_signals-mt.so
sudo ln -s /usr/lib/libboost_date_time.so /usr/lib/libboost_date_time-mt.so
sudo ln -s /usr/lib/libboost_serialization.so /usr/lib/libboost_serialization-mt.so
sudo ln -s /usr/lib/libboost_wserialization.so /usr/lib/libboost_wserialization-mt.so
sudo ln -s /usr/lib/libboost_regex.so /usr/lib/libboost_regex-mt.so

sudo ln -s /usr/lib/libboost_signals.so.1.49.0 /usr/lib64/libboost_signals.so.1.49.0

Then asio
cd ~/tmp
wget "http://downloads.sourceforge.net/project/asio/asio/1.5.3%20%28Development%29/asio-1.5.3.tar.bz2?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fasio%2F&ts=1331441086&use_mirror=aarnet"
tar -xvf asio-1.5.3.tar.bz2
cd asio-1.5.3/
./configure
make
sudo make install

Then sinfo
cd ~/tmp
wget http://www.ant.uni-bremen.de/whomes/rinas/sinfo/download/sinfo-0.0.45.tar.gz
tar -xvf sinfo-0.0.45.tar.gz
cd sinfo-0.0.45/
./configure --disable-IPv6

The build should be fine.

Configuration:
you'll end up with
/usr/local/sbin/sinfod
/usr/local/bin/sinfo
You may want to make sure there are paths to them by adding the following to your ~/.bashrc:
export PATH=$PATH:/usr/local/bin:/usr/local/sbin
The changes take effect next time you log in to a shell, or just run
source ~/.bashrc
for immediate effect.

Also, create a file called /etc/default/sinfo with the following in it:
OPTS="--quiet --bcastaddress=192.168.1.255"

Start sinfod with
sinfod --quiet --bcastaddress=192.168.1.255

then check that it's running
ps aux | grep sinfod

If it's not running, then try
sinfod -F

If it gives something along the lines of
exception:open:address family not supported
you most likely
1) haven't enabled ipv6 for your interface and
2) didn't disable IPv6 during compilation and/or
3) used version<0.045

Check by doing ifconfig -- does it return both an ipv4 and an ipv6 address?

Enabling ipv6
Unless you know what you're doing, don't fiddle with the network interfaces on a production cluster -- network interfaces on a multinode cluster are typically highly tuned to minimise latency, so don't mess it up.

Anyway. First check your /etc/modules.conf and - if present - comment out
alias ipv6 off
options ipv6 disable=1
Edit your /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1:0
IPADDR=192.168.1.111
NETMASK=255.255.255.0
BOOTPROTO=none
MTU=1500
TYPE=Ethernet
GATEWAY=192.168.1.1
USERCTL=no
IPV6INIT=yes
PEERDNS=yes
ONPARENT=yes
IPV6ADDR=fe80::2f0:4dff:f383:b44/64
IPV6_DEFAULTGW=fe80::2f0:4dff:fe83:a48/64
I just made up the IPV6ADDR, and took the IPV6_DEFAULTGW from my gateway machine (running debian, so ipv6 enabled by default)

Assuming that your firewall is allowing traffic at port 60003 and free traffic in and out on 192.168.1.255 things should work fine.



Errors


Error (boost):
MPI auto-detection failed: unknown wrapper compiler mpic++
Please report this error to the Boost mailing list: http://www.boost.org
You will need to manually configure MPI support.
Solution:
make sure you've symlinked to your mpic++ instance in /usr/bin
e.g. if your mpic++ is in /opt/openmpi/bin/mpic++
sudo ln -s /opt/openmpi/bin/mpic++ /usr/bin/mpic++


Error (sinfo):
message.cc: In member function 'void Message::popFrontMemory(void*, size_t)':
message.cc:183: error: 'memory' was not declared in this scope
message.cc:193: error: 'boost' has not been declared
message.cc:193: error: expected primary-expression before 'char'
message.cc:193: error: expected `;' before 'char'
message.cc:196: error: 'newMemory' was not declared in this scope
message.cc:196: error: 'memory' was not declared in this scope
make[2]: *** [message.lo] Error 1
make[2]: Leaving directory `/state/partition1/home/me/tmp/sinfo-0.0.44/libmessage'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/state/partition1/home/me/tmp/sinfo-0.0.44/libmessage'
make: *** [all-recursive] Error 1
Solution:
You need to make sure that the libs are found -- either symlink manually between your build directory and /usr/lib, or use boostrap.sh --prefix=/usr. See above for how to do it.

Error (sinfo):
udpmessagereceiver.h:14: error: 'asio' has not been declared
udpmessagereceiver.h:14: error: ISO C++ forbids declaration of 'endpoint' with no type
udpmessagereceiver.h:14: error: expected ';' before 'sender_endpoint'
udpmessagereceiver.h:16: error: 'asio' has not been declared
udpmessagereceiver.h:16: error: ISO C++ forbids declaration of 'io_service' with no type
udpmessagereceiver.h:16: error: expected ';' before '&' token
udpmessagereceiver.h:17: error: 'asio' has not been declared
udpmessagereceiver.h:17: error: ISO C++ forbids declaration of 'socket' with no type
udpmessagereceiver.h:17: error: expected ';' before 'sock'
udpmessagereceiver.h:20: error: expected ',' or '...' before '::' token
udpmessagereceiver.h:20: error: ISO C++ forbids declaration of 'asio' with no type
udpmessagereceiver.h:23: error: 'asio' has not been declared
udpmessagereceiver.h:23: error: expected `)' before '&' token
udpmessagereceiver.cc:5: error: 'asio' has not been declared
udpmessagereceiver.cc:5: error: expected `)' before '&' token
make[1]: *** [udpmessagereceiver.lo] Error 1
make[1]: Leaving directory `/state/partition1/home/me/tmp/sinfo-0.0.44/libmessageio'
make: *** [all-recursive] Error 1

Solution: you've only got boost::asio installed, not the independent asio. See above for how to compile and install asio.

Error (sinfo):

/usr/bin/ld: cannot find -lboost_signals-mt
collect2: ld returned 1 exit status
make[2]: *** [sinfod] Error 1
make[2]: Leaving directory `/state/partition1/home/me/tmp/sinfo-0.0.44/sinfod'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/state/partition1/home/me/tmp/sinfo-0.0.44/sinfod'
make: *** [all-recursive] Error 1
Solution:
You need a symlink pointing form /usr/lib/libboost_signals-mt.so to /usr/lib/libboost_signals.so
ln -s /usr/lib/libboost_signals.so /usr/lib/libboost_signals-mt.so 

Error (sinfod):
sinfod --quiet --bcastaddress=192.168.1.255 gives nothing and sinfod exits silently immediately
sinfod -F gives
exception:open:address family not supported
Here's the relevant strace output:
[..]
 socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 6
[..]
 socket(PF_INET6, SOCK_DGRAM, IPPROTO_UDP) = -1 EAFNOSUPPORT (Address family not supported by protocol)
futex(0x333a40d350, FUTEX_WAKE_PRIVATE, 2147483647) = 0
close(6)                                = 0
close(3)                                = 0
close(4)                                = 0
close(5)                                = 0
write(2, "Exception: ", 11)             = 11
write(2, "open: Address family not support"..., 46) = 46
write(2, "\n", 1)                       = 1
exit_group(0)                           = ?

Solution: enable ipv6 (see above)