17 October 2013

520. New node: AMD FX 8350/32 Gb RAM/990 FX

Update 5 Nov 2012: Note that the motherboard doesn't support the CPU and this leads to spontaneous reboots under certain conditions. Make sure to look at the list over supported CPUs for the motherboard you use (in retrospect, obvious -- but as a linux person you get used to ignoring those things since everything's for just OSX or Win).

See here for the troubleshooting thread:
 http://verahill.blogspot.com.au/2013/10/523-random-reboots-troubleshooting-in.html

Also see this thread: http://www.techpowerup.com/forums/showthread.php?t=184061
I'll need to read up on...stuff...but the bottom line seems to be that one would expect issues with this board/cpu combo:

Still only a 4+1 phase board the FX chips pull a bit more power than that can put out comfortably and stable. [..] Those would be your three best to choose from all are the better 8+2 phase designs...
and
my opinion is to stay away from the asus FX ive seen many people asking why their boards are throttling at full load, vrm protection causes voltages to drop at full load when vrms hit a certain temp.

and it seemed that low (CPU) voltages precipitated crashes.

Original post:
So I built a new node at the beginning of October 2013, using the following parts:
  • AMD FX 8350 CPU
  • 4*8 Gb GSkill RAM
  • ASRock 990FX Extreme3 motherboard
  • 1 Tb Seagate Barracuda HDD
  • MSI N210 graphics card
  • ASUS NX1101 Gigabit NIC
  • Corsair GS700 PSU
  • Antec GX700 case
NOTE that I'm having issues with spontaneous reboots during extended periods (days) of heavy load (100% CPU) which do not appear to be associated with faulty RAM, so you might want to think twice before using the exact same permutation of parts as is listed above. Most likely there's a power issue -- either the PSU isn't supplying enough juice to the Mobo, or the Mobo isn't supplying enough power to the CPU. Note also that the CPU isn't listed as an officially supported CPU by the motherboard manufacturer.*

See here for my troubleshooting thread: http://verahill.blogspot.com.au/2013/10/523-random-reboots-troubleshooting-in.html

So the main value of this post are the photos which show how easy it is to build a computer. Just don't...well...build this particular computer -- use a different motherboard.

The other value of the post is purely personal -- I just wanted to write down the steps to take whenever installing a new node in my little cluster.

There will be a post later on troubleshooting (and hopefully fixing) the issue of the spontaneous reboots.

*I've built a number of computers for myself as well as for other people and haven't had any issues (other than bad RAM) before. I got lazy this time and am paying for it.

The first step was assembly:

I like the case -- it's metal and feels robust. Having two fans on top is a definite plus as well as it works well with my home-built rack.

Note that the case doesn't come with a printed manual -- to get the manual you need to go online. And it still falls short -- there's no guide as to how to use the many, many cables it comes with. However, it's not rocket science either. Turns out that the case has a molex plug for powering the case fans. So plug in the molex plug to the PSU, then plug in the fans to the four weird plug/cables that the case comes with. Note that the mobo has no plug for the internal connector USB 3 cable that the case comes with.

See here for more details re the case: http://www.hardocp.com/article/2013/09/12/antec_gx700_atx_computer_case_review
Case, closed

The glorious innards of The Case.

The 700GX does not come with a mobo panel -- not that they tend to be useful anyway

Luckily all (most?) mobos come with their own panels -- push it in place before doing anything else. It can need a bit of negotiation in order to snap in properly.

The case came with riser nuts, four PSU screws and lots of screws for the mobo.

Put the riser nuts in the case -- they are the golden thingies

And here's the mother board. I don't know if there are universally accepted recommendations, but I prefer to install the CPU and RAM onto the mother board before installing it in the case -- you have more space to manouver and the risk of breaking the board is smaller.

The heatsink (left) and CPU (right)

The heatsink comes with thermal paste pre-applied. Don't touch it -- you want it to be as smooth and even as possible.

Get the CPU out

Note the yellow triangle in the bottom right corner in the picture

That should match up with the triangle in the bottom left of this picture. Note the raised level on the right side of the CPU socket.

The CPU in place. Note the raised lever. There should be no pushing -- the CPU should fit perfectly without any force whatsoever. If you bend a pin...then good luck.

The lever is in the locked position.

Next put the heatsink on. Give this a bit of thought as you won't want to have to reseat it several times (in the worst case you'll have to go buy some thermal paste, clean the heatsink and reapply the paste). So make sure you line up the fasteners before pushing the heatsink in place.

Everything is locked down.

The motherboard with the processor and heatsink in place. In the picture the CPU fan is attached to the WRONG connector. Look for a connector saying 'CPU FAN' (in the picture it's attached to POWER FAN)

Open the RAM slot fasteners, and push the RAM sticks in place firmly, but without excessive violence. Once the fasteners snap shut by themselves the sticks are properly seated. Improperly seated RAM sticks tend to prevent you from booting and leads to a lot of noise.

All four RAM sticks in place, and the motherboard attached to the case via seven screws that screw into the riser nuts.

The PSU is in place.

Main power and auxiliary power cables attached.

This particular case has a special tray for the hard drives.

Hard drive in place

SATA data and power cables attached

The other end of the SATA data cable attaches to the motherboard (SATA 1)

After a bit of rewiring. 

PCI NIC and PCI-E graphics cards in place.
And below is a picture of the cluster -- each node is connected to a gigabit WAN (192.168.2.0/24) router  and a gigabit LAN switch (192.168.1.0/24). 8/32 means 8 cores, 32 gb ram. The cluster 'runs' on the LAN. Each of the four nodes in the picture (there are two three-core nodes in addition) are connected to a KVM. Jobs are managed using SGE.

It's questionable whether one can really call it a cluster though since I run each job on a single node for performance reasons. It still attracts attention from visitors to my office though.



Software:
I then installed debian wheezy on it. During the installation I was notified that I might want to consider enabling non-free to get the r8169 and tg3 firmwares

So after enabling non-free in the sources I did:
sudo apt-get install firmware-realtek firmware firmware-linux-nonfree

Didn't seem to change anything though -- everything was working fine before too.

I also installed amd64-microcode which, if I understand things correctly, should obviate the need for some of the full BIOS updates.

Other little housekeeping things:
I first sorted out
INIT: Id "co" respawning too fast: disabled for 5 minutes
as shown here: http://verahill.blogspot.com.au/2012/01/debian-testing-64-wheezy-small-fixes.html

I then installed a few basic thing:
sudo apt-get install vim screen sinfo gawk lm-sensors

and made a ~/.vimrc:
set number set pastetoggle=<f3> nnoremap <f4> :set nonumber!<CR>

And set vim to the default editor in lieu of nano:
sudo update-alternatives --config editor

I edited /etc/default/sinfo to make it use the correct network:
OPTS="${OPTS} --quiet --bcastaddress=192.168.1.255"
I set up 'static' dhcp on the WAN router.

On the node, I then sorted out /etc/network/interfaces  to use dhcp on eth1 and 192.168.1.180 on eth0, and to route everything properly (i.e. local traffic over eth0, and everything else over eth1):

auto lo
iface lo inet loopback

auto eth1
iface eth1 inet dhcp

auto eth0
iface eth0 inet static
address 192.168.1.180
gateway 192.168.1.1
netmask 255.255.255.0

post-up ip route flush all
post-up route add default eth1
post-up route add -net 192.168.1.0 netmask 255.255.255.0 gw 192.168.1.1 eth0

SGE won't work properly unless you edit /etc/hosts:
127.0.0.1       localhost
#127.0.1.1      oxygen
192.168.1.180   oxygen

The way my cluster works is that every node has its own shared folder.
mkdir ~/oxygen
mkdir ~/scratch
chmod 777 ~/oxygen

Export it as shown here: http://verahill.blogspot.com.au/2012/02/debian-testing-wheezy-64-sharing-folder.html
Set up ssh key login in both directions:
ssh-keygen
vim ~/.ssh/authorized_keys

Then add the new node to the cluster: http://verahill.blogspot.com.au/2013/08/501-briefly-adding-new-node-to-sge.html
Build nwchem as shown here: http://verahill.blogspot.com.au/2013/05/424-nwchem-63-on-debian-wheezy.html
Set up gaussian as shown here: http://verahill.blogspot.com.au/2012/05/settiing-up-gaussian-g09-on-debian.html
Fix shmem: http://verahill.blogspot.com.au/2012/10/shmmax-revisited-and-shmall-shmmni.html

Finally, to address this issue regarding corrupt packages during SSH sessions I then added to /etc/rc.local/sbin/ethtool -K eth1 rx off tx off

No comments:

Post a Comment