23 July 2013

483. MS data, part III: generating a matrix by combining several spectra, and plotting it in gnuplot

This post is primarily intended for two particular students. However, the problem it addresses is something that a lot of spectrometrists/scopists who want to take their data presentation to the next level have encountered.

My presumption:
You're running linux.

You've already exported your data as csv files as shown in this post: http://verahill.blogspot.com.au/2013/07/474-exporting-data-from-wsearch32-and.html

In addition, for the specifics in the commands below I will presume that this data is based on a cone voltage sweep from 0 to 300 in 10 volt steps. I thus have a series of files named: 0.csv, 10.csv, 20.csv..190,300.csv.

You should be able to easily customize the approach to e.g. time or concentration dependent data.

Let's get started:
0. Pre-reqs
Make sure you have gawk, sed, xargs, gnuplot, paste, python installed. On debian do
sudo apt-get install gawk sed xargs gnuplot paste python

1. Convert the csv files to dat files
Create the following script and call it csv2dat.sh
#!/bin/bash for e in 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 do tail -n +8 $e.csv | sed 's/\,/\t/g'| gawk '{print $2,$4}' > $e.dat done
and run it
sh csv2dat.sh

If all went well, you'll have a series of tab-separated .dat files which contain the m/z and the relative abundance (not absolute).

2. Extract ALL m/z values from all files
Create a file called homogenize.sh and put the following in it:
#!/bin/bash for e in {0..200..10} {210..300..10} do cat $e.dat | gawk '{print $1}' echo "" done
We'll run the homogenize script (it does nothing of the sort though), and then use the unix tools unique and sort to get rid of all non-unique m/z values, and to sort them in reverse numerical order:
sh homogenize.sh > allmz.dat
uniq allmz.dat temp.dat
sort -gr temp.dat > mz.dat

3. Pad the data with zeroes
Create a file called makelist.py, and put the following in it. Watch out for tab lengths etc. It's written for python 2.x, and probably won't work under python 3. It was also hacked together from an earlier script which didn't quite work the way I hoped it would.

#!/usr/bin/env python
import sys
from numpy import linspace
infile=sys.argv[1]

f=open(infile,'r')
arr=[]

print "Read %s" %infile
for line in f:
 line=line.rstrip('\n')
 try:
  arr+=[round(float(line),3)]
 except:
  pass
  #print line
f.close

mylist=arr
mylist.sort(reverse=True)

print "Calculating spacing"
spacing=1.0
old=max(mylist)
for i in range(0,len(mylist)):
 if round(abs(old-mylist[i]),3)<spacing and not (abs(old-mylist[i])==0):
  spacing=round(abs(old-mylist[i]),3)  
 old=mylist[i]

values=1+(max(mylist)-min(mylist))/spacing
print "Max, min, resolution: ",max(mylist),min(mylist),spacing
completelist=linspace(max(mylist),min(mylist),values).tolist()
mylist=completelist

voltages=[0,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,260,270,280,290,300]
myys=[0]*len(mylist)

for n in voltages:
 print "voltage: ",n,'\n'
 f=open(str(n)+'.dat','r')
 g=open(str(n)+'pad.dat','w')
 arrx=[]
 arry=[]

 for line in f:
  line=line.rstrip('\n')
  line=line.split(' ')
  try:
   line[0]=round(float(line[0]),3)
   line[1]=float(line[1])
   arrx+=[line[0]]
   arry+=[line[1]]
  except:
   pass

 for i in range(0,len(arrx)-1):
  try:
   myys[mylist.index(arrx[i])]=arry[i]
  except:
   a=0   

 for i in range(0,len(myys)-1):
  g.write(str(myys[i])+'\n')
f.close
g.close

h=open('mz.x','w')

for i in range(0,len(mylist)-1):
 h.write(str(mylist[i])+'\n')
h.close

Run
python makelist.py allmz.dat

Getting 'fail' messages is ok -- most likely it's due to an empty line. You can check that everything worked out by doing e.g.
wc 0pad.dat 220pad.dat

The numbers in the first column should be the same if the files have the same number of lines.

4. Make a matrix
Paste all the ms data side-by-side.
paste 0pad.dat 10pad.dat 20pad.dat 30pad.dat 40pad.dat 50pad.dat 60pad.dat 70pad.dat 80pad.dat 90pad.dat 100pad.dat 110pad.dat 120pad.dat 130pad.dat 140pad.dat 150pad.dat 160pad.dat 170pad.dat 180pad.dat 190pad.dat 200pad.dat 210pad.dat 220pad.dat 230pad.dat 240pad.dat 250pad.dat 260pad.dat 270pad.dat 280pad.dat 290pad.dat 300pad.dat > allpad.dat

5. Rotate the matrix
Create a script called rotate.sh:
gawk ' { for (i=1; i<=NF; i++) { a[NR,i] = $i } } NF>p { p = NF } END { for(j=1; j<=p; j++) { str=a[1,j] for(i=2; i<=NR; i++){ str=str" "a[i,j]; } print str } }' $1
and run
sh rotate.sh allpad.dat > matrix.rot.dat

6. Plot using gnuplot
See the following script for an example. Note that plotting in gnuplot using 'matrix' you don't get the benefit of proper axes labels. Instead we do a bit of on-the-fly maths to get the axes right. Specifically:
using (2999.3-(($1-1)/10)):(($2-1)*10):($3)

means that for the m/z axes ($1) we take the highest value (in our case 2999.3) and remove 0.1 m/z (our resolution) for each data point. This data is in each row. For the CV axes ($2), which goes down the columns in our matrix.rot.dat, we have thirty values. Each one corresponds to an increase in 10V starting at 0V, hence we multiply by 10. $3 is the intensity, which we don't need to fiddle with.

Save the following as cntr.gplt
set term png size 1000,1000 set output 'map.png' set zrange [-10:110] set yrange [0:300] unset surface set contour base set cntrparam levels 15 set view 0,0 unset ztics unset key splot 'matrix.rot.dat' matrix using (2999.3-($1/10)):($2*10):($3) with lines palette

Running
gnuplot cntr.gplt

VERY SLOWLY (in my case I had 0.9 M data points) gives us

If you're confused as to why the data doesn't go beyond 250 Volt (y axis) it's because I made a mistake at one point.

Changing the ranges a bit we get
And even more zoomed in:

Soon to come as a separate post:
 the same data, but as a stacked plot. Here's what it looks like though:

No comments:

Post a Comment