Saturday 29 June 2019

CCP4 lcf (labelled column format) files were phased-out in the early 90's but some interesting data still lingers on half-inch tapes so, if they can be read, the lcf files can be converted to ASCII or other formats for safe-keeping, or further use. For the record, some of the lcf routines are kept here but it was too difficult for me to use them or understand all the fortran, so I thought I would try to read some lcf files with a Hex editor. 

Fortunately, the data are stored as two-byte signed integers (integer*2) and both VAX and Intel computers are little-endian, so the bit-order is the same. However, I have not worked out how to read all of the header information, so this method assumes that the unit cell and symmetry have been recorded separately. The only parts of the header that can be read easily, even with a normal editor, are the title and column labels, so make a note of these before starting. Since writing the above, I have (with some help) worked-out how to read the unit cell and this is explained towards the end of this page

If we take the example of the old VAX file H256C30.LCF, the best way to read the header info on linux is to say: strings H256C30.LCF. With a bit of interpreting, this tells us that the column labels are: H K L S FOH256 SOH256 FCH256 PHIH256, so we are looking at recovering 8 columns of data. The first three items should be smallish signed reflection integers (hkl) and the rest can be bigger numbers, although the last item (the phase) will be in the range 0-360 or +/-180 degrees

If we open the file with gHex and look at the bottom of the data, this is what we see. 



After a few frustrating failures, I got the hang of reading the information. You need to look at the 16-bit signed integer value of alternate bytes and try to spot the sort of numbers you are looking for. In the example shown, the pairs of bytes with yellow lines underneath read-off as: -7     8    22  2455   108    29   106    -6 which makes an awful lot of sense, so try going backwards and indeed we can read-off a few more lines as follows: 

    -9     7    22  2418    95    29   118   -99
    -8     7    22  2419   106    29   110   131
   -11     8    22  2468    95    31   154    47
   -10     8    22  2453    93    31   172   176
    -9     8    22  2446   144    35   104   -19
    -7     8    22  2455   108    29   106    -6

These look like very sensible values for h, k, l, 10000*S^2, Fo, sigma(Fo), Fc and Phi(calc), so can we try to find where the data starts? Back at the beginning of the file, this is how it looks. 



In the window on the right, we can see the title records, so aim for a dozen or so bytes to the right of this and we can indeed read-off our first reflection as:

     3     0     0    36   131    30   719     0

and others follow:

     4     0     0    63   342    77   646     0
     5     0     0    99   348    78   369     0
     6     0     0   142   244    55   200   180
     7     0     0   193   322    73   379     0

Note the index sort-order, the fact that S increases with index, that sigma(F) is generally smaller than F and we can see the values of 0 and 180 for centric phases. Hence, this all makes sense. 

To write a little script to read all the data in the file, we need to know the byte-offset at the beginning, i.e. the length of the header record in bytes so that we can tell the program to ignore that many bytes at the start of the file. To get this number we can simply count the squares of bytes making up the header, but a more streamlined way is to look at the offset of the first byte we want to read which is shown above in the lower left corner of the gHex window as 0x1E0. This is in hex so to convert it to decimals we go 16^2+14*16, or just type "0x1e0 to decimal" into Google, or change the gHex 'Preferences' so that it shows the offset in decimal. However you do it, the result is 480 which means that there are 480 bytes in the header preceding the first reflection item that we want to recover. Knowing the offset and the number of columns we want to read, we can enter these values in a little python3 script which is below:

file1 = open("H256C30.LCF", "rb")
file2 = open("h256edit.ref", "w")
Offset=480    # Number of bytes in header
Ncol=8        # Number of columns of data
header = file1.read(Offset)
print(header)
n=0
while True:
      twobyte=file1.read(2)
      if not twobyte: break
      twobyteinteger=int.from_bytes(twobyte,"little",signed=True)
      charint="{0:6d}".format(twobyteinteger)
      file2.write(charint)
      n+=1
      if n==Ncol:
              file2.write("\n")
              n=0
file1.close()
file2.close()

By typing python3 scriptname.py we can indeed convert the whole file to readable form and after a bit of tidying-up at the end (i.e. deleting junk lines), this can be converted to MTZ format using the current CCP4 suite (remember to miss out the S column since this is not needed). A bit more CCP4 work and, voila, we have a 2.2 Angstrom difference map that has not been seen since 1987. 




And another 2.1 Angstrom map from early 70's Hilger and Watts diffractometer data!



Footnote: I found that one or two of my VAX files were corrupted by the insertion of the hex characters "0D 0A" (which stand for carriage-return and line-feed) and one had these at exactly 1024 byte intervals, possibly from ftp. These can be removed by a global search and replace in gHex. Should there be a genuine "0D 0A" in the data, I think this will show-up in the general layout of the data items produced by the script, and the stray characters would instead have to be removed individually or by revising the script such that it ignored them at the requisite interval (e.g. after reading each byte-pair check if Offset/(1024+2) is an integer). 


Reading the unit cell dimensions 

The unit cell parameters are stored in the header before the column labels as six regularly-spaced 32-bit floats which can be found as just about the only meaningful numbers in this part of the header, as shown below.


These are shown in the 'Float 32 bit' box in exponential notation e.g. the first one is 2.13e+02 i.e. 2.13 x 10^2 or more simply 213.0. However, I was expecting a value close to 54 and was therefore confused by the fact that the stored number is almost 4 times bigger than it should be but, with help from the CCP4 bulletin board, an explanation was found. Apparently there is a difference in the way that VAXes stored floats compared with Intel computers such that reading a VAX float as an Intel float means that you have to divide it by 4 to get the true number. This effect is strikingly revealed by the orthogonal unit cell angles which come out as 360 degrees instead of 90 degrees. However, after division by 4, the unit cell parameters work out to be: a=53.25 Angstroms, b=73.5 Angstroms, c=45.5 Angstroms, alpha=90 degrees, beta=109.5 degrees and gamma=90 degrees, which are more as I was expecting. However, these are only accurate to within about 0.25 Angstroms or degrees and a fully accurate way of extracting the true values has been worked-out by Kay Diederichs who kindly posted his excellent solution in fortran here and a linux binary can be obtained here. Using Kay's program, the cell parameters work out to be a=53.4 A, b=73.8 A, c=45.6 A, alpha =90 deg, beta=109.5 deg and gamma=90 deg which are spot-on, so if you want the most accurate values, that is the way to do it. Kay's program will automatically read all the header and reflection data items, so there is no need to work out the header offset or specify the number of columns. 

What Kay has worked out helps to account for the only remaining cluster of bytes in the header that looked meaningful. I was thinking that these might contain information on the space group, but unfortunately they don't. The relevant bytes are in the first line shown below.



Apparently, the first byte-pair is related to the number of columns (-2 x Ncol, hence -16) and the second one is related to the combined length of the column labels and the title. Hence one could revise the python script to read these parameters from the LCF file and completely automate the conversion process, as Kay has done in his fortran program. 

Follow-up on reading the unit cell dimensions accurately

As mentioned above, using gHex to read the cell dimensions is only approximate. If we take the example of the first cell dimension shown two figures back which appears as 213.0 when we click on the 01 byte of the following sequence. 

... 01   00    55 43     9A 99 ...

However 213 is only roughly 4*a and the other bytes give nonsense when you click on them. The key thing I failed to realise is that the actual data is made up of the '55 43 9A 99' bytes. If you swap them so that they go in order '9A 99 55 43' then this is little-endian for the exact value of a*4 (i.e. 213.6). You can get the bits from gHex as below and convert them to decimal using the bitstring package in python:

    55       43        9A       99
01010101 01000011  10011010 10011001     Sign-bit underlined

Swap the 2-byte words over:
    9A      99        55       43
10011010 10011001  01010101 01000011     This is LE for 213.6

Convert from little endian to big.
01000011 01010101 10011001 10011010      This is BE for 213.6
   43       55        99      9A

In python3 say: 

import bitstring
from bitstring import BitArray
a=BitArray(bin='01000011 01010101 10011001 10011010') or a=BitArray(hex='4355999A')
a.float    (or a.floatle for LE)
213.60000610351562  

This can now be divided by 4 to give the exact cell parameter 53.4 Angstroms. If we consider the ideal big-endian byte-order as being 1, 2, 3, 4 where byte 1 contains the sign bit and most significant bit on the left, it appears that the Vax has them in the order 2, 1, 4, 3. Hence on swapping the two halves to give 4, 3, 2, 1 we get the little endian representation of the number we want. Saying in python3:

BitArray(hex='9a995543').floatle/4       gives

53.400001525878906

Overall, this is a pretty lengthy discussion of getting the cell dimensions, but at least it is possible to extract all of the information! 

Reading MTZ files

It turns out that we can use a similar Python script for reading MTZ files produced on machines with different endianness and byte-swapping. All of the data in these files are held as 4-byte floats and, with some adaptation, this sort of thing will do the job:

import struct
file1 = open("god1.mtz", "rb")  # Thanks to Harry Powell for this VAX MTZ file.
file2 = open("god1.ref", "w")
Offset=80    # Number of bytes in header
Ncol=6        # Number of columns of data
header = file1.read(Offset)
print(header)
n=0
while True:
      byte2=file1.read(1)  # byte swap from VAX 2143 order.
      byte1=file1.read(1)
      byte4=file1.read(1)
      byte3=file1.read(1)
      fourbyte=byte1+byte2+byte3+byte4  # big endian (BE) order
      n+=1
      fourbytefloat=struct.unpack('>f', fourbyte)[0]/4  # > BE, < LE, VAX/4
      if n<=3:
             fourbyteint=int(fourbytefloat)
             charint="{0:6d}".format(fourbyteint)
      else: charint="{0:8.2f}".format(fourbytefloat)
      file2.write(charint)
      if n==Ncol:
              file2.write("\n")
              n=0
file1.close()
file2.close()                       

The only thing that needs to be specified is the number of columns. MTZ files have the header information at the bottom (!) where the column labels, Ncol and unit cell, etc., can be read easily with gHex or by typing strings filename.mtz. The reflection data always start at byte 81, i.e. the byte with an offset of 80 (note, the documentation says its byte 21, but I think they mean word 21, where each word is 4 bytes) so I believe that the offset never needs to be determined. 
The script is not very sophisticated, so it will crash when it reaches the end of the input file and the output reflection file will have 50-100 junk lines at the bottom, which will need to be deleted. Apart from that it should be OK as long as the endianness and, if necessary, the VAX division by 4 have been sorted out. 

Having used the bitstring package, it crossed my mind that we should be able to use it instead of struct in the above script and indeed you can, as below.


import bitstring
from bitstring import BitArray
file1 = open("god1.mtz", "rb")
file2 = open("god1.ref", "w")
Offset=80    # Number of bytes in header
Ncol=6       # Number of columns of data
header = file1.read(Offset)
print(header)
n=0
while True:
      byte2=file1.read(1)  # byte swap from VAX 2143 order.
      byte1=file1.read(1)
      byte4=file1.read(1)
      byte3=file1.read(1)
      fourbytes=byte1+byte2+byte3+byte4  # big endian (BE) order
      fourbyte=BitArray(fourbytes)
      n+=1
      fourbytefloat=fourbyte.float/4  # use floatle if LE, VAX/4
      if n<=3:
             fourbyteint=int(fourbytefloat)
             charint="{0:6d}".format(fourbyteint)
      else: charint="{0:8.2f}".format(fourbytefloat)
      file2.write(charint)
      if n==Ncol:
              file2.write("\n")
              n=0
file1.close()
file2.close()

It works, but it is a bit slower.