Diff for "SpmWithPentium4" - MRC CBU Imaging Wiki
location: Diff for "SpmWithPentium4"
Differences between revisions 1 and 5 (spanning 4 versions)
Revision 1 as of 2006-07-04 15:00:17
Size: 37854
Editor: Scripting Subsystem
Comment:
Revision 5 as of 2013-03-07 21:24:02
Size: 37881
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
A well-tuned pentium 4 machine is a very [wiki:SpmBenchmarks fast platform for running SPM]. However, there are a couple of flies in this sweet-smelling ointment. The first fly is of moderate size, and I will refer to this as ''the NaN problem'', for reasons that will soon be obvious. The second is a little more obscure; this is ''the Sparse Matrix problem''. The NaN problem means that pentiums can be about 30 times slower than they should be for some aspects of SPM processing, such as writing F contrast images. The problem can be solved by recompiling some SPM and matlab functions. The sparse matrix problem results in moderately slow performance on statistical estimation. The solution is a very small change to an SPM matlab function.

Lastly, you might consider [#maxmem editing the SPM memory setting] to match your system.

The end result of all this work is shown in [#config_bench the SPM benchmarks] on a Pentium 4 system before and after optimization.

[[Anchor(short_simple)]]
A well-tuned pentium 4 machine is a very [[SpmBenchmarks|fast platform for running SPM]]. However, there are a couple of flies in this sweet-smelling ointment. The first fly is of moderate size, and I will refer to this as ''the NaN problem'', for reasons that will soon be obvious. The second is a little more obscure; this is ''the Sparse Matrix problem''. The NaN problem means that pentiums can be about 30 times slower than they should be for some aspects of SPM processing, such as writing F contrast images. The problem can be solved by recompiling some SPM and matlab functions. The sparse matrix problem results in moderately slow performance on statistical estimation. The solution is a very small change to an SPM matlab function.

Lastly, you might consider [[#maxmem|editing the SPM memory setting]] to match your system.

The end result of all this work is shown in [[#config_bench|the SPM benchmarks]] on a Pentium 4 system before and after optimization.

<<Anchor(short_simple)>>
Line 12: Line 12:
 * Download the [http://imaging.mrc-cbu.cam.ac.uk/downloads/Intel_p4/atlas_P4SSE2_gcc.so.gz Linux precompiled library] to the [matlab] bin/glnx86 directory, and uncompress with gzip.
 * Download the [http://imaging.mrc-cbu.cam.ac.uk/downloads/Intel_p4/mex_spm99_p4_gcc.tgz SPM99 mex files] or [http://imaging.mrc-cbu.cam.ac.uk/downloads/Intel_p4/mex_spm2_p4_gcc.tgz SPM2 mex files] to your spm directory, and uncompress with gzip and tar. Note that these files have been compiled for matlab 6.5
 * Download the [[http://imaging.mrc-cbu.cam.ac.uk/downloads/Intel_p4/atlas_P4SSE2_gcc.so.gz|Linux precompiled library]] to the [matlab] bin/glnx86 directory, and uncompress with gzip.
 * Download the [[http://imaging.mrc-cbu.cam.ac.uk/downloads/Intel_p4/mex_spm99_p4_gcc.tgz|SPM99 mex files]] or [[http://imaging.mrc-cbu.cam.ac.uk/downloads/Intel_p4/mex_spm2_p4_gcc.tgz|SPM2 mex files]] to your spm directory, and uncompress with gzip and tar. Note that these files have been compiled for matlab 6.5
Line 16: Line 16:
 * Download the [http://imaging.mrc-cbu.cam.ac.uk/downloads/Intel_p4/atlas_P4SSE2_win32.zip Windows precompiled library] to the [matlab] bin/win32 directory, and uncompress with WinZip or similar.  * Download the [[http://imaging.mrc-cbu.cam.ac.uk/downloads/Intel_p4/atlas_P4SSE2_win32.zip|Windows precompiled library]] to the [matlab] bin/win32 directory, and uncompress with WinZip or similar.
Line 18: Line 18:
 * Either: download the [http://imaging.mrc-cbu.cam.ac.uk/downloads/Intel_p4/mex_spm99_p4_win32.zip Windows gcc precompiled mex files] to your spm directory, and uncompress with WinZip or similar. Compiled for matlab 6.5 again.
 * or: download the [http://imaging.mrc-cbu.cam.ac.uk/downloads/Intel_p4/icl_spm_dlls.zip Windows Intel compiler precompiled mex files], kindly provided by Chris Rorden. Download to your spm directory, and uncompress with WinZip or similar. There are versions for spm99 and spm2b, and the files were compiled for matlab 5.3 - so should work for any version >=5.3
 * Either: download the [[http://imaging.mrc-cbu.cam.ac.uk/downloads/Intel_p4/mex_spm99_p4_win32.zip|Windows gcc precompiled mex files]] to your spm directory, and uncompress with WinZip or similar. Compiled for matlab 6.5 again.
 * or: download the [[http://imaging.mrc-cbu.cam.ac.uk/downloads/Intel_p4/icl_spm_dlls.zip|Windows Intel compiler precompiled mex files]], kindly provided by Chris Rorden. Download to your spm directory, and uncompress with WinZip or similar. There are versions for spm99 and spm2b, and the files were compiled for matlab 5.3 - so should work for any version >=5.3
Line 22: Line 22:
 * [#tell_matlab Tell matlab about the new libraries].
 * For SPM99, apply the [#sparse_fix sparse/full] and [#maxmem max memory] edits to the SPM99 file spm_spm.m.
 * [[#tell_matlab|Tell matlab about the new libraries]].
 * For SPM99, apply the [[#sparse_fix|sparse/full]] and [[#maxmem|max memory]] edits to the SPM99 file spm_spm.m.
Line 28: Line 28:
[[Anchor(nan_problem)]] <<Anchor(nan_problem)>>
Line 30: Line 30:
The default floating point calculations on pentiums 3 and 4 are very slow with not-a-number values (NaNs). This is important because SPM uses NaN values as markers for voxels for which it has not calculated any statistics. The problem can be solved by reconfiguring SPM and matlab. As a side-effect, the new configuration will run faster than the old across the board. This page collects various messages and links I found useful for this process. As the links will show, the solutions were worked out by Satra Ghosh, Chris Rorden and Tom Womack. [[Anchor(nan_problem)]] [#matlab_nan_test matlab test program]. The [#nan_test_output output from the test program] shows that matrix multiplication is over seven hundred times slower for a matrix of NaNs than a matrix of ones. This explains why the pentium can be so slow to write F contrast images. F contrasts involve matrix multiplication on the output from SPM statistics, which are full of NaNs.

You can show that this problem is not restricted to matlab, using simple programs written in [#c_nan_test C] or [http://www.cla.sc.edu/psyc/faculty/rorden/simd.html Delphi]. Luckilt the problem with NaNs does not occur with some optimized floating point operations on the pentium 4. Specifically, the [http://www.tom.womack.net/x86FAQ/faq_features.html SSE2] instructions on the pentium 4 are the same speed for NaNs as they are for other floating point values. So, the NaN problem can be solved by recompiling the matlab math libraries and SPM C files to use SSE2 instructions. As a pleasant side-effect, the recompiled SPM C files are quicker than the standard versions - see the [#config_benchmarks benchmarks for different configurations] below.
The default floating point calculations on pentiums 3 and 4 are very slow with not-a-number values (NaNs). This is important because SPM uses NaN values as markers for voxels for which it has not calculated any statistics. The problem can be solved by reconfiguring SPM and matlab. As a side-effect, the new configuration will run faster than the old across the board. This page collects various messages and links I found useful for this process. As the links will show, the solutions were worked out by Satra Ghosh, Chris Rorden and Tom Womack. <<Anchor(nan_problem)>> [[#matlab_nan_test|matlab test program]]. The [[#nan_test_output|output from the test program]] shows that matrix multiplication is over seven hundred times slower for a matrix of NaNs than a matrix of ones. This explains why the pentium can be so slow to write F contrast images. F contrasts involve matrix multiplication on the output from SPM statistics, which are full of NaNs.

You can show that this problem is not restricted to matlab, using simple programs written in [[#c_nan_test|C]] or [[http://www.cla.sc.edu/psyc/faculty/rorden/simd.html|Delphi]]. Luckilt the problem with NaNs does not occur with some optimized floating point operations on the pentium 4. Specifically, the [[http://www.tom.womack.net/x86FAQ/faq_features.html|SSE2]] instructions on the pentium 4 are the same speed for NaNs as they are for other floating point values. So, the NaN problem can be solved by recompiling the matlab math libraries and SPM C files to use SSE2 instructions. As a pleasant side-effect, the recompiled SPM C files are quicker than the standard versions - see the [[#config_benchmarks|benchmarks for different configurations]] below.
Line 35: Line 35:
Since version 6, Matlab has used highly optimized external math libraries for most computations. These libraries contain [http://www.netlib.org/blas/ BLAS] and [http://www.netlib.org/lapack/ LAPACK] routines that provide dramatic speed gains. In fact the library that matlab uses are from the open source [http://math-atlas.sourceforge.net/ ATLAS] project. So, we can solve the NaN matrix multiplication problem by providing Matlab with versions of these libraries that use the appropriate pentium instructions to do the multiplication. There are two options: Since version 6, Matlab has used highly optimized external math libraries for most computations. These libraries contain [[http://www.netlib.org/blas/|BLAS]] and [[http://www.netlib.org/lapack/|LAPACK]] routines that provide dramatic speed gains. In fact the library that matlab uses are from the open source [[http://math-atlas.sourceforge.net/|ATLAS]] project. So, we can solve the NaN matrix multiplication problem by providing Matlab with versions of these libraries that use the appropriate pentium instructions to do the multiplication. There are two options:
Line 44: Line 44:
ATLAS, God bless it, is open-source. On Linux, ATLAS compiles with gcc by default. Using the standard configuration, this gives very fast routines, but leaves the libraries considerably slower for NaNs. This is because, although the code doing the main computational work uses SSE2 instructions, some of the error checking does not. In any case, you can solve this by recompiling ATLAS to make full use of SSE2. Either recompile with gcc, using appropriate flags for SSE2, or use the Intel C compiler, which is free for academic use. [http://www.jiscmail.ac.uk/cgi-bin/wa.exe?A2=ind0302&L=spm&P=R49173&I=-1 Satra Ghosh] worked out how to do this with the Intel compiler; the following recipe is based on his method, [http://www.jiscmail.ac.uk/cgi-bin/wa.exe?A2=ind0303&L=spm&P=R7888&I=-1 Darren Gitelman's helpful update], and my own experience. If you are still stuck with Windows, the instructions below will also work, using the gcc within [http://cygwin.com Cygwin] - see the [http://www.kevinsheppard.com/research/MatlabAtlas/MatlabAtlas.htm ATLAS / Windows page] ATLAS, God bless it, is open-source. On Linux, ATLAS compiles with gcc by default. Using the standard configuration, this gives very fast routines, but leaves the libraries considerably slower for NaNs. This is because, although the code doing the main computational work uses SSE2 instructions, some of the error checking does not. In any case, you can solve this by recompiling ATLAS to make full use of SSE2. Either recompile with gcc, using appropriate flags for SSE2, or use the Intel C compiler, which is free for academic use. [[http://www.jiscmail.ac.uk/cgi-bin/wa.exe?A2=ind0302&L=spm&P=R49173&I=-1|Satra Ghosh]] worked out how to do this with the Intel compiler; the following recipe is based on his method, [[http://www.jiscmail.ac.uk/cgi-bin/wa.exe?A2=ind0303&L=spm&P=R7888&I=-1|Darren Gitelman's helpful update]], and my own experience. If you are still stuck with Windows, the instructions below will also work, using the gcc within [[http://cygwin.com|Cygwin]] - see the [[http://www.kevinsheppard.com/research/MatlabAtlas/MatlabAtlas.htm|ATLAS / Windows page]]
Line 48: Line 48:
For the Intel C++ compiler for Linux (icc), [http://www.intel.com/software/products/compilers/clin/noncom.htm download icc from Intel] and install. I installed into /usr/local/intel. Don't use the Intel Fortran compiler with ATLAS; there is nothing to gain, as the Fortran compiler is only used to provide Fortran linking, and not for the maths. Before you start the compilation, source the relevant script to set up environment variables for the Intel compiler; in my case (using bash) this was with the command "./usr/local/intel/compiler70/bin/iccvars.sh". For the Intel C++ compiler for Linux (icc), [[http://www.intel.com/software/products/compilers/clin/noncom.htm|download icc from Intel]] and install. I installed into /usr/local/intel. Don't use the Intel Fortran compiler with ATLAS; there is nothing to gain, as the Fortran compiler is only used to provide Fortran linking, and not for the maths. Before you start the compilation, source the relevant script to set up environment variables for the Intel compiler; in my case (using bash) this was with the command "./usr/local/intel/compiler70/bin/iccvars.sh".
Line 54: Line 54:
To make things a bit easier, I have written a [#atlas_config script to configure ATLAS] with some good defaults. This script fixes the L1 cache size as described below, as well as setting a flag giving the CPU cycles per second; this is useful when you are the only user of the CPU, and there are timing problems in compiling ATLAS. To run the script for gcc, use "sh atlas_sse_config.sh gcc"; for icc use "sh atlas_sse_config.sh icc". The scripts set many of the defaults to the values you will need below.If you are not running the script above, just run from the command line.

Here is a list the questions and suggested answers for a config run. ATLAS puts the default answer in square brackets at the end of the question and before the colon. The ATLAS defaults here are the standard defaults; the defaults using the script above will contain many of the suggested answers. If there is nothing after the colon, this means you can accept the default. For more information, see the comments printed on screen during the configure process and the [http://math-atlas.sourceforge.net/ ATLAS] website.

The questions below are from a linux system; there are a few extra questions using [http://www.mrc-cbu.cam.ac.uk/Imaging/Common/www.cygwin.com Cygwin] and Windows, but the answers to these should be clear.
To make things a bit easier, I have written a [[#atlas_config|script to configure ATLAS]] with some good defaults. This script fixes the L1 cache size as described below, as well as setting a flag giving the CPU cycles per second; this is useful when you are the only user of the CPU, and there are timing problems in compiling ATLAS. To run the script for gcc, use "sh atlas_sse_config.sh gcc"; for icc use "sh atlas_sse_config.sh icc". The scripts set many of the defaults to the values you will need below.If you are not running the script above, just run from the command line.

Here is a list the questions and suggested answers for a config run. ATLAS puts the default answer in square brackets at the end of the question and before the colon. The ATLAS defaults here are the standard defaults; the defaults using the script above will contain many of the suggested answers. If there is nothing after the colon, this means you can accept the default. For more information, see the comments printed on screen during the configure process and the [[http://math-atlas.sourceforge.net/|ATLAS]] website.

The questions below are from a linux system; there are a few extra questions using [[http://www.cygwin.com|Cygwin]] and Windows, but the answers to these should be clear.
Line 111: Line 111:
If you didn't use my configuration script, there is a tweak to do before starting to ATLAS. During the compile, ATLAS will try to find the size of the processor L1 cache. For the pentium 4s I have tried, [http://sourceforge.net/mailarchive/forum.php?forum=math-atlas-devel ATLAS seems to end up with a random L1 cache size]. Clint Whaley, the ATLAS maintainer, suggested an L1 cache size of 128MB as optimum for the pentium 4, even though the actual L1 cache is 8MB. The L1 size can be set by hard coding the value; if you use the [#atlas_config script to configure ATLAS], explained below, this will happen automatically. Otherwise you can replace the ATLAS file [ATLAS-root] /tune/sysinfo/L1CacheSize.c with the [#l1cachesize version in the appendix].When I chose not to accept the architecture defaults, I had a lot of difficulties compiling ATLAS. The problem was that I had many timing outside tolerance errors; see the relevant section in the the [http://math-atlas.sourceforge.net/errata.html ATLAS errata file]. In fact the [#atlas_config script to configure ATLAS] mentioned above has a setting in it that tries to deal with this problem, again suggested by Clint Whaley, which is to define a compile variable of to the actual value of your CPU megahertz; for example, for a 2.6 GHz machine this value would be set to 2600. You can do this via a configuration script like mine, or by editing the (or whatever) file created by the configuration run. This fix is likely to work well if you are the only user of the machine, so all the CPU load comes from the ATLAS compile itself, when it starts. If not you may not want to set this value; in that case, if you are using my configuration script, you will need to remove the lines setting from the script.

Another way to try and reduce the timing errors, if you get them, is suggested in the [http://math-atlas.sourceforge.net/errata.html ATLAS errata file]. In fact, on occasion I found it useful to preemptively apply a version of a fix suggested in the errata file, and changing line 117 of ATLAS-root] /tune/sysinfo/GetSysSym.c from :
If you didn't use my configuration script, there is a tweak to do before starting to ATLAS. During the compile, ATLAS will try to find the size of the processor L1 cache. For the pentium 4s I have tried, [[http://sourceforge.net/mailarchive/forum.php?forum=math-atlas-devel|ATLAS seems to end up with a random L1 cache size]]. Clint Whaley, the ATLAS maintainer, suggested an L1 cache size of 128MB as optimum for the pentium 4, even though the actual L1 cache is 8MB. The L1 size can be set by hard coding the value; if you use the [[#atlas_config|script to configure ATLAS]], explained below, this will happen automatically. Otherwise you can replace the ATLAS file [ATLAS-root] /tune/sysinfo/L1CacheSize.c with the [[#l1cachesize|version in the appendix]].When I chose not to accept the architecture defaults, I had a lot of difficulties compiling ATLAS. The problem was that I had many timing outside tolerance errors; see the relevant section in the the [[http://math-atlas.sourceforge.net/errata.html|ATLAS errata file]]. In fact the [[#atlas_config|script to configure ATLAS]] mentioned above has a setting in it that tries to deal with this problem, again suggested by Clint Whaley, which is to define a compile variable of to the actual value of your CPU megahertz; for example, for a 2.6 GHz machine this value would be set to 2600. You can do this via a configuration script like mine, or by editing the (or whatever) file created by the configuration run. This fix is likely to work well if you are the only user of the machine, so all the CPU load comes from the ATLAS compile itself, when it starts. If not you may not want to set this value; in that case, if you are using my configuration script, you will need to remove the lines setting from the script.

Another way to try and reduce the timing errors, if you get them, is suggested in the [[http://math-atlas.sourceforge.net/errata.html|ATLAS errata file]]. In fact, on occasion I found it useful to preemptively apply a version of a fix suggested in the errata file, and changing line 117 of ATLAS-root] /tune/sysinfo/GetSysSym.c from :
Line 128: Line 128:
You might consider trying to fine-tune ATLAS. Here are some [http://www.mrc-cbu.cam.ac.uk/Imaging/Common/atlas_tips.html tips for tuning ATLAS]. You might consider trying to fine-tune ATLAS. Here are some [[http://imaging.mrc-cbu.cam.ac.uk/lug/AtlasTips|tips for tuning ATLAS]].
Line 130: Line 130:
Now you need to make the compiled libraries into a matlab ATLAS library; here are the scripts I use to do this for [#gcc_atlas Linux / ATLAS / gcc], and [#icc_atlas Linux / ATLAS / icc]. For Windows, use the script on [http://www.kevinsheppard.com/research/MatlabAtlas/MatlabAtlas.htm the ATLAS / Windows page] Now you need to make the compiled libraries into a matlab ATLAS library; here are the scripts I use to do this for [[#gcc_atlas|Linux / ATLAS / gcc]], and [[#icc_atlas|Linux / ATLAS / icc]]. For Windows, use the script on [[http://www.kevinsheppard.com/research/MatlabAtlas/MatlabAtlas.htm|the ATLAS / Windows page]]
Line 135: Line 135:
The Intel Math Kernel library is a set of math routines, including BLAS and LAPACK, that are optimised for Intel processors. They are available from Intel, who provide a 30-day evaluation trial, after which you must pay for a license. Intel offer a discount for academic users. To use the MKL, you can either reassemble a new library for matlab (see a [#mkl_atlas shell script to do this] ), or [http://www.jiscmail.ac.uk/cgi-bin/wa.exe?A2=ind0303&L=spm&P=R1100&I=-1 point Matlab to the MKL] directly. Results of the Matlab bench command and some SPM timings suggest that the MKL runs at a similar speed to the ATLAS library recompiled with gcc or icc.

[[Anchor(tell_matlab)]]
The Intel Math Kernel library is a set of math routines, including BLAS and LAPACK, that are optimised for Intel processors. They are available from Intel, who provide a 30-day evaluation trial, after which you must pay for a license. Intel offer a discount for academic users. To use the MKL, you can either reassemble a new library for matlab (see a [[#mkl_atlas|shell script to do this]] ), or [[http://www.jiscmail.ac.uk/cgi-bin/wa.exe?A2=ind0303&L=spm&P=R1100&I=-1|point Matlab to the MKL]] directly. Results of the Matlab bench command and some SPM timings suggest that the MKL runs at a similar speed to the ATLAS library recompiled with gcc or icc.

<<Anchor(tell_matlab)>>
Line 144: Line 144:
to point to the new library, which should be in the same directory. For example, if you used [#gcc_atlas the script in the appendix] to write the gcc recompiled libraries, you should have a file "atlas_P4SSE2_gcc.so" in the matlab bin/glnx86 directory. In this case you can change the above line to: to point to the new library, which should be in the same directory. For example, if you used [[#gcc_atlas|the script in the appendix]] to write the gcc recompiled libraries, you should have a file "atlas_P4SSE2_gcc.so" in the matlab bin/glnx86 directory. In this case you can change the above line to:
Line 151: Line 151:
Otherwise you can point matlab at the relevant library using the BLAS_VERSION environment variable - see the [http://www.mathworks.com/access/helpdesk/help/base/install/unix/chap3_10.shtml Mathworks help on BLAS]. Otherwise you can point matlab at the relevant library using the BLAS_VERSION environment variable - see the [[http://www.mathworks.com/access/helpdesk/help/base/install/unix/chap3_10.shtml|Mathworks help on BLAS]].
Line 154: Line 154:
SPM uses compiled C routines for some rate limiting steps like image resampling. As images may contain NaNs, these need to be recompiled to avoid the Pentium NaN slowdown. Again, this can be done by giving gcc the appropriate flags to use SSE2, or using the Intel compiler. The Intel compiler gives faster times in the benchmarks I have run. Tom Womack and Chris Rorden have long pointed out the benefits of the [http://www.psychology.nottingham.ac.uk/staff/cr1/spm99.html Intel compiler for the SPM mex files]. SPM uses compiled C routines for some rate limiting steps like image resampling. As images may contain NaNs, these need to be recompiled to avoid the Pentium NaN slowdown. Again, this can be done by giving gcc the appropriate flags to use SSE2, or using the Intel compiler. The Intel compiler gives faster times in the benchmarks I have run. Tom Womack and Chris Rorden have long pointed out the benefits of the [[http://www.psychology.nottingham.ac.uk/staff/cr1/spm99.html|Intel compiler for the SPM mex files]].
Line 157: Line 157:
To recompile the spm99 mex files you will need to change the spm_MAKE.sh file, and edit / create a mex options file. For gcc or icc, make the [#spm_make suggested edits to spm_MAKE.sh], which is in your SPM directory. For icc, copy the [#iccopts iccopts.sh appendix] to a file "iccopts.sh" in your SPM directory. For gcc on Linux, edit your gccopts.sh option file, find the section for "glnx86)", and change the COPTIMFLAGS, CXXOPTIMFLAGS variables to be: To recompile the spm99 mex files you will need to change the spm_MAKE.sh file, and edit / create a mex options file. For gcc or icc, make the [[#spm_make|suggested edits to spm_MAKE.sh]], which is in your SPM directory. For icc, copy the [[#iccopts|iccopts.sh appendix]] to a file "iccopts.sh" in your SPM directory. For gcc on Linux, edit your gccopts.sh option file, find the section for "glnx86)", and change the COPTIMFLAGS, CXXOPTIMFLAGS variables to be:
Line 161: Line 161:
For [http://www.mrc-cbu.cam.ac.uk/Imaging/Common/gunmex20.html gcc on Windows], make and then edit your mexopts.bat file; change the OPTIMFLAGS variable to be: For [[http://gnumex.sourceforge.net|gcc on Windows]], make and then edit your mexopts.bat file; change the OPTIMFLAGS variable to be:
Line 183: Line 183:
You can also try using the Intel C compiler for Windows (icl), which does give some speed gains over gcc. icl is not free for academic use, and requires Microsoft Visual C++ to be installed. You also need to edit the SPM code to make the compilation work. For those brave souls who want to give it a try, I have packaged up [http://imaging.mrc-cbu.cam.ac.uk/downloads/Intel_p4/icl_mex.zip an icl/SPM archive], with edited versions of the SPM code, a mexopts.bat file and a version of the spm_MAKE.bat file that works with icl.

If you want to avoid recompiling these files, Chris Rorden and I have packaged some precompiled versions - see the [#short_simple the short and simple] section above for links.
You can also try using the Intel C compiler for Windows (icl), which does give some speed gains over gcc. icl is not free for academic use, and requires Microsoft Visual C++ to be installed. You also need to edit the SPM code to make the compilation work. For those brave souls who want to give it a try, I have packaged up [[http://imaging.mrc-cbu.cam.ac.uk/downloads/Intel_p4/icl_mex.zip|an icl/SPM archive]], with edited versions of the SPM code, a mexopts.bat file and a version of the spm_MAKE.bat file that works with icl.

If you want to avoid recompiling these files, Chris Rorden and I have packaged some precompiled versions - see the [[#short_simple|the short and simple]] section above for links.
Line 203: Line 203:
First, copy the [#iccopts iccopts.sh appendix] to a file "iccopts.sh" in your SPM directory. Next, add the following to the Makefile: First, copy the [[#iccopts|iccopts.sh appendix]] to a file "iccopts.sh" in your SPM directory. Next, add the following to the Makefile:
Line 212: Line 212:
You can use this [#mex_test mex file speed check] to see if your current mex files have a problem with NaNs. If they do, the output will show that an image with NaNs takes considerably longer to resample than the same image without NaNs. This is what the output looks like on my laptop, running SPM with the default mex files under Windows on a P3: You can use this [[#mex_test|mex file speed check]] to see if your current mex files have a problem with NaNs. If they do, the output will show that an image with NaNs takes considerably longer to resample than the same image without NaNs. This is what the output looks like on my laptop, running SPM with the default mex files under Windows on a P3:
Line 223: Line 223:
On my pentium 4 system I found that SPM99 model estimation was not as fast as I expected. After a little searching, this [http://www.jiscmail.ac.uk/cgi-bin/wa.exe?A2=ind0303&L=spm&P=R35347&I=1 turned out] to be due to great slowness during the application of the high and low pass temporal filtering. The filtering involves multiplying a matrix of data by a sparse matrix. A sparse matrix is a matrix which is likely to have many zeros, and matlab implements these differently from other matrices. The multiplication of a sparse matrix and a standard (full) matrix is [http://newsreader.mathworks.com/WebX?14@nhxnaCOtegT@.eebba2a usually slow]. For some reason, on my Linux systems, it was much slower than for other systems, at more than a hundred times slower than a full by full multiplication. The appendix contains a [#sparse_demo sparse/full test script] to evaluate the problem on your system. This does not seem to be a Pentium 4 problem in particular; [http://www.jiscmail.ac.uk/cgi-bin/wa.exe?A2=ind0303&L=spm&P=R35404&I=-1 others] have not found the same slowdown.The sparse matrix problem can be solved by converting the sparse matrix format of the SPM99 filters to full matrix format; this gives the same result from the filtering, but is much quicker on all the machines that I have tested. You only need to add a few lines to the SPM99file "spm_spm.m" - [#sparse_fix see the appendix] for details. I have referred to the use of this edited SPM file as the Optimized model - see below. [[Anchor(config_benchmarks)]]

[[Anchor(maxmem)]]
On my pentium 4 system I found that SPM99 model estimation was not as fast as I expected. After a little searching, this [[http://www.jiscmail.ac.uk/cgi-bin/wa.exe?A2=ind0303&L=spm&P=R35347&I=1|turned out]] to be due to great slowness during the application of the high and low pass temporal filtering. The filtering involves multiplying a matrix of data by a sparse matrix. A sparse matrix is a matrix which is likely to have many zeros, and matlab implements these differently from other matrices. The multiplication of a sparse matrix and a standard (full) matrix is [[http://newsreader.mathworks.com/WebX?14@nhxnaCOtegT@.eebba2a|usually slow]]. For some reason, on my Linux systems, it was much slower than for other systems, at more than a hundred times slower than a full by full multiplication. The appendix contains a [[#sparse_demo|sparse/full test script]] to evaluate the problem on your system. This does not seem to be a Pentium 4 problem in particular; [[http://www.jiscmail.ac.uk/cgi-bin/wa.exe?A2=ind0303&L=spm&P=R35404&I=-1|others]] have not found the same slowdown.The sparse matrix problem can be solved by converting the sparse matrix format of the SPM99 filters to full matrix format; this gives the same result from the filtering, but is much quicker on all the machines that I have tested. You only need to add a few lines to the SPM99file "spm_spm.m" - [[#sparse_fix|see the appendix]] for details. I have referred to the use of this edited SPM file as the Optimized model - see below. <<Anchor(config_benchmarks)>>

<<Anchor(maxmem)>>
Line 236: Line 236:
Last but not least, there may be gains from a general Linux tune up: for details try these links: [http://www.geocities.com/lunatech3007/how-to/html/howto_17.html Post-install tuning]; [http://linux.oreillynet.com/pub/a/linux/2000/06/29/hdparm.html Tuning disks with hdparm]; [http://koushik.fateback.com/tune.html Linux network tuning]. [[Anchor(config_bench)]] Last but not least, there may be gains from a general Linux tune up: for details try these links: [[http://www.geocities.com/lunatech3007/how-to/html/howto_17.html|Post-install tuning]]; [[http://linux.oreillynet.com/pub/a/linux/2000/06/29/hdparm.html|Tuning disks with hdparm]]; [[http://koushik.fateback.com/tune.html|Linux network tuning]]. <<Anchor(config_bench)>>
Line 238: Line 238:
After all that; here are some sample timings with the various options described here. The benchmarks used a 2.5GHz Pentium 4 machine with 1GB RAM, running SPM99 under Matlab 6.5 and Mandrake Linux 9.0. Values are times in minutes to complete various SPM99 jobs, as described in more detail in [wiki:SpmBenchmarks the SPM benchmarks page]. Note the last column, which gives times for writing a long list of contrast images, including 41 F contrast images. The "default" configuration is the standard SPM mex files, with the atlas library that matlab ships with. "icc atlas" is the same jobs after compiled ATLAS with icc. "icc atlas + mex" uses the recompiled ATLAS and mex files using icc. "gcc SSE atlas + mex" is for the ATLAS and mex files recompiled with gcc as described here. "W2000 gcc SSE atlas + icl mex" refers to timings from the same machine running Windows 2000, and using the ATLAS libraries recompiled using SSE, and SPM mex files recompiled using the Intel C compiler for Windows (icl). This was the Windows configuration that gave the fastest times. After all that; here are some sample timings with the various options described here. The benchmarks used a 2.5GHz Pentium 4 machine with 1GB RAM, running SPM99 under Matlab 6.5 and Mandrake Linux 9.0. Values are times in minutes to complete various SPM99 jobs, as described in more detail in [[SpmBenchmarks|the SPM benchmarks page]]. Note the last column, which gives times for writing a long list of contrast images, including 41 F contrast images. The "default" configuration is the standard SPM mex files, with the atlas library that matlab ships with. "icc atlas" is the same jobs after compiled ATLAS with icc. "icc atlas + mex" uses the recompiled ATLAS and mex files using icc. "gcc SSE atlas + mex" is for the ATLAS and mex files recompiled with gcc as described here. "W2000 gcc SSE atlas + icl mex" refers to timings from the same machine running Windows 2000, and using the ATLAS libraries recompiled using SSE, and SPM mex files recompiled using the Intel C compiler for Windows (icl). This was the Windows configuration that gave the fastest times.
Line 248: Line 248:
Note the huge speed increase for contrast generation after optimizing for the P4. See [wiki:SpmBenchmarks the SPM benchmarks page] for some comments on Linux and Windows. Note the huge speed increase for contrast generation after optimizing for the P4. See [[SpmBenchmarks|the SPM benchmarks page]] for some comments on Linux and Windows.
Line 254: Line 254:
[[Anchor(matlab_nan_test)]] <<Anchor(matlab_nan_test)>>
Line 273: Line 273:
[[Anchor(nan_test_output)]] with some example output: <<Anchor(nan_test_output)>> with some example output:
Line 284: Line 284:
[[Anchor(c_nan_test)]] <<Anchor(c_nan_test)>>
Line 343: Line 343:
[[Anchor(atlas_config)]] <<Anchor(atlas_config)>>
Line 409: Line 409:
[[Anchor(l1cachesize)]] <<Anchor(l1cachesize)>>
Line 429: Line 429:
[[Anchor(gcc_atlas)]] <<Anchor(gcc_atlas)>>
Line 455: Line 455:
[[Anchor(icc_atlas)]] <<Anchor(icc_atlas)>>
Line 484: Line 484:
[[Anchor(mkl_atlas)]] <<Anchor(mkl_atlas)>>
Line 511: Line 511:
[[Anchor(spm_make)]] <<Anchor(spm_make)>>
Line 546: Line 546:
[[Anchor(iccopts)]] <<Anchor(iccopts)>>
Line 594: Line 594:
[[Anchor(mex_test)]] <<Anchor(mex_test)>>
Line 662: Line 662:
[[Anchor(sparse_demo)]] <<Anchor(sparse_demo)>>
Line 678: Line 678:
On my system, the second (sparse matrix) timing - t(2), is very considerably longer than the first (full matrix) timing - t(1). [[Anchor(sparse_fix)]] On my system, the second (sparse matrix) timing - t(2), is very considerably longer than the first (full matrix) timing - t(1). <<Anchor(sparse_fix)>>
Line 698: Line 698:

Configuring Pentium 4 machines for SPM

A well-tuned pentium 4 machine is a very fast platform for running SPM. However, there are a couple of flies in this sweet-smelling ointment. The first fly is of moderate size, and I will refer to this as the NaN problem, for reasons that will soon be obvious. The second is a little more obscure; this is the Sparse Matrix problem. The NaN problem means that pentiums can be about 30 times slower than they should be for some aspects of SPM processing, such as writing F contrast images. The problem can be solved by recompiling some SPM and matlab functions. The sparse matrix problem results in moderately slow performance on statistical estimation. The solution is a very small change to an SPM matlab function.

Lastly, you might consider editing the SPM memory setting to match your system.

The end result of all this work is shown in the SPM benchmarks on a Pentium 4 system before and after optimization.

The short simple version

To cut to the chase, you will need a new version of a matlab math library, and the SPM mex files. If you don't want to wade through the rest of this document, you could follow this simple recipe:For Linux:

For Windows (sigh):

  • Download the Windows precompiled library to the [matlab] bin/win32 directory, and uncompress with WinZip or similar.

  • Note that SPM2 is distributed with Windows mex files for the Pentium 4, compiled with gcc; these are contained in the file ; to use them you should unzip the contained .dll files into a directory which is earlier on your matlab path than the SPM directory. If you are using SPM99, you may need to do one of the following:
  • Either: download the Windows gcc precompiled mex files to your spm directory, and uncompress with WinZip or similar. Compiled for matlab 6.5 again.

  • or: download the Windows Intel compiler precompiled mex files, kindly provided by Chris Rorden. Download to your spm directory, and uncompress with WinZip or similar. There are versions for spm99 and spm2b, and the files were compiled for matlab 5.3 - so should work for any version >=5.3

For both:

That's it. Your analyses will be quicker, your love life will be better, and your foreign policy objectives will be achievable by diplomacy.

Here is the long complex version:

The NaN problem

The default floating point calculations on pentiums 3 and 4 are very slow with not-a-number values (NaNs). This is important because SPM uses NaN values as markers for voxels for which it has not calculated any statistics. The problem can be solved by reconfiguring SPM and matlab. As a side-effect, the new configuration will run faster than the old across the board. This page collects various messages and links I found useful for this process. As the links will show, the solutions were worked out by Satra Ghosh, Chris Rorden and Tom Womack. matlab test program. The output from the test program shows that matrix multiplication is over seven hundred times slower for a matrix of NaNs than a matrix of ones. This explains why the pentium can be so slow to write F contrast images. F contrasts involve matrix multiplication on the output from SPM statistics, which are full of NaNs.

You can show that this problem is not restricted to matlab, using simple programs written in C or Delphi. Luckilt the problem with NaNs does not occur with some optimized floating point operations on the pentium 4. Specifically, the SSE2 instructions on the pentium 4 are the same speed for NaNs as they are for other floating point values. So, the NaN problem can be solved by recompiling the matlab math libraries and SPM C files to use SSE2 instructions. As a pleasant side-effect, the recompiled SPM C files are quicker than the standard versions - see the benchmarks for different configurations below.

Recompiling the Matlab math routines

Since version 6, Matlab has used highly optimized external math libraries for most computations. These libraries contain BLAS and LAPACK routines that provide dramatic speed gains. In fact the library that matlab uses are from the open source ATLAS project. So, we can solve the NaN matrix multiplication problem by providing Matlab with versions of these libraries that use the appropriate pentium instructions to do the multiplication. There are two options:

  1. Recompile the ATLAS libraries
  2. Use the Intel Math Kernel library

Recompile the ATLAS libraries

ATLAS, God bless it, is open-source. On Linux, ATLAS compiles with gcc by default. Using the standard configuration, this gives very fast routines, but leaves the libraries considerably slower for NaNs. This is because, although the code doing the main computational work uses SSE2 instructions, some of the error checking does not. In any case, you can solve this by recompiling ATLAS to make full use of SSE2. Either recompile with gcc, using appropriate flags for SSE2, or use the Intel C compiler, which is free for academic use. Satra Ghosh worked out how to do this with the Intel compiler; the following recipe is based on his method, Darren Gitelman's helpful update, and my own experience. If you are still stuck with Windows, the instructions below will also work, using the gcc within Cygwin - see the ATLAS / Windows page

To recompile with gcc, you will need at least gcc >=3.1, which was (I think) the first to implement the relevant SSE2 flags. I have only used 3.2 and 3.3 myself.

For the Intel C++ compiler for Linux (icc), download icc from Intel and install. I installed into /usr/local/intel. Don't use the Intel Fortran compiler with ATLAS; there is nothing to gain, as the Fortran compiler is only used to provide Fortran linking, and not for the maths. Before you start the compilation, source the relevant script to set up environment variables for the Intel compiler; in my case (using bash) this was with the command "./usr/local/intel/compiler70/bin/iccvars.sh".

First download the ATLAS source. The current stable version is 3.6, the development version is 3.7. The intructions below are for ATLAS 3.6. Unpack the source into a handy directory. You should now have directory named "ATLAS"; cd to that directory.

Getting ready to configure ATLAS

Configure ATLAS

To make things a bit easier, I have written a script to configure ATLAS with some good defaults. This script fixes the L1 cache size as described below, as well as setting a flag giving the CPU cycles per second; this is useful when you are the only user of the CPU, and there are timing problems in compiling ATLAS. To run the script for gcc, use "sh atlas_sse_config.sh gcc"; for icc use "sh atlas_sse_config.sh icc". The scripts set many of the defaults to the values you will need below.If you are not running the script above, just run from the command line.

Here is a list the questions and suggested answers for a config run. ATLAS puts the default answer in square brackets at the end of the question and before the colon. The ATLAS defaults here are the standard defaults; the defaults using the script above will contain many of the suggested answers. If there is nothing after the colon, this means you can accept the default. For more information, see the comments printed on screen during the configure process and the ATLAS website.

The questions below are from a linux system; there are a few extra questions using Cygwin and Windows, but the answers to these should be clear.

Enter number at top left of screen [0]:
Have you scoped the errata file? [y]:
Are you ready to continue? [y]:
enable Posix threads support? [y]:
use express setup? [y]: n
Enter Architecture name (ARCH) [Linux_P4SSE2]:
Enter Maximum cache size (KB) [512]: 1024

This last number should be double your actual cache size. In my case the cache was 512MB.

Enter File creation delay in seconds [0]:
Enter Top level ATLAS directory [/home/mb312/installs/ATLAS]:
Enter Directory to build libraries in [$(TOPdir)/lib/$(ARCH)]:
Enter Library name for ATLAS primitives [libatlas.a]:
Enter Library name for the C interface to BLAS [libcblas.a]:
Enter Library name for the Fortran77 interface to BLAS [libf77blas.a]:
Enter Library name for the C interface to BLAS [libptcblas.a]:
Enter Library name for the Fortran77 interface to BLAS [libptf77blas.a]:
Enter F77 Linker  [$(F77)]:
Enter F77 Link Flags  [$(F77FLAGS)]:

Here the configurations for gcc and icc diverge. Here are the next few steps for gcc:

Enter ANSI C compiler(CC) [/usr//bin/gcc]:
Enter C Flags (CCFLAGS) [-fomit-frame-pointer -O3 -funroll-all-loops]: -fomit-frame-pointer -O3 -funroll-all-loops -march=pentium4 -mfpmath=sse
Enter C compiler for generated code (MCC) [/usr//bin/gcc]:
Enter C FLAGS (MMFLAGS) [-fomit-frame-pointer -O]: -fomit-frame-pointer -O -march=pentium4 -mfpmath=sse

For icc these steps would be:

Enter ANSI C compiler(CC) [/usr//bin/gcc]: icc
Enter C Flags (CCFLAGS) [-fomit-frame-pointer -O3 -funroll-all-loops]: -xN -O3 -mp1 -static
Enter C compiler for generated code (MCC) [/usr//bin/gcc]: icc
Enter C FLAGS (MMFLAGS) [-fomit-frame-pointer -O]: -xN -O3 -mp1 -static -fno-alias

After this the configure continues:

Enter C Linker  [$(CC)]:
Enter C Link Flags  [$(CCFLAGS)]:
Enter Archiver  [ar]:
Enter Archiver flags  [r]:
Enter Ranlib  [echo]:
Enter BLAS library []:
Enter General and system libs [-lpthread -lm]:

The next question is. If you choose the default, 'y', you life will be a lot easier, and you will get a compiled version of ATLAS that will probably be well tuned to your hardware. If you choose 'n', the compile will be much longer, may break down (see below), but has the chance of being ideally tuned to your hardware. The supplied defaults are unlikely to be far from the ideal, so the easier choice here is 'n'. If you choose 'y', you get one last question:

Tune the Level 1 BLAS? [y]:

Compile ATLAS

If you didn't use my configuration script, there is a tweak to do before starting to ATLAS. During the compile, ATLAS will try to find the size of the processor L1 cache. For the pentium 4s I have tried, ATLAS seems to end up with a random L1 cache size. Clint Whaley, the ATLAS maintainer, suggested an L1 cache size of 128MB as optimum for the pentium 4, even though the actual L1 cache is 8MB. The L1 size can be set by hard coding the value; if you use the script to configure ATLAS, explained below, this will happen automatically. Otherwise you can replace the ATLAS file [ATLAS-root] /tune/sysinfo/L1CacheSize.c with the version in the appendix.When I chose not to accept the architecture defaults, I had a lot of difficulties compiling ATLAS. The problem was that I had many timing outside tolerance errors; see the relevant section in the the ATLAS errata file. In fact the script to configure ATLAS mentioned above has a setting in it that tries to deal with this problem, again suggested by Clint Whaley, which is to define a compile variable of to the actual value of your CPU megahertz; for example, for a 2.6 GHz machine this value would be set to 2600. You can do this via a configuration script like mine, or by editing the (or whatever) file created by the configuration run. This fix is likely to work well if you are the only user of the machine, so all the CPU load comes from the ATLAS compile itself, when it starts. If not you may not want to set this value; in that case, if you are using my configuration script, you will need to remove the lines setting from the script.

Another way to try and reduce the timing errors, if you get them, is suggested in the ATLAS errata file. In fact, on occasion I found it useful to preemptively apply a version of a fix suggested in the errata file, and changing line 117 of ATLAS-root] /tune/sysinfo/GetSysSym.c from :

fprintf(fpout, "#define ATL_nkflop %d\n", nkflop);

to:

fprintf(fpout, "#define ATL_nkflop %d\n", nkflop*5);

Now run the compilation as the config output suggests; here the command would be:

make install Linux_P4SSE2

On my 3GHz Pentium, this process took a few hours.

Tune ATLAS

You might consider trying to fine-tune ATLAS. Here are some tips for tuning ATLAS.

Make into matlab library

Now you need to make the compiled libraries into a matlab ATLAS library; here are the scripts I use to do this for Linux / ATLAS / gcc, and Linux / ATLAS / icc. For Windows, use the script on the ATLAS / Windows page

You're done for the ATLAS recompile. If you don't want to recompile ATLAS, the other option is to:

Use the Intel Math Kernel libraries

The Intel Math Kernel library is a set of math routines, including BLAS and LAPACK, that are optimised for Intel processors. They are available from Intel, who provide a 30-day evaluation trial, after which you must pay for a license. Intel offer a discount for academic users. To use the MKL, you can either reassemble a new library for matlab (see a shell script to do this ), or point Matlab to the MKL directly. Results of the Matlab bench command and some SPM timings suggest that the MKL runs at a similar speed to the ATLAS library recompiled with gcc or icc.

Telling matlab about the new libraries

Lastly, you will need to tell matlab to use the newly created ATLAS libraries. You can do this by editing the file "blas.spec" in the "bin/ [arch] " subdirectory in the matlab root directory, where [arch] is "glnx86" for Linux and "win32" for Windows. Change the line like:

GenuineIntel Family 15 Model * atlas_P4.so # Pentium IV (Foster)

to point to the new library, which should be in the same directory. For example, if you used the script in the appendix to write the gcc recompiled libraries, you should have a file "atlas_P4SSE2_gcc.so" in the matlab bin/glnx86 directory. In this case you can change the above line to:

GenuineIntel Family 15 Model * atlas_P4SSE2_gcc.so # Pentium IV (Foster)

Don't forget to save the old blas.spec as a backup.

Otherwise you can point matlab at the relevant library using the BLAS_VERSION environment variable - see the Mathworks help on BLAS.

Recompiling the SPM mex routines

SPM uses compiled C routines for some rate limiting steps like image resampling. As images may contain NaNs, these need to be recompiled to avoid the Pentium NaN slowdown. Again, this can be done by giving gcc the appropriate flags to use SSE2, or using the Intel compiler. The Intel compiler gives faster times in the benchmarks I have run. Tom Womack and Chris Rorden have long pointed out the benefits of the Intel compiler for the SPM mex files.

Compiling mex files for SPM99

To recompile the spm99 mex files you will need to change the spm_MAKE.sh file, and edit / create a mex options file. For gcc or icc, make the suggested edits to spm_MAKE.sh, which is in your SPM directory. For icc, copy the iccopts.sh appendix to a file "iccopts.sh" in your SPM directory. For gcc on Linux, edit your gccopts.sh option file, find the section for "glnx86)", and change the COPTIMFLAGS, CXXOPTIMFLAGS variables to be:

-O3 -fomit-frame-pointer -funroll-loops -march=pentium4 -mfpmath=sse -DNDEBUG

For gcc on Windows, make and then edit your mexopts.bat file; change the OPTIMFLAGS variable to be:

-O3 -fomit-frame-pointer -funroll-loops -march=pentium4 -mfpmath=sse

Then change directory to the SPM directory, and recompile with:

./spm_MAKE.sh gcc

for Linux gcc, and

./spm_MAKE.sh icc

for icc, and

./spm_MAKE.sh windows

for Windows gcc

You can also try using the Intel C compiler for Windows (icl), which does give some speed gains over gcc. icl is not free for academic use, and requires Microsoft Visual C++ to be installed. You also need to edit the SPM code to make the compilation work. For those brave souls who want to give it a try, I have packaged up an icl/SPM archive, with edited versions of the SPM code, a mexopts.bat file and a version of the spm_MAKE.bat file that works with icl.

If you want to avoid recompiling these files, Chris Rorden and I have packaged some precompiled versions - see the the short and simple section above for links.

For reasons that are beyond me, the icc/icl compiled mex file - spm_resels_vol.mexglx / spm_resels_vol.dll - was causing an intermittent matlab crash. This may well have been fixed for more recent versions of the compiler. If you run into this problem, just the gcc version of that file should be used instead of the Intel compiled version. (If you get this problem, and want to try and track it down, the crash only occurs when the SSE2 extensions are used (-xW) to compile the vol_utils library; the file never crashes when running the debugger).

Compiling mex files for SPM2

gcc and Linux

To compile for gcc under Linux, find the following lines the Makefile in the spm2 main directory:

Linux:
        make all SUF=mexglx  CC="gcc -O3 -funroll-loops"    MEX="mex COPTIMFLAGS='-O3 -funroll-loops'"

Just below this, add the following lines:

Linux.P4_gcc:
        make all SUF=mexglx  CC="gcc -O3 -funroll-loops -march=pentium4 -mfpmath=sse"    MEX="mex COPTIMFLAGS='-O3 -funroll-loops -march=pentium4 -mfpmath=sse'"

then recompile the mex files from the command line with.

icc and Linux

First, copy the iccopts.sh appendix to a file "iccopts.sh" in your SPM directory. Next, add the following to the Makefile:

Linux.P4_icc:
        make all SUF=mexglx  CC="icc -xN -O3 -mp1 -static"    MEX="mex -f iccopts.sh COPTIMFLAGS='icc -xN -O3 -mp1 -static'"

and compile with.

gcc and Windows

The default compile for windows uses the P4 optimization flags.

Checking mex file execution time

You can use this mex file speed check to see if your current mex files have a problem with NaNs. If they do, the output will show that an image with NaNs takes considerably longer to resample than the same image without NaNs. This is what the output looks like on my laptop, running SPM with the default mex files under Windows on a P3:

Simple read:  0.88
     Linear resample:  2.58
       Sinc resample: 19.53
 NaN linear resample:  8.40
              Smooth: 11.60

Here you see that the NaN resample is taking nearly four times longer than the standard linear resample.

The sparse matrix problem

On my pentium 4 system I found that SPM99 model estimation was not as fast as I expected. After a little searching, this turned out to be due to great slowness during the application of the high and low pass temporal filtering. The filtering involves multiplying a matrix of data by a sparse matrix. A sparse matrix is a matrix which is likely to have many zeros, and matlab implements these differently from other matrices. The multiplication of a sparse matrix and a standard (full) matrix is usually slow. For some reason, on my Linux systems, it was much slower than for other systems, at more than a hundred times slower than a full by full multiplication. The appendix contains a sparse/full test script to evaluate the problem on your system. This does not seem to be a Pentium 4 problem in particular; others have not found the same slowdown.The sparse matrix problem can be solved by converting the sparse matrix format of the SPM99 filters to full matrix format; this gives the same result from the filtering, but is much quicker on all the machines that I have tested. You only need to add a few lines to the SPM99file "spm_spm.m" - see the appendix for details. I have referred to the use of this edited SPM file as the Optimized model - see below.

Telling SPM how much memory to use

A last possible edit you may want to make is to tell SPM about how much memory it should use. This is nothing to do with Intel machines specifically, and applies to any platform.

By default, SPM99 is careful, if possible, to use a small amount of memory during model estimation. If you have a large amount of memory, you can speed up SPM by telling it to use more memory. Edit the file spm_spm.m in the SPM99 distribution, and look around line 340 for something like;

maxMem   = 2^30;        %-Max data chunking size, in bytes

Here, I have already edited this line to tell SPM to use (up to) all of my system's 1GB (2^30 bytes) of memory. Edit this line to suit your system. Consider trying a model estimation with different settings; when the setting is too small, you will get not much use of swap space, and rather slow performance. When it is too large, there will be slow performance and lots of swap use.

General Linux tuning

Last but not least, there may be gains from a general Linux tune up: for details try these links: Post-install tuning; Tuning disks with hdparm; Linux network tuning.

Results - timings

After all that; here are some sample timings with the various options described here. The benchmarks used a 2.5GHz Pentium 4 machine with 1GB RAM, running SPM99 under Matlab 6.5 and Mandrake Linux 9.0. Values are times in minutes to complete various SPM99 jobs, as described in more detail in the SPM benchmarks page. Note the last column, which gives times for writing a long list of contrast images, including 41 F contrast images. The "default" configuration is the standard SPM mex files, with the atlas library that matlab ships with. "icc atlas" is the same jobs after compiled ATLAS with icc. "icc atlas + mex" uses the recompiled ATLAS and mex files using icc. "gcc SSE atlas + mex" is for the ATLAS and mex files recompiled with gcc as described here. "W2000 gcc SSE atlas + icl mex" refers to timings from the same machine running Windows 2000, and using the ATLAS libraries recompiled using SSE, and SPM mex files recompiled using the Intel C compiler for Windows (icl). This was the Windows configuration that gave the fastest times.

Configuration

Realign

Smooth

Model

Optimized model

Contrast

Default

22.0

5.9

24.2

6.7

94.0

icc atlas

22.7

5.8

24.2

5.4

20.1

icc atlas, mex

16.2

5.4

24.2

5.7

3.2

gcc SSE atlas, mex

18.2

5.5

23.6

5.8

3.1

W2K, gcc SSE atlas, icl mex

23.0

5.8

23.4

5.5

3.0

Note the huge speed increase for contrast generation after optimizing for the P4. See the SPM benchmarks page for some comments on Linux and Windows.

Good luck...

Matthew Brett 11/4/2003

Appendix - matlab NaN test program

A test program to show the problem with NaNs in matlab:

sz = 500;
A = ones(sz);
B = ones(sz);
tic
C = A*B;
t1 = toc;
A = A * NaN;
tic
C = A*B;
t2 = toc;
[tmp msg] = unix('uname -a');
fprintf('System: %s\nMatlab: %s\nOnes: %8.4f; NaNs:  %8.4f; Propn: %8.4f\n',...
        msg, version, t1, t2, t2/t1);

with some example output:

System: Linux ... 2.4.21pre4-5mdkenterprise #1 SMP
Sat Feb 8
22:27:52 CET 2003 i686 unknown unknown GNU/Linux
Matlab version:
6.5.0.180913a (R13)
Ones:   0.1447; NaNs:  108.7452; Propn:
751.7246

Appendix - C NaN test program

Here is little C program to show the NaN problem:

#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>

#ifndef RUSAGE_SELF
#define RUSAGE_SELF      0     /* calling process */
#endif

#define SZ 10000000

#define STIME(X) ((X.ru_utime.tv_sec +     \
                    X.ru_stime.tv_sec) +   \
                   (X.ru_utime.tv_usec +   \
                    X.ru_stime.tv_usec) / 1.0E6)

int main( int argc, char *argv [ ] ) {
  double i, mval;
  struct rusage t0, t1, t2;
  double a,b,c;
  char op;

  op = '*';
  getrusage(RUSAGE_SELF,&t0);
  mval = 1.0; 
  for (i=0,c=1;<SZ;i++) c*=mval;
  getrusage(RUSAGE_SELF,&t1);
  mval = 0.0 / 0.0; 
  for (i=0,c=1;i<SZ;i++) c*=mval;
  getrusage(RUSAGE_SELF,&t2);
  a = STIME(t1) - STIME(t0);
  b = STIME(t2) - STIME(t1);
  c = b/a;
  printf("%c time %8.4f, %c NaN time %8.4f, proportion %8.4f\n", 
         op, a, op, b, c);

  op = '+';
  getrusage(RUSAGE_SELF,&t0);
  mval = 1.0;
  for (i=0,c=1;i<SZ;i++) c+=mval;
  getrusage(RUSAGE_SELF,&t1);
  mval = 0.0 / 0.0; 
  for (i=0,c=1;i<SZ;i++) c+=mval;
  getrusage(RUSAGE_SELF,&t2);
  a = STIME(t1) - STIME(t0);
  b = STIME(t2) - STIME(t1);
  c = b/a;
  printf("%c time %8.4f, %c NaN time %8.4f, proportion %8.4f\n", 
         op, a, op, b, c);

  return(0);
}

Compile without optimization (e.g gcc -O0).

Appendix - configuration scrip for ATLAS

Save this as [ATLAS-root] /atlas_sse_config.sh, and cd to the [ATLAS-root] directory and run the script with "sh atlas_sse_config.sh gcc" for gcc, or "sh atlas_sse_config.sh icc" for the Intel C compiler.

# Script to configure ATLAS for P4 SSE flags

case "$1" in
    gcc)
        ;;
    icc)
        ;;
    *)
        echo "Usage: $0 {gcc|icc}"
        exit 1
        ;;
esac

# reset L1CacheSize
l1file='tune/sysinfo/L1CacheSize.c'
mv $l1file ${l1file}.bak
cat <<EOF > $l1file
#include <stdio.h>
#include <stdlib.h>

main(int nargs, char *args [] )
{
  int L1Size=128;
   FILE *L1f;
   fprintf(stderr, "Assumed L1 cache size = %dkb\n",
           L1Size);
   L1f = fopen("res/L1CacheSize", "w");
   fprintf(L1f, "%d\n",L1Size);
   fclose(L1f);
   exit(0);
}
EOF

# System parameters
cpu_mhz=`grep "MHz" /proc/cpuinfo | gawk 'BEGIN { FS = " " } {print $4 }'`;
sys_name=`uname`;
arch_name="${sys_name}_P4SSE2_sse_$1"

# Make and run config options
make xconfig
case "$1" in
    gcc)
        ./xconfig \
            -F c "-fomit-frame-pointer -O3 -funroll-all-loops -march=pentium4 -mfpmath=sse" \
            -F m "-fomit-frame-pointer -O -march=pentium4 -mfpmath=sse" \
            -a $arch_name \
            -C "-D PentiumCPS=$cpu_mhz"
        ;;
    icc)
        ./xconfig \
            -c icc \
            -m icc \
            -a $arch_name \
            -C "-D PentiumCPS=$cpu_mhz"
        ;;
esac

# to run a lengthy make; uncomment this next line
# make install arch=$arch_name

Appendix - replacement for L1CacheSize.c

Save this as a replacement for [ATLAS-root] /tune/sysinfo/L1CacheSize.c;

#include <stdio.h>
#include <stdlib.h>

main(int nargs, char *args [] )
{
  int L1Size=128;
   FILE *L1f;
   fprintf(stderr, "Assumed L1 cache size = %dkb\n",
           L1Size);
   L1f = fopen("res/L1CacheSize", "w");
   fprintf(L1f, "%d\n",L1Size);
   fclose(L1f);
   exit(0);
}

Appendix - make matlab ATLAS for gcc

Script to make matlab ATLAS library compiled with gcc. Copy this text as a new file, say "make_atlas.sh". Edit the file to change your paths, ATLAS architecture name, to match your system. Run with "bash make_atlas.sh"

# Script to make matlab ATLAS library for gcc

# matlab, atlas paths; change these as necessary
MATLAB=/usr/local/matlab6p5
ATLAS=/root/installs/ATLAS
ARCH=P4SSE2_gcc

ATLAS_LIB=$ATLAS/lib/Linux_$ARC<H
GCC_LIB=/usr/lib/gcc-lib/i586-mandrake-linux-gnu/3.2/

LIB_NAME=$MATLAB/bin/glnx86/atlas_$ARCH.so

ld -shared -o $LIB_NAME --whole-archive \
 $ATLAS_LIB/libatlas.a \
 $ATLAS_LIB/libcblas.a \
 $ATLAS_LIB/liblapack.a \
 $ATLAS_LIB/libf77blas.a \
 --no-whole-archive -L $GCC_LIB \
 -lg2c -lpthread -lm

Appendix - make matlab ATLAS for icc

Script to make matlab ATLAS library compiled with icc. Copy this text as a new file, say "make_atlas.sh". Edit the file to change your paths, ATLAS architecture name, to match your system. Run with "sh make_atlas.sh"

# Script to make matlab ATLAS library for icc

# matlab, intel, atlas paths: change as necessary
MATLAB=/usr/local/matlab6p5
INTEL_LIB=/usr/local/intel
ATLAS=/root/installs/ATLAS
ARCH=P4SSE2_icc

ATLAS_LIB=$ATLAS/lib/Linux_$ARCH
GCC_LIB=/usr/lib/gcc-lib/i586-mandrake-linux-gnu/3.2/

LIB_NAME=$MATLAB/bin/glnx86/atlas_$ARCH.so
ICC_LIB=$INTEL_LIB/compiler70/ia32/lib

ld -shared -o $LIB_NAME --whole-archive \
 $ATLAS_LIB/libatlas.a \
 $ATLAS_LIB/libcblas.a \
 $ATLAS_LIB/liblapack.a \
 $ATLAS_LIB/libf77blas.a \
 --no-whole-archive -L $GCC_LIB -L $ICC_LIB \
 -lg2c -lpthread -lm \
 -dn -limf -lirc

Appendix - make matlab ATLAS from MKL

Script to write Math Kernel Library static library routines into matlab ATLAS-like library (e.g. atlas_mkl_P4.so in the matlab bin/glnx86/ directory. Change the paths for your system.

# CPU can be one of "def", "p3", "p4"
CPU=p4

# matlab and intel paths 
MATLAB=/usr/local/matlab6p5
INTEL_LIB=/usr/local/intel

LIB_NAME=$MATLAB/bin/glnx86/atlas_mkl_$CPU.so
MKL_LIB=$INTEL_LIB/mkl/lib/32
ICC_LIB=$INTEL_LIB/compiler70/ia32/lib

cp $MKL_LIB/libmkl_lapack.a tmp_lapack.a
ar -d tmp_lapack.a mkl_lsame.o

ld -shared -o $LIB_NAME --whole-archive \
 $MKL_LIB/libmkl_$CPU.a tmp_lapack.a  $MKL_LIB/libguide.a \
--no-whole-archive -L $ICC_LIB -L $MKL_LIB \
-dn -lF90 -limf -lirc

\rm tmp_lapack.a

Appendix - spm_MAKE.sh for pentium 4 mex files

Replace the sections for "windows)" through "gcc)" in spm_MAKE.sh with the following:

windows)
        echo "Windows compile with Cygwin/MingW gcc"
        echo "see http://www.mrc-cbu.cam.ac.uk/Imaging/gnumex20.shtml"
        echo "for instructions about installing gcc for"
        echo "compiling Mex files."
        echo ""
        deff=-DSPM_WIN32
        CC="gcc -mno-cygwin -march=pentium4 -mfpmath=sse -O3 -fomit-frame-pointer -fno-exceptions -funroll-loops $deff"
        cmex5="mex.bat $deff "
        cmex4="mex.bat $deff -V4 "
        # Windows added utility files
        $CC -c -o win32mmap.o win32mmap.c
        $cmex5 spm_win32utils.c
        added_objs="win32mmap.o spm_mapping.obj";;
    gcc)
        echo "Optimised standard unix compile for gcc"
        echo "This should work on Sun, Linux etc"
        echo "Note that the path to the gccopts.sh file may need"
        echo "changing."
        echo ""
        CC="gcc -O3 -funroll-loops -fomit-frame-pointer -march=pentium4 -mfpmath=sse"
        cmex5="mex     -f gccopts.sh"
        cmex4="mex -V4 -f gccopts.sh";;
    icc)
        echo "Optimised Linux compile using Intel C compiler"
        echo ""
        CC="icc -O3 -tpp7 -xW -unroll"
        cmex5="mex     -f iccopts.sh"
        cmex4="mex -V4 -f iccopts.sh"

Appendix - mex option file for icc

Save the following as iccopts.sh:

# iccopts.sh    Shell script for configuring MEX-file creation script
# Don't forget to change the root path to the compiler, below
TMW_ROOT="$MATLAB"
MFLAGS=''
if [ "$ENTRYPOINT" = "mexLibrary" ]; then
MLIBS="-L$TMW_ROOT/bin/$Arch -lmx -lmex -lmatlb -lmat -lmwservices -lut -lm"
else  
MLIBS="-L$TMW_ROOT/bin/$Arch -lmx -lmex -lmat -lm"
fi

# change path to match your system
# The compiler itself will be in $INTEL_PATH/bin
INTEL_PATH="/usr/local/intel"
ILIBS="-L$INTEL_PATH/lib/ -lcprts -lcxa  -lguide -lirc -lircmt -lompstub  -lsvml -lunwind -limf"
RPATH="-Wl,--rpath-link,$TMW_ROOT/extern/lib/$Arch,--rpath-link,$TMW_ROOT/bin/$Arch"
#   icc -V
#   Intel(R) C++ Compiler for 32-bit applications, Version 8.0   Build 20040318Z Package ID: l_cc_pc_8.0.058_pe063.1
CC='icc'
CFLAGS='-fPIC -D_GNU_SOURCE'
CLIBS="$RPATH $ILIBS $MLIBS"
COPTIMFLAGS="-O3 -tpp7 -xW -unroll -DNDEBUG"
CDEBUGFLAGS='-g'

CXX='icc'
CXXFLAGS="$CFLAGS"
CXXLIBS="$CLIBS"
CXXOPTIMFLAGS='$COPTIMFLAGS'
CXXDEBUGFLAGS='-g'
#   ifc -V
# Intel(R) Fortran Compiler for 32-bit applications, Version 7.1   Build 20030307Z
FC='ifc'
FFLAGS='-fPIC -r8 -w -cm -q'
FLIBS="$RPATH $ILIBS $MLIBS"
FOPTIMFLAGS="$COPTIMFLAGS"
FDEBUGFLAGS='-g'
#
LD="icc"
LDFLAGS="-pthread -shared -static -Wl,--version-script,$TMW_ROOT/extern/lib/$Arch/$MAPFILE"
LDOPTIMFLAGS="$COPTIMFLAGS -fPIC"
LDDEBUGFLAGS='-g'
#
POSTLINK_CMDS=':'

Appendix - mex speed test

% Script to test speed of compiled SPM files

% Tests defined
tests = {'Simple read','Linear resample', 'Sinc resample', ...
         'NaN linear resample', 'Smooth'};

% Make test image
dim = [128 128 128];
img = randn(dim);
V    = struct(...
    'fname',    'test.img',...
    'dim',      [dim spm_type('float')],...
    'mat',      eye(4),...
    'pinfo',    [1 0 0] ',...
    'descrip',  'test image');
V = spm_write_vol(V, img);

% Read to load cache
img = spm_read_vols(V);

% write with NaNs
V2 = V;
V2.fname = 'nan_test.img';
img(img < 3.5) = NaN;
V2 = spm_write_vol(V2, img);

% Start tests
t = [];

% simple read
tic
img = spm_read_vols(V);
t = [t toc];

% resampling
startp = 2;
grain = 1;
dims = V.dim(1:3) - 2;
[X Y Z] = ndgrid(startp:grain:dims(1), ...
                 startp:grain:dims(2), ...
                 startp:grain:dims(3));
X = X+0.5; Y = Y+0.5; Z = Z + 0.5;

tic
p = spm_sample_vol(V, X, Y, Z, 1); % trilinear
t = [t toc];

tic 
p = spm_sample_vol(V, X, Y, Z, 8); % sinc
t = [t toc];

tic 
p = spm_sample_vol(V2, X, Y, Z, 1); % trilinear, NaN
t = [t toc];

% smoothing
tic
spm_smooth(V, 'sm_test.img', 8);
t = [t toc];

% report
for i = 1:length(tests)
  fprintf('%20s: %5.2f\n', tests{i}, t(i));
end

Appendix - showing the sparse matrix problem

Simply run this little program in matlab:

Kd = 300;
Vd = 2000;
K = ones(Kd);
y = ones(Kd, Vd);
tic
y = K*y;
t = toc;
K = sparse(K);
tic
y = K*y;
t(2) = toc

On my system, the second (sparse matrix) timing - t(2), is very considerably longer than the first (full matrix) timing - t(1).

Appendix - fixing the sparse matrix problem

Edit the file spm_spm.m in your SPM99 directory. Around line 420, you will see something like:

%-Initialise design space
%=======================================================================
fprintf('%-40s: %30s','Initialising design space','...computing')    %-#

Just after these lines, add the following:

if iscell(xX.K)
  for s = 1:length(xX.K)
    xX.K{s}.KH=full(xX.K{s}.KH);
    xX.K{s}.KL=full(xX.K{s}.KL);
  end
else
  xX.K = full(xX.K);
end

This converts the filter matrices from sparse to full, and greatly speeds later execution time.Last Refreshed: Fri Apr 11 19:53:43 GMT Daylight Time 2003

CbuImaging: SpmWithPentium4 (last edited 2013-03-07 21:24:02 by localhost)