Scientific Computing at the CBU
In contrast to other, smaller research centres, computing facilities at the CBU are provided centrally, and operate as a shared resource for all scientists. This offers the advantages of a well-maintained, high-throughput computing facility designed for the demands of cognitive neuroscience.
Here we lay out some basic information on the way that the three main elements of the scientific computing facility (file storage, compute servers and software) are structured. We also try to offer some recommended "best practices" and will link to pages, and people that can provide further help and support.
Since this system is shared across all users, we're keen to hear comments and suggestions on how the system could be improved in future. Suggestions can be sent to the Scientific Computing Management Committee.
1. File storage
Neuroscience data sets are large. Our system is designed to deal with this and so we are fortunate to rarely run into disk space limitations. If we manage things well you should never get the dreaded "disk full" message. This is accomplished by having a combination of hard drive space (what most of us think of as disk space) and tape storage drives, that are integrated into a single resource known as a "Hierarchical Storage Management" (HSM) system. The Unit's HSM was provided by Sun, and is called SAM.
Files on a hard drive are quick to access but it is not practical or efficient to purchase and maintain enough hard drives for all of the files (over 100 million!) that we want to store. As a solution, files that haven't been accessed in a long time are moved off onto tape storage, a little like automatically archiving your old data. These files still appear when you list files or view files in Windows Explorer or similar; however, physically, they are in a different location. When you attempt to access a file that has been moved off to tape, a little robot must physically locate the tape the requested file is stored on, retrieve it, and copy that file to some free space on the hard drive. This process can take up to several minutes, or longer, depending on how many other files other users are requesting. When accessing older files, this delay is normal and simply the price we have to pay for never having to delete files off of our system. If you are re-running an fMRI analysis on a large number of subjects, each of whom has a large number of functional files that have gone off to tape, it may take a few hours to retrieve all of the data. Although it may appear that SPM is freezing, in fact, it may just be the time associated with accessing the old files.
Once files have been retrieved back to disk, they will remain there for the time being, because they have been accessed recently. Of course, when enough old files get accessed, hard drive space needs to be cleared up, and so files (hopefully ones that aren't being actively used) are also constantly being moved from the hard drive to tape.
The file system is mirrored, meaning that all of the information is stored in two physically different places, so that it is protected in the case of a disc failure, or even if one of the server rooms burns down.
2. Cluster computing and load sharing
Imaging computing is performed on a Linux cluster (see Cbu Computing Resources). This is a large set of computers named "l" for linux and a number from 02 to 42. These are (mostly) identically configured machines configured for running imaging analysis. Information on the different machines in the cluster can be found here.
The preferred way to access these machines is to use scripts that have load sharing built-in: that is, they attempt to evenly distribute the computing jobs over the whole cluster automatically, to make the most efficient use of our computing resources.
Any time you start a new VNC session (see Using VNC), you are allocated a session on one of these cluster machines. If you start SPM using the command
from a terminal, the SPM session is started on a new machine---in fact, one that has a relatively low load.
If you are running a matlab program (but not SPM), you can start Matlab on a new machine by typing:
which will start a matlab session (but not SPM) on one of the cluster machines.
Several other software packages have loadsharing capabilities when run in the CBU environment. These are documented on their invididual software pages, and includes Freesurfer, MNE, and FSL. Parallel AA scripts for Matlab also perform loadsharing - distributing individual analysis jobs to whichever machine has the lowest workload at that time.
One challenge that the current system does not cope with particularly elegantly is the case where more jobs are being run than computers are available. In this case, all the processes on a machine are competing for resources (processing power and RAM), which makes all jobs run extremely slowly. The solution to this is a job queuing system. We are currently choosing and installing a system, and will provide information on it when it is ready. In the mean time, we should try to limit the number of jobs we each run at a time; if the total number is below the number of available processors, things should continue to run relatively smoothly.
3. Available software
All of the linux machines are identically configured with core software for processing, analyzing and displaying functional imaging and E/MEG data.
Actively maintained software packages, and the people responsible for keeping them updated, are listed on the AvailableSoftware. Most of these packages have an associated Wiki page, which is the best place to start if you have questions.
Best practices (aka "Suggestions to keep everything working smoothly for everyone")
- Regularly exit unused Matlab sessions (including overnight when possible). This frees up licenses and computing resources for people actively running jobs. (In fact, if you leave a Matlab session idle for a while, the license will automatically be revoked, but it's probably easier to just get in the habit of closing sessions you're not actively using.)
- Having a fewer number of large files is generally more efficient than having a larger number of smaller files. So, for example, use 4D fMRI files when possible (support for this will be added to AA).
- Delete files you don't need. Although thanks to tape storage we have a great deal of storage space, it still helps things run smoothly to get rid of files you really don't need (especially if you can recreate them). For example, you may be able to delete the intermediate stages from old fMRI analyses.
- Use loadsharing versions of software whenever possible.
Avoid running too many simultaneous jobs (which can be especially easy to do with parallel computing toolboxes). It's hard to define "too many", but just keep in mind that when the system is overloaded no one's jobs run very efficiently.
Scientific computing management committee
If you have feedback or suggestions, please pass it on to one of the members of the Computing Management Committee.