All of the Human Genetics faculty groups use computers to process data and carry out analyses. While there are some analyses with relatively small data sets that could be analysed on a laptop or desktop computer, many analyses involve data sets with vast amounts of data and intensive processing requirements that would take many years (or even centuries) to run on a single machine. In order to stay at the cutting edge of scientific research in this field, our researchers utilise large computational clusters consisting of hundreds of individual computers with the collective processing power of tens of thousands of individial laptop or desktop computers to carry out their analyses on many machines simultaneously, such that work that might have taken 10 years to run on a single machine can be completed within a single day.
The Informatics Support Group maintains the large-scale computational clusters that we use to run these sorts of analyses, while the Human Genetics Informatics (HGI) team looks after the computational needs that are shared across Human Genetics faculty groups. Some examples of ways in which we do that are to:
- install and maintain specialised analysis software used by researchers to carry out their analyses.
- manage shared data storage.
- develop and operate computational workflows for pre-analysis processing of human genetics data sets.
The computational workflows that we run can be very complicated, involving hundreds or even many thousands of individual steps. Each of these steps needs to be able to access its input data and pass its output along to the next step, and it is important not to overload any of the individual computers that are involved by giving them more work to do than they can handle (because overloading them tends to make them more efficient). Given that we share our computers with other groups at the institute, sometimes there will be fewer computational resources available because they are being used by researchers in other groups. At the same time, we would ideally like to be able to reliably recreate the same output data each time we repeat a particular analysis so that we can avoid having to keep all of the pieces of data that we generate stored (in science it is generally important to be able to refer back to data that was used to support a result, so if we are not able to regenerate it we have no choice but to store it indefinitely). This means isolating the software environment that runs each of the steps in our workflows as much as possible so that it can be run in exactly the same way each time. It is these competing interests that ultimately makes running large-scale computational workflows a non-trivial task, and for this reason it is useful to have HGI develop expertise in handling this work centrally rather than distributing it out to individual researchers within the faculty teams.