previous  page PLNT4610/PLNT7690 Bioinformatics
Lecture 9, part 4 of 4
next page

4. High Performance Computing (HPC)

Biology is increasingly becoming a big data consumer and producer. The types of problems that even small labs face are increasingly beyond the capabilities of desktop computers.

High Performance Computing refers to computer systems that integrate hardware and software solutions to address problems requiring resources beyond that found in desktop computers. Resources include:
Thus, HPC systems can often run bioinformatics tasks more quickly than desktop computers, and in many cases, handle large problems that are not possible on desktops.

Terms used in HPC

Units of Data

Unit
Value
Example
Megabyte (Mb)
106 bytes
prokaryotic genome ~ 106 - 107 bp
Gibabyte (Gb)
109 bytes human genome = 3 x 109 bp
total length of reads to sequence human genome = 1.4 x 1011 nt
combined nucleotide sequences in NCBI GenBank 209.0 = 1.99 x 1011 bp
Terabyte (Tb)
1012 bytes Daily output of 157 DNA sequencers at Beijing Genomics Institute1 = 6 x 1012 nt
Petabyte (Pb)
1015 bytes 2013 European Bioinformatics Institute Databases = 2.0 x 1016 bytes
Annual data output of Large Hadron Collider = 1.5 x 1016 bytes
Library of Congress, including multimedia = 2.0 x 1016 bytes

Exabyte (Eb)
1018 bytes All words ever spoken by human beings (as written text)  = 5 x 1018 bytes
2013 Estimate of Google disk storage = 1.0 x 1019 bytes
1 Marx, V (2013) Biology: The big challenges of big data. Nature 498:255-260. doi:10.1038/498255a
2 Exabyte - Wikipedia https://en.wikipedia.org/wiki/Exabyte

RAM - Random Access Memory - All data used in computation resides in RAM. To work on data from a disk, it is necessary for a program to read a copy of the data into RAM.

CPU and core - A Central Processing Unit performs operations on data in RAM. Originally, the CPU had one processor. Today, the vast majority of CPUs manufactured today, even on low-end PCs,  have 2 or more cores, each of which can process information independently. The terms CPU and core, while not synonymous, are often used interchangable. Strictly speaking, it is most correct to use the term core to refer to the number of processing units.

compute node - An individual computer belonging to a cluster. Can also refer to almost any computer on a cloud.

cluster - a group of computers that functions as a single computer, often sharing memory

Shared memory architecture3 - A system in which each processing unit, by way of a core processor for example, can access the entire memory space. This is typically the case with a traditional computer, where two processes can easily share the memory to exchange information quickly.

MPI3 - Message Passing Interface - A communication standard for nodes that run parallel programs on distributed memory systems such as HPC clusters.

cloud - Not really an HPC term per se. The cloud refers to a large array of computers that can provide computing capability as needed. While the cloud does not imply HPC capabilities, HPC systems can be offered as part of a cloud.

Serial computing - Working on a problem step by step until it is complete, on a single CPU. Most computer programs fall into this category.

Parallel computing - The art of breaking large computational problems into smaller problems that can each be solved simultaneously by a large number of CPUs. Parallel programming requires specialized strategies for re-thinking computational problems so that they better lend themselves to parallelization. Some languages such as C++ and Fortran have extensive libraries that handle common tasks in parallel computing.

3 HPC glossary [https://www.ovh.ie/hpc/lexicon.xml]

Examples of HPC in Bioinformatics

a) Running many serial jobs at the same time - You don't need to be an expert in HPC to take advantage of HPC capabilities. A crude form of parallel computing can be done in case where you know that a problem can easily be divided into many parts that can be done simultaneously. Simply break the dataset into as many parts as you have cores, and launch them at the same time. The operating system will take care of assigning each instance of the task to a core, and the net effect will be parallel computing. With a little knowledge of programming, you could write a script that breaks up the dataset, runs the jobs, and combines the output into a single file when all jobs have been completed.

Normally, to construct a parsimony tree from bootstrap replicates, you would create a file containing your bootstrapped datasets (eg. 100 replicates) by running a single instance of seqboot. Next you would run dnapars to make trees from each dataset, and then consense to create a consensus tree from 100 trees produced by dnapars. This would be an example of running a serial job.

A simple script could be written to leverage the many CPUs on the system, simply by breaking the problem into many jobs to be run at the same time.


In this example, the script would run several seqboot jobs to generate the same number of replicate datasets, but in this case, divided among a number of files. After producing the datasets, dnapars would be run once for each dataset, to generate a number of treefiles. When all dnapars jobs are finished, the treefiles would be concatenated into a single treefile, to be read by consense, which would create the consensus tree.

The operating system would automatically take care of scheduling different jobs to run on cores, which would also be shared with other jobs on the system.

b) Parallel processes which use multiple CPUs concurrently.

Example: ScalaBLAST from Kalyanaraman A Introduction to BLAST
http://www.shodor.org/media/content/petascale/materials/dataIntensive/BLAST/BLAST_Intro_ppt.ppt



Considerations when running HPC jobs

1. # jobs < = number of cores - It does no good to run more jobs than there cores. In fact, overloading cores actually slows down system performance, because of the added overhead of swapping jobs on and off of the CPUs.

2. The total size of the RAM needed for the job should not exceed the total RAM needed by all jobs at a given time. If the total memory used by all jobs exceeds RAM, parts of the data may need to be swapped onto disk, and re-read at a later time. This can drastically slow down a system.

Use the top command on your system to get an idea of load average and memory being used at a given time.

load average - average number of jobs waiting to be run on a core. When this number exceeds the number of cores, system performance will begin to be degraded.

Memory/Swap: The system will usually try to fill the available memory. When jobs are not being run, their memory is sometimes written to a disk filesystem called Swap. Swap should be a small percentage of total memory for best performance.

top - 13:56:29 up 64 days, 13:16, 109 users,  load average: 68.56, 67.43, 66.59
Tasks: 3015 total,  62 running, 2952 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.0%us, 12.9%sy, 77.4%ni,  9.5%id,  0.0%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:  264498644k total, 251730180k used, 12768464k free,   587932k buffers
Swap:  8191996k total,    11840k used,  8180156k free, 222026532k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
17696 umjoona7  24   4  679m 186m  26m R 101.6  0.1  80:33.99 wrf.exe          
17794 umjoona7  24   4  680m 187m  26m R 101.6  0.1  93:41.31 wrf.exe          
17886 umjoona7  24   4  679m 186m  26m R 101.6  0.1  94:20.41 wrf.exe          
17962 umjoona7  24   4  679m 179m  24m R 101.6  0.1  94:33.43 wrf.exe          
18089 umjoona7  24   4  661m 164m  26m R 101.6  0.1  94:35.67 wrf.exe          
17611 umjoona7  24   4  681m 189m  27m R 101.2  0.1  92:09.04 wrf.exe          
17668 umjoona7  24   4  665m 169m  28m R 101.2  0.1  89:14.75 wrf.exe          
17912 umjoona7  24   4  683m 183m  24m R 101.2  0.1  94:22.77 wrf.exe          
17981 umjoona7  24   4  677m 178m  24m R 101.2  0.1  94:36.04 wrf.exe          
17996 umjoona7  24   4  677m 178m  24m R 101.2  0.1  94:35.98 wrf.exe          
18024 umjoona7  24   4  664m 169m  27m R 101.2  0.1  94:26.51 wrf.exe          
18034 umjoona7  24   4  649m 154m  26m R 101.2  0.1  94:29.06 wrf.exe          
18051 umjoona7  24   4  662m 166m  26m R 101.2  0.1  94:29.71 wrf.exe          
18066 umjoona7  24   4  661m 164m  26m R 101.2  0.1  94:36.35 wrf.exe    

HPC systems available to U of M Researchers

1) CC Unix/Linux Compute Nodes - Information Services and Technology maintains a set of servers that may be used by any student of staff member.

Configuration as of Fall 2017

venus, mars, jupiter - Login servers for routine Linux sessions; Login by ssh or Thinlinc.

cc01, cc02, cc03... cc12 - Linux compute nodes. These are configured identically to venus, mars and jupiter, but should only be used for long-running, CPU-intensive jobs. Normally, users only login by ssh, but you can run vncserver jobs and run a full desktop session using VNC viewer.

2) West Grid http://westgrid.ca


Compute Canada is migrating to a more centralized infrastructure for High Performance Computing.
Much of the WestGrid will be defunded beginning in 2018, and users migrated to systems operated by four National centers at the University of Victoria, Simon Fraser University, University of Waterloo and the University of Toronto.

WestGrid is a consortium of academic centers providing HPC services, infrastructure and software to Canadian Researchers. Access is free of charge, but researchers must be affiliated with a Canadian research institution to obtain an account.

from https://www.westgrid.ca/support/training/overview_201010





Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada

previous  page PLNT4610/PLNT7690 Bioinformatics
Lecture 9, part 4 of 4
next page