Return to BIRCH home page

KEY CONCEPTS

This section introduces some concepts that will help you leverage the power of Unix to work more efficiently.
A video covering these topics can be viewed at:

What most biologists probably don't know about computers (57:39)

1. The Computer: What's under the hood, and when does it matter

1.1 What is Unix/Linux?
1.2 Beyond the standalone PC: The network is the computer
1.3 File systems - share or serve?
1.4 The Unix command line - Sometimes, typing is WAY easier than point and click.
1.5 What do programs actually do?

2. The Home Directory: Do everything from the comfort of your $HOME
3. Organizing your files: A place for everything, and everything in its place
4. Text files: It's actually quite simple
5. Screen Real Estate: Why one window should not own the screen.
6. Network-centric Computing - Any user can do anything from anywhere

6.1. Running remote Unix sessions at home or when traveling
6.2. Moving files across the network

1. The Computer: What's under the hood, and when does it matter

1.1 What is Unix/Linux?

Unix is an operating system, that is, an environment, that provides commands for creating, manipulating, and examining datafiles, and running programs. But behind the scenes, an operating system also manages system resources, and orchestrates the running of anywhere from dozens to hundreds of programs that may be running at the same time. Some other operating systems with which you may be familiar are MS-Windows, Macintosh OSX. Despite their differences, all of these operating systems do essentially the same things, which is to act as the unifying framework within which all tasks are performed.

Unix is usually the system of choice for scientific and mathematical work, as well as for enterprise-level systems and servers. This is because Unix was designed as a multitasking, multiuser, networked system with that had to be reliable and responsive under heavy loads, have 24/7 availability, and be highly secure.

MS-Windows was designed as a single-user desktop system, primarily for running one program at a time. Higher-level capabilities such as networking, multitasking, running several simultaneous users, and server functions have all been retrofitted into Windows. Security has long been, and is still a serious problem on the Windows platform.

The Unix family of operating systems include commercial Unix systems such as Sun's Solaris, and the many different distributions of Linux, most of which are free, as well as Apple's proprietary OSX.

Linux vs. Unix - Strictly speaking, Unix is a proprietary operating system owned by ATT. Linux is a Unix-like operating, written from scratch to function like Unix. Linux is Open Source software. There are numerous Linux distributions, most of them freely-avaliable, as well as value-added commercial distributions. Many people use the terms Unix and Linux interchangeably.

1.2 Beyond the standalone PC: The network is the computer

1.2.1 Every PC is a special case

Each computer is a bit different from every other
Everything happens on your PC
Your data tends to be spread out among a number of machines
Different programs on different machines
No way to remotely login to most Windows PCs.
How many PCs actually get backed up?

1.2.2. The network is the computer

The standalone PC is only one of many ways of using computer resources. This figure illustrates the three main functions of computers: File Services, Processing, and Display. The figure is meant to be Generic. On A PC, all three functions occur in a single machine. For this reason, a PC is sometimes referred to as a "fat client".

However, there is no reason that these functions have to be on the same machine. For example, on distributed Unix systems, files reside on a file server, processing is done on login hosts, and you can run a desktop session on any login host, and the desktop will display on a "thin client". Because the thin client does nothing but display the desktop, it doesn't matter what kind of machine is doing the display. A thin client can be a specialized machine or just a PC running thin client software.

A compromise between a thin client is a fat client is the "lean client". Essentially, a lean client is a computer that carries out both the Display and Processing functions, but remotely mounts filesystems from the fileserver, which behave as if they were on the machine's own hard drive. Many computer labs are configured in this way to save on system administration work, at the expense of extra network traffic.

Advantages of network-centric computing:

You can access your data from anywhere

Full access to the resources of a datacenter from anywhere

Protection from obsolescence, when you use thin clients

High availability, because all components are redundant

There is nothing to lose (eg. memory stick), nothing that can be stolen (eg. laptop)

Once software is installed, it works for everyone

Automated backups

One example of network-centric computing is Google Docs. Google Docs lets you maintain documents, spreadsheets, presentations online, using any web browser. Your documents stay on the server, so you can work on them from any browser on any computer anywhere.

More and more resources reside on the network. This is now referred to as "cloud computing":

Databases - Remote databases return data in response to queries from local clients.
Applications servers - application runs on remote host, but displays on local client
Web services - local client sends data to web service; service returns the result of the computation.
Computing Grid - Analogous to an electrical grid. A community of servers on a high-speed backbone share computing resources, including CPU time. Different parts of a job may be done on different machines, transparently to the client.

All of network-centric computing can be summarized in a single sentence:

Any user can do any task from anywhere

1.3 File systems - share or serve?

Unix systems typically include many machines, all of which remotely mount files from a file server. From the user's point of view, it looks as if the files are on their own hard drive. There are many advantages to using a file server. First, all machines on a LAN will have the same files and directories available, regardless of which desktop machine you use. Secondly, a file server makes it possible to standardize best practices which contribute to data integrity, including security protocols and scheduled automated backups. Finally, file servers typically store data redundantly using RAID protocols, protecting against of loss of data due to disk failure.

Many LANs support peer to peer file sharing. In file sharing, each PC on the LAN may have some files or directories that are permitted to be shared with others. Again, from each user's perspective, it looks as if the file is on their own hard drive. However, file sharing also invites many potential security problems. As well, data integrity is only as good as the hard drive a file is actually on, and whatever steps the owner of that PC may or may not have taken to back up files.

more: http://en.wikipedia.org/wiki/Shared_disk_access

1.4 The Unix command line - Sometimes, typing is WAY easier than point and click.

One of the strengths of Unix is the wealth of commands available. While typing commands might seem like a stone-age way to use a computer, commands are essential for automating tasks, as well as for working with large sets of files, or extracting data from files. For example, when you use a DNA sequence to search the GenBank database for similar sequences, the best matching sequences are summarized, as excerpted below:

gb|EU920048.1| Vicia faba clone 042 D02 defensin-like protein mR...   143   1e-32
gb|EU920047.1| Vicia faba clone 039 F05 defensin-like protein mR...   143   2e-32
gb|EU920044.1| Vicia faba clone 004 C04 defensin-like protein mR...   143   2e-32
gb|FJ174689.1| Pisum sativum pathogenesis-related protein mRNA, ...   139   3e-31
gb|L01579.1|PEADRR230B Pisum sativum disease resistance response...   132   4e-29

There are often dozens of hits. If you wanted to retrieve all matching sequences from NCBI, you would need the accession numbers, found between the pipe characters "|". Rather than having to copy and paste each accession number to create a list for retrieval, a file containing that list could be created in a single Unix command:

grep  'gb|' AY313169.blast |  cut -f2 -d '|'  > AY313169.acc

would cut out the accession numbers from AY313168.blast and write them to a file called AY313169.acc:

EU920048.1
EU920047.1
EU920044.1
FJ174689.1
L01579.1

This list could now be used to retrieve all sequences in one step.

Explanation: The grep command searches for the string 'gb|' in the file AY313169.blast, and writes all lines matching that string to the output. The next pipe character sends that output to the cut command. The cut command splits each line into several fields, using '|' as a delimiter between fields. Field 2 from each line is written to a file called AY313169.acc.

If you learn the commands listed below, you will be able to do the vast majority of what you need to do on the computer, without having to learn the literally thousands of other commands that are present on the system.

cat	Write and concatenate files
cd	Move to new working directory
chmod	Change read,write, execute permissions for files
cp	Copy files
cut	cut out one or more columns of text from a file
grep	Search a file for a string
less	View files a page at a time
logout	Terminate Unix session
lpr	Send files to lineprinter
ls	List files and directories
man	Read or find Unix manual pages
mkdir	Make a new directory
mv	Move files
passwd	Change password
rm	Remove files
rmdir	Remove a directory
ps	list processes
top	list most CPU-intensive processes
kill	kill a process

more: UsingUnix

1.5 What do programs actually do?

The cell is a good analogy for how a computer works. An enzyme takes a substrate and modifies it to produce a product. In turn, any product might be used as a substrate by another enzyme, to produce yet another product. From these simple principles, elaborate biochemical pathways can be described.

Similarly, computer programs take input and produce output. For example, program 1 might read a genomic DNA sequence and write the mRNA sequence to the ouptut. Program 2 might translate the RNA to protein, and Program 3 might predict secondary structural characteristics for the protein. Alternatively, program 4 might predict secondary structures from the mRNA.

The process of chaining together several programs to perform a complex task is known as 'data pipelining'.

One subtlety that is sometimes missed about computers has to do with the roles of random access memory (RAM) and the hard drive. Programs don't actually work directly on files that are on the hard drive. When you open a file in a program, a copy of that file is read from disk and written into memory. All changes that you make to the file occur on the copy in memory. The original copy of the file on disk is not changed until you save the file. At that time, the modified copy in memory is copied back to disk, overwriting the original copy.

2. The Home Directory*: Do everything from the comfort of your $HOME

One of the features of Unix that makes contributes to its reliability and security, and to its ease of system administration, is the compartmentalization user and system data. The figure below shows the highest-level directories of the directory tree. To cite a few examples, /bin contains binary executables, /etc contains system configuration files, and /usr contains most of the installed applications programs.

One of the most important directories is /home, the directory in which each user has their own home directory. Rather than having data for each user scattered across the directory tree, all files belonging to each user are found in their home directory. For example, all files belonging to a user named 'homer' has a are found in /home/homer. Subdirectories such as 'beer', 'doughnuts', and 'nuclear_waste' organize his files into topics. Similarly the home directory for 'bart' is /home/bart, and is organized according to bart's interests.

Most importantly, the only place that homer or bart can create, modify or delete files is in their home directories. They can neither read nor write files anywhere else on the system, unless permissions are specifically set to allow them to do this. Thus, the worst any user can do is to damage their own files, and the files for each user are protected.

* In Unix, the term directory is synonymous with folder. The two can be used interchangeably.

usually work in home directory
all your data is in your home dir. and nowhere else!
system directories are world-readable
each user can only read/write their own home directories

3. Organizing your files: A place for everything, and everything in its place

Most people know about organizing their files into a tree-structured hierarchy of folders. On Unix you can organize your files using a file manager such as Nautilus.

Some good guidelines to follow:

Organize your files by topic, not by type. It makes no sense to put all presentations in one folder, all images in another folder, and all documents in another folder. Any given task or project will generate files of many kinds, so it makes sense to put all files related to a particular task into a single folder or folder tree.
Each time you start a new task or project or experiment, create a new folder.
Your home directory should be mostly composed of subdirectories. Leave individual files there only on a temporary basis.
Directory organization is for your convenience. Whenever a set of files all relate to the same thing, dedicate a directory to them.
If a directory gets too big (eg. more files than will fit on the screen when you type 'ls -l'), it's time to split it into two or more subdirectories.
On Unix/Linux, a new account will often have a Documents directory, which is confusing and makes no sense, since your HOME directory already serves the purpose of a Documents directory in Windows. It is best to just delete the Documents directory and work directly from your HOME directory.

4. Text files: It's actually quite simple

A text editor is a program that lets you enter data into files, and modify it, with a minimal amount of fuss. Text editors are distinct from word processors in two crucial ways. First, the text editor is a much simpler program, providing none of the formatting features (eg. footnotes, special fonts, tables, graphics, pagination) that word processors provide. This means that the text editor is simpler to learn, and what it can do is adequate for the task of entering a sequence, changing a few lines of text, or writing a quick note to send by electronic mail. For these simple tasks, it is easier and faster to use a text editor.

Two of the most commonly used text editors with graphic interfaces are Nedit and gedit. Both are available on most Unix and Linux systems.

Example of a text editor editing a computer-readable file specifying an alternative genetic code used in flatworm mitochondria.

The second important difference between word processors and text editors is the way in which the data is stored. The price you pay for having underlining, bold face, multiple columns, and other features in word processors is the embedding of special computer codes within your file. If you used a word processor to enter data, your datafile would thus also contain these same codes. Consequently, only the word processor can directly manipulate the data in that file.

Text editors offer a way out of this dilemma, because files produced by a text editor contain only the characters that appear on the screen, and nothing more. These files are sometimes referred to as ASCII files, since they only contain standard ASCII characters.

Generally, files created by Unix or by other programs are ASCII files. This seemingly innocuous fact is of great importance, because it implies a certain universality of files. Thus, regardless of which program or Unix command was used to create a file, it can be viewed on the screen ('cat filename'), sent to the printer ('lpr filename'), appended to another file ('cat filename1 >> filename2'), or used as input by other programs. More importantly, all ASCII files can be edited with any text editor.

If you plan to do a lot of work at the command line, you will need a text editor that does not require a graphic interface. Several common editors include:

nano - A very simple but not very powerful editor
vi - The vi editor is the universal screen editor available with all UNIX implementations.
emacs - a text editor with many advanced capabilities for programming; it also has a long learning curve

5. Screen Real Estate: Why one window should not own the screen.

The so-called "desktop metaphor" of today's workstations is instead an "airplane-seat" metaphor. Anyone who has shuffled a lap full of papers while seated between two portly passengers will recognize the difference -- one can see only a very few things at once. - Fred Brooks, Jr.

One of the most counter-productive legacies from the early PC era is that "One window owns the screen". Many applications start up taking the entire screen. This made sense when PC monitors were small with 800x600 pixel resolution. It makes no sense today when the trend is toward bigger monitors with high resolution. The image below shows a typical Unix screen, in which each window takes just the space it needs, and on more. Particularly in bioinformatics, you will be working on a number of different datafiles, or using several different programs at the same time. The idea is that by keeping your windows small, you can easily move from one task to another by moving to a different window.

Most Unix desktops today give you a second way to add more real estate to your screen. The toolbar at the lower right hand corner of the figure shows the Workplace Switcher. If the current screen gets too cluttered with windows, the workspace switcher lets you move back and forth between several virtual screens at the click of a button. This is a great organizational tool when you have a number of independent jobs going on at the same time.

6. Network-centric Computing - Any user can do anything from anywhere

6.1. Running remote Unix sessions at home or when traveling

Since all Unix and Linux systems are servers, you can always run a Unix session from any computer, anywhere.

see Using Unix from Anywhere

6.2. Uploading and downloading files across the network

Email is usually not the best way to move files across a network.There are better tools for this purpose. On Unix and Linux systems, one of the best tools is Filezilla. Filezilla gives you two panels, one for viewing files on the local system, and the other for viewing files on the remote system. In the example below, the left panel shows folders in the user's local home directory. The right panel shows the user's files on the server. Copying files, or entire directory trees from one system to the next is as easy as selecting them in one panel and clicking on the appropriate green arrow button. For security, Filezilla uses ssh to encrypt all network traffic, so that no one can eavesdrop on your upload or download. Filezilla is freely available for download at https://filezilla-project.org.

In the example at right, the user logs into the remote server by filling in the boxes as shown:

Host - name of remote server (eg. at Univ. of Manitoba, these are mars.cc.umanitoba.ca, venus.cc.umanitoba.ca, jupiter.cc.umanitoba.ca)
Username - your Unix userid
Password - password
Port: can probably be blank, but if necessary type '22' to indicate a secure connection on port 22 using ssh.

Click on Quickconnect to login.