KEY CONCEPTS
This section
introduces some concepts that will help you leverage the power of
Unix
to work more efficiently.
1. The Computer: What's under the hood,
and
when does it matter
2. The Home
Directory: Do everything from the comfort of your $HOME
3. Organizing
your files: A place for everything, and everything in its place
4. Text files:
It's actually quite simple
5. Screen Real
Estate: Why one window should not own the screen.
6.
Network-centric
Computing - Any user can do anything from anywhere
1.
The Computer: What's under the hood, and when does it matter
1.1 What is
Unix/Linux?
Unix is an
operating system, that is, an environment, that
provides commands for creating, manipulating, and examining
datafiles,
and
running programs. But behind the scenes, an operating system also
manages system resources, and orchestrates the running of anywhere
from
dozens to hundreds of programs that may be running at the same
time.
Some other operating systems with which you may be familiar
are
MS-Windows, Macintosh OSX. Despite their
differences, all of these operating systems do
essentially
the same things, which is to act as the unifying framework within
which
all
tasks are performed.
Unix is
usually
the system of choice for scientific and mathematical work, as well
as
for enterprise-level systems and servers. This is because Unix was
designed as a multitasking, multiuser, networked system with that
had
to be reliable and responsive under heavy loads, have 24/7
availability, and be highly secure.
MS-Windows
was
designed as a single-user desktop system, primarily for running
one
program at a time. Higher-level capabilities such as networking,
multitasking, running several simultaneous users, and server
functions
have all been retrofitted into Windows. Security has long been,
and is
still a serious problem on the Windows platform.
The Unix
family
of operating systems are include commercial Unix systems such as
Sun's
Solaris, and the many different distributions of Linux, most of
which
are free, as well as Apple's proprietary OSX.
1.2
Beyond
the
standalone PC: The network is the computer
1.2.1 Every PC is a special case

- Each computer
is
a bit different from every other
- Everything
happens on your PC
- Your data
tends
to be spread out among a number of machines
- Different
programs on different machines
- No way
to
remotely login to most Windows PCs.
- How
many
PCs actually get backed up?
1.2.2. The network is the computer
The
standalone
PC is only one of many ways of using computer resources. This
figure
illustrates the three main functions of computers: File Services,
Processing, and Display. The figure is meant to be Generic. On A
PC,
all three functions occur in a single machine. For this reason, a
PC is
sometimes referred to as a "fat client".

However,
there
is no reason that these functions have to be on the same machine.
For
example, on distributed Unix systems, files reside on a file
server,
processing is done on login hosts, and you can run a desktop
session on
any login host, and the desktop will display on a "thin client".
Because
the thin client does nothing but display the desktop, it
doesn't matter what kind of machine is doing the display. A thin
client
can be a specialized machine, like a SunRay
terminal, or just a PC running thin client software.
A compromise between a thin client is a fat client is the "lean client".
Essentially, a lean client is a computer that carries out both the
Display and Processing functions, but remotely mounts filesystems
from
the fileserver, which behave as if they were on the machine's own
hard
drive. Many computer labs are configured in this way to save on
system
administration work, at the expense of extra network traffic.
Advantages of network-centric computing:
- You can access your data from anywhere
- Full access to the resources of a datacenter from anywhere
- Protection from obsolescence, when you use thin clients
- High availability, because all components are redundant
- There is nothing to lose (eg. memory stick), nothing that
can
be stolen (eg. laptop)
- Once software is installed, it works for everyone
- Automated backups
One example
of
network-centric computing is Google
Docs.
Google Docs lets you maintain documents, spreadsheets,
presentations
online, using any web brower. Your documents stay on the server,
so you
can work on them from any browser on any computer anywhere.
More
and more resources reside on the network. This is now referred to
as "cloud computing":
- Databases - Remote
databases return
data in response to queries from local clients.
- Applications servers -
application
runs on remote host, but displays on local client
- Web services - local
client sends
data to web service; service returns the result of the
computation.
- Computing Grid - Analogous
to an
electrical grid. A community of servers on a high-speed
backbone share
computing resources, including CPU time. Different parts of a
job may
be done on different machines, transparently to the client.
All of
network-centric computing can be summarized in a single
sentence:
Any user
can do any task from anywhere
|
1.3
File
systems - share or serve?
Unix systems typically include many machines, all of which
remotely
mount files from a file server. From the user's point of view, it
looks
as if the files are on their own hard drive. There are many
advantages
to using a file server. First, all machines on a LAN will have the
same
files and directories available, regardless of which desktop
machine
you use. Secondly, a file server makes it possible to standardize
best
practices which contribute to data integrety, including security
protocols and scheduled automated backups. Finally, file servers
typically store data redundantly using RAID protocols, protecting
against of loss of data due to disk failure.
Many LANs support peer to peer file sharing. In file sharing, each
PC
on the LAN may have some files or directories that are permitted
to be
shared with others. Again, from each user's perspective, it looks
as if
the file is on their own hard drive. However, file sharing also
invites
many potential security problems. As well, data integrety is only
as
good as the hard drive a file is actually on, and whatever steps
the
owner of that PC may or may not have taken to back up files.
more: http://en.wikipedia.org/wiki/Shared_disk_access
1.4 The
Unix
command line - Sometimes, typing is WAY easier than point
and click.
One of the
strengths of Unix is the wealth of commands available. While
typing
commands might seem like a stone-age way to use a computer,
commands
are essential for automating tasks, as well as for working with
large
sets of files, or extracting data from files. For example, when
you use
a DNA sequence to search the
GenBank database for similar sequences, the best matching
sequences are
summarized, as excerpted below:
gb|EU920048.1| Vicia faba clone 042 D02 defensin-like protein mR... 143 1e-32
gb|EU920047.1| Vicia faba clone 039 F05 defensin-like protein mR... 143 2e-32
gb|EU920044.1| Vicia faba clone 004 C04 defensin-like protein mR... 143 2e-32
gb|FJ174689.1| Pisum sativum pathogenesis-related protein mRNA, ... 139 3e-31
gb|L01579.1|PEADRR230B Pisum sativum disease resistance response... 132 4e-29
There are
often
dozens of hits. If you wanted to retrieve all matching sequences
from
NCBI, you would need the accession numbers, found between
the pipe characters "|". Rather than having to copy and paste each
accession number to create a list for retrieval, a file containing
that
list could be created in a single Unix command:
grep 'gb|' AY313169.blast | cut -f2 -d '|' > AY313169.acc
would cut out the
accession numbers from AY313168.blast and write them to a
file called
AY313169.acc:
EU920048.1
EU920047.1
EU920044.1
FJ174689.1
L01579.1
This list could now be used to retrieve all sequences in
one step. |
Explanation:
The grep command searches for the string 'gb|' in the file
AY313169.blast, and writes all lines matching that string to
the
output. The next pipe character sends that output to the cut
command.
The cut command splits each line into several fields, using
'|' as a
delimiter between fields. Field 2 from each line is written
to a file
called AY313169.acc.
|
If you learn the commands listed below, you will be able to do the vast
majority of
what you need to do on the computer, without having to learn
the
literally thousands of other commands that are present on the system.
| cat |
Write and
concatenate files |
| cd |
Move to
new
working directory |
| chmod |
Change
read,write, execute permissions for files |
| cp |
Copy
files |
cut
|
cut
out one or more columns of text from a file
|
grep
|
Search
a file for a string
|
| less |
View
files a
page at a time |
| logout |
Terminate
Unix session |
| lpr |
Send
files
to lineprinter |
| ls |
List
files
and directories |
| man |
Read or
find
Unix manual pages |
| mkdir |
Make a
new
directory |
| mv |
Move
files |
| passwd |
Change
password |
| rm |
Remove
files |
| rmdir |
Remove a
directory |
| ps |
list
processes |
| top |
list most
CPU-intensive processes |
| kill |
kill a
process |
more: UsingUnix
1.5 What do
programs actually do?
The
cell is a good analogy for how a computer works. An enzyme
takes a
substrate and modifies it to produce a product. In turn,
any product
might be used as a substrate by another enzyme, to produce
yet another
product. From these simple principles, elaborate
biochemical pathways
can be described.
Similarly, computer programs take input and produce
output. For
example, program 1 might read a genomic DNA sequence and
write the mRNA
sequence to the ouptut. Program 2 might translate the RNA
to protein,
and Program 3 might predict secondary structural
characteristics for
the protein. Alternatively, program 4 might predict
secondary
structures from the mRNA.
The process of chaining together several programs to
perform a complex
task is known as 'data
pipelining'.
|
 |
One
subtlety that is sometimes missed about computers has to
do with the
roles of random access memory (RAM) and the hard drive. Programs
don't
actually work directly on files that are on the hard
drive. When
you open a file in a program, a copy of that file is read
from disk and
written into memory. All changes that you make to the file
occur on the
copy in memory.
The original copy of the file on disk is not changed until
you save the
file. At that time, the modified copy in memory is
copied back to
disk, overwriting the original copy.
|
 |
2.
The Home Directory*: Do everything from the comfort of your
$HOME
One of the features of Unix that makes contributes to its
reliability and security, and to its ease of system
administration, is
the compartmentalization user and system data. The figure below
shows
the highest-level directories of the directory tree. To cite a few
examples, /bin contains binary executables, /etc contains system
configuration files, and /usr contains most of the installed
applications programs.
One of the most important directories is /home, the directory in
which
each user has their own home directory. Rather than having data
for
each user scattered across the directory tree, all files belonging
to
each user are found in their home directory. For example, all
files
belonging to a user named 'homer' has a are found in /home/homer.
Subdirectories such as 'beer', 'doughnuts', and 'nuclear_waste'
organize his files into topics. Similarly the home directory for
'bart'
is /home/bart, and is organized according to bart's interests.
Most importantly, the only place that homer or bart can create,
modify
or delete files is in their home directories. They can neither
read nor
write files anywhere else on the system, unless permissions are
specifically set to allow them to do this. Thus, the worst any
user can
do is to damage their own files, and the files for each user are
protected.
* In Unix, the term directory is synonymous with folder.
The two
can be used interchangeably.
 |
- usually
work
in home directory
- all
your
data is in your home dir. and nowhere else!
- system
directories
are world-readable
- each
user
can only read/write their own home directories
|
3. Organizing your
files: A place for everything, and everything in its place
Most people
know
about organizing their files into a tree-structured hierarchy of
folders. On Unix you can organize your files using a file manager
such
as Nautilus.

Some
good guidelines to follow:
- Organize your files by topic, not by type. It makes no
sense
to put all presentations in one folder, all images in another
folder,
and all documents in another folder. Any given task or project
will
generate files of many kinds, so it makes sense to put all
files
related to a particular task into a single folder or folder
tree.
- Each time you start a new task or project or experiment,
create a new folder.
- Your home
directory should be mostly composed of subdirectories. Leave
individual
files there only on a temporary basis.
- Directory
organization is for your convenience. Whenever a set of files
all
relate to the same thing, dedicate a directory to them.
- If a directory
gets too big (eg. more files than will fit on the screen when
you type
'ls
-l'), it's time to split it into two or more subdirectories.
- On Unix/Linux, a new account will often have a Documents
directory, which is confusing and makes no sense, since your
HOME
directory already serves the purpose of a Documents directory
in
Windows. It is best to just delete the Documents directory and
work
directly from your HOME directory.
4. Text files: It's
actually quite simple
A text editor is a
program that lets you enter data into
files, and modify it, with a minimal amount of fuss. Text
editors are
distinct from word processors in two crucial ways. First,
the text
editor is a much simpler program, providing none of the
formatting
features (eg. footnotes, special fonts, tables, graphics,
pagination)
that word processors provide. This means that the text
editor is
simpler to learn, and what it can do
is adequate for the task of entering a sequence, changing
a few lines
of
text, or writing a quick note to send by electronic mail.
For these
simple
tasks, it is easier and faster to use a text editor.
Two of the most commonly used text editors with graphic
interfaces are Nedit
and gedit.
Both are available
on most Unix and Linux systems.
|

Example of a text editor editing a computer-readable file
specifying an
alternative genetic code used in flatworm mitochondria.
|
The second
important difference between word processors and
text editors is the way in which the data is stored. The price you
pay
for
having underlining, bold face, multiple columns, and other
features in
word processors is the embedding of special computer codes within
your
file. If you used a word processor to enter data, your datafile
would
thus also contain these same codes. Consequently, only the word
processor
can directly manipulate the data in that file.
Text editors
offer
a way out of this dilemma, because files
produced by a text editor contain only the characters that appear
on
the screen, and nothing more. These files are sometimes referred
to as
ASCII files,
since they only contain standard ASCII characters.
Generally,
files created by Unix or by other programs are
ASCII files. This seemingly innocuous fact is of great
importance,
because it implies a certain universality of files. Thus,
regardless of
which program or Unix command was used to create a file, it can
be
viewed on the screen ('cat
filename'),
sent to the printer ('lpr filename'), appended
to another file ('cat
filename1 >> filename2'),
or used as input by other programs. More importantly, all ASCII
files
can be edited with any text editor.
If you plan to
do a
lot of work at the command line, you will need a text editor that
does
not require a graphic interface. Several common editors include:
- nano - A very simple but
not very
powerful editor
- vi
- The vi editor
is the universal screen editor available with
all UNIX implementations.
- emacs - a
text
editor with many advanced capabilities for programming; it
also has a
long learning curve
5. Screen Real
Estate: Why one window should not own the screen.
One of the
most
counter-productive legacies from the early PC era is that
"One
window owns the screen". Many applications start up taking
the
entire screen. This made sense when PC monitors were small with
800x600
pixel resolution. It makes no sense today when the trend is toward
bigger monitors with high resolution. The image below shows a
typical
Unix screen, in which each window takes just the space it needs,
and on
more. Particularly in bioinformatics, you will be working on a
number
of different datafiles, or using several different programs at the
same
time. The idea is that by keeping your windows small, you can
easily
move from one tast to another by moving to a different window.

Most Unix
desktops today give you a second way to add more real estate to
your
screen. The toolbar at the lower right hand corner of the figure
shows
the Workplace Switcher. If the current screen gets too cluttered
with
windows, the workspace switcher lets you move back and forth
between
several virtual screens at the click of a button. This is a great
organizational tool when you have a number of independent jobs
going on
at the same time.
6.
Network-centric
Computing - Any user can do anything from anywhere
6.1. Running
remote Unix sessions at home
or when traveling
Since all Unix and Linux systems are servers, you can always run a
Unix
session from any computer, anywhere.
see Using Unix from
Anywhere
6.2. Uploading
and downloading files across the network
Email is
usually
not the best way to move files across a network.There are better
tools
for this purpose. On Unix and Linux systems, one of the best tools
is
gFTP. gFTP gives you two panels, one for viewing files on the
local
system, and the other for viewing files on the remote system. In
the
example below, the left panel shows folders in the user's local
home
directory. The right panel shows the user's files on the coe01
server
at the University of Calgary. Copying files, or entire directory
trees
from one system to the next is as easy as selecting them in one
panel
and clicking on the appropriate green arrow button. For security,
gFTP
uses ssh to encrypt all network traffic, so that no one can
eavesdrop
on your upload or download.
