Copyright 1995-2004 The Institute for Genomic Research. All rights reserved. ---------------------------------------- Installation 1. To install Lucy, type "make" in this directory. Fix any compiler or makefile incompatible errors, then move the executable "lucy" to your local binary directory, such as "/usr/local/bin". If your operating system does not support the standard POSIX Pthread library to allow multi-threading required in this version of lucy, use the included version 1.16s of lucy which is also included in this release. Just "cd" to its directory and "make" there. It is functionally identical to 1.16p except it won't take advantage of multiple CPUs on your machines. 2. Also, move the manual page file "lucy.1" to your local man page directory, such as "/usr/local/man/man1". Don't forget to remake your manual page index or you will need to type "man -F lucy" each time you want to get the Lucy manual page. You can also type "man ./lucy.1" in this directory to quickly see the manual page and/or make a printout. 3. A Postscript version of the manual page has been included as well. To print the manual page from this file, dump the Postscript file "lucy.ps" to your Postscript-capable printer. ---------------------------------------- Testing 1. To test the correctness of installed Lucy, type the command lucy -v PUC19 PUC19splice atie.seq atie.qul atie.2nd -debug lucy.info in this directory. Check the generated files lucy.seq and lucy.qul, see if they are reasonably correct (what does that mean? :). Also, compare the content of the information file "lucy.info" with the included "lucy.debug" file to see if they are the same (using the Unix "diff" command). If not, something may be wrong. Note that in the CLZ fields there may be some difference when you are running lucy on a different platform other than Linux PC, but they usually won't influence the outcome of lucy trimming. See the FAQ below if you are curious. (If you have multiple CPUs in your computer, type the same command above with the additional option "-x CPU_count". You should see a dramatic speedup of lucy and you should obtain the same output from lucy with or without this option.) 2. Use Lucy on your own data, see if it works as expected. 3. For more information please see the manual page. I hope this program is useful to you. ---------------------------------------- Quality trimming parameters Note: do NOT turn on Phred trimming if you intent to feed its output directly to lucy; trimming in Phred shortened the sequences and can prevent lucy from seeing the vector fragments at the ends, resulted in untrimmed vector fragments. Keeping as much data as possible for lucy is your best strategy. Lucy does a decent job of both quality and vector trimming, when you give it enough data to see the whole picutre of a sequence! The quality trimming parameters that we use in TIGR depend on a couple of factors: was the sequence run on an ABI 377 or a 3700, and is the project a BAC-end project or a non-BAC project. Here are the 4 cases: (actually, the quality trimming is the same for all BAC-end projects regardless of machine type.) NonBAC3700="-error 0.025 0.02 -bracket 10 0.02 -window 50 0.03" NonBAC377="-error 0.025 0.02 -bracket 10 0.02 -window 50 0.08 10 0.3" BAC3700="-error 0.025 0.9 -bracket 10 0.02 -window 50 0.07" BAC377="-error 0.025 0.9 -bracket 10 0.02 -window 50 0.07" The real difference between 377 and 3700 is the issue of whether the quality values were produced by phred or by TraceTuner. TraceTuner is a 3rd party product that calls quality values on sequences run on a 3700. I think that the best parameters to use would be lucy's default set: LucyDefault="-error 0.025 0.02 -bracket 10 0.02 -window 50 0.08 10 0.3" ---------------------------------------- Frequently asked questions 1. Why does lucy only provide the coordinate of good quality regions instead of directly removing the bad regions of sequences? We do not recommend physically removing the bad regions from each sequence because many sequence assembly programs can still benefit from these so called "overhang" regions to improve the chance of making a successfully assembly. If you must remove those bad regions for your purposes you can use the included simple AWK script "zapping.awk" to do it or write your own scripts. 2. Can I run lucy on my XYZ type machine? At TIGR, we run Lucy on a Sun workstation, running the Solaris operating system (version 5.5.1). Lucy has also been compiled and run successfully under the Linux operating system, running on a PC. Although there are no known problems with Lucy under Linux, it has been exercised much less under Linux than under Solaris. We use the Sun Workshop C compiler for compilation under Solaris, and the gcc (Gnu C) compiler under Linux. I do not mean to imply that other compilers will not work, but these are the ones that have been tried here. Lucy has not been run on MacOS or Windows by us, although we believe porting it to these two platforms should not be too difficult since source codes have been included. It is very likely that lucy can run without any modification under a Windows command shell. 3. How's lucy's memory requirement? Lucy's memory requirement is very moderate. The memory usage does increase with the number of sequences being trimmed. However, Lucy does not read all of the sequence and quality data into memory at one time, but rather reads the data from disk as it is needed. For detail information about lucy's memory requirement, see the manual page. 4. How can I make lucy talk to my internal database server? Lucy does not access (nor depend on) a database server. It reads its input from ASCII text files, and writes its output also to ASCII text files. The input sequence and quality files are in multi-FASTA format, as are the output sequence and quality files. It is a design decision to separate lucy from any site specific assumptions. In TIGR, we use a separate program, ricky, to drive lucy and provide iuput/output between lucy and our database infrastructures. You may need to design similar driver programs if you wish to automatically upload lucy's output to your database. 5. Which base calling software should I use? Currently, we are using phred version 0.990722.g as our base caller for sequences from the ABI 377 sequencer, and TraceTuner (from Paracel) for sequences from the ABI 3700. I recommend that you use phred version 0.990722.g or later for 377 sequences. Some earlier version of phred produced non-zero quality values no smaller than 15. Older versions of lucy tends to overtrim sequences from those earlier versions of phred. Latest version of phred can be obtained directly from its authors . 6. I've downloaded and installed 'lucy', but I can't get it to produce the same debug info as in the distributed 'lucy.debug' file. That, according to the READ_ME, means something is wrong, right? Short answer: it's not a bug in lucy but just the different random number sequences generated on the two different platforms that are causing the difference in lucy output. You can safely use lucy on either platform and it should produce correct outputs. Long answer: in lucy's secondary sequence extension module, it calls random number generator to dertermine a real base when it sees letters such as N or B in a sequence that can mean more than one kind of nucleotides. This is safe since if the sequence is of high quality there won't be any N's in its ABI base call anyway. This is just to give lower quality sequences at the borderline of being dropped by lucy a chance to be salvaged if their random number determined DNA sequences match the Phred sequence well. Therefore, in case the random number generators on two different platforms produce different numbers, the converted ABI sequence will be somewhat different (at those N bases) and the match result may be different. If you look at the diff output between lucy.debug and lucy.info, you will notice that most matches reported at CLZ fields there are too short to cause any real difference in lucy generated final trimming output (i.e., CLR). You can think of them as just random noises. However, there are indeed three differences in the final lucy output file lucy.seq between the two platforms (Linux and Solaris): >ATIEG51TR was dropped on the Solaris side but included in the Linux >side. >ATIEO52TF was included on the Solaris side but dropped in the Linux >side. >ATIEO93TF was included on both sides but its reported good regions are >different between the two sides: < >ATIEO93TF 0 0 0 227 384 Solaris --- > >ATIEO93TF 0 0 0 49 179 Linux These three are the only differences between the outputs of lucy running on the two different paltforms with the atie test suite. If you look at the ABI sequence file (atie.2nd) and find the above three sequences in it, you will see a lot of N's scattered around their sequences. This is the reason of the difference. If you run lucy without the secondary sequence extension step, then you will get exactly the same outputs on both platforms (i.e., drop atie.2nd from the argument list, then run lucy again). So this is really a user choice: do you want to have more usesable data included by comparing against the ABI sequence and salvage some data at the risk of including some junks, or you want to have just higher quality data at the risk of losing some data that are still useful? Perhaps an answer to this is to to run lucy with the -inform_me option and double check those sequences lucy reports as 'salvaged'; you will find these three sequence names mentioned above in the lucy report, meaning that lucy knows they are right at the borderlines. :) ---------------------------------------- File list Information: Copyright - copyright notice HISTORY - lucy modification history README.FIRST - this file you are reading now lucy.1 - lucy's manual page in standard Unix man page format lucy.ps - lucy's manual page in Postscript format Source file: - source codes used to build the lucy program Makefile abi.c lucy.c poly.c qual_trim.c splice.c vector.c lucy-1.16s/ - same set of source files for non-parallel lucy Test files: - files for testing lucy as mentioned earlier PUC19 PUC19splice PUC19splice.for PUC19splice.rev atie.seq atie.qul atie.2nd pSPORT1splice - these four files were mentioned in HISTORY pSPORT1vector ARMTM40TR.seq ARMTM40TR.qul