//===================================================================== // File: ABILaneFilter.java // Class: ABILaneFilter // Package: AFLPcore // // Author: James J. Benham // Date: August 10, 1998 // Contact: james_benham@hmc.edu // // Genographer v1.0 - Computer assisted scoring of gels. // Copyright (C) 1998 Montana State University // // This program is free software; you can redistribute it and/or // modify it under the terms of the GNU General Public License // as published by the Free Software Foundation; version 2 // of the License. // // This program is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the // GNU General Public License for more details. // // You should have received a copy of the GNU General Public License // along with this program; if not, write to the Free Software // Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. // // The GNU General Public License is distributed in the file GPL //===================================================================== package AFLPcore; import java.io.File; import java.io.RandomAccessFile; import java.io.IOException; import java.util.NoSuchElementException; /** * This class reads data from a lane file produced by extracting lanes * from a gel run on an ABI377. It has been tested with lanes extracted * by GeneScan 2.0. It will probably work on with the ABI373 as well, but * this has not been tested. This class reads in the processed data, so * the lane must be processed as well as simply extracted. Also, it relies * on the ABI software to find the peaks in the size standard. * *

It will extract the following pieces of information from the file: *

the data trace, after color seperation *
the name of the sample *
the name of the gel it was run on *
the number of the lane on the original gel *
the color of the lane size standard *
the peaks, as called by the ABI software, of the size standard. *

* *

This information will be stored in a Lane object, * which is used by the program. The peaks read in will be passed to * a SizeFunction which will use them to calculate the sizing * information for the data. Since the ABI software also calls peaks that * are not part of the size standard, the program compares all of the * peaks to an internal SizeStandard and use only the sizes * it finds in that internal size standard. For example, the peaks with * locations of 50.00, 100.00, and 150.00 bp would be used, but 54.23 would * not. (Unless 54.23 was defined as part of the size standard, which can't * really happen since the size standard must contian whole values.) * *

The filter has three options that must be set before it can run *

Data Color *
Size Function to use *
Size Standard to use *

* These can be manipulated using getOptions() and * setOptions(Option[]). All three options are a list of * choices, one of which must be selected. The possible values for the * color option are red, blue, green, and yellow. The size function and * the size standard can be the name of any size function/standard known * to the program. This class uses the FeatureList class to * retrieve the known functions. Once the options have been set, the * readLane method can be called to read the actual file. * *

The File Format

* *

The first 4 bytes contains the value of "ABIF" which indicates * that the file is an ABI Lane file (I think). The file contains a * record structure. Each record is 28 bytes long. The number of records * is given in a 32-bit integer at byte 18 (indexed to 0), and the offset * from the beginning of the file to the start of the first record is given * as a 32-bit integer at byte 26. A record has the following structure: *

 *    struct{
 *      byte[4] name;      Four ASCII character name, like "DATA"
 *      int tagNumber;     Distinguishes fields with the same name for 
 *                         example: DATA1, DATA2, ... , DATA12
 *      short data_type;   Denotes the type of data 4 = integer
 *                         7 = float, 10 = mm/dd/yy 11 = hh/mm/ss
 *                         18 = pascal string, 1024 = some sort of structure
 *      SHORT elementSize;   The size of each element.
 *      int numElements;   The number of elements.
 *      int recordLength;  The length of the whole record.
 *      int dataOffset;    The offset from the beginning of the file to
 *                         the start of the record, unless the recordLength
 *                         is less than 4, in which case it contains the
 *                         actual data.
 *      int unknown;       Usually 0, but seems to change with the editing
 *                         of the file.
 *    }
 *

* Most of this information was obtained from Clark Tibbetts paper. ( * Tibbetts, Clark. "Raw Data File Formats and the Digital and Analog * Raw Data Streams of the ABI PRISM DNA Sequencer(c)." 1995.) * *

The following records are of interest: *

DATA *: This contains the trace data as a sequence of 16-bit integers. * The file can contain up to 12 DATA fields. The first 8 are always * present. The first four represent the raw color data from the * machine. The fifth through 8th represent values for the gel voltage, * gel current, electoporetic power, and the gel temperature. The * last four are the ones of interest to this program. They contain * the color data after it has been processed and seperated. Note that * this will not always exist. For example, often only certain colors * are extracted from a lane. The tag number corresponds to the color * in the following manner: blue = 0, yellow = 1, green = 2, red = 3. * For the processed data, the correct tag number is simply given by * 9 + colorNumber. *
GELN *: This is a pascal-type string representing the gel name. *
LANE *: This contains the lane number on the original gel. It is stored * as a 16-bit integer in the first 2 bytes of the dataOffset field. *
LANS *: This probably contains lots of information, but I don't know what * it is. However, the third and fourth byte in this structure give * the number of the color for the size standard, as a short integer. * Therefore, if the value there is stdColor, the standard * trace is in DATA(9+stdColor) and PEAK(stdColor) contains the size * standard peaks. *
PEAK *: This contains a number of peaks as called by the ABI software. This * filter uses it for the size standard information. See below for * a description of the peak data structure. *
SpNm *: This contains the name of the sample, as a pascal string. *
StdF *: In some cases, this contains the name of the size standard, but * it seems to be missing in some files, so it is not used by * this filter. (Stored as a pascal string.) *
OFFS *: This is not used by the program, and seems to be 1000 in most cases. * 1000 is also the difference between the scan number displayed and * the number stored in the peaks. This may have something to do with * where the software thinks the zero point is, or it may not. It * appears to be a single 16-bit integers. *

* * A peak in the ABI file is 96 bytes long. The first 4 bytes are used * to store the scan number as 32-bit integer. This scan number is * different than the one displayed by the ABI programs. It is 1000 less, * but the number 1000 could vary. 1000 is also the value stored in OFFS. * The next two bytes are the height, as a 16-bit integer. I don't know * what the next 12 bytes are. After that, the peak area is stored as a * 32-bit integer. Skip four bytes again. We then have the size of the * peak, in bp. This is a IEEE 754 single precision float. * *

 *   Value     Start   Length(bytes)    Type
 *   scan        0           4           integer (1000 + this value)
 *   height      4           2           integer
 *   area       18           4           integer
 *   size       26           4           IEEE 754 single-percision float
 *

* * * @see SizeFunction * @see SizeStandard * * @author James J. Benham * @version 1.0.0 * @date August 10, 1998 */ public class ABILaneFilter extends ImportFilter { // Variables from parent class //private protected int filetype; // the type, see constants above //private protected String name; // the name of this filter //private protected String descript; // a brief description //private protected File helpFile; // represents the file that contains // the help info for this filter. // Used to indentify the different entries of interest in the ABI file and // store the index into an array that contains the info. private static int NUM_ENTRIES = 10; // the number of entries. private static final int DATA = 0; private static final int GELN = 1; private static final int LANE = 2; private static final int LANS = 3; private static final int PEAK1 = 4; // Keep these 4 in order! private static final int PEAK2 = 5; private static final int PEAK3 = 6; private static final int PEAK4 = 7; private static final int SpNm = 8; private static final int StdF = 9; private ABIIndexEntry entries[]; /** color channel */ public static final int YELLOW = 2; /** color channel */ public static final int RED = 3; /** color channel */ public static final int BLUE = 0; /** color channel */ public static final int GREEN = 1; public static final int ALL = 4; private int colorChannel=0; private int stdColorChannel; private String standardName; private SizeFunction sizeFn; /** * Creates a new filter to read in ABI lane files. */ public ABILaneFilter() { // Initialize the variables for this filter filetype = LANE; name = "ABI Trace"; descript = "Reads lane files from ABI 377, not gel files."; helpFile = "abitrace.html"; // Options must be set. options = null; standardName = "not set"; sizeFn = null; } /** * Access the name of the filter. * * @return name of the import filter */ public String getName() { return name; } /** * Returns the type of input file supported by this filter In this case * ImportFilter.LANE, since the filter reads in lane data. * * @return constant LANE. */ public int getFileType() { return filetype; } /** * Retrieves a short, approximately one sentence, description of the filter. * * @return the description */ public String getDescription() { return descript; } /** * The help file describes which files the filter reads and the options * that this filter accepts. * * @return File that contains the help information, either html or * plaintext. */ public String getHelpFile() { return helpFile; } /** * Returns the options for this filter, which includes the color of the * data, the size function to use, and the size standard. The first * option is the color to read, which can be one of four possilbe * values: Red, Blue, Green, or Yellow. The color choice is given as * a Option of type CHOICE. The second * option is also of type CHOICE. It tells which size * method should be used to compute the size of the fragements. Please * see the help files and the code for the size functions for a * description of how the work. The third option describes the size * standard to use. This simply gives the program a list of values. * These are stored in a file called "standards.cfg" Possible values * for all of these options are read in from the * FeatureList class. * * @return an array containing the options described above. * * @see Option * @see FeatureList * @see SizeFunction * @see SizeStandard */ public Option[] getOptions() { Option[] returnOpts = new Option[3]; // Pick the color String[] colors = new String[5]; colors[RED] = "Red"; colors[BLUE] = "Blue"; colors[GREEN] = "Green"; colors[YELLOW] = "Yellow"; colors[ALL] = "All"; Option param = new Option("Color", Option.CHOICE, true, colors, "Blue"); returnOpts[0] = param; // The size function option, possiblities retrieved from the // feature list. param = new Option("Size Method", Option.CHOICE, true, FeatureList.getSizeMgr().getNames(), FeatureList.getSizeMgr().getDefaultName()); returnOpts[1] = param; // the size standards defined try { param = new Option("Size Standard", Option.CHOICE, true, FeatureList.getStandardMgr().getNames()); } catch(IOException e) { throw new MissingParameterError("Error accessing standards file. " + e.getMessage()); } returnOpts[2] = param; return returnOpts; } /** * Sets the parameters for the filter to the specified values, including * color. The color must be set before this filter can run. The option * representing the color should have a string value naming the color. * The size function must also be set for the filter to work. It * must contain the name of a valid SizeFunction. Note that * the name is not the class name of the SizeFunction, but * the name each SizeFunction stores internally. The * third option must also be set. * * @param opts an array of length 3 which contains the options * mentioned above and described in getOptions() * The order must be: color, size function, size standard. * * @exception MissingParameterError occurs when the filter fails to * extract a string from the first option in opts. * @exception IllegalArgumentException occurs when a string is found but * cannot be matched to one of the colors: Red, Blue, Green, or Yellow. * Or if an array with length not equal to 3 is given as * opts, or if the specified size function, the second * option, could not be matched to a defined size function. */ public void setOptions(Option[] opts) { // Check the length. if(opts.length != 3) throw new IllegalArgumentException("Invalid options for ABI Lane " + "Filter. 3 options expected, but " + opts.length + " were provided."); // extract the option String value = opts[0].getStringValue(); // store the options options = opts; // check to make sure we have a string if (value == null) throw new MissingParameterError("Color not provided as parameter to " + "ABI Lane Filter."); if(value.equalsIgnoreCase("Red")) colorChannel = RED; else if(value.equalsIgnoreCase("Blue")) colorChannel = BLUE; else if(value.equalsIgnoreCase("Green")) colorChannel = GREEN; else if(value.equalsIgnoreCase("Yellow")) colorChannel = YELLOW; else if(value.equalsIgnoreCase("All")) colorChannel = ALL; else { // didn't match a color, so something is wrong. // set the options back to null since the ones we got were no good. options = null; // and complain throw new IllegalArgumentException("Invalid color specified for ABI" + " Lane Filter."); } // Next should be the size function String sizeFnName = opts[1].getStringValue(); try { sizeFn = (SizeFunction) FeatureList.getSizeMgr().get(sizeFnName); } catch(NoSuchElementException e) { options = null; throw new IllegalArgumentException("Invalid sizing function specified" + " for ABI Lane Filter. "); } // The final option is the size standard definition standardName = opts[2].getStringValue(); // this will be checked later } /** * This is the method that is called to preform the actual reading of the * file. The data in the file represents data from a single lane. The * options/parameters required for the filter should be set using * setOptions, and if they are not, an exception will be * thrown. * * @param inputFile The file that contains the lane data. * * @return a Lane object with all of the appropriate information. * * @exception MissingParameterError occurs if the options are not * set. Since this includes the required color, the filter cannot * read in the lane. * @exception IOException If an error is encountered in the file, * then this exception will be thrown */ public Lane [] readLane(File inputFile) throws IOException { Lane newLane; Lane [] laneArray; int numOfLanes; boolean allChannels; long indexOffset; long indexLength; DataList stdPoints; int peakIndex; SizeStandard sizeStd; SizeFunction sizeFn; entries = null; // Make sure we have options set, including the color channel if(options == null) throw new MissingParameterError("The color for the filter must be " + "set before the filter can work."); // Open the file. Set the mode to read only. RandomAccessFile in = new RandomAccessFile(inputFile, "r"); // Check the file type. They all seem to start with "ABIF", which // becomes 0x41424946 in hex. int magicNum = in.readInt(); if( magicNum != 0x41424946) throw new IOException("This does not appear to be an ABI lane file." + " See help for more info."); // Get the length of the index of types. in.seek(18); indexLength = (long) in.readInt(); // Get the location of the index. in.seek(26); indexOffset = (long) in.readInt(); //Added 6/25/2001 by Philip DeCamp if(colorChannel == ALL){ laneArray = new Lane[4]; for(int i =0 ; i < 4; i++) laneArray[i] = null; allChannels = true; colorChannel = -1; } else{ laneArray = new Lane[1]; allChannels = false; } for(int i = 0; i < 4; i++) { if(allChannels){ //Goes through and finds a valid color channel for(;;){ colorChannel++; if(colorChannel > 3) break; entries = readRecords(indexOffset, indexLength, in); try{ checkForColor(); in.seek(entries[LANS].dataOffset + 2); stdColorChannel = in.readUnsignedShort() - 1; if(stdColorChannel != colorChannel) break; } catch(Exception e){ // This color is not present in the file, skip it // and take no other action. } } // If the coloChannel is this high, it means that all channels // have been checked. if(colorChannel > 3) break; } else{ entries = readRecords(indexOffset, indexLength, in); checkForColor(); // Read in the color channel of the size standard. It is located at bytes // 3 and 4 of the entry pointed to by LANS. It is in the form of an // unsigned short. in.seek(entries[LANS].dataOffset + 2); stdColorChannel = in.readUnsignedShort() - 1; } // Move to the location of the Data in.seek(entries[DATA].dataOffset); int traceSize = (int) entries[DATA].numElements; double [] trace = new double[traceSize]; for(int j= 0; j < entries[DATA].numElements; j++) trace[j] = (double) in.readUnsignedShort(); newLane = new Lane(trace); // Read in the Gel name. if(entries[GELN].numElements > 4) newLane.setGelName(readPString(entries[GELN].dataOffset, in)); else newLane.setGelName(readPString(entries[GELN].dataOffset)); // Read in the Sample name. if(entries[SpNm].numElements > 4) newLane.setName(readPString(entries[SpNm].dataOffset, in)); else newLane.setName(readPString(entries[SpNm].dataOffset)); // Read in the Lane number // In this case, the offset is actually the data since the values // are so small. The number is stored two bytes up from the end of // the long, so shift it so that we get the correct value. newLane.setLaneNumber( (int)(entries[LANE].dataOffset >> 16) ); // This doesn't seem to work for every file, so just let the user pick // if for now. // Read in the name of the size standard used. // if(entries[StdF].numElements > 4) // standardName = readPString(entries[StdF].dataOffset, in); // else // standardName = readPString(entries[StdF].dataOffset); // Select the correct peak entry. peakIndex = PEAK1 + stdColorChannel; //=========== Read in the standard peaks=========== // get the size standard. try{ sizeStd = ((SizeStandard) FeatureList.getStandardMgr().get(standardName)); } catch(NoSuchElementException e) { throw new IOException("Unknown size standard! '" + standardName + "'"); } stdPoints = new DataList(); Peak pk; for(int j=0; j < entries[peakIndex].numElements; j++) { in.seek(entries[peakIndex].dataOffset + j*96); pk = readPeak(in); if(sizeStd.contains(pk.getLocation())){ stdPoints.addData(pk); } } // Set the color channel newLane.setColor(colorChannel); //================= set the size function ============== String sizeName = options[1].getStringValue(); sizeFn = (SizeFunction) FeatureList.getSizeMgr().get(sizeName); sizeFn = (SizeFunction) sizeFn.clone(); sizeFn.init(stdPoints); sizeFn.setMaxScan(newLane.getNumPoints() - 1); newLane.setSizeFunction(sizeFn); laneArray[i] = newLane; if(!allChannels) break; } //==================clean up============================= in.close(); /*=================DEBUG===================*/ //System.out.println("Gel Name is: " + newLane.getGelName()); //System.out.println("Sample Name is: " + newLane.getName()); //System.out.println("Lane number is: " + newLane.getLaneNumber()); //System.out.println("Standard name is: " + standardName); //System.out.println("std color is: " + stdColorChannel); /*=================DEBUG===================*/ if(allChannels){ for(int i = 0; i < 4; i++) if(laneArray[i] != null){ allChannels = false; break; } if(allChannels) throw new IOException("No Color Channels Found"); } return laneArray; } /** * This filter does not read gels. * * @return Always null */ public Gel readGel(File inputFile) throws IOException { return null; } /** * Parses the records portion of the ABI file to gather information about * several different data structures in the file. The important pieces of * information are, where the data structure is in the file, and how big * it is. All of the records start with a four charachter ASCII value. * the records of interest are DATA, which stores the trace information; * SMPL, which is the name of the sample; GELN, which is the name of the * gel on which the sample was run; and LANE, which stores the lane * number of the sample. In some cases, these identifers are repeated. * For example, a file can have up to 12 DATA entries, but each one has * a tag number to seperate it. Only one of these contains the information * we want. * * The parser works be converting strings like "DATA" into a long value, * which is the ASCII representation of the string. It then compares * this to the first four bytes of each record. On a match, it will * look at the rest of the record and decide to either keep it or discard * it. More details are in the code for those interested. * * @param indexOffset the location from the beginning of the file to * the start of the index of records. ie, the location of the first * record * @param indexLength the number of records in the index * @param in the ABI trace file. * * @return the records of interest stored in ABIIndexEntry(s). Only * the portions of the records that are needed are returned. * * @exception IOException could come from RandomAccessFile methods or * from the method itself MODIFY after complition */ private ABIIndexEntry[] readRecords(long indexOffset, long indexLength, RandomAccessFile in) throws IOException { // Create the structure to hold the records for the entries. ABIIndexEntry record[] = new ABIIndexEntry[NUM_ENTRIES]; // Add the values that we are looking for // in some cases, we are looking for certain records. For example: // the correct color channel's data entry is 9 + colorChannel // which represents the matrix corrected data in the file. // The file actully contains anywhere from 8-12 DATA records, // but the tag number determines which one we want. record[DATA] = new ABIIndexEntry("DATA", 9 + colorChannel); // Go for the raw data... //record[DATA] = new ABIIndexEntry("DATA", 1 + colorChannel); record[GELN] = new ABIIndexEntry("GELN"); record[LANE] = new ABIIndexEntry("LANE"); record[LANS] = new ABIIndexEntry("LANS"); record[PEAK1] = new ABIIndexEntry("PEAK", 1); record[PEAK2] = new ABIIndexEntry("PEAK", 2); record[PEAK3] = new ABIIndexEntry("PEAK", 3); record[PEAK4] = new ABIIndexEntry("PEAK", 4); record[SpNm] = new ABIIndexEntry("SpNm"); record[StdF] = new ABIIndexEntry("StdF"); // Variables to temporarly hold the record info while we decide if the // record is valid. long nameKey; long tag; long numElem; long offset; // Go to the start of the index in.seek(indexOffset); // Look for records that begin with the name that we're interested in // and then look at that record more carefully. If it doesn't look valid, // we won't copy the temporary values to anything permenant. for(int count=0; count < indexLength; count++) { // read in the name nameKey = (long) in.readInt(); // Read in the other info tag = (long) in.readInt(); in.skipBytes(4); // skip to the next interesting part numElem = (long) in.readInt(); in.skipBytes(4); // skip this too offset = (long) in.readInt(); // Now see if the name matches any of the ones we're looking for by // comparing it to each entry in the array. for(int i=0; i < NUM_ENTRIES; i++) { if(nameKey == record[i].nameKey) { // Make sure we have the data record we want. // If the tag is set, check to see if it matches. if( (offset != 0) && !((record[i].matchTagNumber()) && (tag != record[i].tagNumber))) { // Make sure the data points to something // good. There seem to be a lot of entries that // don't point to anything. For example, some // files contain multiple SMPL entries, but only // one of these has a non-null data pointer. If // the record points to something, store the // temporary values. record[i].tagNumber = tag; record[i].numElements = numElem; record[i].dataOffset = offset; // we only need to store it once, so don't go through // the inner loop extra times. break; } } //if (name matches one we're interested in. } // for(every entry in the record array) // move to the next record in.seek(indexOffset + count*28); } // for every record return record; } /** * Check to make sure we found the color channel. If we don't throw an * exception. This could happen because not every file has every channel * for the processed color data. In this case, the offset will be zero * since it was never assigned a value. * * @exception IOException occurs when the filter cannot find the color * channel specified with setColorChannel in the file. */ private void checkForColor() throws IOException { if( entries[DATA].dataOffset == 0) { String errorMsg=""; switch(colorChannel) { case RED: errorMsg = "red"; break; case BLUE: errorMsg = "blue"; break; case GREEN: errorMsg = "green"; break; case YELLOW: errorMsg = "yellow"; break; } errorMsg = "Could not find the color " + errorMsg + " in the file."; throw new IOException(errorMsg); } } /** * Read in a peak from the file. A peak in the ABI file is 96 bytes * long. The first 4 bytes are used to store the scan number as 32-bit * integer. This scan number is different than the one displayed by the * ABI programs. It is 1000 less, but the number 1000 could vary. 1000 is * also the value stored in OFFS. The next two bytes are the height, as * a 16-bit integer. I don't know what the next 12 bytes are. After that, * the peak area is stored as a 32-bit integer. Skip four bytes again. * we then have the size of the peak, in bp. This is a IEEE 754 single * precision float. * *

   *   Value     Start   Length(bytes)    Type
   *   scan        0           4           integer (1000 + this value)
   *   height      4           2           integer
   *   area       18           4           integer
   *   size       26           4           IEEE 754 single-percision float
   *

* * @param in the input source * * @return a peak, with the size/location and height read from the file * and the area set as the scan number, not the area. * * @exception IOException occurs if the file cannot be read. */ public Peak readPeak(RandomAccessFile in) throws IOException { int scan; int height; int area; double size; scan = in.readInt(); height = in.readUnsignedShort(); in.skipBytes(12); area = in.readInt(); in.skipBytes(4); size = in.readFloat(); return new Peak(size, (double) height, (double) scan); } /** * Read in a Pascal type string, where the first byte is the length of * the string, and the rest are the charachters. This is accomplised by * reading in the bytes (unsigned) and converting them into a charachter * array of the correct length, and then turing the character arrray * into a String. * * @param location where the string is in the file, relative to the * beginning of the file. * @param in the file with the information. * * @return the string as a String object. * * @exception IOException occurs if location can not * be reached for some reason. */ String readPString(long location, RandomAccessFile in) throws IOException { // Move to the correct location in.seek(location); // Read in the length and set up the array int length = in.readUnsignedByte(); char gelname[] = new char[length]; // fill the array. for(int i=0; i < length; i++) gelname[i] = (char) in.readUnsignedByte(); return new String(gelname); } /** * This converts a long integer into a string. This is used when the * dataOffset contains the actual data. This will happen if it is * an extremely short string, < 3 characters. The format is as follows, * the bits 24-31 contain the length of the string, and the bits following * contain the sequence character. It is perhaps easier to think of it * as 8 bytes. The first four high-order bytes are not used. A long is * used to store the original 32-bit data so sign wrapping can be avoided. * of the lower 4 bytes, say 3, 2, 1 and 0. With 0 being the low-order byte, * the length of the string is stored in byte 3, while the characters are * stored in 2, 1, and 0 as neccessary. * * @param stringBits a data structure matching that specified above. * * @return the string contained in the bits */ String readPString(long stringBits) { int length = (int) (stringBits >>> 24); char name[] = new char[length]; String hal = ""; // fil the array. for(int i=0; i < length; i++) name[i] = (char) ((stringBits >>> ( (2 - i)*8 )) & 0x00000000000000ff); return new String(name); } }