textsearch Wiki The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki. Please help by correcting and extending the Wiki pages. Function Search the textual description of sequence(s) Description textsearch searches for words (specified as a regular expression) in the description text of one or more input sequences. It writes an output file with optional contents such as the name, description and accession number of any sequence whose description line from the annotation matches the search term. Optionally, the search is case-sensitive and the results output as an HTML table. textsearch is convenient for small input files but will be slow for larger files and databases; you should use use SRS or Entrez instead. Algorithm textsearch searches only the description line, not the full sequence annotation. Usage Here is a sample session with textsearch Search for 'lactose': % textsearch "tsw:*" "lactose" Search the textual description of sequence(s) Output file [12s1_arath.textsearch]: ajSeqxrefNewDbS '1-I' 'FT025' Go to the output files for this example Example 2 Search for 'lactose' or 'permease' in E.coli proteins: % textsearch "tsw:*_ecoli" "lactose | permease" Search the textual description of sequence(s) Output file [bgal_ecoli.textsearch]: Go to the input files for this example Go to the output files for this example Example 3 Output a search for 'lacz' formatted with HTML to a file: % textsearch "tembl:*" "lacz" -html -outfile embl.lacz.html Search the textual description of sequence(s) Go to the output files for this example Command line arguments Search the textual description of sequence(s) Version: EMBOSS:6.4.0.0 Standard (Mandatory) qualifiers: [-sequence] seqall (Gapped) sequence(s) filename and optional format, or reference (input USA) [-pattern] string The search pattern is a regular expression. Use a | to indicate OR. For example: human|mouse will find text with either 'human' OR 'mouse' in the text (Any string) [-outfile] outfile [*.textsearch] Output file name Additional (Optional) qualifiers: -casesensitive boolean [N] Do a case-sensitive search -html boolean [N] Format output as an HTML table Advanced (Unprompted) qualifiers: -only boolean [N] This is a way of shortening the command line if you only want a few things to be displayed. Instead of specifying: '-nohead -noname -nousa -noacc -nodesc' to get only the name output, you can specify '-only -name' -heading boolean [@(!$(only))] Display column headings -usa boolean [@(!$(only))] Display the USA of the sequence -accession boolean [@(!$(only))] Display 'accession' column -name boolean [@(!$(only))] Display 'name' column -description boolean [@(!$(only))] Display 'description' column Associated qualifiers: "-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-outfile" associated qualifiers -odirectory3 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write first file to standard output -filter boolean Read first file from standard input, write first file to standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messages -version boolean Report version number and exit Input file format textsearch reads one or more nucleotide or protein sequences. The input is a standard EMBOSS sequence query (also known as a 'USA'). Major sequence database sources defined as standard in EMBOSS installations include srs:embl, srs:uniprot and ensembl Data can also be read from sequence output in any supported format written by an EMBOSS or third-party application. The input format can be specified by using the command-line qualifier -sformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: gff (gff3), gff2, embl (em), genbank (gb, refseq), ddbj, refseqp, pir (nbrf), swissprot (swiss, sw), dasgff and debug. See: http://emboss.sf.net/docs/themes/SequenceFormats.html for further information on sequence formats. Input files for usage example 2 '|' is a sequence entry in the example protein database '|' Output file format Output files for usage example File: 12s1_arath.textsearch # Search for: lactose tsw-id:LACI_ECOLI LACI_ECOLI P03023 Lactose operon repressor tsw-id:LACY_ECOLI LACY_ECOLI P02920 Lactose permease (Lactose-proton symport ) Output files for usage example 2 File: bgal_ecoli.textsearch # Search for: lactose | permease tsw-id:LACI_ECOLI LACI_ECOLI P03023 Lactose operon repressor tsw-id:LACY_ECOLI LACY_ECOLI P02920 Lactose permease (Lactose-proton symport ) Output files for usage example 3 File: embl.lacz.html Search for: lacz tembl-id:J01636 J01636 J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes. tembl-id:V00296 V00296 V00296 E. coli gene lacZ coding for beta-galactosidase (EC 3.2.1.23). The first column in the name or ID of each sequence. The remaining text is the description line of the sequence. When the -html qualifier is specified, then the output will be wrapped in HTML tags, ready for inclusion in a Web page. Note that tags such as , , and are not output by this program as the table of databases is expected to form only part of the contents of a web page - the rest of the web page must be supplier by the user. The lines of out information are guaranteed not to have trailing white-space at the end. So if '-nodesc' is used, there will not be any whitespace after the ID name. Data files None. Notes This is a rather slow way to search for text in databases. If you are searching for text in public databases, you should consider using either SRS (http://srs.ebi.ac.uk/) or Entrez ( http://www.ncbi.nlm.nih.gov/Entrez/) References None. Warnings Diagnostic Error Messages None. Exit status It always exits with status 0 Known bugs None. See also Program name Description drtext Get data resource entries complete text entret Retrieves sequence entries from flatfile databases and files ontotext Get ontology term(s) original full text textget Get text data entries Author(s) Gary Williams formerly at: MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK Please report all bugs to the EMBOSS bug team (emboss-bug (c) emboss.open-bio.org) not to the original author. History Target users This program is intended to be used by everyone and everything, from naive users to embedded scripts. Comments None