notseq

 

Function

Excludes a set of sequences and writes out the remaining ones

Description

When you have a set of sequences (a file of multiple sequences?) and you wish to remove one or more of them from the set, then use notseq.

This program was written for the case where a file containing several sequences is being used as a small database, but some of the sequences are no longer required and must be deleted from the file.

notseq splits the input sequences into those that you wish to keep and those you wish to exclude.

notseq takes a set of sequences as input together with a list of sequence names or accession numbers. It also takes the name of a new file to write the files that you want to keep into, and optionally the name of a file that will contain the files that you want excluded from the set.

notseq then reads in the input sequences. It outputs the ones that match one of the sequence names or acession numbers to the file of excluded sequences, and those that don't match are output to the file of sequences to be kept.

Note that the names of the sequences to be excluded are not standard EMBOSS USAs. Only the name or accession number shoudl be specified, not the database or file that these entries may occur in. These excluded sequence names will be matched against the names of the input sequences to see if there is a match. Wildcarded names may be specified by using '*'s. Any specified names of sequences to be excluded that are not found are simply ignored.

Usage

Here is a sample session with notseq

In this case the excluded sequences (myg_phyca and lgb2_luplu) are not saved to any file:


% notseq 
Excludes a set of sequences and writes out the remaining ones
Input sequence(s): ../../data/globins.fasta
Sequence names to exclude: myg_phyca,lgb2_luplu
Output sequence [hbb_human.fasta]: mydata.seq

Go to the input files for this example
Go to the output files for this example

Example 2

Here is an example where the sequences to be excluded are saved to another file:


% notseq -junkout hb.seq 
Excludes a set of sequences and writes out the remaining ones
Input sequence(s): ../../data/globins.fasta
Sequence names to exclude: hb*
Output sequence [hbb_human.fasta]: mydata.seq

Go to the output files for this example

Command line arguments

   Mandatory qualifiers:
  [-sequence]          seqall     Sequence database USA
  [-exclude]           string     Enter a list of sequence names or accession
                                  numbers to exclude from the sequences read
                                  in. The excluded sequences will be written
                                  to the file specified in the 'junkout'
                                  parameter. The remainder will be written out
                                  to the file specified in the 'outseq'
                                  parameter.
                                  The list of sequence names can be separated
                                  by either spaces or commas.
                                  The sequence names can be wildcarded.
                                  The sequence names are case independent.
                                  An example of a list of sequences to be
                                  excluded is:
                                  myseq, hs*, one two three
                                  a file containing a list of sequence names
                                  can be specified by giving the file name
                                  preceeded by a '@', eg: '@names.dat'
  [-outseq]            seqoutall  Output sequence(s) USA

   Optional qualifiers:
   -junkout            seqoutall  This file collects the sequences which you
                                  have excluded from the main output file of
                                  sequences.

   Advanced qualifiers: (none)
   General qualifiers:
  -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Sequence database USA Readable sequence(s) Required
[-exclude]
(Parameter 2)
Enter a list of sequence names or accession numbers to exclude from the sequences read in. The excluded sequences will be written to the file specified in the 'junkout' parameter. The remainder will be written out to the file specified in the 'outseq' parameter. The list of sequence names can be separated by either spaces or commas. The sequence names can be wildcarded. The sequence names are case independent. An example of a list of sequences to be excluded is: myseq, hs*, one two three a file containing a list of sequence names can be specified by giving the file name preceeded by a '@', eg: '@names.dat' Any string is accepted An empty string is accepted
[-outseq]
(Parameter 3)
Output sequence(s) USA Writeable sequence(s) <sequence>.format
Optional qualifiers Allowed values Default
-junkout This file collects the sequences which you have excluded from the main output file of sequences. Writeable sequence(s) /dev/null
Advanced qualifiers Allowed values Default
(none)

Input file format

notseq reads normal sequence USAs.

Input files for usage example

File: ../../data/globins.fasta

>HBB_HUMAN Sw:Hbb_Human => HBB_HUMAN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>HBB_HORSE Sw:Hbb_Horse => HBB_HORSE
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
>HBA_HUMAN Sw:Hba_Human => HBA_HUMAN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>HBA_HORSE Sw:Hba_Horse => HBA_HORSE
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR
>MYG_PHYCA Sw:Myg_Phyca => MYG_PHYCA
VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
GDFGADAQGAMNKALELFRKDIAAKYKELGYQG
>GLB5_PETMA Sw:Glb5_Petma => GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA
AVIADTVAAGDAGFEKLMSMICILLRSAY
>LGB2_LUPLU Sw:Lgb2_Luplu => LGB2_LUPLU
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPEL
QAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKE
VVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA

The names (or accession numbers) of the sequences to be excluded can be entered as a file of such names by specifying an '@' followed by the name of the file containing the sequence names. For example: '@names.dat'.

The names or accession numbers of the sequences to be excluded are not standard EMBOSS USAs. Only the ID name or accession number can be specified, you cannot specify the sequences as 'database:ID', 'file:accession', 'format::file', etc.

Output file format

notseq writes normal a sequence file.

Output files for usage example

File: mydata.seq

>HBB_HUMAN Sw:Hbb_Human => HBB_HUMAN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>HBB_HORSE Sw:Hbb_Horse => HBB_HORSE
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
>HBA_HUMAN Sw:Hba_Human => HBA_HUMAN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>HBA_HORSE Sw:Hba_Horse => HBA_HORSE
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR
>GLB5_PETMA Sw:Glb5_Petma => GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA
AVIADTVAAGDAGFEKLMSMICILLRSAY

Output files for usage example 2

File: hb.seq

>HBB_HUMAN Sw:Hbb_Human => HBB_HUMAN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>HBB_HORSE Sw:Hbb_Horse => HBB_HORSE
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
>HBA_HUMAN Sw:Hba_Human => HBA_HUMAN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>HBA_HORSE Sw:Hba_Horse => HBA_HORSE
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR

File: mydata.seq

>MYG_PHYCA Sw:Myg_Phyca => MYG_PHYCA
VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
GDFGADAQGAMNKALELFRKDIAAKYKELGYQG
>GLB5_PETMA Sw:Glb5_Petma => GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA
AVIADTVAAGDAGFEKLMSMICILLRSAY
>LGB2_LUPLU Sw:Lgb2_Luplu => LGB2_LUPLU
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPEL
QAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKE
VVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA

Data files

None.

Notes

Note that the names or accession numbers of the sequences to be excluded are not standard EMBOSS USAs. Only the ID name or accession number can be specified, you cannot specify the sequences as 'database:ID', 'file:accession', 'format::file', etc.

References

None.

Warnings

None.

Diagnostic Error Messages

If no matches are found to any of the specified sequence names, the message "This is a warning: No matches found." is displayed.

Exit status

It exits with a status of 0 unless no matches are found to any of the input sequences name, in which case it exits with a status of -1.

Known bugs

None.

See also

Program nameDescription
biosedReplace or delete sequence sections
cutseqRemoves a specified section from a sequence
degapseqRemoves gap characters from sequences
descseqAlter the name or description of a sequence
entretReads and writes (returns) flatfile entries
extractfeatExtract features from a sequence
extractseqExtract regions from a sequence
listorWrites a list file of the logical OR of two sets of sequences
maskfeatMask off features of a sequence
maskseqMask off regions of a sequence
newseqType in a short new sequence
noreturnRemoves carriage return from ASCII files
nthseqWrites one sequence from a multiple set of sequences
pasteseqInsert one sequence into another
revseqReverse and complement a sequence
seqretReads and writes (returns) sequences
seqretsplitReads and writes (returns) sequences in individual files
skipseqReads and writes (returns) sequences, skipping the first few
splitterSplit a sequence into (overlapping) smaller sequences
trimestTrim poly-A tails off EST sequences
trimseqTrim ambiguous bits off the ends of sequences
unionReads sequence fragments and builds one sequence
vectorstripStrips out DNA between a pair of vector sequences
yankReads a sequence range, appends the full USA to a list file

Author(s)

This application was written by Gary Williams (gwilliam@hgmp.mrc.ac.uk)

History

Written (9 Jan 2001) - Gary Williams

Added ability to specify names to exclude as a list file (June 2002) - Gary Williams

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments