NAME

gbenchmacro - Genome Workbench Macro

SYNOPSIS

gbenchmacro [-h] [-help] [-p Input Directory] [-i InFile] [-d Output Directory] [-o OutFile] [-r] [-a a] [-b] [-f] [-m InFile] [-e Macro1[,...,MacroN] | all] [-y InFile] [-s String] [-x String] [-l SeqId1[,...,SeqIdN]] [-logfile File_Name] [-version] [-dryrun]

DESCRIPTION

gbenchmacro is a command-line tool to run macros on annotated biological sequence data represented according to NCBI's ASN.1 specifications. The input files can be either text or binary ASN.1 files, containing the following structures:

Complete list of ASN.1 types and their definition can be found at: https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/asn_spec/

The program executes macro(s) on ASN.1 file(s) (see parameter -e). The macros are defined in an external file. If macro file is not specified (using parameter -m), the program will use the default one. The default macro library is list_of_macros.txt. Please use the macros in this file as a sample of how to write your own macros.

The macros can do various editing actions. A typical use is to correct typos in the input file, fix date format, convert from one object to another, remove duplicate features and protein products, etc. Below is the list of functionsthat can be used in a macro.

For all types of qualifiers:

Description Function Example
Apply action – uses a single field, an input value to be applied
to the field, and information about how to handle existing text 
present in the field (replace, prepend with a delimiter, append with a delimiter, no action).
SetStringQual(field_name, new_value, existing_text, delimiter);

Existing_text can be one of:
“eLeaveOld”
“ePrepend”
“eAppend”
“eReplace”

Delimiter can be one of or may be omitted:
“ “  “;”  “,”  “:”
MACRO Uncultured_Taxname "Apply uncultured to taxname"
VAR
 value = "Uncultured"
 handle_existing_text =
 "ePrepend"
 delimiter = %;%
FOR EACH BioSource
DO
 SetStringQual(“org.taxname”, value,
 handle_existing_text, delimiter);
DONE
To set subsource/orgmod qualifiers:
SetModifier(object, modifier_path, modifier_subtype, new_value, existing_text, delimiter);

Resolve(field_name) – returns the field of interest corresponding to the field_name.
MACRO Apply_modifier "Apply 4 to segment (overwrite existing text)"
VAR
 qual_name = "subtype"
 modifier = "segment"
 new_value = "4"
 existing_text = "eReplace"
FOR EACH BioSource
DO
 obj =Resolve(qual_name) WHERE obj.subtype =
 modifier;
 SetModifier(obj, qual_name, modifier,
 new_value, existing_text);
DONE
To set non-string fields:
SetQual(field_name, new_value);
MACRO FixPopSets "Convert pop sets to phy sets when taxnames are inconsistent"
FOR EACH SeqSet
WHERE class = "pop-set" AND
INCONSISTENT_TAXA()
DO
 SetQual("class", "phy-set");
DONE
Edit action – uses a single field and four additional input values: a string to find in the field, a string to replace the found text, location type that specifies where to search for the string, and case sensitivity of the search EditStringQual(field_name, find_text, replace_text, location, case_sensitive);
Location can be one of:
“at the beginning”
“at the end”
“anywhere”
MACRO Fix_GeneLocus "Fix a misspelling of cytochrome in gene locus"
VAR
 find_text = "cytochorme"
 repl_text = "cytochrome"
 location = "anywhere"
 case_sensitive = false
FOR EACH Gene
DO
 EditStringQual(“data.gene.locus”,
 find_text, repl_text, location, case_sensitive);
DONE
Copy action – uses a source and a destination field, and information about how to handle existing text present in the destination field. It does not remove the source qualifier. CopyStringQual(source_field, dest_field, existing_text, delimiter);

Delimiter may be omitted.
MACRO Copy "Copy taxname to common name"
VAR
  src_qual = "org.taxname"
  dest_qual = "org.common"
  handle_existing_text =
  "ePrepend"
FOR EACH BioSource
DO
 CopyStringQual(src_qual,
 dest_qual,
 handle_existing_text, “,“);
DONE
Convert action – uses a source field and a destination field, conversion options (capitalization change and stripping prefix fieldnames) and information about how to handle existing text present in the destination field.  It does remove the source qualifier. ConvertStringQual(source_field, dest_field, cap_change, strip_name, existing_text, delimiter);

Delimiter may be omitted.
This function alone does not remove the source qualifier.

Capitalization change can be one of:
“none”
“toupper”
“tolower”
“firstcap”
“firstcap-restnochange”
“firstlower-restnochange”
“cap-word-space”
“cap-word-space-punct”
MACRO Convert_isolate_to_common "Convert source isolate to common name (overwrite existing text), capitalize all letter, and remove source isolate"
VAR
 dest_field = "org.common"
 cap_change = "toupper"
 strip_name = false
 existing_text = "eReplace"
FOR EACH BioSource
DO
 src =
 Resolve("org.orgname.mod")
 Where src.subtype = "isolate";
 ConvertStringQual("src.subname",
 dest_field, cap_change,
 strip_name, existing_text);
 RemoveModifier(src);
DONE
Remove action – requires a single field that needs to be removed. RemoveQual(field_name);

To remove subsource/orgmod qualifiers:
RemoveModifier(qualifier_object);
MACRO Remove_technique "Remove technique both where sequence type is protein and where technique is both"
FOR EACH MolInfo
WHERE biomol = "peptide" AND tech = "both"
DO
 RemoveQual("tech");
DONE
Swap action – uses two fields whose values will be swapped. Both fields need to be present. SwapStringQual(field_name1, field_name2); MACRO Swap_Gene_Map_With_Descr "Swap gene map with gene description"
VAR
 qual_src = "data.gene.map"
 qual_dest = "data.gene.descr"
FOR EACH Gene
DO
 SwapStringQual(qual_src,
 qual_dest);
DONE

For descriptors:

Description Function Example
Remove descriptor action – removes the descriptor selected by the iterator. RemoveDescriptor(); MACRO Remove_Defline "Remove title descriptor"
FOR EACH Seqdesc
WHERE CHOICETYPE() = "title"
DO
 RemoveDescriptor();
DONE
Fix capitalization in source qualifiers FixSourceQualCaps(field_name); MACRO FixCountry “Fix source country”
FOR EACH BioSource
DO
 obj =
 Resolve(“subtype”)
 Where obj.subtype =“country”;
 FixSourceQualCaps(obj);
DONE
Fix capitalization in mouse strain FixMouseStrains(field_name); MACRO FixMouse “Fix cap in common Mus musculus strains“
FOR EACH BioSource
DO
 obj =Resolve(“org.orgname.mod”)
 Where obj.subtype = “strain”;
 FixMouseStrains(obj);
DONE
Trim “junk” strings from forward and reverse primer sequences TrimJunkFromPrimerSeq(field_name); MACRO Remove_from_primers "Trim junk strings in primer sequences"
FOR EACH BioSource
VAR
 qual_name_fwd = "pcr-
 primers..forward..seq"
 qual_name_rev = "pcr-
 primers..reverse..seq"
DO
 TrimJunkFromPrimerSeq(qual_name_fwd);
 TrimJunkFromPrimerSeq(qual_name_rev);
DONE
Remove subsource/orgmod note if the note is exclusively constructed from special phrases or from words that appear in the lineage or in the taxname. Matching is case insensitive. RemoveLineageSourceNotes(); MACRO RemoveLineageSrcNotes "Remove lineage source notes"
FOR EACH BioSource
DO
 RemoveLineageSourceNotes();
DONE
Reorder structured comments ReorderStructuredComment(); MACRO ReorderStrComm "Reorder structured comment fields"
FOR EACH StructComment
DO
 ReorderStructuredComment();
DONE
Remove duplicate structured comments
If there are two identical structured comments, one at the sequence level and one at the set level, the one from the sequence level will be removed.
RemoveDuplicateStructComments(); MACRO RemoveDuplicates "Remove duplicate structured comments"
FOR EACH Seq
WHERE NOT inst.mol = "aa"
DO
 RemoveDuplicateStructComments();
DONE
Fix format action – applied to collection date, lat-lon or altitude. Applied to collection date, lat-lon and altitude:
FixFormat(field_name)
MACRO Fix_collectiondate "Fix format of collection date"
FOR EACH BioSource
DO
 o =Resolve("subtype") WHERE o.subtype
 = "collection-date";
 FixFormat(o);
DONE
Fix format action – applied to primers Applied to primers:
FixIInPrimerSeq(field_name);
MACRO Fix_primers "Fix i in primer sequences"
VAR
 qual_name_fwd = "pcr-
 primers..forward..seq"
 qual_name_rev = "pcr-
 primers..reverse..seq"
DO
 FixIInPrimerSeq(qual_name_fwd);
 FixIInPrimerSeq(qual_name_rev);
DONE

For publications:

Description Function Example
Fix capitalization in authors’ last names FixCapsAuthorLastName(field_name); MACRO FixLastNames “Fix cap in author last names where last name is all caps“
FOR EACH Pubdesc
DO
 obj =PUB_AUTHORS(“last”) Where ISUPPER(obj);
 FixCapsAuthorLastName(obj);
DONE
Fix USA and state abbreviations in publications FixUSAAndStateAbbreviations(); MACRO FixUSA "Fix USA and state abbreviations in publications"
FOR EACH Pubdesc
DO
 FixUSAAndStateAbbreviations();
DONE
Fix capitalization in publications – fix in all fields of affiliation FixPubCapsAffiliation(field, punct_only);

Punctuation_only – if true, fixes apply only to the punctuation. May be omitted.
unctuation_only – if true, fixes apply only to the punctuation. May be omitted.
MACRO FixCapsAffil "Fix pub affiliation (punctuation only)"
VAR
 punct_only = true
FOR EACH Pubdesc
DO
 obj =PUB_AFFIL();
 FixPubCapsAffiliation(obj,
 punct_only);
DONE
Fix capitalization in publication – fix title FixPubCapsTitle(field, punct_only);

Punctuation_only – if true, fixes apply only to the punctuation. May be omitted.
MACRO FixCapsTitle "Fix pub title where title is all caps"
FOR EACH Pubdesc
DO
 obj =PUB_TITLE()
 WHERE ISUPPER(obj);
 FixPubCapsTitle(obj);
DONE
Fix capitalization in publication - authors FixPubCapsAuthors(field, punct_only);

Punctuation_only – if true, fixes apply only to the punctuation. May be omitted.
MACRO FixCapsAuthors "Fix pub authors where each author name is all caps"
FOR EACH Pubdesc
DO
 obj =PUB_AUTHORS() WHERE
 IS_ALL_UPPER(PUB_AUTHORS());
 FixPubCapsAuthors(obj);
DONE
Fix capitalization in publication – affiliation title FixPubCapsAffilCountry(field, punct_only);

Punctuation_only – if true, fixes apply only to the punctuation. May be omitted.
MACRO FixCapsAffilCountry "Fix pub affiliation country"
FOR EACH Pubdesc
DO
 obj =PUB_AFFIL();
 FixPubCapsAffilCountry(obj);
DONE
Fix capitalization in publication – affiliation except institution and department FixPubCapsAffiliation_NOInstDept(field, punct_only);

Punctuation_only – if true, fixes apply only to the punctuation. May be omitted.
MACRO FixCapsAffil_NOInstDept "Fix pub caps in affiliation except institution and department"
FOR EACH Pubdesc
DO
 obj =PUB_AFFIL();
 FixPubCapsAffiliation_NOInstDept(obj);
DONE
Truncates the middle initials TruncateMiddleInitials(); MACRO TruncateMI " Truncate middle name initials"
FOR EACH Pubdesc
DO
 TruncateMiddleInitials();
DONE
Remove author suffix RemoveAuthorSuffix(); MACRO RemoveSuffixes "Remove author suffix"
FOR EACH Pubdesc
DO
 RemoveAuthorSuffix();
DONE
Move middle name to first name MoveMiddleToFirstName(); MACRO MoveMiddle "Move middle name to first name"
FOR EACH Pubdesc
DO
 MoveMiddleToFirstName();
DONE

For coding regions:

Description Function Example
Synchronize partials for a coding region and for its protein product sequence and protein feature SynchronizeCDSPartials(); MACRO Synchronize_CDS_partials "Synchronize coding region partials”
FOR EACH Cdregion
DO
 SynchronizeCDSPartials();
DONE
Adjust internal intervals of a coding region so that they  abut the consensus splice sites if such change can be made without changing the translation of the protein. AdjustCDSForConsensusSplice(); MACRO Adjust_for_consensus_splice "Adjust internal intervals of a coding region for consensus splice sites (GT-AG)"
FOR EACH Cdregion
DO
 AdjustCDSForConsensusSplice();
DONE
Retranslate coding regions  RetranslateCDS(obey_stop_codon); MACRO Retranslate "Retranslate coding regions while obeying stop codons"
VAR
 obey_stop_codon = true
FOR EACH Cdregion
DO
 RetranslateCDS(obey_stop_codon);
DONE
Remove stop codon from complete coding region TrimStopsFromCompleteCDS(); MACRO Trim_stop_from_complete_CDS "Remove trailing * from complete coding regions"
FOR EACH Cdregion
DO
 TrimStopsFromCompleteCDS();
DONE
Replace stops with selenocysteines ReplaceSelenocysteineStops(); MACRO ReplaceSelenocysteine "Replace stops with selenocysteines"
FOR EACH Cdregion
DO
 ReplaceSelenocysteineStops();
DONE

For features:

Description Function Example
Remove invalid EC numbers RemoveInvalidECNumbers(); MACRO RemoveEC "Remove invalid EC numbers"
FOR EACH Protein
DO
 RemoveInvalidECNumbers();
DONE
Update replaced EC numbers – accepts three parameters:  whether to delete improperly formatted, unrecognized EC numbers, and numbers that have been replaced by more than one number. UpdateReplacedECNumbers(delete_improper_format, delete_unrecognized, delete_multiple_replacement); MACRO UpdateReplEC "Update replaced EC numbers"
VAR
 del_improper = true
 del_unrecog = true
 del_mult_repl = true
FOR EACH Protein
DO
 UpdateReplacedECNumbers(del_improper, del_unrecog, del_mult_repl);
DONE
Fix format action – applied to protein name Applied to protein name:
FixProteinFormat(field_name);
MACRO Fix_proteins "Remove organism names from protein names"
FOR EACH Protein
DO
 FixProteinFormat("data.prot.name");
DONE
Edit feature location – extend 5’ end of feature to the end of sequence FixProteinFormat(field_name); MACRO ExtendFeatToSeqStart "Extend 5' end of feature to end of sequence for CDS features that are 5’ partial and where sequence type is mRNA"
FOR EACH Cdregion
WHERE MOLINFO_FOR_SEQFEAT("biomol") = "mRNA"
DO
 ExtendFeatToSeqStart();
DONE
Edit feature location – extend 3’ end of feature to the end of sequence ExtendFeatToSeqStop(); MACRO ExtendFeatToSeqStop "Extend 3' end of feature to end of sequence for CDS features that are 3' partial"
FOR EACH Cdregion
WHERE ISPARTIALSTOP()
DO
 ExtendFeatToSeqStop();
DONE
Edit feature location – set both ends to partial SetBothEndsPartial(type, extend);

Type can be “all” or “at-end”. The latter is used for features whose both ends are the end of the sequence. When the "extend" variable is true, the features are extended to the ends of sequence if partials are set.
MACRO SetPartialEnds "Set both ends of genes to partial when both ends of location are at the end of sequence"
VAR
 extend = false
FOR EACH Gene
DO
 SetBothEndsPartial("at-end", extend);
DONE
Add gene Xref to features AddGeneXref(); MACRO AddGeneXref "Add gene Xref from overlapping gene feature for mRNA features"
FOR EACH mRNA
DO
 AddGeneXref();
DONE
Remove gene Xrefs from features.  RemoveGeneXref(suppressing_type, necessary_type);

Suppressing_type can be one of:
“any”
“suppressing”
“nonsuppressing”

Necessary_type can be one of:
“any”
“necessary”
“unnecessary”
MACRO RemoveXref_suppressing "Remove suppressing gene Xref from CDS features"
VAR
 suppr_type = "suppressing"
 necessary_type = "any"
FOR EACH Cdregion
DO
 RemoveGeneXref(suppr_type,
 necessary_type);
DONE
Make Bold Xrefs – creates BARCODE dbXrefs on the sequences. MakeBOLDXrefs(); MACRO AddBarcodeXrefs "Make BARCODE Xrefs"
FOR EACH Seq
DO
 MakeBOLDXrefs();
DONE

For sets, sequences, alignment:

Description Function Example
Remove gaps in alignments – remove any segment from an alignment in which each row has a gap. RemoveSegGaps(); MACRO RemoveSegGaps "Remove seg-gaps"
FOR EACH SeqAlign
DO
 RemoveSegGaps();
DONE
Remove sequences RemoveSequence(); MACRO RemoveSequences "Remove sequence with a specified local id"
FOR EACH Seq
WHERE id.local.str = "Seq_10"
DO
 RemoveSequence();
DONE
Remove single-sequence pop, phy, mut or eco wrapper set without alignment RemoveSingleItemSet(); MACRO RemoveSingleItemSet "Remove single-sequence pop, phy, mut, or eco wrapper set without alignment"
FOR EACH TSEntry
DO
 RemoveSingleItemSet();
DONE
Renormalize nuc-prot sets RenormalizeNucProtSet(); MACRO RenormalizeNucProtSet "Renormalize nuc-prot sets"
FOR EACH TSEntry
DO
 RenormalizeNucProtSet();
DONE
Convert pop sets to phy sets when taxnames are inconsistent SetQual(field_name, new_value); MACRO FixPopSets "Convert pop sets to phy sets when taxnames are inconsistent"
FOR EACH SeqSet
WHERE class = "pop-set" AND INCONSISTENT_TAXA()
DO
 SetQual("class", "phy-set");
DONE
Fix common misspellings that may occur anywhere in the entry. FixSpelling(); MACRO FixSpelling "Fix spelling"
FOR EACH TSEntry
DO
 FixSpelling();
DONE
Perform discrepancy autofix for a given discrepancy test PerformDiscrAutofix(test_name); MACRO DiscrAutofix "Perform autofix for EC_NUMBER_ON_UNKNOWN_PROTEIN discrepancy test"
FOR EACH TSEntry
DO
 PerformDiscrAutofix("EC_NUMBER_ON_UNKNOWN_PROTEIN");
DONE
Autodef – generate definition line for all nucleotide sequences, using biosource modifiers to ensure that the definition lines are unique. Autodef(list_feat_rule, misc_feat_rule, modifier1, …);

List_feat_rule may be one of:
“List All Features”
“Complete Sequence”
“Complete Genome”
“Partial Sequence”
“Partial Genome”
“Sequence”

Misc_feat rule may be one of:
“Delete”
“CommentFeat”
“NonCodingProductFeat”

Options are case insensitive. List of modifiers may be omitted.
 
Misc_feat rule may be one of:
MACRO Autodefmacro "Autodef list all features with modifier strain, country, use misc_feat comment before first semicolon"
VAR
 list_feat_rule = "List All Features"
 misc_feat_rule = "CommentFeat"
FOR EACH TSEntry
DO
 Autodef(list_feat_rule, misc_feat_rule, "strain", "country");
DONE
Perform taxonomy lookup and extended cleanup, correct genetic codes of coding regions DoTaxLookup(); MACRO TaxLookup "Do tax lookup"
FOR EACH TSEntry
DO
 DoTaxLookup();
DONE
All functions are case sensitive. The ones present in the WHERE section need to be capitalized. The program can process a single file or multiple files, located in one directory and its sub-directories. The output files can be stored in either text or binary form, in the specified output folder. If output folder is not specified, the output files will be stored in the folder containing the input file(s). By default gbenchmacro appends “.processed” to the output file name; this can be overridden with the –s parameter. By default the program logs to the console. The log information can be forwarded to a file, using the –logfile parameter. The parameter -dryruncan be used to check the program arguments correctness and the syntax of the input macro file, without making any modifications to the data.

The options are as follows:

Option Description
-h Print USAGE and DESCRIPTION; ignore all other parameters.
-help Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters.
-<File_In> Path to a folder, containing the input ASN.1 files. The program filters the files by extension (see -xparameter). Incompatible with -i, -o.
-<File_In> Path to the input file. Incompatible with -p.
-<File_Out> Path to a folder, where the output ASN.1 files will be stored. Incompatible with: -o.
-<File_Out> Path to the output ASN.1 file. Incompatible with: -d, -p.
-r Flag, indicating whether to process directories recursively.
-<String> Specify the ASN.1 type contained in the input file:
a Automatic – the application determines the type automatically
eSeq-entry
bBioseq
sBioseq-set
mSeq-submit
Default value is 'a' (automatic).
-b Flag, indicating that the input is in binary format. Used only if the program cannot determine the format of the input file(s).
-f Flag, indicating that the output file(s) should be in binary format.
-<File_In> Macro (library) file that contains definition of macros. Default is '<std>/etc/list_of_macros.txt'. <std> is the standard path for this file; relative to the program’s executable.
-<String> This option may be omitted if the macro library contains only one macro. Otherwise:
Macro1[,...,MacroN]List of comma-separated macros to execute on the input files.
allExecute all macros, defined in the macro library.
-<File_In> Path to the file that contains synonyms. This is mostly used in string constraints. The default is '<std>/etc/synonyms.txt'.
<std> is the standard path for this file; relative to the program excutable.
-<String> Use this parameter to change the extension of the output file(s).
Default is '.proceed'.
-<String> File extension for input files specified by -p
Default='.asn'
-<String> List of comma-separated sequences identifiers (accession numbers of local identifiers) to filter sequences from the input file. The macro(s) will only be executed on the specified sequences.
-logfile <File_Out> Path to the file which the program log will be redirected to. If not specified the log will be displayed in the console.
-version Print version number; ignore other arguments
-dryrun Dry run the application: do nothing, only test all preconditions. Verifies the parameters and parses the macro library without actually executing the macro(s). Useful for syntax checking of the macros and for checking program arguments’ correctness.

EXAMPLES

gbenchmacro -i seq_submit.asn -e TestDebug1,TestDebug2 -logfile seq_submit.log

Execute macros TestDebug1 and TestDebug2 from the standard macro library on file seq_submit.asn. Forward the log information to file seq_submit.log.

Execute macro TestDebug1 from the standard macro library on all files in folder “input_files”, where the extension of the input files is “.bin”. Store the output files in folder “output_files”, appending “.asn” to the names of the original files.

MACRO QUERY LANGUAGE

EBNF language description:

Command Description
<SCRIPT>::= "MACRO" <macro-name> <macro-title>
[<vars-section>]
<macro-body>
<vars-section>::= "VARS"
<var-name> "=" (<var-value> | <ask-statement> | <choice>)
<macro-body>::=

"FOR REACH" <asn-sel-list>
["WHERE" <condition-clause>]
"DO"
<function-spec>
{<function-spec>}
"DONE"

<macro-name>::= <identifier>
<macro-title>::= <string>
<var-name>::= <identifier>
<var-value>::= <number> | <string>
<asn-sel-list>::= <asn-selector> "," <asn-sel-list>
<asn-selector>::= ('*' | digit | letter | ".">) {('*' | digit | letter | ".">)}
<condition-clause>::= ("+", "-", && and TBD) | (any parsable lexems \ keywords) {<condition-clause>}
<function-spec>::= <func-name>
"(" [(<var-name>|<var-value>){","(<var-name>|<var-value>)}] ")"
<func-name>::= <identifier>
<number>::= [("+" | "-")] <digit>
<string>::= "\"" {*} "\""
<ask-statement>::= "%" {*} "%"
<choice>::= "CHOICE" "{"<var-value> {"," <var-value>} "}"
<identifier>::= <letter> {(<letter> | <digit> | "_"}
<digit>::= "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"
<letter>::= "a"|"b"|"c"|"d"|"e"|"f"|"g"|"h"|"i"|"j"|"k"|"l"|"m"|"n"|"o"|"p"|"q"|"r"|"s"|"t"|"u"|"v"|"w"|"x"|"y"|"z"

Support Center

Last updated: 2017-11-04T00:41:41Z