NCBI C++ ToolKit
Classes | Public Types | Public Member Functions | Static Public Member Functions | Protected Types | Protected Member Functions | Protected Attributes | List of all members
CSequenceAmbigTrimmer Class Reference

Search Toolkit Book for CSequenceAmbigTrimmer

This trims ambiguous bases from the start and/or end of sequences, using customizable rules. More...

#include <objmgr/util/sequence.hpp>

+ Inheritance diagram for CSequenceAmbigTrimmer:
+ Collaboration diagram for CSequenceAmbigTrimmer:

Classes

struct  SAmbigCount
 This holds the output of x_CountAmbigInRange. More...
 
struct  STrimRule
 For example, if bases_to_check is 10 and max_bases_allowed_to_be_ambig is 5, then on each iteration we check the 10 terminal bases and trim off those 10 if there are more than 5 ambiguous bases there. More...
 

Public Types

enum  EMeaningOfAmbig { eMeaningOfAmbig_OnlyCompletelyUnknown, eMeaningOfAmbig_AnyAmbig }
 This enum is used to set what is meant by "ambiguous". More...
 
enum  EFlags { fFlags_DoNotTrimBeginning = (1 << 0), fFlags_DoNotTrimEnd = (1 << 1), fFlags_DoNotTrimSeqGap = (1 << 2) }
 
enum  EResult { eResult_SuccessfullyTrimmed, eResult_NoTrimNeeded }
 This indicates what happened with the trim. More...
 
typedef int TFlags
 
typedef vector< STrimRuleTTrimRuleVec
 Multiple STrimRules are allowed, which are applied from smallest bases_to_check to largest bases_to_check, and redundant rules are automatically removed. More...
 
- Public Types inherited from CObject
enum  EAllocFillMode { eAllocFillNone = 1, eAllocFillZero, eAllocFillPattern }
 Control filling of newly allocated memory. More...
 
typedef CObjectCounterLocker TLockerType
 Default locker type for CRef. More...
 
typedef CAtomicCounter TCounter
 Counter type is CAtomiCounter. More...
 
typedef TCounter::TValue TCount
 Alias for value type of counter. More...
 

Public Member Functions

 CSequenceAmbigTrimmer (EMeaningOfAmbig eMeaningOfAmbig, TFlags fFlags=0, const TTrimRuleVec &vecTrimRules=GetDefaultTrimRules(), TSignedSeqPos uMinSeqLen=50)
 This sets up the parameters for how this trimmer will act. More...
 
virtual ~CSequenceAmbigTrimmer ()
 Do-nothing destructor just to allow inheritance. More...
 
virtual EResult DoTrim (CBioseq_Handle &bioseq_handle)
 This trims the given bioseq, using params set in the CSequenceAmbigTrimmer constructor. More...
 
- Public Member Functions inherited from CObject
 CObject (void)
 Constructor. More...
 
 CObject (const CObject &src)
 Copy constructor. More...
 
virtual ~CObject (void)
 Destructor. More...
 
CObjectoperator= (const CObject &src) THROWS_NONE
 Assignment operator. More...
 
bool CanBeDeleted (void) const THROWS_NONE
 Check if object can be deleted. More...
 
bool IsAllocatedInPool (void) const THROWS_NONE
 Check if object is allocated in memory pool (not system heap) More...
 
bool Referenced (void) const THROWS_NONE
 Check if object is referenced. More...
 
bool ReferencedOnlyOnce (void) const THROWS_NONE
 Check if object is referenced only once. More...
 
void AddReference (void) const
 Add reference to object. More...
 
void RemoveReference (void) const
 Remove reference to object. More...
 
void ReleaseReference (void) const
 Remove reference without deleting object. More...
 
virtual void DoNotDeleteThisObject (void)
 Mark this object as not allocated in heap – do not delete this object. More...
 
virtual void DoDeleteThisObject (void)
 Mark this object as allocated in heap – object can be deleted. More...
 
void * operator new (size_t size)
 Define new operator for memory allocation. More...
 
void * operator new[] (size_t size)
 Define new[] operator for 'array' memory allocation. More...
 
void operator delete (void *ptr)
 Define delete operator for memory deallocation. More...
 
void operator delete[] (void *ptr)
 Define delete[] operator for memory deallocation. More...
 
void * operator new (size_t size, void *place)
 Define new operator. More...
 
void operator delete (void *ptr, void *place)
 Define delete operator. More...
 
void * operator new (size_t size, CObjectMemoryPool *place)
 Define new operator using memory pool. More...
 
void operator delete (void *ptr, CObjectMemoryPool *place)
 Define delete operator. More...
 
virtual void DebugDump (CDebugDumpContext ddc, unsigned int depth) const
 Define method for dumping debug information. More...
 
- Public Member Functions inherited from CDebugDumpable
 CDebugDumpable (void)
 
virtual ~CDebugDumpable (void)
 
void DebugDumpText (ostream &out, const string &bundle, unsigned int depth) const
 
void DebugDumpFormat (CDebugDumpFormatter &ddf, const string &bundle, unsigned int depth) const
 
void DumpToConsole (void) const
 

Static Public Member Functions

static const TTrimRuleVecGetDefaultTrimRules (void)
 This returns a reasonable default for trimming rules. More...
 
- Static Public Member Functions inherited from CObject
static NCBI_NORETURN void ThrowNullPointerException (void)
 Define method to throw null pointer exception. More...
 
static NCBI_NORETURN void ThrowNullPointerException (const type_info &type)
 
static EAllocFillMode GetAllocFillMode (void)
 
static void SetAllocFillMode (EAllocFillMode mode)
 
static void SetAllocFillMode (const string &value)
 Set mode from configuration parameter value. More...
 
- Static Public Member Functions inherited from CDebugDumpable
static void EnableDebugDump (bool on)
 

Protected Types

typedef bool TAmbigLookupTable[26]
 

Protected Member Functions

bool x_TestFlag (TFlags fFlag)
 Test if a given flag is set. More...
 
virtual void x_NormalizeVecTrimRules (TTrimRuleVec &vecTrimRules)
 This prepares the vector of trimming rules to be used by the trimming algorithm. More...
 
virtual EResult x_TrimToNothing (CBioseq_Handle &bioseq_handle)
 The bioseq is trimmed to size 0. More...
 
virtual TSignedSeqPos x_FindWhereToTrim (const CSeqVector &seqvec, const TSignedSeqPos iStartPosInclusive_arg, const TSignedSeqPos iEndPosInclusive_arg, TSignedSeqPos iTrimDirection)
 This returns the last good base that won't be trimmed (note: last really means "first" when we're starting from the end) More...
 
virtual void x_EdgeSeqMapGapAdjust (const CSeqVector &seqvec, TSignedSeqPos &in_out_uStartOfGoodBasesSoFar, const TSignedSeqPos uEndOfGoodBasesSoFar, const TSignedSeqPos iTrimDirection, const TSignedSeqPos uChunkSize)
 This adjusts in_out_uStartOfGoodBasesSoFar if we're at a CSeqMap gap. More...
 
virtual void x_CountAmbigInRange (SAmbigCount &out_result, const CSeqVector &seqvec, const TSignedSeqPos iStartPosInclusive_arg, const TSignedSeqPos iEndPosInclusive_arg, const TSignedSeqPos iTrimDirection)
 This counts the number of ambiguous bases in the range [leftmost_pos_to_check, rightmost_pos_to_check]. More...
 
TSignedSeqPos x_SegmentGetBeginningInclusive (const CSeqMap_CI &segment, const TSignedSeqPos iTrimDirection)
 This returns the (inclusive) position at the beginning of the segment. More...
 
TSignedSeqPos x_SegmentGetEndInclusive (const CSeqMap_CI &segment, const TSignedSeqPos iTrimDirection)
 This returns the (inclusive) position at the end of the segment currently at iStartPosInclusive_arg. More...
 
CSeqMap_CIx_SeqMapIterDoNext (CSeqMap_CI &in_out_segment_it, const TSignedSeqPos iTrimDirection)
 Returns the "next" segment. More...
 
void x_SliceBioseq (TSignedSeqPos leftmost_good_base, TSignedSeqPos rightmost_good_base, CBioseq_Handle &bioseq_handle)
 
- Protected Member Functions inherited from CObject
virtual void DeleteThis (void)
 Virtual method "deleting" this object. More...
 

Protected Attributes

EMeaningOfAmbig m_eMeaningOfAmbig
 This holds the current interpretation for "ambiguous". More...
 
TFlags m_fFlags
 This holds the flags that affect the behavior of this class. More...
 
TTrimRuleVec m_vecTrimRules
 This holds the trimming rules that will be applied. More...
 
TSignedSeqPos m_uMinSeqLen
 When the bioseq gets trimmed down to less than this size, we halt the trimming. More...
 
TAmbigLookupTable m_arrNucAmbigLookupTable
 
TAmbigLookupTable m_arrProtAmbigLookupTable
 

Additional Inherited Members

- Static Public Attributes inherited from CObject
static const TCount eCounterBitsCanBeDeleted = 1 << 0
 Define possible object states. More...
 
static const TCount eCounterBitsInPlainHeap = 1 << 1
 Heap signature was found. More...
 
static const TCount eCounterBitsPlaceMask
 Mask for 'in heap' state flags. More...
 
static const int eCounterStep = 1 << 2
 Skip over the "in heap" bits. More...
 
static const TCount eCounterValid = TCount(1) << (sizeof(TCount) * 8 - 2)
 Minimal value for valid objects (reference counter is zero) Must be a single bit value. More...
 
static const TCount eCounterStateMask
 Valid object, and object in heap. More...
 

Detailed Description

This trims ambiguous bases from the start and/or end of sequences, using customizable rules.

Definition at line 1225 of file sequence.hpp.

Member Typedef Documentation

typedef bool CSequenceAmbigTrimmer::TAmbigLookupTable[26]
protected

Definition at line 1504 of file sequence.hpp.

Definition at line 1250 of file sequence.hpp.

Multiple STrimRules are allowed, which are applied from smallest bases_to_check to largest bases_to_check, and redundant rules are automatically removed.

When a rule is applied, we start over at the first sorted rule again.

Definition at line 1263 of file sequence.hpp.

Constructor & Destructor Documentation

CSequenceAmbigTrimmer::CSequenceAmbigTrimmer ( EMeaningOfAmbig  eMeaningOfAmbig,
TFlags  fFlags = 0,
const TTrimRuleVec vecTrimRules = GetDefaultTrimRules(),
TSignedSeqPos  uMinSeqLen = 50 
)

This sets up the parameters for how this trimmer will act.

Parameters
eMeaningOfAmbigThis indicates exactly what ambiguous means (e.g. just "N" or do all ambiguous symbols count? )
fFlagsmiscellaneous parameters to control this. See TFlags.
vecTrimRulesThis indicates how trimming will occur. See TTrimRuleVec.
uMinSeqLenTrimming tries to halt if the sequence becomes smaller than this size. It is possible for the resulting sequence to be below the uMinSeqLen size (or even trimmed to nothing), but the trimmer will at least try not to do that.

Definition at line 163 of file seq_trimmer.cpp.

References _ASSERT, ArraySize(), eMeaningOfAmbig_AnyAmbig, eMeaningOfAmbig_OnlyCompletelyUnknown, m_arrNucAmbigLookupTable, m_arrProtAmbigLookupTable, m_eMeaningOfAmbig, m_vecTrimRules, NCBI_USER_THROW_FMT, and x_NormalizeVecTrimRules().

virtual CSequenceAmbigTrimmer::~CSequenceAmbigTrimmer ( )
inlinevirtual

Do-nothing destructor just to allow inheritance.

Definition at line 1289 of file sequence.hpp.

Member Function Documentation

CSequenceAmbigTrimmer::EResult CSequenceAmbigTrimmer::DoTrim ( CBioseq_Handle bioseq_handle)
virtual

This trims the given bioseq, using params set in the CSequenceAmbigTrimmer constructor.

It will properly handle the annots and descs inside the bioseq, too, if requested.

Parameters
bioseq_handleThe bioseq to trim.
Returns
This returns how the trimming went. On error, an exception is thrown and the bioseq may be in an undefined state.

Definition at line 213 of file seq_trimmer.cpp.

References _ASSERT, CBioseq_Handle::eCoding_Iupac, eResult_NoTrimNeeded, eResult_SuccessfullyTrimmed, fFlags_DoNotTrimBeginning, fFlags_DoNotTrimEnd, CBioseq_Handle::GetBioseqLength(), x_FindWhereToTrim(), x_SliceBioseq(), x_TestFlag(), and x_TrimToNothing().

Referenced by CTrimN::apply().

const CSequenceAmbigTrimmer::TTrimRuleVec & CSequenceAmbigTrimmer::GetDefaultTrimRules ( void  )
static

This returns a reasonable default for trimming rules.

Definition at line 156 of file seq_trimmer.cpp.

References CSafeStatic< T, Callbacks >::Get(), and NULL.

Referenced by CTrimN::apply().

void CSequenceAmbigTrimmer::x_CountAmbigInRange ( SAmbigCount out_result,
const CSeqVector seqvec,
const TSignedSeqPos  iStartPosInclusive_arg,
const TSignedSeqPos  iEndPosInclusive_arg,
const TSignedSeqPos  iTrimDirection 
)
protectedvirtual

This counts the number of ambiguous bases in the range [leftmost_pos_to_check, rightmost_pos_to_check].

Note that rightmost_pos_to_check is inclusive.

Parameters
out_resultThis will store the result. Pass in a struct initialized by the default constructor.
seqvecThis is used to get the bases.
iStartPosInclusiveThis is where we start our count.
iEndPosInclusiveThis is where we end our count. Note that it can be < or > iStartPosInclusive, depending on trim direction.
iTrimDirection1 to trim from left to right, -1 to trim from right to left.

Definition at line 551 of file seq_trimmer.cpp.

References abs, CSeqMap::eSeqData, CSeqMap::eSeqGap, fFlags_DoNotTrimSeqGap, CSeqMap::FindSegment(), CSeqVector::GetScope(), CSeqVector::GetSeqMap(), CSeqVector::GetSequenceType(), CSeqMap_CI::GetType(), CSeqVector::IsNucleotide(), CSeqVector::IsProtein(), m_arrNucAmbigLookupTable, m_arrProtAmbigLookupTable, m_fFlags, max(), min(), NCBI_USER_THROW_FMT, NULL, CSequenceAmbigTrimmer::SAmbigCount::num_ambig_bases, CSequenceAmbigTrimmer::SAmbigCount::pos_after_last_gap, x_SegmentGetBeginningInclusive(), x_SegmentGetEndInclusive(), and x_SeqMapIterDoNext().

Referenced by x_FindWhereToTrim().

void CSequenceAmbigTrimmer::x_EdgeSeqMapGapAdjust ( const CSeqVector seqvec,
TSignedSeqPos in_out_uStartOfGoodBasesSoFar,
const TSignedSeqPos  uEndOfGoodBasesSoFar,
const TSignedSeqPos  iTrimDirection,
const TSignedSeqPos  uChunkSize 
)
protectedvirtual

This adjusts in_out_uStartOfGoodBasesSoFar if we're at a CSeqMap gap.

It does not notice ambiguous bases that are inside a normal sequence.

Parameters
seqvecThis is used to access information about the sequence.
in_out_uStartOfGoodBasesSoFarThis is the start of where we check for a gap. It will be updated to be past the gap, if a gap is found.
in_out_uRightmostGoodBaseSoFarAnalogous to in_out_uLeftmostGoodBaseSoFar. It's inclusive.
uEndOfGoodBasesSoFarThis limits how far this function may search (inclusive) when looking for the end of a gap segment.
iTrimDirection1 to trim from left to right, -1 to trim from right to left.
uChunkSizeThe gap size that we chop off must be a multiple of uChunkSize. We will chop off less if we would go more than 1 past the uEndOfGoodBasesSoFar. A uChunkSize of 1 means no chunking for obvious math reasons.

Definition at line 477 of file seq_trimmer.cpp.

References abs, CSeqMap::eSeqData, CSeqMap::eSeqGap, fFlags_DoNotTrimSeqGap, CSeqMap::FindSegment(), CSeqVector::GetScope(), CSeqVector::GetSeqMap(), CSeqMap_CI::GetType(), CSeqVector::IsNucleotide(), CSeqVector::IsProtein(), m_arrNucAmbigLookupTable, m_arrProtAmbigLookupTable, m_fFlags, NCBI_USER_THROW, NULL, and x_SegmentGetEndInclusive().

Referenced by x_FindWhereToTrim().

TSignedSeqPos CSequenceAmbigTrimmer::x_FindWhereToTrim ( const CSeqVector seqvec,
const TSignedSeqPos  iStartPosInclusive_arg,
const TSignedSeqPos  iEndPosInclusive_arg,
TSignedSeqPos  iTrimDirection 
)
protectedvirtual

This returns the last good base that won't be trimmed (note: last really means "first" when we're starting from the end)

Parameters
seqvecThis lets us explore the Bioseq to find out where to trim.
iStartPosInclusive_argThis is the where we start our trimming. Depending on direction, this could be < or > iEndPosInclusive_arg.
iEndPosInclusive_argThis is where the trimming ends (inclusive). Analogous to iStartPosInclusive_arg.
iTrimDirection1 to trim from left to right, -1 to trim from right to left.
Returns
The last good base (remember: last means "lower number" when we're checking from the end). If trimming would trim off the entire sequence, it returns a position past the end of the sequence.

Definition at line 342 of file seq_trimmer.cpp.

References _ASSERT, abs, CSequenceAmbigTrimmer::STrimRule::bases_to_check, ITERATE, m_uMinSeqLen, m_vecTrimRules, max(), CSequenceAmbigTrimmer::STrimRule::max_bases_allowed_to_be_ambig, min(), CSequenceAmbigTrimmer::SAmbigCount::num_ambig_bases, CSequenceAmbigTrimmer::SAmbigCount::pos_after_last_gap, s_IsValidDirection(), x_CountAmbigInRange(), and x_EdgeSeqMapGapAdjust().

Referenced by DoTrim().

void CSequenceAmbigTrimmer::x_NormalizeVecTrimRules ( TTrimRuleVec vecTrimRules)
protectedvirtual

This prepares the vector of trimming rules to be used by the trimming algorithm.

For example, it eliminate duplicates and puts the rules in the correct order.

Parameters
vecTrimRulesInput and output.

Definition at line 264 of file seq_trimmer.cpp.

References CSequenceAmbigTrimmer::STrimRule::bases_to_check, ITERATE, CSequenceAmbigTrimmer::STrimRule::max_bases_allowed_to_be_ambig, NCBI_USER_THROW_FMT, and remove_if().

Referenced by CSequenceAmbigTrimmer().

TSignedSeqPos CSequenceAmbigTrimmer::x_SegmentGetBeginningInclusive ( const CSeqMap_CI segment,
const TSignedSeqPos  iTrimDirection 
)
inlineprotected

This returns the (inclusive) position at the beginning of the segment.

Parameters
segmentThis is the segment we're trying to find the beginning of.
iTrimDirectionThis is which direction in which we're trimming. The beginning will be in the opposite direction.
Returns
This returns the (inclusive) position at the beginning of the given segment. As always, the definition of "beginning" depends on iTrimDirection.

Definition at line 1459 of file sequence.hpp.

Referenced by x_CountAmbigInRange().

TSignedSeqPos CSequenceAmbigTrimmer::x_SegmentGetEndInclusive ( const CSeqMap_CI segment,
const TSignedSeqPos  iTrimDirection 
)
protected

This returns the (inclusive) position at the end of the segment currently at iStartPosInclusive_arg.

Parameters
segmentThis is the segment we're trying to find the end of.
iTrimDirectionThis is which direction in which we're trimming. The end of the segment will be found by looking in that direction.
Returns
This returns the (inclusive) position at the end of the given segment. The definition of "end" depends on iTrimDirection.

As always, the definition of "end" depends on iTrimDirection.

Definition at line 654 of file seq_trimmer.cpp.

References _ASSERT, CSeqMap_CI::GetEndPosition(), CSeqMap_CI::GetPosition(), and s_IsValidDirection().

Referenced by x_CountAmbigInRange(), and x_EdgeSeqMapGapAdjust().

CSeqMap_CI & CSequenceAmbigTrimmer::x_SeqMapIterDoNext ( CSeqMap_CI in_out_segment_it,
const TSignedSeqPos  iTrimDirection 
)
protected

Returns the "next" segment.

The definition of "next" depends on iTrimDirection

Parameters
in_out_segmentCaller gives the current CSeqMap_CI, which will be returned adjusted in the trim direction.
iTrimDirectionThe direction in which to increment. 1 means normal incrementing and -1 really means decrementing.
Returns
Reference to in_out_segment_it.

Definition at line 672 of file seq_trimmer.cpp.

References _ASSERT, and s_IsValidDirection().

Referenced by x_CountAmbigInRange().

void CSequenceAmbigTrimmer::x_SliceBioseq ( TSignedSeqPos  leftmost_good_base,
TSignedSeqPos  rightmost_good_base,
CBioseq_Handle bioseq_handle 
)
protected
bool CSequenceAmbigTrimmer::x_TestFlag ( TFlags  fFlag)
inlineprotected

Test if a given flag is set.

Definition at line 1330 of file sequence.hpp.

Referenced by DoTrim().

CSequenceAmbigTrimmer::EResult CSequenceAmbigTrimmer::x_TrimToNothing ( CBioseq_Handle bioseq_handle)
protectedvirtual

Member Data Documentation

TAmbigLookupTable CSequenceAmbigTrimmer::m_arrNucAmbigLookupTable
protected
TAmbigLookupTable CSequenceAmbigTrimmer::m_arrProtAmbigLookupTable
protected
EMeaningOfAmbig CSequenceAmbigTrimmer::m_eMeaningOfAmbig
protected

This holds the current interpretation for "ambiguous".

For example, it indicates whether just 'N' is ambiguous or if any non-ACGT letter is ambiguous. Works for amino acids, too (e.g. 'X' for completely unknown, etc.)

Definition at line 1318 of file sequence.hpp.

Referenced by CSequenceAmbigTrimmer().

TFlags CSequenceAmbigTrimmer::m_fFlags
protected

This holds the flags that affect the behavior of this class.

Definition at line 1320 of file sequence.hpp.

Referenced by x_CountAmbigInRange(), and x_EdgeSeqMapGapAdjust().

TSignedSeqPos CSequenceAmbigTrimmer::m_uMinSeqLen
protected

When the bioseq gets trimmed down to less than this size, we halt the trimming.

Definition at line 1327 of file sequence.hpp.

Referenced by x_FindWhereToTrim().

TTrimRuleVec CSequenceAmbigTrimmer::m_vecTrimRules
protected

This holds the trimming rules that will be applied.

It should be normalized by the constructor to eliminate dups and to sort it from least to most bases.

Definition at line 1324 of file sequence.hpp.

Referenced by CSequenceAmbigTrimmer(), and x_FindWhereToTrim().


The documentation for this class was generated from the following files:
Modified on Sat Sep 24 15:21:04 2016 by modify_doxy.py rev. 506947