|
||||||||
|
|
What is the Conserved Domain Architecture Retrieval Tool (CDART)?Given a query sequence, CDART shows the functional domains that make up a protein and then lists proteins with a similar domain architecture. The functional domains for a sequence are found by RPS-BLAST, which defines a domain by a PSSM (Position-specific scoring matrices), a set of probabilities of amino acids existing at each position of the domain. RPS-BLAST is known as a "profile" search, which is a sensitive way to look for sequence homologues. Why doesn't my protein have any domain hits?Our set of domain definitions, while large, is not complete. Also, existing domain definitions may not be sensitive enough in some cases. We are actively working on both issues. How are domains for each sequence in the related sequences list calculated?We take all protein sequences in the non-redundant subset of Entrez proteins (a.k.a. nr) and run RPS-BLAST on them with an expectation value of 0.01, filtered for low complexity, and using multiple hits 1-pass mode. These parameters are not adjustable. No two hits are allowed to overlap by more that 50% of either of their length. If two hits overlap, the highest scoring hit is taken. How is the list of similar proteins ranked?By the number of non-redundant hits to the same domains as in the query protein. This allows hits to proteins with one or more of the domains in the query. However, if you requery by using the form at the bottom of the results page, the similar proteins returned must include all of the domains you have selected. Why does the ranking appear incorrect when searching all domain databases? Why do proteins with fewer unique hits rank higher than those with more unique hits?If you are searching all domain databases, it is likely that there are hits that are not appearing in the graphical display because they are overlapped by higher scoring redundant hits from one of the other databases. To see these hits, click on the "RPS>>" link. How are the hits clustered together? What does the number of sequences in the hit line mean?Many proteins have very similar domain architectures. To reduce the size of similar proteins list, we take advantage of this fact and cluster proteins by the order of hits to domain definitions, ignoring repeats. If you click on one of these clusters, which are labelled by the number of sequences in the cluster, you will see a complete list of proteins in the cluster. Why do some domains share the same graphical logo?Some domain definitions are largely redundant. To simplify the display and increase the sensitivity, we group together redundant domains. This is done by examining the domain hits on all sequences and looking for significant overlap that consistently happens in the entire set of non-redundant sequences. Why do some of the domains appear redundant?Many of our domain definitions are imported from various groups who may create nearly redundant definitions. We attempt to cluster together these domains (see the above question), but this process is not foolproof. Redundancy does mean that some similar proteins are not shown as such, since the query sequence and the similar sequence may hit different redundant domains. Why do some of the domains run together or overlap?Some domain definitions contain multiple domains. If any of the subdomains are also defined, overlap can occur. We are working to redefine domains so that they do not contain multiple domains or contain them in a hierarchical manner. Why do some domains appear broken?Domain hits are defined by matching a multiple sequence alignment to the query sequence. In some cases, only part of the match has significant probability. You can adjust the expected probability value of the original search to try and discover more of the domain. We are working to improve the domain definitions to increase their sensitivity and specificity. How do I get more information on the sequences that are similar? How do I find the same sequences with different annotations?If you click on the protein name/accession on the results page, you will go to the Entrez record for the protein. To find related sequences, click on the related sequences link. Some of the sequences at the top of this list will be identical to the original sequence but with different annotations. How do I see the sequence alignment itself in detail, along with expectation values?Click on the RPS button next to each sequence. This runs RPS blast on the sequence. Note that the RPS-BLAST alignment and the alignment you see in CDART may be slightly different, as the CDART alignments are recorded at an earlier time -- the RPS-BLAST algorithm or the domain definitions may change in the meantime. How do I query for sequences containing only the domains I am interested in?At the bottom of the page, check off the domains you are interested in and then press the search button. Results are sequences that contain the domains you selected. How do I search for a domain not found in your database?You can use PSI-BLAST to define and iteratively search for your domain. How do I get more information about the domains?Click on the symbol or hyperlink for the domain. This will show a page containing a short description, medline references, taxonomy, and sequence alignments. Is there a command line version of this program?Not presently, but we are considering such a program. |
|
Help | Disclaimer | Write to the Help Desk
|