What are "pseudo documents" and why do we need them?

Pseudo documents are a conversion of the information from short and long form karyotypes, SKY, M-Fish and CGH ( automatic machine generated as well as human curated) profiles into one common set of terms. A pseudo document is comprised of words that are made for breakpoints, junctions, amplifications, deletions, abnormal terminals (the terminal ends of chromosomes), normal, and abnormal chromosomes in addition to the clinical textual content. By doing this we can reuse the statistical comparison used to detect textual documents. See the discussion Computation of Related Articles for how we detect Related textual documents.

Which databases are converted to pseudo documents?

Pseudo documents are generated for all cases present in the NCI Mitelman Database of Chromosome Aberrations in Cancer, the NCI Recurrent Chromosome Aberrations in Cancer, and the NCI/NCBI Spectral Karyotyping (SKY) and Comparative Genomic Hybridization (CGH) Database.

How are the cases in the NCI Recurrent Chromosome Aberration in Cancer Database converted to pseudo documents?

To convert the recurrent Mitelman cases to pseudo documents we generate three pseudo documents. The Mitelman recurrent cases record the number of times each particular aberration was observed, hence we record one cell with frequently found numerical aberrations, one cell with frequently found structural aberrations and another cell which comprises all structural aberrations for the recurrent case. The pseudowords generated for each cell are described below.

What does a pseudo document look like?

The following is an example of a pseudo document derived from the short-form karyotype 47,XX,+8,t(9;22)(q34;q11),i(17)(q10), which is listed in the Mitelman Database of Chromosome Aberrations in Cancer. The first section shows the pseudo document as displayed on the Entrez Cancer Chromosome website, and the second section in its native form. In the native form, some terms are repeated for statistical reasons. This repetition is removed for display on the Entrez Cancer Chromosomes website.

Pseudo document(Human readable format):
Pseudo document(Native Form):

//Ploidy of cell
 Ploidy@@2
 
//Structurally abnormal chromosomes
A9 
A17 
A22 

//Numerical abnormalities
gainN8 
lossN9 
lossN17 
lossN22 

//Breakpoints
 BP9U  BP9U  BP9U  BP9U  BP9U  BP9U  BP9U  BP9U  BP9U  BP9U 
 BP22U  BP22U  BP22U  BP22U  BP22U  BP22U  BP22U  BP22U  BP22U  BP22U 
BP9q34 BP22q11 
BP9q BP22q 
BP9q34 BP22q11 
BP9q BP22q 
BP9q34 BP22q11 
BP9q BP22q 
BP9q34 BP22q11 
BP9q BP22q 
BP9q34 BP22q11 
BP9q BP22q 
BP9q34 BP22q11 
BP9q BP22q 
BP9q34 BP22q11 
BP9q BP22q 
BP9q34 BP22q11 
BP9q BP22q 
BP9q34 BP22q11 
BP9q BP22q 
BP9q34 BP22q11 
BP9q BP22q 

 BP17U  BP17U  BP17U  BP17U  BP17U  BP17U  BP17U  BP17U  BP17U  BP17U 
 BP17U  BP17U  BP17U  BP17U  BP17U  BP17U  BP17U  BP17U  BP17U  BP17U 
BP17q10 BP17q10 
BP17q BP17q 
BP17q10 BP17q10 
BP17q BP17q 
BP17q10 BP17q10 
BP17q BP17q 
BP17q10 BP17q10 
BP17q BP17q 
BP17q10 BP17q10 
BP17q BP17q 
BP17q10 BP17q10 
BP17q BP17q 
BP17q10 BP17q10 
BP17q BP17q 
BP17q10 BP17q10 
BP17q BP17q 
BP17q10 BP17q10 
BP17q BP17q 
BP17q10 BP17q10 
BP17q BP17q 

 BP22U  BP22U  BP22U  BP22U  BP22U  BP22U  BP22U  BP22U  BP22U  BP22U 
 BP9U  BP9U  BP9U  BP9U  BP9U  BP9U  BP9U  BP9U  BP9U  BP9U 
BP22q11 BP9q34 
BP22q BP9q 
BP22q11 BP9q34 
BP22q BP9q 
BP22q11 BP9q34 
BP22q BP9q 
BP22q11 BP9q34 
BP22q BP9q 
BP22q11 BP9q34 
BP22q BP9q 
BP22q11 BP9q34 
BP22q BP9q 
BP22q11 BP9q34 
BP22q BP9q 
BP22q11 BP9q34 
BP22q BP9q 
BP22q11 BP9q34 
BP22q BP9q 
BP22q11 BP9q34 
BP22q BP9q 

//Junctions
 9UJ22U  9UJ22U  9UJ22U  9UJ22U  9UJ22U  9UJ22U  9UJ22U  9UJ22U  9UJ22U  9UJ22U 
9q34J22q11 
9qJ22q 
9q34J22q 
9qJ22q11 
9q34J22q11 
9qJ22q 
9q34J22q 
9qJ22q11 
9q34J22q11 
9qJ22q 
9q34J22q 
9qJ22q11 
9q34J22q11 
9qJ22q 
9q34J22q 
9qJ22q11 
9q34J22q11 
9qJ22q 
9q34J22q 
9qJ22q11 
9q34J22q11 
9qJ22q 
9q34J22q 
9qJ22q11 
9q34J22q11 
9qJ22q 
9q34J22q 
9qJ22q11 
9q34J22q11 
9qJ22q 
9q34J22q 
9qJ22q11 
9q34J22q11 
9qJ22q 
9q34J22q 
9qJ22q11 
9q34J22q11 
9qJ22q 
9q34J22q 
9qJ22q11 

 17UJ17U  17UJ17U  17UJ17U  17UJ17U  17UJ17U  17UJ17U  17UJ17U  17UJ17U  17UJ17U  17UJ17U 
17q10J17q10 
17qJ17q 
17q10J17q 
17qJ17q10 
17q10J17q10 
17qJ17q 
17q10J17q 
17qJ17q10 
17q10J17q10 
17qJ17q 
17q10J17q 
17qJ17q10 
17q10J17q10 
17qJ17q 
17q10J17q 
17qJ17q10 
17q10J17q10 
17qJ17q 
17q10J17q 
17qJ17q10 
17q10J17q10 
17qJ17q 
17q10J17q 
17qJ17q10 
17q10J17q10 
17qJ17q 
17q10J17q 
17qJ17q10 
17q10J17q10 
17qJ17q 
17q10J17q 
17qJ17q10 
17q10J17q10 
17qJ17q 
17q10J17q 
17qJ17q10 
17q10J17q10 
17qJ17q 
17q10J17q 
17qJ17q10 

 22UJ9U  22UJ9U  22UJ9U  22UJ9U  22UJ9U  22UJ9U  22UJ9U  22UJ9U  22UJ9U  22UJ9U 
22q11J9q34 
22qJ9q 
22q11J9q 
22qJ9q34 
22q11J9q34 
22qJ9q 
22q11J9q 
22qJ9q34 
22q11J9q34 
22qJ9q 
22q11J9q 
22qJ9q34 
22q11J9q34 
22qJ9q 
22q11J9q 
22qJ9q34 
22q11J9q34 
22qJ9q 
22q11J9q 
22qJ9q34 
22q11J9q34 
22qJ9q 
22q11J9q 
22qJ9q34 
22q11J9q34 
22qJ9q 
22q11J9q 
22qJ9q34 
22q11J9q34 
22qJ9q 
22q11J9q 
22qJ9q34 
22q11J9q34 
22qJ9q 
22q11J9q 
22qJ9q34 
22q11J9q34 
22qJ9q 
22q11J9q 
22qJ9q34 


//Bands gained/lost with respect to ploidy

gain8pter gain8p gain8p23 gain8p22 gain8p21 gain8p12 gain8p11P2 gain8p11 
gain8p11P1 gain8p11 gain8cen gain8q11P1 gain8q11 gain8q gain8q11P2 gain8q11 
gain8q12 gain8q13 gain8q21P1 gain8q21 gain8q21P2 gain8q21 gain8q21P3 gain8q21 
gain8q22 gain8q23 gain8q24P1 gain8q24 gain8q24P2 gain8q24 gain8q24P3 gain8q24 
gain8qter 

gain9q34 gain9q 

loss17pter loss17p loss17p13 loss17p12 loss17p11P2 loss17p11 gain17cen loss17p11P1 loss17p11 
gain17q11P1 gain17q11 gain17q gain17q11P2 gain17q11 gain17q12 gain17q21 gain17q22 gain17q23 gain17q24 gain17q25 gain17qter 

gain22q11P1 gain22q11 gain22q gain22q11P2 gain22q11 


//Base Pair Positions for the Bands specified above.

gain8@@0M gain8@@0M gain8@@5M gain8@@10M  gain8@@15M gain8@@20M
gain8@@25M gain8@@30M gain8@@35M gain8@@40M  gain8@@45M gain8@@50M gain8@@55M gain8@@60M gain8@@65M gain8@@70M 
gain8@@75M gain8@@80M  gain8@@85M gain8@@90M gain8@@95M
gain8@@100M gain8@@105M gain8@@110M gain8@@115M  gain8@@120M
gain8@@125M gain8@@130M gain8@@135M gain8@@140M gain8@@145M
gain8@@150M gain8@@155M gain8@@155M  gain9@@130M 

gain9@@135M gain9@@140M gain9@@145M

loss17@@0M loss17@@0M loss17@@5M loss17@@10M loss17@@15M 
loss17@@20M loss17@@25M loss17@@30M
gain17@@35M gain17@@40M gain17@@45M gain17@@50M gain17@@55M
gain17@@60M gain17@@65M gain17@@70M gain17@@75M gain17@@80M
gain17@@85M gain17@@90M gain17@@95M

gain22@@15M gain22@@20M gain22@@25M


Credits :