Download
PedHunter
|
|
PedHunter is a software package that facilitates creation and
verification of pedigrees within large genealogies.
-
Agarwala R, Biesecker LG, Hopkins KA, Francomano CA,
Schäffer AA:
Software for constructing and verifying pedigrees within
large Genealogies and an application to the Old Order Amish of
Lancaster County.
Genome Research 8:211-221, 1998.
[PubMed]
[pedhunter.ps]
-
Lee W-J, Pollin TI, O'Connell JR, Agarwala R, Schäffer AA:
PedHunter 2.0 and its usage to characterize the founder structure of the
Old Order Amish of Lancaster County. BMC Medical Genetics 11:68, 2010.
BMC Medical Genetics 11:68, 2010.
[PubMed]
[pedhunter2_appeared.pdf]
Linkage analysis requires describing pedigrees for a set of
people exhibiting a specific trait and verifying relationships
in pedigrees. PedHunter uses methods from graph theory to solve
two versions of the pedigree connection problem for genealogies
as well as other pedigree analysis problems. The pedigrees are
produced by PedHunter as files in LINKAGE format ready for
linkage analysis and for drawing with a variety of drawing
programs, such as PEDDRAW and cranefoot.
In 2009, we completed version 2.0 of PedHunter that includes
improved engineering of the code and many new queries.
Queries provided in PedHunter 2.0 can be divided into four categories:
-
testing relationships: is_father, is_mother,
is_child, is_sibling,
is_half_sibling, is_first_cousin,
is_ancestor, is_descendant,
is_founder
-
finding people satisfying a certain relation:
spouses, spouses_file, father,
mother, children, siblings,
half_siblings, uncles_aunts,
first_cousins, ancestors,
ancestors_file, descendants,
descendants_file, lca, lca_file,
birth_death, age, living,
lifespan, founder, founder_birth,
founder_descendant, count_descendant
-
printing information: person_info,
children_info, children_couple_info,
family_info
-
complex queries: family, subset,
all_shortest_paths,
all_shortest_paths_count, kinship,
inbreeding, ancestors_ped,
descendants_ped, all_relatives,
acp, asp, average_r,
calculate_r, minimal
Some queries are briefly described below:
-
ancestors_ped: Finds the ancestors of a given
person upto specified generations. Output is a pedigree in
pre-makeped LINKAGE format. By using non-zero values of the
key, it is possible to restrict attention to male or female
ancestors, which is useful for Y chromosome and mitochondrial
studies.
-
lca (short for lowest common ancestors): Finds the
most recent common ancestors of two persons. For each such
ancestor, the program also prints the length of the shortest
paths from two persons in the input to that ancestor.
-
lca_file: Given a list of people, find all persons
P such that each person p in P is
an ancestor of everyone in the list, but none of the children
of p are ancestors of everyone in the list.
-
subset: Find a maximal subset of a list of people
that has a common ancestor. The subset returned is "maximal"
in the sense that it cannot be enlarged, but not necessarily
of "maximum" size.
-
acp (short for all common paths pedigree): Find all
common paths pedigree for persons in an input file. Output is
a pedigree in pre-makeped LINKAGE format includes all paths
that link more than one person in an input file.
-
asp (short for all shortest paths pedigree): Find all
shortest paths pedigree for a given list of people, if any.
The typical use of asp is to find a pedigree to connect
several persons with the same phenotype. The program first
finds the minimal ancestors (as in the lca_file
query) of all persons in the file. Then for each minimal
ancestor all shortest (in number of generations) paths are
found to each person listed. The collected set of people in
the pedigree is output in pre-makeped LINKAGE format. When
there are multiple minimal ancestors multiple pedigrees are
output. The justification for the "all shortest paths"
pedigree is in the paper cited above.
-
all_shortest_paths: Print all shortest paths from an
ancestor to a descendant. This function can be used to help
understand the output of asp.
-
lifespan: Find all persons who lived for a given
time. In order to deal with missing birth or death dates,
there are five pertinent options as follows.
-
all - If a person has missing date and has
age ≥ queried age, we include that person.
-
pessimistic - Include only those individuals
that have both birth date and death date specified and age
≥ queried age.
-
optimistic - Include the subset that have birth
date specified and if they do not have death date specified,
then include only those who could not have beyond LIMIT
(default is 96) years in order to throw out ones that are
definitely missing death date and are probably not living at
present.
-
optimistic living - Qualifies under criterion 3 and
does not have death date specified.
-
oldest person - Consider all people who have both
death dates and birth dates. Print birth date and death date
of oldest person among all considered.
The computed "age" is approximate, since only the birth and
death years are considered; months and days are ignored.
-
all_relatives: Find all relatives of a person. Output
is a pedigree in pre-makeped LINKAGE format. Here "relative"
means either the person given as the argument or anyone
connected to that person via any combination of parent, spouse
or child links.
-
inbreeding: Compute the inbreeding coefficients of a
list of people with respect to the entire genealogy.
-
kinship: Compute the kinship coefficients of a list
of pairs of people with respect to the entire genealogy.
-
calculate_r: Calculate the relative representation of
each founder in each given descendant. We define the relative
founder representation by a given founder in a given
descendant as the expected proportion of alleles in the
descendant that were inherited identical-by-descent (IBD)
from the founder.
-
average_r: Calculate the mean relative founder
representation for each founder over all study descendants.
-
minimal: Print minimal tree connecting the given list
of people who have the given asp pedigree. This
function can be used to find a small pedigree when the
asp pedigree is too big for your purpose. When there
are multiple minimal pedigrees, one of them is chosen
arbitrarily. Unfortunately, the performance of minimal
degrades rapidly as the given list of individuals grows.
Fortunately, other researchers have developed better software
for the general Steiner tree problem in graphs as described in
the following paper:
Koch T and Martin A: Solving Steiner Tree Problems in
Graphs to Optimality. Networks 32:207-232, 1998.
Utility programs include linkage2tables,
verify_tables, generations, subped,
print_pedigree, renumber_pedigree and
trim_pedigree.
Some utility programs are briefly described below:
-
renumber_pedigree: Renumber IDs in an input pedigree
file, so that parent IDs are smaller than the child IDs and/or
add missing parents. Adding missing parents is necessary for
some packages such as LINKAGE that assumes that each person
has either 0 or 2 parents shown. Renumbering to make parent
IDs smaller than child IDs is useful for some methods that
compute kinship coefficients.
-
trim_pedigree: Trim the parents of nuclear family
that is at the top of the pedigree and has only one child.
Trimmed pedigrees are useful for studying some aspects of
inheritance such as the average_r.
The genealogy data to be used by PedHunter can be stored as a
relational database in Sybase or as column delimited ASCII
plaintext files. There are two required tables: person
information table, parent-child relationship table, and
two optional tables: id table, generation
-
Person table: This table has information specific to
a person; it has fields:
program identifier (required),
name (optional),
birth date (optional),
death date (optional),
address (optional),
gender (required for married couples),
special status (used to encode twins, adoptions, optional),
other information (optional).
-
Relationship table: This table encodes parent-child
relationships; it has fields:
program identifier of father (required),
program identifier of mother (required),
marriage date (optional),
delimited program identifiers for children
(with these two parents, required but can be empty).
-
Id table: If you have a system of identifiers for
your genealogy that you find convenient and these identifiers
are not integers, then an id table with columns for program
identifier and your identifier that expresses the 1-to-1
correspondence between them is required.
-
Generation table: This table can be generated
automatically using code provided in PedHunter and is needed
only if you are using queries inbreeding and
kinship. This table lists the maximum generation
level for each person in the database.
Send comments, questions, and suggestions to
Richa Agarwala and
Alejandro Schäffer.
|
|