NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Gliklich RE, Dreyer NA, Leavy MB, editors. Registries for Evaluating Patient Outcomes: A User's Guide [Internet]. 3rd edition. Rockville (MD): Agency for Healthcare Research and Quality (US); 2014 Apr.

Cover of Registries for Evaluating Patient Outcomes

Registries for Evaluating Patient Outcomes: A User's Guide [Internet]. 3rd edition.

Show details

6Data Sources for Registries

1. Introduction

Identification and evaluation of suitable data sources should be completed within the context of the registry purpose and availability of the data of interest. A single registry may have multiple purposes and integrate data from various sources. While some data in a registry are collected directly for registry purposes (primary data collection), important information also can be transferred into the registry from existing databases. Examples include demographic information from a hospital admission, discharge, and transfer system; medication use from a pharmacy database; and disease and treatment information, such as details of the coronary anatomy and percutaneous coronary intervention from a catheterization laboratory information system, electronic medical record, or medical claims databases. In addition, observational studies can generate as many hypotheses as they test, and secondary sources of data can be merged with the primary data collection to allow for analyses of questions that were unanticipated when the registry was conceived.

This chapter will review the various sources of both primary and secondary data, comment on their strengths and weaknesses, and provide some examples of how data collected from different sources can be integrated to help answer important questions.

2. Types of Data

The types of data to be collected are guided by the registry design and data collection methods. The form, organization, and timing of required data are important components in determining appropriate data sources. Data elements can be grouped into categories identifying the specific variable or construct they are intended to describe. One framework for grouping data elements into categories follows:

  • Patient identifiers—Some registries may use patient identifiers to link data. In these registries, data elements are linked to the specific patient through a unique patient identifier or registry identification number. The use of patient identifiers may not be possible in all registries due to the additional legal requirements that usually apply to the use and disclosure of such data. (See Chapter 7.)
  • Patient selection criteria—The eligibility criteria in a registry protocol or study plan determine the group that will be included in the registry. These criteria may be very broad or restrictive, depending on the purpose. Criteria often include demographics (e.g., target age group), a disease diagnosis, a treatment, or diagnostic procedures and laboratory tests. Health care provider, health care facility or system, and insurance criteria may also be included in certain types of registries (e.g., following care patterns of specific conditions at large medical centers compared with small private clinics).
  • Treatments and tests—Treatments and tests are necessary to describe the natural history of patients. Treatments can include pharmaceutical, biotechnology, or device therapies, or procedures such as surgery or radiation. Evaluation of the treatment itself is often a primary focus of registries (e.g., treatment safety and effectiveness over 5 years). Results of laboratory testing or diagnostic procedures may be included as registry outcomes and may also be used in defining a diagnosis or condition of interest.
  • Confounders—Confounders are elements or factors that have an independent association with the outcomes of interest. These are particularly important because patients are typically not randomized to therapies in registries. Confounders such as comorbidities (disease diagnoses and conditions) can confuse analysis results and interpretation of causality. Information on the health care provider, treatment facility, concomitant therapies, or insurance may also be considered. Unknown confounders, or those not recorded in the registry, pose particular challenges for the analysis of patient outcomes. If external, or linked, data sources may provide values for these confounder variables otherwise not in the registry, they may ultimately help reduce bias in the analysis and interpretation of patient outcomes.
  • Outcomes—The focus of this document is on patient outcomes. Outcomes are end results and are defined for each condition. Outcomes may include patient-reported outcomes (PROs). In some registries, surrogate markers, such as biomarkers or other interim outcomes (e.g., hemoglobin A1c levels in diabetes) that are highly reflective of the longer term end results are used.

Before considering the potential sources for registry data, it is important to understand the types of data that may be collected in a registry. Several types of data that may be gathered from other sources in some registries are described below.

Cost/resource utilization—Cost and/or resource utilization data may be necessary to examine the cost-effectiveness of a treatment. Resource utilization data reflect the resources consumed (both services and products), while cost data reflect a monetary value assigned to those resources. Examples include the actual cost of the treatment (e.g., medication, screening, procedure) and the associated costs of the intervention (e.g., treatment of side effects, expenses incurred traveling to and from clinicians' appointments). Costs that are avoided due to the treatment (e.g., the cost to treat the avoided disease) and costs related to lost workdays may also be important to collect, depending on the objectives of the study. Registries that collect cost data over long periods of time (i.e., many years) may need to adjust costs for inflation during the analysis phase of the study. The types of data elements included in this framework are further described in Chapter 4 and below with respect to their source or the utility of the data for linking to other sources. Many of these may be available through data sources outside of the registry system.

Patient identifiers—Depending on the data sources required, some registries may use certain personal identifiers for patients in order to locate them in other databases and link the data. For example, Social Security numbers (SSNs) in combination with other personal identifiers can be used to identify individuals in the National Death Index (NDI). Patient contact information, such as address and phone numbers, may be collected to support tracking of participants over time. Information for additional contacts (e.g., family members) may be collected to support followup in cases where the patient cannot be reached. In many cases, patient informed consent and appropriate privacy authorizations are required so that personal identifiers can be used for registry purposes, and the use of personal identifiers may not be possible in some registries; Chapter 7 discusses the legal requirements for including patient identifiers. Systems and processes must be in place to manage security and confidentiality of these data. Confidentiality can be enhanced by assigning a registry-specific identifier via a crosswalk algorithm, as discussed below. Demographics, such as date of birth (to calculate age at any time point), gender, and ethnicity, are typically collected and may be used to stratify the registry population.

Disease/condition—Disease or condition data include those related to the disease or condition of focus for the registry and may incorporate comorbidities. Elements of interest related to the confirmation of a diagnosis or condition could be date of diagnosis and the specific diagnostic results that were used to make the diagnosis, depending on the purpose of the registry. Disease or condition is often a primary eligibility or outcome variable in registries, whether the intent is to answer specified treatment questions (e.g., measure effectiveness or safety) or to describe the natural history. This information may also be collected in constructing a medical history for a patient. In addition to “yes” or “no” to indicate presence or absence of the diagnosis, it may be important to capture responses such as “missing” or “unknown.”

Treatment/therapy—Treatment or therapy data include specific identifying information for the primary treatment (e.g., drug name or code, biologic, device product or component parts, or surgical intervention, such as organ transplant or coronary artery bypass graft) and may include information on concomitant treatments. Dosage (or parameters for devices), route of administration, and prescribed exposure time, such as daily or 3 times weekly for 4 weeks, should be collected. Pharmacy data may include dispensing information, such as the primary date of dispensation and subsequent refill dates. Data in device registries can include the initial date of dispensation or implantation and subsequent dates and specifics of required evaluations or modifications. Compliance data may also be collected if pharmacy representatives or clinic personnel are engaged to conduct and report pill counts or volume measurements on refill visits or return visits for device evaluations and modifications.

Laboratory/procedures—Laboratory data include a broad range of testing, such as blood, tissue, catheterization, and radiology. Specific test results, units of measure, and laboratory reference ranges or parameters are typically collected. Laboratory databases are becoming increasingly accessible for electronic transfer of data, whether through a system-wide institutional database or a private laboratory database. Diagnostic testing or evaluation may include procedures such as psychological or behavioral assessments. Results of these procedures and clinician exam procedures may be difficult to obtain through data sources other than the patient medical record.

Biosamples—The increased collection, testing, and storage of biological specimens as part of a registry (or independently as a potential secondary data source such as those described further below) provides another source of information that includes both information from genetic testing (such as genetic markers) and actual specimens.

Health care provider characteristics—Information on the health care provider (e.g., physician, nurse, or pharmacist) may be collected, depending on the purpose of the registry. Training, education, or specialization may account for differences in care patterns. Geographic location has also been used as an indicator of differences in care or medical practice.

Hospital/clinic/health plan—System interactions include office visits, outpatient clinic visits, emergency room visits, inpatient hospitalizations, procedures, and pharmacy visits, as well as associated dates. Data on all procedures as defined by the registry protocol or plan (e.g., physical exam, psychological evaluation, chest x-ray, CAT scan), including measurements, results, and units of measure where applicable, should be collected. Cost accounting data may also be available to match these interactions and procedures. Descriptive information related to the points of care may be useful in capturing differences in care patterns and can also be used to track patterns of referral of care (e.g., outpatient clinic, inpatient hospital, academic center, emergency room, pharmacy).

Insurance—The insurance system or payer claims data can provide useful information on interactions with the health care systems, including visits, procedures, inpatient stays, and costs associated with these events. When using these data, it is important to understand what services were covered under the various insurance plans at the time the data were collected, as this may affect utilization patterns.

3. Data Sources

Data sources are classified as primary or secondary based on the relationship of the data to the registry purpose. Primary data sources incorporate data collected for direct purposes of the registry (i.e., primarily for the registry). Primary data sources are typically used when the data of interest are not available elsewhere or, if available, are unlikely to be of sufficient accuracy and reliability for the planned analyses and uses. Primary data collection increases the probability of completeness, validity, and reliability because the registry drives the methods of measurement and data collection. (See Chapter 4.) These data are prospectively planned and collected under the direction of a protocol or study plan, using common procedures and the same format across all registry sites and patients. The data are readily integrated for tracking and analyses. Since the data entered can be traced to the individual who collected them, primary data sources are more readily reviewed through automated checks or followup queries from a data manager than is possible with many secondary data sources.

Secondary data sources are comprised of data originally collected for purposes other than the registry under consideration (e.g., standard medical care, insurance claims processing). Data that are collected as primary data for one registry are considered secondary data from the perspective of a second registry if linking was done. These data are often stored in electronic format and may be available for use with appropriate permissions. Data from secondary sources may be used in two ways: (1) the data may be transferred and imported into the registry, becoming part of the registry database, or (2) the secondary data and the registry data may be linked to create a new, larger data set for analysis. This chapter primarily focuses on the first use for secondary data, while Chapters 16, 17, and 18 discuss the complexities of linking registries with other databases.

When considering secondary data sources, it is important to note that health professionals are accustomed to entering the data for defined purposes, and additional training and support for data collection are not required. Often, these data are not constrained by a data collection protocol and they represent the diversity observed in real-world practice. However, there may be increased probability of errors and underreporting because of inconsistencies in measurement, reporting, and collection. Staff changes can further complicate data collection and may affect data quality. There may also be increased costs for linking the data from the secondary source to the primary source and dealing with any potential duplicate or unmatched patients.

Sufficient identifiers are also necessary to accurately match data between the secondary sources and registry patients. The potential for mismatch errors and duplications must be managed. (See Case Example 40.) The complexity and obligations inherent in the collection and handling of personal identifiers have previously been mentioned (e.g., obligations for informed consent, appropriate data privacy, and confidentiality procedures).

Some of the secondary data sources do not collect information at a specific patient level but are anonymous and intended to reflect group or population estimates. For example, census tract or ZIP-Code-level data are available from the Census Bureau and can be merged with registry data. These data can be used as “ecological variables” to support analyses of income or education when such socioeconomic data are missing from registry primary data collection. The intended use of the data elements will determine whether patient-level information is required.

The potential for data completeness, variation, and specificity must be evaluated in the context of the registry and intended use of the data. It is advisable to have a solid understanding of the original purpose of the secondary data collection, including processes for collection and submission, and verification and validation practices. Questions to ask include: Is data collection passive or active? Are standard definitions or codes used in reporting data? Are standard measurement criteria or instruments used (e.g., diagnoses, symptoms, quality of life)? The existence and completeness of claims data, for example, will depend on insurance company coverage policies. One company may cover many preventive services, whereas another may have more restricted coverage. One company may cover a treatment without restriction, while another may require prior authorization by the physician or require that the patient must have first failed on a previous, less expensive treatment. Also, coverage policies can change over time. These variations must be known and carefully documented to prevent misinterpretation of use rates. Additionally, secondary data may not all be collected in the format (e.g., units of measure) required for registry purposes and may require transformation for integration and analyses.

An overview of some secondary data sources that may be used for registries is given below. Table 6–1 identifies some key strengths and limitations of the identified data sources.

Table 6–1. Key data sources—strengths and limitations.

Table 6–1

Key data sources—strengths and limitations.

Medical chart abstraction—Medical charts primarily contain information collected as a part of routine medical care. These data reflect the practice of medicine or health care in general and at a specific level (e.g., geographical, by specialty care provider). Charts also reflect uncontrolled patient behavior (e.g., noncompliance). Collection of standard medical practice data is useful in looking at treatments and outcomes in the real world, including all of the confounders that affect the measurement of effectiveness (as distinguished from efficacy) and safety outside of the controlled conditions of a clinical trial. Chart documentation is often much poorer than one might expect, and there may be more than one patient-specific medical record (e.g., hospital and clinical records). A pilot collection is recommended for this labor-intensive method of data collection to explore the availability and reproducibility of the data of interest. It is important to recognize that physicians and other clinicians do not generally use standardized data definitions in entering information into medical charts, meaning that one clinician's documented diagnosis of “chronic sinusitis” or “osteoarthritis” or description of “pedal edema” may differ from that of another clinician.

Electronic health records—The use of electronic health records (EHRs), sometimes called electronic medical records (EMRs), is increasing. EHRs have an advantage over paper medical records because the data in some EHRs can be readily searched and integrated with other information (e.g., laboratory data). The ease with which this is accomplished depends on whether the information is in a relational databasea or exists as scanned documents. An additional challenge relates to terminology and relationships. For example, including the term “fit” in a search for patients with epilepsy can yield a record for someone who was noted as “fit,” meaning “healthy.” Relationships can also be difficult to identify through searches (e.g., “Patient had breast cancer” vs. “Patient's mother had breast cancer”). The quality of the information has the same limitations as described in the paragraph above. Both the availability and standardization of EHR data have grown significantly in recent years, and this trend is expected to continue. As of 2009, some data suppliers cited individual data sets exceeding 10 million lives.1 More recently, data suppliers are reporting 20 million2 to 35 million3 patients in their data sets. Further, it is anticipated that more significant standardization of EHR data will result from the “EHR certification” requirements being developed in phases under the American Recovery and Reinvestment Act of 2009 (ARRA). Such standardization should increase not only the availability and utility of EHR records, but also the ability to aggregate them into larger data sources.

Institutional or organizational databases— Institutional or organizational databases may be evaluated as potential sources of a wide variety of data. System-wide institutional or hospital databases are central data repositories, or data warehouses, that are highly variable from institution to institution. They may include a portion of everything from admission, discharge, and transfer information to data reflecting diagnoses and treatment, pharmacy prescriptions, and specific laboratory tests. Laboratory test data might be chemistry or histology laboratory data, including patient identifiers with associated dates of specimen collection and measurement, results, and standard “normal” or reference ranges. Catheterization laboratory data for cardiac registries may be accessible and may include details on the coronary anatomy and percutaneous coronary intervention. Other organizational examples are computerized order entry systems, pharmacies, blood banks, and radiology departments.

Administrative databases—Private and public medical insurers collect a wealth of information in the process of tracking health care, evaluating coverage, and managing billing and payment. Information in the databases includes patient-specific information (e.g., insurance coverage and copays; identifiers such as name, demographics, SSN or plan number, and date of birth) and health care provider descriptive data (e.g., identifiers, specialty characteristics, locations). Typically, private insurance companies organize health care data by physician care (e.g., physician office visits) and hospital care (e.g., emergency room visits, hospital stays). Data include procedures and associated dates, as well as costs charged by the provider and paid by the insurers. Amounts paid by insurers are often considered proprietary and unavailable. Standard coding conventions are used in the reporting of diagnoses, procedures, and other information. Coding conventions include the Current Procedure Terminology (CPT) for physician services and International Classification of Diseases (ICD) for diagnoses and hospital inpatient procedures. The databases serve the primary function of managing and implementing insurance coverage, processing, and payment. (See Case Example 12.)

Medicare and Medicaid claims files are two examples of commonly used administrative databases. The Medicare program covers over 43 million people in the United States, including almost everyone over the age of 65, people under the age of 65 who qualify for Social Security Disability, and people with end-stage renal disease.4 The Medicaid program covers low-income children and their mothers; pregnant women; and blind, aged, or disabled people. As of 2007, approximately 40 million people were covered by Medicaid.5 Medicare and Medicaid claims files, maintained by the Centers for Medicare & Medicaid Services (CMS), can be obtained for inpatient, outpatient, physician, skilled nursing facility, durable medical equipment, and hospital services. As of 2006, Medicare claim files for prescription drugs can also be obtained. The claims files generally contain person-specific data on providers, beneficiaries, and recipients, including individual identifiers that would permit the identity of a beneficiary or physician to be deduced. Data with personal identifiers are clearly subject to privacy rules and regulations. As such, the information is confidential and to be used only for reasons compatible with the purpose(s) for which the data are collected. The Research Data Assistance Center (ResDAC), a CMS contractor at the University of Minnesota, provides assistance to academic, government, and nonprofit researchers interested in using Medicare and/or Medicaid data for their research.6

Death and birth records—Death indexes are national databases tracking population death data (e.g., the NDI7 and the Death Master File [DMF] of the Social Security Administration [SSA]8). Data include patient identifiers, date of death, and attributed causes of death. These indexes are populated through a variety of sources. For example, the DMF includes death information on individuals who had an SSN and whose death was reported to the SSA. Reports may come in to the SSA by different paths, including from survivors or family members requesting benefits or from funeral homes. Because of the importance of tracking Social Security benefits, all States, nursing homes, and mortuaries are required to report all deaths to the SSA. Prior to 2011, the DMF contained virtually 100-percent complete mortality ascertainment for those eligible for SSA benefits. As of November 2011, however, the DMF no longer includes protected State death records. In practical terms, this means that approximately 4.2 million records were removed from the historical public DMF (which contained 89 million records), and some 1 million fewer records will be added to the DMF each year.9 The NDI can be used to provide both fact of death and cause of death, as recorded on the death certificate. Cause-of-death data in the NDI are relatively reliable (93–96 percent) compared with death certificates.10, 11 Time delays in death reporting should be considered when using these sources, and vital status should not be assumed to be “alive” by the absence of information at a recent point in time. These indexes are valuable sources of data for death tracking. Of course, mortality data can be accessed directly through queries of State vital statistics offices and health departments when targeting information on a specific patient or within a State. Likewise, birth certificates are available through State departments and may be useful in registries of children or births.

Area-level databases—Two sources of area-level data are the U.S. Census and the Area Health Resources Files (AHRF). The U.S. Census Bureau databases12 provide population-level data utilizing survey sampling methodology. The Census Bureau conducts many different surveys, the main one being the population census. The primary use of the data is to determine the number of seats assigned to each State in the House of Representatives, although the data are used for many other purposes. These surveys calculate estimates through statistical processing of the sampled data. Estimates can be provided with a broad range of granularity, from population numbers for large regions (e.g., specific States), to ZIP Codes, all the way down to a household level (e.g., neighborhoods identified by street addresses). Information collected includes demographic, gender, age, education, economic, housing, and work data. The data are not collected at an individual level but may serve other registry purposes, such as understanding population numbers in a specific region or by specific demographics. The AHRF is maintained by the Health Resources and Services Administration, which is part of the Department of Health and Human Services. The AHRF includes county-level data on health facilities, health professions, measures of resource scarcity, health status, economic activity, health training programs, and socioeconomic and environmental characteristics.13

Provider-level databases—Data on medical facilities and physicians may be important for categorizing registry data or conducting subanalyses. Two sources of such data are the American Hospital Association's Annual Survey Data and the American Medical Association's Physician Masterfile Data Collection. The Annual Survey Data is a longitudinal database that collects 700 data elements, covering organizational structure, personnel, hospital facilities and services, and financial performance, from more than 6,000 hospitals in the United States.14 Each hospital in the database has a unique ID, allowing the data to be linked to other sources; however, there is a data lag of about 2 years, and the data may not provide enough nuanced detail to support some analyses of cost or quality of care. The Physician Masterfile Data Collection contains current and historic data on nearly one million physicians and residents in the United States. Data on physician professional medical activities, hospital and group affiliations, and practice specialties are collected each year.

Encounter-level databases—Databases of individual patient encounters (e.g., physician office visits, emergency department visits, hospital inpatient stays), generally do not contain individual patient identifiers and thus may not be linkable to patient registries, but nevertheless provide valuable insight into the makeup of the registry's target population. This is particularly true for data from nationally representative surveys, such as AHRQ's Health Care Utilization Project (H-CUP) Nationwide Inpatient Sample (NIS) and the suite of surveys by the Centers for Disease Control and Prevention (CDC) and the National Center for Health Statistics (NCHS), including the National Ambulatory Medical Care Survey (NAMCS), the National Hospital Ambulatory Medicare Care Survey (NHAMCS), and the National Hospital Discharge Survey (NHDS).

Existing registry and other databases—There are numerous national and regional registries and other databases that may be leveraged for incorporation into other registries (e.g., disease-specific registries managed by nonprofit organizations, professional societies, or other entities). An example is the National Marrow Donor Program (NMDP),15 a global database of cord blood units and volunteers who have consented to donate marrow and blood cells. Databases maintained by the NMDP include identifiers and locators in addition to information on the transplants, such as samples from the donor and recipient, histocompatibility, and outcomes. NMDP actively encourages research and utilization of registry data through a data application process and submission of research proposals.

The Registry of Patient Registries (RoPR) may become a useful resource for finding existing registries ( RoPR is a database of registry-specific information intended to promote collaboration, reduce redundancy, and improve transparency in registry-based research. The database contains information on existing registries, such as the registry description, classification, and purpose, as well as the registry sponsor's interest in collaboration opportunities. Registry planners may be able to use RoPR to identify relevant registries to contact about data sharing or research collaborations.

In accessing data from one registry for the purposes of another, it is important to recognize that data may have changed during the course of the source registry, and this may or may not have been well documented by the providers of the data. For example, in the United States Renal Data System (USRDS),16 a vital part of personal identification is CMS 2728, an enrollment form that identifies the incident data for each patient as well as other pertinent information, such as the cause of renal failure, initial therapy, and comorbid conditions. Originally created in 1973, this form is in its third version, having been revised in 1995 and again in 2005. Consequently, there are data elements that exist in some versions and not others. In addition, the coding for some variables has changed over time. For example, race has been redefined to correspond with Office of Management and Budget directives and Census Bureau categories. Furthermore, form CMS 2728 was optional in the early years of the registry, so until 1983 it was filled out for only about one-half of the subjects. Since 1995, it has been mandatory for all people with end-stage renal disease. These changes in form content, data coding, and completeness would not be evident to most researchers trying to access the data.

4. Other Considerations for Secondary Data Sources

The discussion below focuses on logistical and data issues to consider when incorporating data from other sources. Chapter 11 fully explores data collection, management, and quality assurance for registries.

Before incorporating a secondary data source into a registry, it is critical to consider the potential impact of the data quality of the secondary data source on the overall data quality of the registry. The potential impact of quality issues in the secondary data sources depends on how the data are used in the primary registry. For example, quality would be significant for secondary data that are intended to be populated throughout the registry (i.e., used to populate specific data elements in the entire registry over time), particularly if these populated data elements are critical to determining a primary outcome. Quality of the secondary data will have less effect on overall registry quality if the secondary data are to be linked to registry data only for a specific analytic study (see Chapter 18). For more information on data quality, see Chapter 11.

The importance of patient identifiers for linking to secondary data sources cannot be overstated. Multiple patient identifiers should be used, and primary data for these identifiers should not be entered into the registry unless the identifying information is complete and clear. While an SSN is very useful, high-quality probabilistic linkages can be made to secondary data sources using various combinations of such information as name (last, middle initial, and first), date of birth, and gender. For example, the NDI will make possible matches when at least one of seven matching conditions is met (e.g., one matching condition is “exact month and day of birth, first name, and last name”). However, the degree of success in such probabilistic and deterministic matching generally is enhanced by having many identifiers to facilitate matching. As noted earlier, the various types of data (e.g., personal history, adverse events, hospitalization, and drug use) have to be linked through a common identifier. A discussion of both statistical and privacy issues in linkage is provided in Chapter 16, and a discussion of managing patient identity across systems is provided in Chapter 17.

The best identifier is one that is not only unique but has no embedded personal identification, unless that information is scrambled and the key for unscrambling it is stored remotely and securely. The group operating the registry should have a process by which each new entry to the registry is assigned a unique code and there is a crosswalk file to enable the system to append this identifier to all new data as they are accrued. The crosswalk file should not be accessible by people or entities outside the management group.

In addition, consideration should be given to the fact that a registry may need to accept and link data sets from more than one outside organization. Each institution contributing data to the registry will have unique requirements for patient data, access, privacy, and duration of use. While having identical agreements with all institutions would be ideal, this may not always be possible from a practical perspective. Yet all registries have resource constraints, and decisions about including certain institutions have to be determined based on the resources available in order to negotiate specialized agreements or to maintain specialized requirements. Agreements should be coordinated as much as possible so that the function of the registry is not greatly impaired by variability among agreements. All organizations participating in the registry should have a common understanding of the rules regarding access to the data. Although exceptions can be made, it should be agreed that access to data will be based on independent assessment of research protocols and that participating organizations will not have individual veto power over access.

When data from secondary sources are used, agreements should specify ownership of the source data and clearly permit data use by the recipient registry. The agreements should also specify the roles of each institution, its legal responsibilities, and any oversight issues. It is critical that these issues and agreements be put in place before data are transferred so that there are no ambiguities or unforeseen restrictions on the recipient registry later on.

Some registries may wish to incorporate data from more than one country. In these cases, it is important to ensure that the data are being collected in the same manner in each country or to plan for any necessary conversion. For example, height and weight data collected from sites in Europe will likely be in different units than height and weight data collected from sites in the United States. Laboratory test results may also be reported in different units, and there may be variations in the types of pharmaceutical products and medical devices that are approved for use in the participating countries. Understanding these issues prior to incorporating secondary data sources from other countries is extremely important to maintain the integrity and usefulness of the registry database.

When incorporating other data sources, consideration should also be given to the registry update schedule. A mature registry will usually have a mix of data update schedules. The registry may receive an annual update of large amounts of data, or there could be monthly, weekly, or even daily transfers of data. Regardless of the schedule of data transfer, routine data checks should be in place to ensure proper transfer of data. These should include simple counts of records as well as predefined distributions of key variables. Conference calls or even routine meetings to go over recent transfers will help avoid mistakes that might not otherwise be picked up until much later.

An example of the need for regular communication is a situation that arose with the United States Renal Data System a few years ago. The United Network for Organ Sharing (UNOS) changed the coding for donor type in their transplant records. This resulted in an apparent 100-percent loss of living donors in a calendar year. The change was not conveyed to USRDS and was not detected by USRDS staff. After USRDS learned about the change, standard analysis files that had been sent to researchers with the errors had to be replaced.

Distributed data networks are another model for sharing data. In a distributed data network, data sharing may be limited to the results of analyses or aggregated data only. There is much interest in the potential of distributed data networks, particularly for safety monitoring or public health surveillance (see Chapter 15, Section 11). However, the complexities of data sharing within a distributed data network are still being addressed, and it is premature to discuss good practice for this area.

5. Summary

In summary, a registry is not a static enterprise. The management of registry data sources requires attention to detail, constant feedback to all participants, and a willingness to make adjustments to the operation as dictated by changing times and needs.

Case Example for Chapter 6

Case Example 12Using claims data along with patient-reported data to identify patients

DescriptionThe National Amyotrophic Lateral Sclerosis (ALS) Registry is a rare disease registry created by the Agency for Toxic Substances and Disease Registry (ATSDR) within the U.S. Department of Health and Human Services (HHS). The purpose of the registry is to quantify the incidence and prevalence of ALS in the United States, describe the demographics of people with ALS, and examine potential risk factors for the disease.
SponsorU.S. Department of Health and Human Services and Agency for Toxic Substances and Disease Registry, through funding from the “ALS Registry Act” (U.S. Congress Public Law 110-373).
Year Started2010
Year EndedOngoing
No. of SitesAll 50 States, including U.S. territories; data from national administrative databases are combined with patient self-enrollment data.
No. of PatientsThe first registry report is anticipated for release in spring 2014.


Amyotrophic lateral sclerosis (ALS) is a progressive, fatal neurodegenerative disorder of both the upper and lower motor neurons. Many knowledge gaps exist in the understanding of ALS, including uncertainty about the disease's incidence and prevalence, misdiagnosis of ALS in patients with other motor neuron disorders, and the role of environmental exposures in the etiology of ALS. Because ALS is a non-reportable disease in the United States (except for the Commonwealth of Massachusetts), previous attempts to estimate ALS incidence and prevalence using nonspecific mortality data have faced many challenges and at best overestimated disease frequency. Identifying patients through site recruitment for research purposes poses additional challenges, as access to patient medical records can be limited, costly, and time-consuming to obtain. Patient recruitment issues are compounded by the complexities of this rare disease, in which the average timeframe from diagnosis to death is 2–5 years. U.S. governmental agencies acknowledged that a national, structured data collection program for ALS was greatly needed, and that alternative data sources and recruitment strategies would need to be identified.

Proposed Solution

In 2008, President Bush signed the ALS Registry Act into law, allowing ATSDR to create the National ALS Registry. The registry is the only Congressionally mandated population-based ALS registry in the United States. As a first step in developing the registry, a workshop of international experts in neurological and autoimmune conditions was convened to discuss approaches to creating a national database. Based on feedback from these experts, the registry uses a two-pronged approach to identify all U.S. cases of ALS. The first approach uses national administrative databases, including those of Medicare, Medicaid, the Veterans Health Administration, and the Veterans Benefit Administration, to identify prevalent cases based on an algorithm developed through pilot projects. These administrative databases cover approximately 90 million Americans, and the algorithm identifies 80 to 85 percent of all true ALS cases when applied to these databases. The second approach uses a secure Web portal to allow patients to self-enroll voluntarily. Data from the two approaches are combined into the registry database, and duplicate patients are identified and removed so that each person with ALS is counted only once in the registry.


The registry data will support several research projects. The Web portal for self-enrolled participants contains brief surveys that collect information on potential risk factors, such as socio-demographic characteristics, occupational history, military history, cigarette smoking, alcohol consumption, physical activity, family history of neurodegenerative diseases, and disease progression. ATSDR is also currently implementing active surveillance projects that will allow population-based case estimates of ALS in certain smaller geographic areas (i.e., at the State and metropolitan levels) to help ATSDR evaluate the completeness of the registry. In addition, ATSDR has developed a system to inform people with ALS about new research (e.g., clinical trials, epidemiological studies) for which they may be eligible. Lastly, ATSDR is funding a feasibility study for the creation of a national biospecimen repository that would be open to all U.S. residents with ALS who are enrolled in the registry. This proposed biorepository will help researchers better understand the disease because it will pair biospecimens (e.g., blood, brain tissue) with existing risk-factor data from patients.

Key Point

Combining multiple data sources, such as administrative databases and patient-reported information, is a novel approach and can be an effective way to successfully identify patients with a rare disease and to better understand the prevalence, incidence, and etiology of the disease. However, using alternative approaches requires a strong understanding of the nuances of the individual data sources; pilot testing is also helpful to identify potential issues with data sources prior to registry launch.

For More Information

References for Chapter 6

Federal Coordinating Council for Comparative Effectiveness Research. Report to the President and the Congress. U.S. Department of Health and Human Services; Jun 30, 2009. [August 14, 2012]. http://www​.med.upenn​.edu/sleepctr/documents​/FederalCoordinatingCouncilforCER​_2009.pdf.
GE Healthcare. Medical Quality Improvement Consortium. [August 15, 2012]. http://www3​.gehealthcare​.com/en/Products​/Categories/Healthcare_IT​/Clinical_Knowledge_Solutions/MQIC.
Practice Fusion. Practice Fusion Releases EMR Data set, Launches Health Data Challenge with Kaggle. [August 15, 2012]. http://www​.practicefusion​.com/pages/pr/health-data-initiative-forum-challenge-2012.html.
Centers for Medicare and Medicaid Services. Medicare Coverage – General Information. [August 15, 2012]. http://www​​/Coverage/CoverageGenInfo/index​.html.
DeNavas-Walt C, Proctor BD, Smith JC. Current Population Reports, P60-235, Income, Poverty, and Health Insurance Coverage in the United States: 2007. Washington, DC: U.S. Government Printing Office; 2008. [July 16, 2013]. http://www​​/prod/2008pubs/p60-235.pdf.
Research Data Assistance Center. [August 15, 2012]. http://www​
National Center for Health Statistics. National Death Index. [July 16, 2013]. http://www​
Social Security Administration. Death Master File. National Technical Information Service; [August 15, 2012]. http://www​​.aspx.
National Technical Information Service. Important Notice: Change in Public Death Master File Records. [August 14, 2012]. http://www​​/import-change-dmf.pdf.
Doody MM, Hayes HM, Bilgrad R. Comparability of national death index plus and standard procedures for determining causes of death in epidemiologic studies. Ann Epidemiol. 2001 Jan;11(1):46–50. [PubMed: 11164119]
Sathiakumar N, Delzell E, Abdalla O. Using the National Death Index to obtain underlying cause of death codes. J Occup Environ Med. 1998 Sep;40(9):808–13. [PubMed: 9777565]
U.S. Bureau of the Census. [August 15, 2012]. http://www​
Health Resources and Services Administration. Area Health Resources Files (AHRF). [August 15, 2012]. http://arf​
American Hospital Association. AHA Data and Directories. [August 15, 2012]. http://www​​/rc/stat-studies​/data-and-directories.shtml.
National Marrow Donor Program. [August 15, 2012]. http://www​
United States Renal Database. [August 15, 2012]. http://www​



In a relational database, information is presented in tables with rows and columns. Data within a table may be related by a common concept, and the related data may be retrieved from the database. See: A Relational Database Overview. http://docs​​/javase/tutorial/jdbc​/overview/database.html. Accessed July 16, 2013.


  • PubReader
  • Print View
  • Cite this Page

Related information

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...