Source: Surveillance, Epidemiology, and End Results (SEER) Program ( www.seer.cancer.gov).2
The Agency for Healthcare Research and Quality (AHRQ), through its Evidence-Based Practice Centers (EPCs), sponsors the development of evidence reports and technology assessments to assist public- and private-sector organizations in their efforts to improve the quality of health care in the United States. This report was requested and funded by the Division of Cancer Prevention and Control, National Center for Chronic Disease Prevention and Health Promotion at the Centers for Disease Control and Prevention (CDC). The reports and assessments provide organizations with comprehensive, science-based information on common, costly medical conditions and new health care technologies. The EPCs systematically review the relevant scientific literature on topics assigned to them by AHRQ and conduct additional analyses when appropriate prior to developing their reports and assessments.
To bring the broadest range of experts into the development of evidence reports and health technology assessments, AHRQ encourages the EPCs to form partnerships and enter into collaborations with other medical and research organizations. The EPCs work with these partner organizations to ensure that the evidence reports and technology assessments they produce will become building blocks for health care quality improvement projects throughout the Nation. The reports undergo peer review prior to their release.
AHRQ expects that the EPC evidence reports and technology assessments will inform individual health plans, providers, and purchasers as well as the health care system as a whole by providing important information to help improve health care quality.
We welcome comments on this evidence report. They may be sent by mail to the Task Order Officer named below at: Agency for Healthcare Research and Quality, 540 Gaither Road, Rockville, MD 20850, or by e-mail to epc@ahrq.gov.
Carolyn M. Clancy, M.D.
Director
Agency for Healthcare Research and Quality
Eddie Reed, M.D.
Director, Division of Cancer Prevention and Control
National Center for Chronic Disease Prevention and Promotion
Centers for Disease Control and Prevention
Jean Slutsky, P.A., M.S.P.H.
Director, Center for Outcomes and Evidence
Agency for Healthcare Research and Quality
Beth A. Collins Sharp, R.N., Ph.D.
Acting Director, EPC Program
Agency for Healthcare Research and Quality
Susan Meikle, M.D., M.S.P.H.
EPC Program Task Order Officer
Agency for Healthcare Research and Quality
The authors gratefully acknowledge Jane Kolimaga and Alison Lee for assistance in study initiation; Margaret Jamison, PhD, for assistance with analysis of the Nationwide Inpatient Sample data set; and Karen Hoffman, MD, for initial work on developing the ovarian cancer natural history model. They also thank Mona Saraiya, MD, MPH, and Christie Eheman, PhD, from the Centers for Disease Control and Prevention for their valuable input.
Objectives: To assess diagnostic strategies for distinguishing benign from malignant adnexal masses.
Data Sources: MEDLINE® and reference lists of recent reviews; discharge data from the Nationwide Inpatient Sample.
Review Methods: The major diagnostic methods evaluated were bimanual pelvic examination, ultrasound (morphology and Doppler velocimetry), MRI, CT, FDG-PET, CA-125, and scoring systems that incorporated multiple clinical, laboratory, and radiologic findings. Meta-analysis using a random-effects model was used to estimate pooled sensitivity and specificity for discriminating benign from malignant. We reviewed evidence for followup strategies for masses considered benign, and for adverse outcomes of diagnostic surgery. We also reviewed published models of the natural history of ovarian cancer and compared the impact of assumptions about natural history on outcomes.
Results: The majority of studies did not describe whether patients presented with asymptomatic masses detected through screening or with symptoms. Prevalence of malignant masses in a U.S. postmenopausal screening population was approximately 0.1 percent, while benign masses were found in 0.8 to 1.8 percent of women. Pooled (a) sensitivity and (b) specificity were: bimanual exam (a) 0.45, (b) 0.90; ultrasound morphology scores (a) 0.86 to 0.91, (b) 0.68 to 0.83; Doppler resistive index (a) 0.72, (b) 0.90; pulsatility index (a) 0.80, (b) 0.73; maximum systolic velocity (a) 0.74, (b) 0.81; presence of vessels (a) 0.88, (b) 0.78; combined morphology and Doppler (a) 0.86, (b) 0.91; MRI (a) 0.91, (b) 0.88; CT (a) 0.90, (b) 0.75; FDG-PET (a) 0.67, (b) 0.79; and CA-125 (a) 0.78, (b) 0.78. Both sensitivity and specificity of CA-125 were better in postmenopausal than in premenopausal women. In modeled outcomes, combinations of imaging and CA-125 were both more sensitive and more specific than either alone. Performance of scoring systems in validation studies was consistently worse than in development studies; the highest demonstrated specificity observed was 0.91, with a concurrent sensitivity of 0.74. Evidence on followup strategies was sparse, although one large study provided good evidence for safely following unilocular cysts less than 10 cm in diameter. Overall complication rates in studies of surgically managed adnexal masses were low, but important clinical information was not reported.
Conclusions: All diagnostic modalities showed trade-offs between sensitivity and specificity, but the available literature does not provide sufficient detail on relevant characteristics of study populations to allow confident estimation of the results of alternative diagnostic strategies. Although modeling studies may prove useful in evaluating diagnostic algorithms, further work is needed to explore the implications of uncertainty about the natural history of ovarian cancer.
Ovarian cancer is the leading cause of cancer death from gynecologic malignancies in the United States, with an annual incidence of over 25,000 and an annual mortality of approximately 14,000. Cancer incidence increases dramatically with age.
The high case-fatality rate has largely been attributed to the fact that most ovarian cancers are diagnosed in advanced stages (Stage III, where the cancer has spread beyond the pelvis to organs of the upper abdominal cavity, and Stage IV, where the cancer has spread outside of the peritoneal cavity), when survival is poor. Stage I cancer (limited to the ovaries) has a survival rate of over 90 percent. Thus, there has long been an emphasis on early detection of ovarian cancer in the belief that detection in early stages will lead to decreases in morbidity and mortality. The detection of a mass in the area of the ovaries and fallopian tubes (the uterine adnexae) raises the possibility of ovarian cancer, which necessitates further study to rule out malignancy.
There are two main clinical routes by which an adnexal mass may be detected: (1) women with symptoms may have an adnexal mass detected as part of their evaluation for those symptoms, either by physical exam or radiographic imaging; (2) the mass may be detected during bimanual pelvic examination or radiologic imaging as part of a routine health maintenance examination.
For the purposes of this evidence report, we define an adnexal mass as an enlarged structure in the uterine adnexa that can either be palpated on a bimanual pelvic examination or visualized using radiographic imaging.
There are a number of conditions that can be associated with an adnexal mass. These include malignancies arising from the ovary and fallopian tube, or metastatic disease from another site (such as the breast or gastrointestinal tract), as well as a wide range of benign conditions. For the purposes of this evidence report, “management” of the adnexal mass refers to the process by which a mass is ultimately classified as benign or malignant.
The clinical significance of discriminating benign from malignant masses differs depending on the clinical setting in which the mass is initially detected. For women with symptoms, in whom surgical management may be appropriate whether or not the mass is malignant, the main reason to discriminate between benign and malignant lesions is to facilitate referral and management by clinicians who have specialized training and experience in managing ovarian malignancy, with improved outcomes. For asymptomatic women, discriminating benign from malignant disease is important both to ensure appropriate management in the setting of malignancy, but also to avoid unnecessary diagnostic procedures, including surgery, in women with asymptomatic, nonmalignant conditions.
The prevalence of malignancy may differ between women with symptomatic and asymptomatic masses, which may in turn affect the positive and negative predictive value of a test, and, potentially, sensitivity and specificity as well. Prevalence also varies with age and with family history.
This report focuses on the evidence relevant to establishing the most appropriate way to distinguish benign from malignant adnexal masses in both symptomatic and asymptomatic women. A key consideration throughout the report will be the underlying likelihood of malignancy in the populations studied, and the impact of this prevalence on the interpretation of the results of the reviewed studies. The results of this report are intended primarily to (a) provide a resource for clinicians and policymakers developing guidelines on management of adnexal masses, and (b) provide a resource for researchers and funding agencies in identifying gaps in our knowledge and research priorities.
Working with the Agency for Healthcare Research and Quality (AHRQ), the Centers for Disease Control and Prevention (CDC), and members of the technical expert panel, we developed seven questions to be addressed, using an analytic framework which incorporated prior probability of disease, test results, and outcomes of diagnostic surgery.
We searched MEDLINE® (1966-September 2004) and the Cochrane Database of Systematic Reviews. Searches of these databases were supplemented by reviews of reference lists contained in all included articles and in relevant review articles and meta-analyses. The searches yielded a total of 1,023 citations. Pairs of readers reviewed each abstract and selected 445 articles for full text review. Specific inclusion criteria were developed for each question, and both readers were required to agree on inclusion.
We developed tables to abstract each article, and quality criteria for each question. For studies of diagnostic tests, 2-by-2 tables were constructed for each included article, and sensitivity, specificity, and positive and negative predictive values, with 95 percent confidence intervals (CIs) for each, were calculated. If not provided, we also calculated 95% CIs for articles about prevalence and adverse event rates during diagnostic surgery. For diagnostic tests, pooled estimates of sensitivity and specificity were calculated using a random-effects model.
We performed three supplemental analyses. First, we used the Nationwide Inpatient Sample (NIS), a nationally representative database containing discharge data from approximately 20 percent of U.S. hospitals. Using International Classification of Diseases, Ninth Revision (ICD-9) codes and the provided corrections for sample weighting, we estimated the number of cases of women 15 and older undergoing diagnostic laparoscopy and exploratory laparotomy in 2000 and 2001 for diagnoses consistent with an adnexal mass. Mortality and morbidity rates for each type of procedure within each diagnosis were also estimated.
Second, we performed a simple decision model based on serial or parallel testing using the pooled sensitivity and specificity of various tests to predict outcomes.
Finally, we used a previously developed Markov model of the natural history of ovarian cancer to explore the implications of alternative possible pathways in the development of advanced disease - specifically, that some cancers limited to the ovaries (Stage I) may spread to the upper abdomen (Stage III) without first spreading to other pelvic organs (Stage II).
Question 1: What is the prevalence of various tumor types among women with an adnexal mass, stratified by cancer status (malignant vs. benign), age, menopausal status, and size of tumor?
Question 2: What are the sensitivity, specificity, and reliability of the bimanual pelvic examination?
Question 3: Among women with a palpable adnexal mass on exam or a mass identified by ultrasound/imaging, what is the sensitivity/specificity of various evaluation modalities including ultrasound (transvaginal ultrasound, transabdominal ultrasound, color Doppler, two-dimensional [2D] vs. three-dimensional [3D] ultrasound), computer tomography (CT) scan, magnetic resonance imaging (MRI) scan, and CA-125 levels for distinguishing benign from malignant masses?
Question 4: What is the accuracy of explicit scoring systems which incorporate various combinations of imaging findings, patient risk factors, and/or CA-125 levels for detecting malignancy? Have these scoring systems been applied to a population of women before laparoscopy or laparotomy?
Question 5: Among women with suspected benign masses on initial investigation, what are the sensitivity and specificity of monitoring with periodic CA-125 and/or interval ultrasound examinations for detecting malignant masses? How does the interval of testing/definition of change affect sensitivity and predictive value?
Question 6: Among women with adnexal masses, what are the morbidity and mortality from diagnostic surgery (laparoscopy or laparotomy)? At what point does the risk of surgery outweigh the risk of detecting malignancy?
Question 7: What are the estimated trade-offs resulting from various strategies for evaluation of the adnexal mass?
The main limitation in the literature was the failure to adequately describe relevant patient characteristics, including the presence or absence of symptoms, and variable reporting of menopausal status. Inadequate sample size, lack of blinding, and failure to account for observer variability were also common limitations.
The report did not include non-English publications. We did not include non-U.S. studies in our review of the prevalence of different types of adnexal mass. Given the heterogeneity of studies, pooled estimation of sensitivity and specificity may not be appropriate. The NIS does not include outpatient procedures, and our coding algorithm may have missed some complications.
Research priorities include: a minimal consensus data set on key patient characteristics (with results presented stratified by those characteristics); better estimates of prevalence and surgical outcomes using data sources that capture inpatient and outpatient encounters, such as Medicare or health maintenance organizations; better characterization of patient characteristics in all studies; better evidence on the value of the pelvic exam as part of routine health maintenance; and development of additional models for simulating the natural history of ovarian cancer and evaluating screening, diagnosis, and treatment strategies.
Developing an effective and efficient algorithm for the evaluation of any condition requires good evidence on the prevalence of the condition at the first diagnostic encounter, and the sensitivity and specificity of the potential diagnostic tests to be used. Unfortunately, the overwhelming majority of the literature we reviewed did not provide sufficient detail on important patient characteristics to allow estimation of the outcomes of different diagnostic strategies, either in the context of detecting adnexal masses or distinguishing benign from malignant masses.
All of the diagnostic tests and scoring systems we evaluated exhibited a trade-off between sensitivity and specificity - studies of a given test that reported higher sensitivity had lower specificity, and vice versa. The bimanual pelvic examination has low sensitivity for both detection of adnexal masses and discriminating benign from malignant masses, raising doubts about its utility as a screening test in asymptomatic women. In pooled analysis, the combination of ultrasound morphology and Doppler blood flow had the best combination of sensitivity and specificity, with MRI comparable. In a preliminary model, serial testing with imaging followed by CA-125 was both more sensitive and more specific than either test alone; parallel testing using both tests incorporated into the Risk of Malignancy Index resulted in fewer missed cancers (greater sensitivity) but more surgeries (lower specificity), with twice as many tests.
Studies of surgical management suffered from the same limitations in terms of description of patient characteristics, making estimation of the risks of false positive diagnostic testing impossible.
Ultimately, evaluation of potential strategies for reducing morbidity and mortality from ovarian cancer may require use of simulation models, a technique that has proven helpful in evaluating prevention strategies for other cancers. Because the natural history of ovarian cancer is relatively unknown, testing of alternative models is critical. Although a few sophisticated models exist, development of additional models would be helpful, especially in the context of evaluating results from ongoing trials of screening. If any of these trials show a benefit from screening, then the need for better evidence on the diagnostic evaluation of adnexal masses will become even more critical.
Source: Surveillance, Epidemiology, and End Results (SEER) Program ( www.seer.cancer.gov).2
| White | African-American | Asian/Pacific Islander | Native American | Hispanic | |
|---|---|---|---|---|---|
| Incidence | 15.1 | 10.3 | 10.4 | 8.9 | 11.9 |
| Mortality | 9.3 | 7.6 | 4.8 | 5.1 | 6.2 |
Source: Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov).2
Malignant tumors of the ovary can either arise in the ovary (primary ovarian cancer) or be the result of metastasis from another site, such as the breast or colon. Primary ovarian tumors, whether benign or malignant, can arise from three broad types of cells: the cells on the surface (epithelial cells); the cells that form eggs (germ cells); and the cells surrounding the eggs, including the cells that produce ovarian hormones (sex cord-stromal cells). Epithelial tumors are the most common type, accounting for 60 percent of all ovarian tumors and up to 90 percent of primary cancers. Sex-cord-stromal tumors account for 10 to 15 percent of all tumors, while germ cell tumors account for 25 percent of tumors. In general, sex cord-stromal tumors and germ cell tumors are relatively more common in younger premenopausal women. Thus, although ovarian cancer is relatively rare in younger women, when it does occur it is more likely to be a non-epithelial cancer than cancers in postmenopausal women.3
Within the broad classification of epithelial, sex cord-stromal, and germ cell tumors, tumors are further classified by the individual cell types from which they are derived. For example, the most common epithelial tumors are serous and mucinous tumors, the most common sex-cord stromal tumors are fibromas (arising from the connective tissue surrounding eggs), and the most common germ cell tumors are teratomas. Within each histological class, tumors can be benign or malignant, based on their ability to metastasize.3
Some epithelial tumors are classified as “borderline” or “low malignant potential” (LMP) tumors. These are tumors in which there is no invasion into the ovarian stroma, but for which histologic evidence of proliferation exists (increased cell division, changes in the appearance of the cell nucleus). There is controversy over whether these tumors represent pre-invasive cancer, and, if untreated, would go on to become a cancer, or whether they represent a subtype of tumor that has a relatively small chance of becoming a cancer.3 In estimating the diagnostic accuracy of tests for determining whether a mass is benign or malignant, whether LMP tumors are classified as benign or malignant can have an effect on the estimates of test performance, as we will discuss later in the report.
Ovarian cancer spreads primarily by dissemination throughout the peritoneal cavity; common sites of metastasis are the small and large bowel, the omentum, the liver, and the diaphragm. Spread to retroperitoneal lymph nodes is also common.
Treatment for ovarian cancer consists of surgical removal of the ovaries, fallopian tubes, and uterus (if present), along with as much metastatic disease as possible; if there is no obvious spread beyond the ovaries, the lymph nodes are sampled to determine if there has been lymphatic metastasis. Surgery is followed by chemotherapy, with responsiveness to chemotherapy depending on the amount of tumor left after surgical removal and the cell type of tumor, among other factors.3
The high case-fatality rate observed in ovarian cancer has largely been attributed to the fact that most ovarian cancers are diagnosed in advanced stages (Stage III, where the cancer has spread beyond the pelvis to organs of the upper abdominal cavity, and Stage IV, where the cancer has spread outside of the peritoneal cavity), when survival is poor. Stage I cancer (limited to the ovaries) has a survival rate of over 90 percent. Thus, there has long been an emphasis on early detection of ovarian cancer, in the belief that detection in early stages will lead to decreases in morbidity and mortality, just as cervical cancer screening has resulted in substantial reductions in morbidity and mortality from cervical cancer. The detection of a mass in the area of the ovaries and fallopian tubes (the uterine adnexae) raises the possibility of ovarian cancer, which necessitates further study to rule out malignancy.
This evidence report was prepared by the Duke Evidence-based Practice Center, in partnership with the Centers for Disease Control and Prevention (CDC) and the Agency for Healthcare Research and Quality (AHRQ). The purpose of the report is to provide followup data regarding key issues identified at two conferences sponsored by CDC, one in November 2000 on broad issues in preventing morbidity and mortality from ovarian cancer,4 and one in May 2002 on the use of ultrasound in the diagnosis of ovarian cancer.5
For the purposes of this report, we define an adnexal mass as an enlarged structure in the uterine adnexa which can either be palpated on a bimanual pelvic examination or visualized using radiographic imaging. The normal ovary is approximately 3 cm in length, decreasing in size after menopause.6 In terms of physical examination, the precise size definition used in the literature is quite variable and, in practice, may also vary depending on the ease with which the examination is performed, the patient's body habitus, the examiner's experience, the time taken during the exam, and the presence of other abnormalities such as uterine fibroids. Historically, because of the decrease in size after menopause, any palpable mass in a postmenopausal woman has been considered abnormal (the “palpable postmenopausal ovary syndrome”).7 As discussed below, some masses may ultimately prove to not be ovarian in origin.
The definition of an abnormal structure on radiologic imaging is also quite variable. Small fluid-filled cysts are quite common in both pre- and postmenopausal women. For the purposes of this report, we consider any structure observed during radiologic imaging that prompts additional evaluation (such as measurement of serologic markers or further imaging) as a mass.
There are three main clinical routes by which an adnexal mass may be detected. First, women with symptoms may have an adnexal mass detected as part of their evaluation for those symptoms, either by physical exam or radiographic imaging. Because ovarian cancer often presents with vague abdominal symptoms, we would consider any evaluation for symptoms to be in symptomatic women. Second, the mass may be detected as part of a routine health maintenance examination. Finally, it is possible that an asymptomatic mass could be detected during imaging done for another indication. In premenopausal women, the most likely scenario where this would occur would be during ultrasound evaluation during pregnancy. Another common scenario in perior postmenopausal women would be evaluation for uterine bleeding; because uterine bleeding is not a common symptom of ovarian cancer, a finding of an adnexal mass during evaluation for bleeding could be considered as an incidental finding. Because malignancy is rare during pregnancy, and because the technical considerations for both diagnosis and management are different, the most appropriate management of masses detected during pregnancy, especially if detected serendipitously by ultrasound, is outside of the scope of this report.
We did not identify any literature that would allow an estimate of the proportions of women with adnexal masses presenting by each route; as we will discuss, this is a major deficiency of the literature. The proportions are likely to vary by setting, referral patterns, patient thresholds for seeking care, physician thresholds for diagnostic tests, and other factors. For example, one gynecologic oncologist estimated that well over half of the referrals for evaluation in a large health maintenance organization were for incidentally detected masses (W. Kinney, personal communication).
Conditions that can present as an adnexal mass include:
Benign primary ovarian tumors - epithelial, sex cord-stromal, and germ cell;
Borderline and malignant ovarian tumors - epithelial, sex cord-stromal, and germ cell;
Metastatic malignant tumors - most commonly breast and gastrointestinal tract;
Masses arising from the fallopian tube - most commonly benign, including hydrosalpinx (a large, fluid-filled fallopian tube) and pyosalpinx (an infected, pus-filled fallopian tube); primary fallopian tube malignancies can occur, but are relatively rare.
Masses arising from the uterus - most commonly benign leiomyomas (fibroids);
Masses arising from the gastrointestinal tract - diverticula of the colon, large colonic tumors, tumors of the appendix;
Masses arising from the urinary tract - pelvic kidneys, diverticula of the ureter;
Masses arising from remnants of embryological development;
Endometriosis;
Pelvic inflammatory disease;
Cysts arising from normal ovarian functions, such as development of eggs (follicular cysts) and ovulation (corpus luteum cysts).
With such a wide range of potential causes, and with a wide range of appropriate therapeutic options, precise diagnosis of a mass, especially in symptomatic women, is important. Once diagnosed, a mass may be managed in a variety of ways, ranging from observation to surgical removal and chemotherapy. However, a review of the test characteristics of various methods for obtaining precise diagnoses of specific conditions, and of the range of medical and surgical treatment options for each condition, is beyond the scope of this report. For our purposes, “management” of the adnexal mass refers to the process by which a mass is ultimately classified as benign or malignant.
The clinical significance of discriminating benign from malignant masses differs depending on the clinical setting in which the mass is initially detected.
In women who initially present with symptoms, diagnosis of the underlying cause of the mass is important since it may help define available treatment options. Although medical therapy may relieve symptoms in some cases, surgical management is the treatment of choice for many conditions. Because surgery may ultimately be the most appropriate management for symptomatic adnexal masses, the main reason to discriminate between benign and malignant lesions is to facilitate referral and management by clinicians with specialized training and experience in managing ovarian malignancy, with improved outcomes.8–10
The other main group of women with adnexal masses consists of those without symptoms who have a mass detected through either physical examination or imaging. No organization currently recommends routine screening with serum markers or imaging for ovarian cancer.11, 12 The U.S. Preventive Services Task Force gives screening (including serum markers, imaging, or pelvic examination) a “D” recommendation (fair evidence against screening).13 However, because an annual pelvic examination continues to be recommended by professional organizations such as the American College of Obstetricians and Gynecologists (ACOG),11, 14 many asymptomatic women may have an adnexal mass detected during a periodic health maintenance examination. In this setting, discriminating benign from malignant disease is important not only to ensure appropriate management in the setting of malignancy, but also to avoid unnecessary diagnostic procedures, including surgery, and anxiety in women with asymptomatic, nonmalignant conditions. In some cases, there may be a rationale for removing certain asymptomatic benign lesions, including prevention of malignant transformation; prevention of ovarian torsion (a condition where the ovary twists and occludes its blood supply, causing abdominal pain and possibly resulting in loss of ovarian function); prevention of rupture, which might lead to acute symptoms or a worse prognosis (for example, in the case of endometriosis); prevention of more advanced or complicated surgery for a larger mass or more extensive pathologic process after the development of symptoms; and, for premenopausal women, possible enhancement of fertility. A review of the evidence (or lack of evidence) supporting these rationales is beyond the scope of this report.
As discussed above, the results of tests used to distinguish benign from malignant disease have different implications depending on whether the patient is symptomatic or asymptomatic. However, clinical presentation also has implications for interpretation of test results.
Diagnostic or screening tests are most commonly characterized by their sensitivity and specificity. The sensitivity of a test is the probability that, given the underlying presence of the disease, the test result will be positive; 100 percent minus the sensitivity is commonly called the false negative rate. The specificity of the test is the probability that, given the underlying absence of disease, the test result will be negative; 100 percent minus the specificity is commonly called the false positive rate. In an ideal evaluation, the sensitivity and specificity of the test are independent of the underlying probability, or prevalence, of disease.
Clinically, the more common scenario is that the clinician is aware of the test result and needs to know the probability of the presence or absence of disease. In this setting, the positive and negative predictive values of the test are more important.
The negative predictive value of a test is the probability that, given a negative test result, the patient truly does not have disease. It is a function of three parameters: the pretest probability of the disease, the sensitivity of the test, and the specificity of test:
(1 - Prevalence) * Specificity/[(1 - Prevalence) * Specificity] + [Prevalence * (1 - Sensitivity)]
As can be seen in the equation, the negative predictive value is much more dependent on test sensitivity than test specificity. Negative predictive value will be high when test sensitivity is high, and when prevalence is low (i.e., disease is rare).
Similarly, the positive predictive value is the probability that, given a positive test result, the patient actually has the disease. It is also a function of prevalence, sensitivity, and specificity:
Prevalence * Sensitivity/(Prevalence * Sensitivity) + [(1 - Prevalence) * (1 - Specificity)]
Positive predictive value is high when a test has high specificity, or when prevalence is high (disease is common).
For any given test, the positive predictive value will be higher and the negative predictive value lower when used in populations where the disease is common compared to populations where the disease is rare, while the positive predictive value will decrease and the negative predictive value increase as the disease becomes less common. This effect of prevalence on predictive values is independent of test sensitivity and specificity. The significance of the prevalence of disease in the population in which test characteristics are being evaluated is even more critical because, under some types of study design, disease prevalence can also affect estimates of sensitivity and specificity.15
Therefore, variations in the prevalence of malignancy among women with different clinical presentations will affect at least predictive values, and possibly sensitivity and specificity estimates. The prevalence of ovarian cancer clearly rises with age, so age and/or menopausal status are important considerations in evaluating management strategies in both the symptomatic and asymptomatic patient with an adnexal mass.
The prevalence of malignancy among asymptomatic women with an adnexal mass will be a function of the underlying prevalence or incidence of malignancy and the test characteristics of the initial test used to detect the mass. Evaluation of the different screening tests and strategies for early detection of ovarian cancer is beyond the scope of this report, especially since there are at least three large trials still ongoing.16–18 However, in order to properly interpret the results of tests performed in asymptomatic women with pelvic masses, some estimate of the underlying probability of malignancy among these women is needed. Since many of these women are likely identified through a bimanual pelvic examination, deriving this estimate requires an assessment of the sensitivity and specificity of the pelvic examination. Symptomatic patients may be more likely to have an underlying adnexal malignancy, especially among postmenopausal women.19 In any series of women with adnexal masses, the proportion of women who are symptomatic and asymptomatic will likely determine the prevalence, and thus the predictive values of the diagnostic tests used to evaluate the mass.
In summary, this report focuses on the evidence relevant to establishing the most appropriate way to distinguish benign from malignant adnexal masses in both symptomatic and asymptomatic women. A key consideration throughout the report will be the underlying likelihood of malignancy in the populations studied, and the impact of this prevalence on the interpretation of the results of the reviewed studies. The results of this report are intended primarily to (a) provide a resource for clinicians and policymakers developing guidelines on management of adnexal masses, and (b) provide a resource for researchers and funding agencies in identifying gaps in our knowledge and research priorities.
This section of the report describes the basic methodology used to develop the evidence report, including topic assessment and refinement, analytic framework, literature search strategies and results, literature screening and grading process and criteria, data abstraction and analysis methods, and quality control procedures.
The Centers for Disease Control and Prevention (CDC) and the Agency for Healthcare Research and Quality (AHRQ) originally identified five key questions to be addressed by the report, focused on management of adnexal masses in peri- and postmenopausal women. The Duke research team clarified and refined the overall research objectives and key questions by first consulting with the two study sponsors, AHRQ and CDC, at which time two questions were added, and then by convening a panel of national experts who would serve as advisors to the project. These experts were selected to represent relevant specialties including radiology, obstetrics-gynecology, and gynecologic oncology, as well as national professional societies, including the American College of Obstetricians and Gynecologists (ACOG), the Society of Gynecologic Oncologists (SGO) and the American College of Radiology (ACR). Members of the technical expert panel were:
Susan Ascher, MD; Department of Radiology, Georgetown University Hospital; Washington, DC (ACR)
Michael L. Berman, MD; Division of Gynecologic Oncology, UCI Medical Center; Orange, CA (SGO)
Barry B. Goldberg, MD; Department of Radiology, Thomas Jefferson University Hospital; Philadelphia., PA (ACR)
Edward E. Partridge, MD; Department of Obstetrics and Gynecology, University of Alabama, Birmingham; Birmingham, AL (American Cancer Society)
George F. Sawaya, MD; Department of Obstetrics and Gynecology, University of California, San Francisco; San Francisco, CA
Howard T. Sharp, MD; University of Utah Hospitals and Clinics; Salt Lake City, UT (ACOG)
Stanley Zinberg, MD, MS; ACOG; Washington, DC
As a result of an initial conference call with the technical experts, AHRQ, and CDC, the Duke research team modified the key research questions originally proposed in the Task Order in two fundamental ways: (1) The questions were expanded to include women of all ages, and (2) Question 6 would include laparotomy data, where available. After review of a draft version of the report by the technical experts and additional reviewers, the order of the questions was also changed to allow a more logical flow.
The key questions addressed by this report are:
Question 1: What is the prevalence of various tumor types among women with an adnexal mass, stratified by cancer status (malignant vs. benign), age, menopausal status, and size of tumor?
Question 2: What are the sensitivity, specificity, and reproducibility of the bimanual pelvic examination?
Question 3: Among women with a palpable adnexal mass on exam or a mass identified by ultrasound/imaging, what is the sensitivity/specificity of various evaluation modalities including ultrasound (transvaginal ultrasound, transabdominal ultrasound, color Doppler, two-dimensional [2D] vs. three-dimensional [3D] ultrasound), computer tomography (CT) scan, magnetic resonance imaging (MRI) scan, and cancer antigen 125 (CA-125) levels for diagnosing malignant masses?
Question 4: What is the accuracy of explicit scoring systems which incorporate various combinations of imaging findings, patient risk factors, and/or CA-125 levels for detecting malignancy? Have these scoring systems been applied to a population of women before laparoscopy or laparotomy?
Question 5: Among women with suspected benign masses on initial investigation, what are the sensitivity and specificity of monitoring with periodic CA-125 and/or interval ultrasound examinations for detecting malignant masses? How does the interval of testing/definition of change affect sensitivity and predictive value?
Question 6: Among women with adnexal masses, what are the morbidity and mortality from diagnostic surgery (laparoscopy or laparotomy)? At what point does the risk of surgery outweigh the risk of detecting malignancy?
Question 7: What are the estimated trade-offs resulting from various strategies for evaluation of the adnexal mass?
Based on the original proposal and discussions with CDC, AHRQ, and the technical expert panel, we developed the following analytic framework to structure our review and synthesis (Figure 2
Comments on this analytic framework are as follows:
Separate consideration of age or menopausal status is important, since several factors that may affect the probability that a given adnexal mass is malignant may vary with age and/or menopausal status: the underlying incidence of various conditions that result in an adnexal mass, the frequency of contact with clinicians, the type and length of followup, and the prevalence of other conditions that may cause symptoms similar to those caused by ovarian malignancy or other symptomatic pelvic pathology. Race/ethnicity may also play a role, both in the relative likelihood of malignancy and the likelihood of other conditions and contact with clinicians.
A variety of conditions, both benign and malignant, can cause a mass in the adnexa. The underlying prevalence of each type of condition, along with the sensitivity and specificity of the initial diagnostic test, will determine the proportion of patients with a given test result who are truly disease-free, or who truly have disease. The evidence on the prevalence of these conditions is reviewed in Question 1.
Women can present with an adnexal mass in one of two ways - through presentation with symptoms and subsequent detection of a mass through a physical examination, or through detection of a mass in an asymptomatic woman during physical examination or an imaging study. The ultimate probability of malignancy may vary based on how an adnexal mass is initially detected, since the prevalence of malignancy at this stage will drive the positive and negative predictive values of all subsequent tests. Because many women will initially have their masses detected through a bimanual pelvic examination, we review the evidence on the sensitivity and specificity of this component of the physical examination in Question 2.
After the initial diagnosis of an adnexal mass, the choice of the next test will provide a revised estimate of the probability of a given disease. Although determining this probability is important in the symptomatic patient so that she may receive appropriate therapy, it is even more important in the asymptomatic patient, who runs the risk of undergoing unnecessary surgery for a benign condition if the test is falsely positive. Question 3 addresses the sensitivity and specificity of tests commonly used as “next step” diagnostic procedures.
Frequently, a combination of various test results and patient characteristics can provide better discrimination between diseased and non-diseased, or benign and malignant, than any single test parameter. Question 4 addresses the performance of various multivariate scoring systems in discriminating benign from malignant masses.
Because 100 percent sensitivity is difficult to achieve, some tests will be falsely negative. One strategy to minimize the consequences of a false negative test would be to monitor the patient with a specified test or tests, at a specified frequency, for a specified duration. Question 5 addresses the evidence for the effectiveness of such an approach, and which combination of test, test frequency, and duration of followup offers optimal performance.
The ultimate diagnosis of ovarian malignancy requires surgical exploration, either through laparoscopy or laparotomy. Although an adverse outcome of surgery is not desirable under any circumstances, patients who undergo surgery because of a symptomatic mass have the possibility of improvement in symptoms, while, for patients who ultimately prove to have an ovarian malignancy, surgical management with adequate staging and reduction in tumor bulk appears to improve outcomes. However, for patients with some asymptomatic benign masses, the benefits of surgery may be less clear while providing substantial risks. Question 6 addresses the risks of diagnostic surgery, both laparoscopy and laparotomy, for women with adnexal masses.
Finally, estimating the benefits, harms, and costs of various management strategies, including screening, for ovarian cancer is complex. Synthesizing the wide range of data and incorporating uncertainty, as well as missing data, can often be done using simulation models. Question 7 presents an initial attempt at summarizing the likely outcomes of several different diagnostic strategies. Because modeling the natural history of ovarian cancer will ultimately be important for comprehensive analyses of different screening and diagnostic strategies, we also review existing models for the natural history of ovarian cancer with special attention paid to underlying assumptions.
The primary sources of literature were MEDLINE® (1966-September 2004) and the Cochrane Database of Systematic Reviews. Searches of these databases were supplemented by reviews of reference lists contained in all included articles and in relevant review articles and meta-analyses.
The basic search strategy used the National Library of Medicine's Medical Subject Headings (MeSH) key word nomenclature developed for MEDLINE® and was adapted for use in the other databases. The searches were limited to the English language. The texts of the three major search strategies are given in Appendix A.* The searches yielded a total of 677 citations, whose records are maintained in a ProCite20 database.
Paired researchers from the Duke research team independently reviewed a set of abstracts and classified each as “include” or “exclude” according to study-specific criteria, which they developed. An abstract was included if at least one of the paired reviewers recommended that it be included. A total of 445 abstracts were included for the further “full-text review” stage. Interrater reliability for include/exclude decisions was tested by having 10 pairs of readers review 138 abstracts. Agreement was good to excellent (kappa 0.66 to 0.95).
At the full-text review stage, the paired researchers independently reviewed a set of the articles, and indicated a decision to “include” or “exclude” the article for the data abstraction stage. When a pair of reviewers arrived at a different opinion about whether to include an article, they were asked to reconcile the difference. Detailed inclusion and exclusion screening criteria were developed by research question and are listed below.
Initially, the patient population was limited to peri- and postmenopausal women, and only articles that provided data specifically by age or menopausal status were included. After initial discussion with the expert panel, the search was expanded to include premenopausal women.
Question 1. Background clarifications were as follows:
The search should be limited to (a) screening studies and (b) case series of women with an undiagnosed mass (not just women who went to laparoscopy/path diagnosis).
Pathology list:
Benign
Uterine leiomyoma
Nonneoplastic cysts, such as:
Follicular (functional) cysts
Corpus luteal (functional) cysts
Theca lutein cysts
Simple cysts
Peritoneal inclusion cysts
Paraovarian cysts
Hemorrhagic cysts
Endometrial cyst
Polycystic ovary disease
Cystic teratoma (dermoid cyst)
Hydrosalpinx,
Cystadenoma
Fibroma
Malignant ovarian neoplasms
Adenocarcinoma
Others
Tumors of low malignant potential
Screening criteria for Question 1 were:
undiagnosed mass (regardless of whether symptomatic or asymptomatic; detected by palpation or ultrasound imaging);
exclude if n < 50; if n ≥ 50, write n on decision sheets;
histology diagnosis;
screened women without mass (case series or cohort) or women with adnexal mass (case series).
Question 2. Screening criteria were as follows:
comparison of bimanual pelvic examination to a reference standard;
n ≥ 20;
able to construct 2-by-2 table for test characteristics.
Question 3. Screening criteria were as follows:
undiagnosed mass (regardless of whether symptomatic or asymptomatic; detected by palpation or ultrasound imaging) or screening population;
disease status distinguishes malignant from non-malignant;
must have 20 or more subjects;
disease status must be verified by histology or negative surgery (laparoscopy/laparotomy);
test is ultrasound, CT, MRI, PET, serum CA-125, or bimanual pelvic exam;
able to construct 2-by-2 table for test characteristics.
Question 4. Screening criteria were as follows:
patients with cancer;
studies with scoring, risk score, combined modality approach;
assesses predictive value of two or more variables (radiographic, patient characteristics or CA-125) using multivariable model;
screening studies;
n ≥ 50.
Question 5. Screening criteria were as follows:
n ≥ 50;
histology or followup interval = at least 9 months;
outcome = continued negative test with no clinical evidence of developing ovarian cancer.
Question 6. Screening criteria were as follows:
procedure = operative laparoscopy for adnexal mass, with or without biopsy;
addresses complications of procedure (morbidity or mortality);
n ≥ 100 for morbidity.
Question 7. Screening criterion was as follows: article described mathematical or computer model of natural history of ovarian cancer.
| Articles identified | 1,023 |
| Abstracts reviewed | 1,023 |
Included | 445 |
Excluded | 578 |
| Full-text articles reviewed | 445† |
Included | 204 |
Excluded | 269 |
The combined number of included (204) and excluded (269) articles exceeds the total 445 reviewed at the full-text level because 28 articles were considered excluded for one question, but included for another question.
| Question | Number of articles |
|---|---|
| Question 1: Prevalence of tumor types | 20 |
| Question 2: Bimanual pelvic examination | 14 |
| Question 3: Single modality tests | 153 |
| Question 4: Explicit scoring systems | 36 |
| Question 5: Monitoring women with suspected benign masses | 9 |
| Question 6: Surgical morbidity and mortality | 24 |
| Question 7: Modeling diagnostic strategies | 4 |
| Total number of included articles | 204† |
Some articles were included for more than one question.
The Duke research team developed and piloted evidence table formats for abstracting data to answer each of the seven research questions (see Appendix C *). Based on clinical expertise, a pair of researchers was assigned to one of the seven research questions to abstract the data from the eligible articles. One of the paired researchers abstracted the data into the evidence tables, and the second researcher over-read the article and accompanying evidence table to check for accuracy and completeness. The completed evidence tables are provided in Appendix D.*
At the data abstraction stage, the researcher was asked to evaluate each included article for factors affecting internal and external validity. The quality assessment criteria varied by question and are listed below. Researchers were instructed to assign a + or - to each item, and provide a brief rationale for each decision.
Quality criteria were as follows:
Question 1: What is the prevalence of various tumor types among women with an adnexal mass, stratified by cancer status (malignant vs. benign), age, menopausal status, and size of tumor?
Question 2: What are the sensitivity, specificity, and reliability of the bimanual pelvic examination?
Question 3: Among women with a palpable adnexal mass on exam or a mass identified by ultrasound/imaging, what is the sensitivity/specificity of various evaluation modalities including ultrasound (transvaginal ultrasound, transabdominal ultrasound, color Doppler, 2D vs. 3D ultrasound), CT scan, MRI scan, and CA-125 levels for diagnosing malignant masses?
Question 4: What is the accuracy of explicit scoring systems which incorporate various combinations of imaging findings, patient risk factors, and/or CA-125 levels for detecting malignancy? Have these scoring systems been applied to a population of women before laparoscopy or laparotomy?
Question 5: Among women with suspected benign masses on initial investigation, what are the sensitivity and specificity of monitoring with periodic CA-125 and/or interval ultrasound examinations for detecting malignant masses? How does the interval of testing/definition of change affect sensitivity and predictive value?
Question 6: Among women with adnexal masses, what are the morbidity and mortality from diagnostic surgery (laparoscopy or laparotomy)? At what point does the risk of surgery outweigh the risk of detecting malignancy?
Question 7: What are the estimated trade-offs resulting from various strategies for evaluation of the adnexal mass?
For test characteristics, a Microsoft Excel® spreadsheet was developed which calculated appropriate test characteristics (sensitivity, specificity, negative predictive value, positive predictive value) for individual studies if studies provided enough data to input (a) values for individual cells of a 2-by-2 table, (b) the prevalence of disease and values for sensitivity and specificity, or (c) sufficient data to solve for two equations involving sensitivity, specificity, or predictive values. Ninety-five percent confidence intervals were automatically estimated using the approximate formula for proportions:

For Questions 3 and 6, prevalence of different mass types, and morbidity and mortality rates, were also calculated using the above formula. For studies where the numerator of a particular proportion was 0, the upper bound was estimated using the formula:

For Questions 2, 3, and 4, we used two complementary methods for assessing diagnostic test performance: (1) summary receiver operating characteristic (ROC) analysis; and (2) independently combined sensitivity and specificity values. We calculated pooled sensitivity and specificity estimates, along with 95 percent confidence intervals and summary ROC curves, using Meta-Stat 0.6, a shareware program for performing meta-analyses of diagnostic tests.22 In this software, logits of sensitivity and specificity values are pooled, using a random-effects model weighted by the inverse of the variance.23
We combined the sensitivity and specificity values of the tests across studies using a random-effects model to estimate the average values. A random-effects model incorporates both the within-study variation (sampling error) and between-study variation (true treatment-effect differences) into the overall treatment estimate. It gives a wider confidence interval than the fixed-effect model (which considers only within-study variability) when estimates are based on heterogeneous results.
When each is combined separately, sensitivity and specificity tend to underestimate the true test sensitivity and specificity; however, they can provide an indication of the approximate test operating point for most of the studies.
Summary ROC curves are a potentially useful graphical summary of the diagnostic test performance data. In brief, each study provides a pair of sensitivity and specificity values to the analysis. After logistic transformation of data, a linear model is fitted to the observed studies using regression analysis. This best-fit model can then be transformed back to ROC space and plotted as curve. A summary ROC curve can be thought of as an ROC curve that describes joint changes in sensitivity and specificity with changes in cutoff values. The ideal position of an ROC curve is near the upper left corner. The area under the curve (AUC) is another summary measure of the degree of discrimination of a test.
The summary ROC method assumes that the variability in the reported sensitivity and specificity values from different studies is due to different cutoff values (explicit or implicit) being applied.24 However, the summary ROC curve can summarize studies whose variability may be due to other sources of variation, since the summary ROC curve no longer ties specific cutoff values to specific intervals of the curve. One can think of a summary ROC curve as an overall estimate of the discrimination ability of a test.
When there is little variability in the test results - i.e., when studies appear to be operating at similar thresholds and report similar results - summary ROC analysis provides little additional information. In this case, separately averaged sensitivity and specificity values across studies will give similarly useful summary information. However, where there is substantial variability in test results, the separately averaged sensitivity and specificity values tend to have wide confidence intervals and have means that do not characterize any of the studies. In this case, SROC curves provide a more suitable analysis framework.
The Nationwide Inpatient Sample (NIS) is a public access database maintained by AHRQ. The NIS represents a stratified sample of approximately 20 percent of all discharges from U.S. hospitals; data for the year 2000 contain administrative discharge data from hospitals in 28 states, while 2001 contains data from 33 states.25 Weights are provided in order to allow estimation of national data based on this sample. We used data from 2000 and 2001 to provide supplemental data on the frequency of diagnostic laparoscopy and exploratory laparotomy for Question 6. Because previous work has shown that administrative data may lack sufficient clinical detail to compare outcomes,26 we did not attempt to directly compare complication rates between these procedures, or between diagnoses.
The search was limited to women 15 years and older, who had one of the following International Classification of Diseases, Ninth Revision (ICD-9) diagnostic codes: 183.x (malignant neoplasm of the ovary and other uterine adnexa), 220.x (benign neoplasms of the ovary); 620.x (ovarian cysts); 752.11 (para-ovarian cysts); 614.0, 614.1, 614.2, 614.6 (adnexal masses secondary to pelvic inflammatory disease); 789.33, 789.34, 789.39 (abdominal masses arising in the left or right lower quadrant, or other nonspecified site); and V655 (normal findings after diagnostic evaluation).
In order to avoid overestimation of complication rates due to other procedures, we then excluded patients who had an ICD-9 diagnosis code for hysterectomy (68.x). Procedures were then classified as laparoscopy only (54.21), laparoscopy with conservative ovarian surgery (65.3x, 65.4x, 65.5x, 65.6x), laparoscopy with oophorectomy (65.0×, 65.2x), or laparotomy (54.11) alone, with conservative ovarian surgery (same codes), or with oophorectomy (same codes).
A discharge status of “Dead” indicated in-hospital mortality. Complications of surgery or hospitalization were indicated by diagnosis codes of E870 through E876.
We used a Markov state-transition model to explore the impact of alternate assumptions about the natural history of ovarian cancer. The original model was developed as a graduate school project by Karen Hoffman, MD, and further refined in collaboration with two of the authors of this report (Drs. Kulasingam and Myers).
| Variable description | Model abbreviation of variable | Value | Range varied |
|---|---|---|---|
| Probability of clinical diagnosis for each stage (I, II, III, or IV) if no screening test or if screening produces a false negative | pClinDxStageI | 0.261 | Calibrated |
| pClinDxStageII | 0.446 | ||
| pClinDxStageIII | 0.837 | ||
| pClinDxStageIV | 0.950 | ||
| Probability of dying from diagnostic exploratory laporotomy | pLapDeath | 0.00023 | 0.00 to 0.0010 |
| Probability of dying from each stage of cancer, based on 5-year survival rates | pDieStageI | 0.051 | Not varied |
| pDieStageII | 0.187 | Not varied | |
| pDieStageIII | 0.691 | Not varied | |
| pDieStageIV | 0.691 | Not varied | |
| Probability of developing Stage I cancer, based on ovarian cancer incidence rates | tCompInc | Varies with age | |
| Probability of dying from a cause other than ovarian cancer | tMortCaAdj | Varies with age | |
We employed internal and external quality-monitoring checks through every phase of the study to reduce bias, enhance consistency, and verify accuracy. Examples of internal monitoring procedures include: three progressively stricter screening opportunities for each article (abstract screening, full-text article review, data abstraction review); involvement of three individuals (two clinicians and copy editor) in each data abstraction; agreement of at least two clinicians on all included studies.
Our principle external quality-monitoring device was the peer-review process. Nominations for peer reviewers were solicited from several sources, including a technical expert panel and interested federal agencies. The list of nominees was forwarded to the Agency for Healthcare Research and Quality (AHRQ) for vetting and approval. A final list of peer reviewers is provided in Appendix E.*
Question 1 is: What is the prevalence of various tumor types among women with an adnexal mass, stratified by cancer status (malignant vs. benign), age, menopausal status, and size of tumor?
We included studies in the U.S. population with more than 50 women and limited the literature search to screening studies and case series where results were provided for all women with an undiagnosed mass, not just those with subsequent positive additional tests.21 Studies of adnexal mass in which the gold standard is applied only to those with positive tests results would underestimate the prevalence of disease and cause a substantial bias.
| Study | N | % Menopausal | Malignant | Borderline | Benign |
|---|---|---|---|---|---|
| DePriest et al.,199336 | 3,220 | 100; most had positive family history of breast, ovarian, or colorectal cancer | 0.09% | Not reported | 1.3% |
| DePriest et al.,199734 | 6,470 | Either menopausal or had positive family history of breast (30%), ovarian (24%), or colorectal cancer (15%) | 0.11% | Not reported | 1.2% |
| Modesitt et al., 200340 | 15,106 | 100 | 0.18% | Not reported | 0.8% |
| Van Nagell et al., 200049 | 14,469 | Either menopausal or had positive family history of breast (34%), ovarian (23%), or colorectal cancer (23%) | 0.1% | 0.02% | 1.1% |
Note: All four publications represent the same screening study at different times.
The most common malignant tumor types include primary ovarian carcinoma, such as serous and mucinous cystadenocarcinoma, granulosa cell tumors, and undifferentiated adenocarcinoma. Borderline tumors were less common, such as serous low malignant potential (0.02 percent). The most common benign tumors were serous cystadenoma (0.4 to 0.7 percent), paratubal cyst (0.1 to 0.16 percent), endometrioma (0.03 to 0.3 percent), and mature teratoma (0.02 to 0.08 percent).
| Study | Denominator | Location | Age, menopausal status, race | Malignant | Borderline | Benign |
|---|---|---|---|---|---|---|
| Childers et al.,199632 | 138 | AZ | 52 | 13.8% | Not reported | 86.2% |
| Dottino et al.,199937 | 160 | NY | 52.2 | 8.1% | 5% | 86.9% |
| 53% post | ||||||
| 91% white | ||||||
| Fleischer et al., 199638 | 62 | TN | 50 | 50% | Not reported | 50% |
| >50% post | ||||||
| Lin et al., 199339 | 80 | NY | 56 | 57.5% | 2.5% | 40% |
| 76% post | ||||||
| 90% white | ||||||
| Parker et al., 199441 | 61 | Multi-site | 65 | None | None | 100% |
| 100% post | ||||||
| Roman et al., 199742 | 226 | CA | 20% post | 11.5% | 7.5% | 81% |
| Schneider et al., 199343 | 55 | AZ | 53 | 25.5% | 3.6% | 70.9% |
| 60% post | ||||||
| Scoutt et al., 199444 | 109 | CT | 40 | 20.2% | Not reported | 79.8% |
| Shen-Gunther et al., 200245 | 125 | OK/NV | 58 | 44.8% | 9.6% | 45.6% |
| 82% white | ||||||
| Smikle et al., 199546 | 195 | TX | 40% post | 13.3% | Not reported | 86.7% |
| *Chalas et al., 199231 | 241 | NY | Not reported | 50.2% | 7.5% | 42.3% |
| Cohen et al., 200133 | 71 | IL | 22–80 | 18.3% | 1.4% | 80.3% |
| 44% post | ||||||
| DePriest et al., 199335 | 121 | KY | 3–74 | 10.7% | Not reported | 89.3% |
| 49% post | ||||||
| Troiano, 199747 | 144 | CT | 45 | 11.8% | 2.1% | 86.1% |
| 29% post | ||||||
| Twickler et al., 199948 | 244 | TX | 38.6 | 5.7% | 6.6% | 87.7% |
| Vasilev et al., 198850 | 182 | CA | Not reported | 8.2% | 1.6% | 90.1% |
Retrospective chart review
Estimating the age-specific prevalence of specific adnexal tumor types from the available literature is difficult. The best data come from a series of reports from a large screening study; overall prevalence of masses was 1 to 2 percent, with benign masses outnumbering malignant by 4- to 10-fold. Because patients with negative screening test results did not undergo definitive diagnostic procedures in these studies, the prevalence estimates are dependent on the sensitivity of the screening tests used (and the completeness of followup among test negatives). In addition, there is a potential bias in that premenopausal women enrolling in the screening study were at higher risk than average because of family history; in addition, postmenopausal women may have been more likely to enroll because of concerns based on family history, vague symptoms, or other reasons which would affect relative prevalence compared to the general population.
Estimates of prevalence in studies with 100 percent histologic diagnosis are inevitably biased by the clinical factors that determine which patients ultimately undergo surgery. These can include the presence and nature of symptoms (patients with symptoms referable to a mass would likely undergo surgery sooner than those with asymptomatic masses, all other things being equal); other findings (for example, the presence of ascites); patient anxiety; the diagnostic algorithms used (for example, the duration of followup for persistence); and the nature of the practice (malignancies will be more frequent in a gynecologic oncology practice compared to a general gynecology practice).
As mentioned previously, we did not include studies from outside the United States. Given differences in ethnic backgrounds (affecting genetic risks), observed differences in cancer incidence, and differences in clinical practice between countries, and the almost universal failure of studies to describe the clinical history leading to the diagnosis of adnexal mass, inclusion of these studies would not have allowed a more precise estimate of prevalence of different types of adnexal masses in the U.S. population.
In four reports from a large U.S. screening study, the prevalence of adnexal masses detected by ultrasound among postmenopausal women was 0.8 to 1.3 percent, and the prevalence of malignancy 0.09 to 0.18 percent (i.e., 9 to 18 per 10,000). Prevalence of different pathologies varies widely among case series. There are no data on the relative prevalence of different pathologies among women with asymptomatic masses compared to women with symptomatic masses.
Question 2 is: What are the sensitivity, specificity, and reliability of the bimanual pelvic examination?
Articles were sought which evaluated the ability of the bimanual examination to detect adnexal masses, and/or to discriminate benign from malignant masses. Preference was given to studies where there was histological confirmation of the diagnosis, but an alternative reference standard (such as followup) was allowed for screening studies. Data allowing calculation of sensitivity and specificity had to be provided.
Our rationale for including the pelvic examination was based on its role in the initial evaluation of adnexal masses. Some asymptomatic women will have a mass detected as part of a “routine” physical examination; others will have a mass detected as part of an examination performed because of symptoms. The postexamination probability of malignancy is a function of the prevalence of cancer and the sensitivity and specificity of the bimanual examination; these probabilities, in turn, will affect the positive and negative predictive values of additional tests such as cancer antigen 125 (CA-125) and imaging studies. Because the pelvic examination will be the first test performed, either as a screening test or as a diagnostic test, knowledge of its test characteristics is important for evaluating subsequent diagnostic tests.
Types of data incorporated. Two of the studies54, 56 included history or clinical impression as part of the “test;” results were not provided separately for examination alone.
Types of study population. Ten of the 14 studies were performed prior to surgery for an adnexal mass, while four were from screening studies.51, 52, 57, 58
Reporting of study populations. Of the screening studies, Andolf et al.52 was performed in women over 40 considered at high risk of ovarian cancer because of symptoms or risk factors; Grover and Quinn57 was performed in asymptomatic volunteers 25 and older, but described menopausal status; Adonakis et al.51 was performed in women over 45; and Jacobs et al.58 was done entirely in a postmenopausal population.
Seven of the 11 preoperative studies reported menopausal status, but only two reported on test characteristics specifically by menopausal status.55, 56 None reported race/ethnicity, and none reported the clinical route by which patients had come to surgery (detection of an asymptomatic mass, symptoms, etc.).
Methodology. The methodological quality of the included studies was as follows:
Reference standard. Of the preoperative studies, all but one42 had operative confirmation of findings. Ultrasound was used as the reference standard in the four screening studies, with 12-month followup examinations or questionnaires.
Verification bias. In the study by Roman et al.,42 26 women with non-palpable masses did not undergo definitive diagnosis.
Test reliability. Only one study60 provided direct data on test reliability. Grover and Quinn,57 Ong et al.,59 Schutter et al.,63 and Buckshee et al.54 used a single examiner. The other studies did not address the issue of test reliability.
Sample size. None of the reports had a priori sample size calculations.
Use of appropriate statistical tests. All reports used appropriate techniques for calculating test characteristics.
Blinding. Only two studies54, 60 explicitly stated whether examiners were blinded to prior history or other findings.
Definition of positive and negative test. Nine of 14 studies reported their definitions of a positive test, although the precision of the definitions was quite variable (from “a mass 5 cm or more in diameter” to “larger than normal”); others relied on “clinical impression.”
| Study | N | Sensitivity (95% CI) | Specificity (95% CI) | % with confirmed mass | Notes |
|---|---|---|---|---|---|
| Jacobs et al., 198858 | 1,010 | 84.6% (65.0 to 100%) | 98.3% (97.5 to 99.1%) | 1.3% (0.1% malignant) | Reference standard: ultrasound |
| Screening study | |||||
| Andolf et al., 199052 | 801 | 33.7% (26.5 to 41.0%) | 92.0% (89.9 to 94.1%) | 20% (0.1% malignant) | Reference standard: ultrasound by midwife |
| Screening in women considered at high risk for ovarian cancer; no ovarian cancers detected: 2 endometrial cancers, 1 LMP detected | |||||
| Padilla et al., 200561 | 252 | 15.6% (8.1 to 23.0%) | 93.8% (90.1 to 97.5%) | 35.7% (unclear if any malignacies) | Exam under anesthesia prior to surgery for pelvic mass; examiners blinded to radiology findings |
| Likelihood of not detecting an adnexal mass increased with less experience (OR for resident 1.13, student 1.36 compared to attending, although 95% CIs cross 1). | |||||
| Statistically significant increase in missed diagnosis if subject with BMI > 30 (OR 2.57; 95% CI, 1.36 to 4.87), and significant decrease in presence of enlarged uterus (OR 0.48; 95% CI, 0.25 to 0.93). | |||||
| Final diagnoses not presented, reasons for surgery not systematically presented | |||||
| Padilla et al., 200060 | 140 (82 masses) | Left adnexa (attending exam): 32.7% (19.5 to 45.8%) | Left adnexa (attending exam): 88.5% (81.4 to 95.6%) | 58% (0 malignancies) | Exam under anesthesia prior to surgery for pelvic mass; examiners blinded to radiology findings; no clear relationship to experience |
| Right adnexa (attending exam): 21.2% (7.3 to 35.2%) | Right adnexa (attending exam): 78.7% (70.4 to 87.0%) | ||||
| Ong et al., 199659 | 86 | 71.9% (60.9 to 82.9%) | 59.1% (38.5 to 78.6%) | 74.4% (0 malignant) | Pre-surgical exam |
Abbreviations: BMI = body mass index; CI = confidence interval; LMP = low malignant potential tumor; OR = odds ratio
When sensitivity and specificity were combined separately using a random-effects model, the pooled sensitivity was 0.45 (95% confidence interval [CI], 0.28 to 0.68), and the pooled specificity was 0.90 (0.80 to 0.96).
| Study | N | Sensitivity (95% CI) | Specificity (95% CI) | % Malignant | Notes |
|---|---|---|---|---|---|
| Adonakis et al., 199651 | 2,000 | 66.7% (13.3 to 100%) | 97.2% (96.5 to 97.9%) | 0.15% | Screening study; threshold of “abnormal or ambiguous exam;” CA-125 used in conjunction to proceed to ultrasound |
| Grover et al., 199557 | 2,623 | 0% (0 to 100%) | 98.5% (98.0 to 98.9%) | 0.05% | Screening study; ultrasound and clinical followup |
| Jacobs et al., 198858 | 1,010 | 100% (0 to 100%) | 97.3% (96.3 to 98.3%) | 0.1% | Screening study; followup with ultrasound |
| Roman et al., 199742 | 200 | 51.2% (36.3 to 66.1%) | 83.6% (77.8 to 89.4%) | 21% | Results for 26 patients with non-palpable masses not included; no substantial difference based on menopausal status |
| Buckshee et al., 199854 | 34 | 77.8 % (50.6 to 100%) | 88.9% (77.0 to 100%) | 25% | One examiner; non-consecutive patients prior to surgery |
| Balbi et al., 200153 | 72 | 90% (77.5 to 100%) | 74% (61.8 to 86.2%) | 31% | 18 patients with “clearly benign masses” and 2 with “clearly malignant” excluded; clinical impression |
| Finkler et al., 198856 | 106 | 43.2% (27.3 to 59.2%) | 90.8% (83.7 to 97.8%) | 36% | “Clinical impression” included exam plus history; results not calculated for exam alone |
| Premenopausal: 16.7% (0 to 33.9%) | Premenopausal: 92.3% (85.1 to 99.6%) | Premenopausal: 26% | |||
| Postmenopausal: 68.4% (47.5 to 89.3%) | Postmenopausal: 84.6% (65.0 to 100%) | Postmenopausal: 59% | |||
| Schutter et al., 199863 | 155 | 91.5% (84.4 to 98.6%) | 73.9% (64.9 to 82.9%) | 39% | All postmenopausal; high prevalence of cancer; single examiner; inclusion/exclusion criteria not described |
| Schutter et al., 199462 | 222 | 92.6% (87.4 to 97.9%) | 63.0% (54.6 to 71.4%) | 43% | Preoperative patients |
| Dowd et al., 199355 | 225 | 51.0% (41.7 to 60.3%) | 87.0% (80.8 to 93.2%) | 49% | Preoperative patients |
| Premenopausal: 31% | Premenopausal: 95% | ||||
| Postmenopausal 59% | Postmenopausal: 75% | ||||
Abbreviations: CA-125 = cancer antigen 125; CI = confidence interval
For both types of studies, there appears to be a trend towards decreased specificity as prevalence increases, although the number of studies is small and the confidence intervals are wide. The extreme differences in sensitivity in the two largest studies (0 and 100 percent) prevent even a qualitative assessment of any relationship between prevalence and sensitivity.
Despite the common recommendation for routine pelvic examination, we found surprisingly little literature on its accuracy. Based on the literature we did identify, its sensitivity for detecting adnexal masses appears fairly low. Sensitivity for detecting normal adnexa is also low, as demonstrated in a recent study of examinations under anesthesia.64 Although sensitivity for distinguishing a malignant mass from a benign one is somewhat better, these results need to be interpreted with caution, since most of the studies were done in preoperative patients, who would already have a higher probability of having a malignancy. In the four large screening studies, there was a total of only five malignancies, with the bimanual detecting 0 percent, 66 percent, and 100 percent in the three individual studies where ovarian cancer was detected; the fourth had one case of a low malignant potential tumor and two endometrial cancers. Pooled sensitivity for the three screening studies that addressed discrimination between benign and malignant masses was considerably lower than for all studies combined (and was similar to the pooled sensitivity of the studies that examined the ability to detect any adnexal mass).
Both types of studies show a trend toward decreased specificity as the prevalence of abnormality increases - this may reflect a greater degree of suspicion on the part of the examiner, based on other findings, and a greater likelihood of calling an examination abnormal. This is supported by the finding of the two studies which stratified results by menopausal status, which found higher sensitivity and lower specificity in postmenopausal women compared to premenopausal women.55, 56 Because examiners were unblinded, and were likely aware of the higher prevalence of malignancy among postmenopausal women, they may have been more likely to assign a diagnosis of malignancy among those patients. Future studies need to pay stricter attention to blinding examiners to other information. In theory, this bias should also result in higher sensitivity as prevalence increases, although, because of the small number of studies, the small numbers of subjects in most studies, and the diametrically opposed findings of the two largest studies, we were unable to recognize any relationship.
In the two studies that addressed the effect of experience on test characteristics,60, 61 there appeared to be a relationship between increasing experience and increased sensitivity (specificity did not change); however, even attending physicians achieved a sensitivity of only 28 percent. Based upon the available literature, the bimanual examination does not appear to be a sensitive test for detecting the presence of adnexal masses and appears to have limited ability to discriminate benign from malignant masses. Although specificity was somewhat better, positive predictive values will still be quite low in low prevalence settings, as discussed under Question 7. This will, in turn, lower the positive predictive value of diagnostic tests performed in patients referred on the basis of a pelvic examination. These tests are discussed in detail in the next section.
Question 3 is: Among women with a palpable adnexal mass on exam or a mass identified by ultrasound/imaging, what is the sensitivity/specificity of various evaluation modalities including ultrasound (transvaginal ultrasound [TVUS], transabdominal ultrasound, color Doppler, two-dimensional [2D] versus three-dimensional [3D] ultrasound), computer tomography (CT) scan, magnetic resonance imaging (MRI) scan, and CA-125 levels for distinguishing benign from malignant masses?
This section considers the various evaluation modalities that are described in the literature and would be available to a clinician to aid in the work-up of an adnexal mass after it has been diagnosed. We focused our search on articles whose primary reference standard was histopathology. Ideally this reference standard would be applied to all test negatives. However, we accepted a repeat negative test (such as imaging) conducted at least 6 months later as an acceptable alternative. We did include some studies that were from population-based screening samples, and these will be considered in a separate section below. The evaluation modalities investigated can be divided into several general categories. Imaging studies will be divided by technological mode (ultrasound, MRI, etc.). Ultrasound studies will be divided into those that evaluate adnexal morphology (either by an explicit scoring system or by descriptive standards), those that measure vascular flow in the mass (Doppler), and those that evaluate these modalities in combination. Serum studies will focus primarily on CA-125, as this is the most common marker in both the literature and in clinical practice. However, other serum markers will be discussed as well. Finally, the studies for which it was possible to stratify by menopausal status will be discussed where appropriate.
Conventional grey scale ultrasonography is the most common imaging modality used to differentiate benign from malignant adnexal masses. Especially with the advent of high-frequency transvaginal probes, the quality of the images allows description of the gross anatomic features of the lesion. This is, however, limited by the great variability of macroscopic characteristics of both benign and malignant masses. Furthermore, the technique is operator dependent. To overcome these limitations, morphologic scoring systems have been developed. Such scoring systems are based on specific ultrasound parameters each with several scores according to determined features and with a cutoff value to categorize masses as either malignant or benign.
| Scoring system | Score | ||||
|---|---|---|---|---|---|
| Sassone et al., 1991159 | |||||
| Morphology | 1 | 2 | 3 | 4 | 5 |
| Inner wall structure | Smooth | Irregularities ≤ 3 mm | Papillarities > 3 mm | Not applicable, mostly solid | - |
| Wall thickness (mm) | Thin (≤ 3) | Thick (> 3) | Not applicable, mostly solid | - | - |
| Septa (mm) | None | Thin (≤ 3) | Thick (> 3) | - | |
| Echogenicity | Sonolucent | Low echogenicity | Low echogenicity with ochogenic core; mixed echogenicity | - | High echogenicity |
| DePriest et al., 199336 | |||||
| Morphology | 0 | 1 | 2 | 3 | 4 |
| Cystic wall structure | Smooth (< 3 mm thick) | Smooth (> 3 mm thick) | Papillary projection (< 3 mm) | Papillary projection (≥ 3 mm) | Predominately solid |
| Volume (cm3) | < 10 | 10–50 | > 50–200 | > 200–500 | > 500 |
| Septum structure | No septa | Thin septa (< 3 mm) | Thick septa (3 mm to 1 cm) | Solid area (≥ 1 cm) | Predominately solid |
| Ferrazzi et al., 199793 | |||||
| Morphology | 1 | 2 | 3 | 4 | 5 |
| Wall | ≤ 3 mm | > 3 mm | - | Irregular, mostly solid | Irregular, not applicable |
| Septa | None | ≤ 3 mm | > 3 mm | ||
| Vegetations | None | - | - | ≤ 3 mm | > 3 mm |
| Echogenicity | Sonolucent | Low echogenicity | - | With echogenic areas | With heterogeneous echogenic areas, solid |
| Lerner et al., 1994131 | |||||
| Morphology | 0 | 1 | 2 | 3 | |
| Wall structure | Smooth or small irregularities < 3 mm | - | Solid or not applicable | Papillarities ≥ 3 mm | |
| Shadowing | Yes | No | - | - | |
| Septa | None or thin (< 3 mm) | Thick (≥ 3 mm) | - | - | |
| Echogenicity | Sonolucent or low-level echo or echogenic core | - | - | Mixed or high | |
Reproducibility of tests. Timmerman et al.196 evaluated the subjective assessment of ultrasonographic images for discriminating between malignant and benign masses. Three hundred consecutive patients were evaluated with TVUS by six different operators, and both diagnostic accuracy and interassessor agreement were calculated. The operators had varied experience in TVUS - from approximately 300 to 15,000 scans. The two most experienced operators agreed 92 percent of the time. The accuracy of the least experienced operators ranged from 82 to 87 percent (p = 0.0001). Overall, 65 percent of all the masses were correctly classified by all six operators. Interassessor agreement was greater between the most experienced operators as well (kappa = 0.852). When comparing experienced with less experienced operators, the kappa ranged from 0.581 to 0.737. This is similar to the kappa reported by Yamashita et al.192 among five operators, 0.62 (± 0.02) with TVUS. Interassessor agreement was not calculated between the less experienced operators. None of the included articles described operator experience, and only a few addressed interobserver variability. Although operator experience appears to correlate with accuracy, the specialty training of the unltrasonographer does not. In a meta-analysis of both morphologic and color Doppler tests in the evaluation of adnexal masses, Kinkel et al.197 found no difference between radiologists and gynecologists in the performance of ultrasound.
TVUS versus abdominal ultrasound. Of the 122 articles that evaluated adnexal masses via ultrasound (through either ultrasound morphology or Doppler measurements), only five articles exclusively used transabdominal imaging.52, 58, 116, 133, 198 Fifty-nine articles used TVUS exclusively and 51 used a combination of TVUS and abdominal ultrasound. There were seven articles for which the ultrasound modality was unknown. In the majority of the articles that used a combination of TVUS and abdominal ultrasound, TVUS was the “method of choice.” The most common reasons cited for also including abdominal ultrasound were patient refusal of transvaginal scans, virginity, poor image quality, and very large masses. Although a few articles reported how many women had which type of ultrasound, none of the articles reported their results such as to permit a stratification by TVUS or abdominal ultrasound. We therefore elected to group all ultrasound studies together regardless of TVUS or abdominal imaging.
| Scoring system | Pooled sensitivity (95% CI) | Pooled specificity (95% CI) | Range of sensitivity in individual studies | Range of specificity in individual studies | References |
|---|---|---|---|---|---|
| Sassone | 0.86 (0.79 to 0.91) | 0.77 (0.73 to 0.81) | 0.65 to 1.00 | 0.65 to 0.93 | 43,54,68,69,83,93,130,131,154,159,160,163,179,193,199 |
| DePriest | 0.91 (0.84 to 0.95) | 0.68 (0.49 to 0.82) | 0.88 to 1.00 | 0.40 to 0.81 | 35,36,69,83,93,115 |
| Ferrazzi | 0.87 (0.80 to 0.92) | 0.81 (0.62 to 0.91) | 0.84 to 0.87 | 0.67 to 0.88 | 697593 |
| Finkler | 0.82 (0.65 to 0.91) | 0.78 (0.59 to 0.91) | 0.52 to 0.88 | 0.55 to 0.70 | 56,62,63 |
| Other (note: significant heterogeneity in criteria used for diagnosis - see ROC curve) | 0.86 (0.82 to 0.89) | 0.83 (0.76 to 0.88) | 0.43 to 1.00 | 0.29 to 1.00 | 33,34,39,42,43,67,69,74,76–80,87,90,95,97,101,102,104,106,108,112,117,118,122,124–127,133–135,138–140,142,144,146,147,155,161,166,168,169,171,180,181,185,187,188,192,195 |
Abbreviations: CI = confidence interval; ROC = receiver operating characteristic
Three articles compared different scoring systems within the same study population. Caruso et al.83 examined 112 women with adnexal masses comparing Sassone, DePriest, and Valentin scores. All performed similarly, displaying a sensitivity and NPV of 1.00, a range of specificity of 0.61 to 0.75, and a range of PPV of 0.35 to 0.48. Alcazar et al.69 also compared the performance of Sassone, DePriest, and Ferrazzi. There were no significant differences between these scoring systems when receiver operating characteristic (ROC) curves were compared. The area under the curve (AUC) was 0.89 for Sassone, 0.92 for DePriest, and 0.90 for Ferrazzi. Ferrazzi et al.93 evaluated 261 masses collected in three different centers. They compared ROC curves for scores based on Sassone, Granberg, DePriest, and Lerner's criteria and compared it with a scoring system they developed. The AUC ranged from 0.72 to 0.75 for the previously established systems. Their new scoring system (Ferrazzi) performed better, with an AUC of 0.84 (p < 0.0001). However, subsequent comparisons have not reaffirmed its superior functioning. When the Ferrazzi scoring system was compared to both Sassone and DePriest,69 its performance was almost identical.
In spite of different designs, all the scoring systems performed similarly when compared within the same study population. It has been suggested that the poor performance of scoring systems with regard to their PPV is due to the misclassification of dermoid tumors.197 Dermoids share many of the features that are characterized as “malignant” in scoring systems. The Alcazar study proposes a scoring system that was developed in part to correct this. Although this scoring system does perform well in its initial application, it has not been independently verified. The authors conclude, “a completely reliable differentiation of malignant masses cannot be obtained by sonographic imaging alone.”69
| Study | Scoring System | Premenopausal | Postmenopausal | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Sens | Spec | PPV | NPV | Sens | Spec | PPV | NPV | ||
| Finkler et al., 198856 | Finkler | 0.50 | 0.96 | 0.50 | 0.77 | 0.78 | 0.92 | 0.94 | 0.75 |
| Franchi et al., 199595 | Descriptive | 0.73 | 0.86 | 0.44 | 0.95 | 0.89 | 0.75 | 0.82 | 0.83 |
| Guerriero et al., 2002105 | Descriptive | 0.98 | 0.89 | 0.44 | 1.00 | 1.00 | 0.51 | 0.52 | 1.00 |
| Reles et al., 1997155 | Modified score | 1.00 | 0.79 | 0.46 | 1.00 | 0.87 | 0.89 | 0.77 | 0.94 |
| Roman et al., 199742 | Descriptive | 0.93 | 0.92 | 0.66 | 0.99 | 0.81 | 0.62 | 0.54 | 0.86 |
| Schelling et al., 2000161 | Descriptive | 0.91 | 0.84 | 0.29 | 0.99 | 1.00 | 0.73 | 0.62 | 1.00 |
| Alcazar et al., 200369 | Sassone | 1.00 | 0.88 | 0.50 | 1.00 | 0.61 | 0.88 | 0.81 | 0.73 |
| DePriest | 1.00 | 0.80 | 0.38 | 1.00 | 1.00 | 0.82 | 0.82 | 1.00 | |
| Ferrazzi | 1.00 | 0.84 | 0.43 | 1.00 | 0.82 | 0.82 | 0.79 | 0.85 | |
| Alcazar | 1.00 | 0.96 | 0.75 | 1.00 | 1.00 | 0.94 | 0.93 | 1.00 | |
| Menon et al., 2000145 | Descriptive | - | - | - | - | 1.00 | 0.94 | 0.24 | 1.00 |
| Schutter et al., 199462 | Finkler | - | - | - | - | 0.88 | 0.64 | 0.65 | 0.88 |
| Bromley et al., 199476 | Unique scoring | - | - | - | - | 0.91 | 0.52 | 0.52 | 0.92 |
| Schutter et al., 199863 | Finkler | - | - | - | - | 0.86 | 0.70 | 0.65 | 0.89 |
| Luxman et al., 1991133 | Descriptive | - | - | - | - | 0.93 | 0.55 | 0.45 | 0.95 |
| Kuriak et al., 1992126 | Unique scoring | - | - | - | - | 0.48 | 0.98 | 0.93 | 0.78 |
Abbreviations: NPV = negative predictive value; PPV = positive predictive value; Sens = sensitivity; Spec = specificity
Color Doppler scanning allows the assessment of tumor vascularity. Malignant neoplasms have active blood vessel creation (angiogenesis) compared to normal or benign neoplasms due, in part, to their increased metabolic activity. Overall, malignancies display an increased vascularity with decreased peripheral blood flow resistance and increased blood flow velocity compared with benign tissue.152, 200 Doppler signal analysis can separate high-resistance and low-resistance vessels and has therefore been investigated as a separate test modality, as well as in combination with ultrasound morphological evaluation in the evaluation of adnexal masses.
The most common flow criteria are the resistance index (RI), the pulsatility index (PI), and the maximum systolic velocity. RI is defined as the difference between peak systolic and maximum enddiastolic flow velocity, divided by peak systolic flow velocity. Usually the lowest measured RI from a series of measurements is reported from different arteries. PI is defined as the difference between peak systolic and enddiastolic flow velocity, divided by the time-averaged flow velocity. The maximum systolic velocity is the maximum flow recorded in any visualized artery.
In order to make a measurement of either RI or PI or maximum systolic velocity, an artery must be identified on ultrasound. The inability to identify an artery in the mass means that the test cannot be performed. Therefore, not every individual included in the study population is captured with the assessment of these color Doppler modalities. Another limitation of these measurements is that the range observed in malignant masses overlaps with that observed in benign masses. For example, in Lin et al.,132 discussed in more detail below, the RI for malignant masses ranged from 0.23 to 0.82. Although they did not report a range for the benign masses, there were eight benign tumors with a RI < 0.4. This overlap limits the effectiveness of any threshold and, perhaps, contributes to the different thresholds reported in the literature.
Reproducibility of tests. Timmerman et al.196 (discussed above under ultrasound morphology) included Doppler measurements in its analysis of interobserver variability and experience. In short, operators with more experience (300 versus 15,000 scans) had greater accuracy (92 percent versus 82 to 87 percent, p = 0.0001). Interassessor agreement was also greater between the most experienced operators (kappa = 0.852) compared with the less experienced operators (range 0.581 to 0.737). None of the articles evaluating color Doppler described operator experience, nor did any address interobserver variability specifically in regards to Doppler measurement.
Trials identified. Fifty-six articles were identified that described color Doppler analysis, comprising a description of 65 tests. Thirty-two articles evaluated RI, 20 PI, and six the maximum systolic velocity. These are the most common flow criteria measured in the literature and presumably in clinical practice as well. Other Doppler parameters were described in the literature sometimes in conjunction with either RI or PI or maximum systolic velocity but were not included in this table. The other articles included 10 that involved the visualization of flow within the mass,70, 71, 104, 105, 119, 137, 160, 161, 168, 182 two that involved counting the total number of arteries (either > 4152 or > 3199), and one that measured the absence of a diastolic notch.137
| Doppler method | Pooled sensitivity (95% CI) | Pooled specificity (95% CI) | Range of sensitivity in individual studies | Range of specificity in individual studies | References |
|---|---|---|---|---|---|
| Resistance index | 0.76 (0.68 to 0.73) | 0.89 (0.84 to 0.92) | 0.19 to 1.00 | 0.53 to 1.00 | 43,68,70,75,76,79,81,86,88,95,106,107,117,124–126,128,130,132,141,146,152,168,172,175,176,179,184,190,193,199,201, |
| Pulsatility index | 0.79 (0.73 to 0.83) | 0.74 (0.64 to 0.81) | 0.57 to 0.95 | 0.32 to 0.97 | 73,79,81,94,103,109,115,120,154,155,158,163,168,169,179,182,184,188,199,201 |
| Maximum systolic velocity | 0.76 (0.61 to 0.86) | 0.83 (0.66 to 0.93) | 0.48 to 0.94 | 0.43 to 0.97 | 68,79,107,109,152,199 |
Lin et al.132 evaluated 370 women with adnexal masses who were scheduled for surgery at a single institution. They reported outcomes based on RI cutpoints of 0.4, 0.5, and 0.6. For RI < 0.4, the sensitivity, specificity, PPV, and NPV were 0.69, 0.97, 0.89, and 0.91, respectively. For RI < 0.5, they were 0.79, 0.92, 0.77, and 0.93. And for < 0.6, they were 0.91, 0.86, 0.68, and 0.98. The authors conclude that the 0.4 cutpoint yields the highest concordance rate between Doppler prediction and histopathologic diagnosis. This conclusion, however, is based more on clinical impression, as ROC curve analysis was not performed.
| Study (N) | Test | Sensitivity | Specificity |
|---|---|---|---|
| Prompeler et al., 1996152 (212) | Total number of arteries > 4 (postmenopausal women only) | 0.82 | 0.92 |
| Valentin, 1997182 (151) | Color lakes visible on Doppler | 0.88 | 0.67 |
| Maly et al., 1995137 (102) | Demonstrable blood vessels | 0.95 | 0.30 |
| Schelling et al., 2000161 (257) | Central vascularity on Doppler in solid component | 0.93 | 0.94 |
| Stein et al., 1995168 (170 masses) | Internal flow within solid component or septation | 0.77 | 0.69 |
| Guerriero et al., 2002105 (826 masses) | Arterial flow visualized in an echogenic structure or irregular solid portion | 0.95 | 0.92 |
| Anandakumar et al., 199670 (146) | “Continuously fluctuating” vessels with turbulent flow | 0.77 | 0.68 |
| Antonic and Rakar, 199571 (71) | Color flow present | 0.89 | 0.47 |
| Guerriero et al., 2005104 (424) | Color flow present in “echogenic structure” | 1.00 | 0.91 |
| Juhasz et al., 1990119 (147) | Color flow present in mass | 0.96 | 0.84 |
| Study (N) | Test | Premenopausal | Postmenopausal | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Sens | Spec | PPV | NPV | Sens | Spec | PPV | NPV | ||
| Franchi et al., 199595 (129) | RI < 0.65 | 0.82 | 0.72 | 0.31 | 0.96 | 0.86 | 0.75 | 0.82 | 0.83 |
| Guerriero et al., 2002105 (826 masses) | Arterial flow visualized in echogenic structure or irregular solid portion | 0.94 | 0.96 | 0.67 | 1.00 | 0.96 | 0.77 | 0.69 | 0.97 |
| Reles et al., 1997155 (98) | PI ≤ 1.1 | 0.80 | 0.67 | 0.36 | 0.93 | 0.93 | 0.83 | 0.76 | 0.91 |
| Schelling et al., 2000161 (257) | Presence of central vascularization on Doppler | 0.91 | 0.94 | 0.53 | 0.99 | 0.93 | 0.92 | 0.84 | 0.97 |
| Prompeler et al., 1996152 (212) | Total number of arteries > 4 | 0.85 | 0.71 | 0.36 | 0.96 | 0.82 | 0.82 | 0.76 | 0.86 |
| RI > 0.5 | 0.84 | 0.47 | 0.23 | 0.94 | 0.82 | 0.69 | 0.66 | 0.84 | |
| Maximum systolic velocity > 30cm/s | 0.92 | 0.65 | 0.33 | 0.98 | 0.76 | 0.88 | 0.82 | 0.84 | |
| Strigini et al., 1996169 (109) | PI < 1 | 0.83 | 0.73 | 0.21 | 0.98 | 0.85 | 0.81 | 0.73 | 0.90 |
| Salem et al., 1994158 (109 masses) | PI < 1 | 1.00 | 0.84 | 0.20 | 1.00 | 0.73 | 0.71 | 0.47 | 0.88 |
| Szpurek et al., 2004170 (464) | Doppler subjective index ≥ 4 | 0.82 | 0.93 | 0.79 | 0.94 | 0.92 | 1.00 | 1.00 | 0.82 |
| Kurjak et al., 1992126 (83) | RI < 0.41 | - | - | - | - | 0.96 | 0.95 | 0.90 | 0.98 |
| randomly separate vessels | - | - | - | - | 0.90 | 0.98 | 0.96 | 0.95 | |
| Bromley et al., 199476 (33) | RI < 0.6 | - | - | - | - | 0.66 | 0.81 | 0.67 | 0.81 |
| Antonic and Rakar, 199571 (71) | Presence of color flow | 1.00 | 0.36 | 0.11 | 1.00 | 0.87 | 0.79 | 0.81 | 0.85 |
| Guerriero et al., 1998103 (192 masses) | PI ≤ 1 | 0.86 | 0.46 | 0.08 | 0.98 | 0.88 | 0.52 | 0.66 | 0.81 |
Abbreviations: NPV = negative predictive value; PI = pulsatility index; PPV = positive predictive value; RI = resistance index; Sens = sensitivity; Spec = specificity
A limiting feature of ultrasound morphologic assessments has been felt to be the high rate of false positive test results.196 Color Doppler, in contrast, has displayed a slightly higher PPV, especially in the earlier studies.197 There have, therefore, been attempts to combine ultrasound morphology and Doppler studies in a single test.
Trials identified. Of all the articles that investigated the use of either ultrasound morphology or color Doppler in the evaluation of an adnexal mass, nine articles containing a total of 13 tests described a combination ultrasound morphology and Doppler modality.65, 79, 91, 100, 123–125, 130, 201
Stratification by menopausal status. There were two studies that analyzed combined ultrasound morphology and Doppler in 100 percent post menopausal patient populations. Kurjak et al.126 reported a combined sensitivity, specificity, PPV, and NPV of 0.90, 0.94, 0.90, and 0.94, respectively. Their combined test consisted of RI < 0.41 and an ultrasound morphology scoring system unique to them. Veunto et al.186 in a population-based screening study reported a sensitivity, specificity, PPV, and NPV of 1.00, 0.83, 0.006, and 1.00, respectively. Given that these two studies are of greatly different design, it is hard to compare them directly. Comparing Kurjak et al. to the range of combined ultrasound and Dopper studies, it appears that in the postmenopausal group, the test has a better performance. However, this test performance may reflect patient selection criteria for the study that was not clearly explained. Combination modalities as a screening tool for ovarian cancer had a high false positive rate (as seen in the PPV of 0.006186).
| Study (number of persons) | Test | Sensitivity | Specificity | PPV | NPV |
|---|---|---|---|---|---|
| Alcazar et al., 200367 (41 masses) | 2D | 0.90 | 0.61 | 0.68 | 0.88 |
| 3D | 1.00 | 0.78 | 0.81 | 1.00 | |
| Presence of one of the following fulfilled criteria for mass: > 3 mm wall, > 3 mm septum, > 3 mm papillary projections, solid areas or echogenicity | |||||
| Kurjak and Kupesic, 1999123 (120) | 2D | 0.91 | 0.97 | 0.77 | 0.99 |
| 3D | 1.00 | 0.99 | 0.92 | 1.00 | |
| Both used a unique scoring system that included Doppler measurements | |||||
| Kurjak et al., 2000124 (90) | 2D morphology | 0.67 | 0.94 | 0.55 | 0.96 |
| 2D Doppler | 0.89 | 0.95 | 0.67 | 0.99 | |
| 2D combined | 0.89 | 0.98 | 0.80 | 0.99 | |
| 3D morphology | 0.78 | 0.98 | 0.78 | 0.98 | |
| 3D Doppler | 0.89 | 0.98 | 0.80 | 0.99 | |
| 3D combined | 1.00 | 0.99 | 0.99 | 1.00 | |
| Both used a unique scoring system for morphological assessment. Doppler for 2D was RI ≤ 0.42, for 3D it was “complex” “chaotic” vessel arrangement | |||||
| Alcazar and Castillo, 200565 (69 masses) | 2D | 0.98 | 0.88 | 0.94 | 0.96 |
| 3D | 0.98 | 0.79 | 0.90 | 0.95 | |
| Presence of at least one of the following fulfilled criteria for “complex mass”: >3mm wall, > 3 mm papillary projection, solid areas or purely solid echogenicity Doppler flow in mass also used in test but unclear how | |||||
Abbreviations: 2D = two-dimensional; 3D = three-dimensional; NPV = negative predictive value; PPV = positive predictive value
Although ultrasound remains the most common imaging modality in the evaluation and diagnosis of adnexal masses, newer technologies such as MRI, CT, and positron emission tomography (PET) have been studied as well. These modalities may not be as readily available to the clinician as ultrasound, and there is less literature devoted to them than to ultrasound; however, they are included in this review because of growing interest both clinical and research in their use. Further, despite refinements in ultrasound morphology scoring systems or Doppler measurements, the overall performance of ultrasound in the evaluation of the adnexal mass may be relatively fixed by the technology itself. Therefore it is necessary to investigate other imaging modalities and see how they compare with ultrasound.
Reproducibility of tests. Unlike ultrasound, MRI, CT, and PET images are not operator dependent in terms of obtaining the images. There is, however, the potential for interobserver variability in their analysis. There are no standardized morphological scoring systems for any imaging modality other than ultrasound. We identified two articles that directly addressed the issue of test reproducibility for either MRI and/or CT in the evaluation of adnexal masses. Buist et al.,78 however, reported a series of 64 women who were evaluated by both MRI and CT and reviewed by two different radiologists. They reported a kappa value for the interobserver reliability for distinguishing between benign and malignant disease of 0.28 for CT and 0.41 for MRI. Yamashita et al.192 also calculated kappa values for interobserver variability among five radiologists. They showed far greater agreement: for precontrast MRI, kappa = 0.71 (± 0.02); for contrast-enhanced MRI, kappa = 0.73 (± 0.02).
Trials identified. We identified 17 articles comprising 22 tests. There were 15 articles for MRI, three for CT, and three for PET and one that used a combined CT/MRI test. There were two articles that investigated nuclear medicine technologies in the evaluation of adnexal masses. These, however, were not included in the review given the experimental nature of such tests at this time. The PET studies were all performed also using tracer 18-Fluorodeoxyglucose (FDG) with the test measuring uptake of FDG in the lesion.
| Imaging modality | Pooled sensitivity (95% CI) | Pooled specificity (95% CI) | Range of sensitivity in individual studies | Range of specificity in individual studies | References |
|---|---|---|---|---|---|
| MRI | 0.91 (0.86 to 0.94) | 0.87 (0.83 to 0.90) | 0.67 to 1.00 | 0.77 to 1.00 | 44,78,91,100,106,111,112,118,121,122,129,144,156,166,192, |
| CT | 0.90 (0.83 to 0.94) | 0.75 (0.36 to 0.94) | 0.86 to 0.96 | 0.35 to 0.89 | 39,78,129 |
| FDG-PET | 0.67 (0.52 to 0.79) | 0.79 (0.70 to 0.85) | 0.58 to 0.78 | 0.76 to 1.00 | 91,100,121 |
Abbreviations: CI = confidence interval; CT = computed tomography; FDG = 18-Fluorodeoxyglucose; MRI = magnetic resonance imaging; PET = positron emission tomography
| Study (N) | Test | Sensitivity | Specificity | PPV | NPV |
|---|---|---|---|---|---|
| Medl et al., 1995144 (73) | Ultrasound morphology (descriptive) | 0.81 | 0.73 | 0.79 | 0.76 |
| MRI descriptive | 0.97 | 0.83 | 0.88 | 0.96 | |
| Yamashita et al., 1995192 (72 women 80 masses) | Ultrasound morphology (unique score) | 0.89 | 0.84 | 0.63 | 0.96 |
| MRI precontrast | 0.78 | 0.93 | 0.79 | 0.93 | |
| MRI contrast enhanced | 0.91 | 0.93 | 0.81 | 0.97 | |
| Fenchel et al., 200291(99) | Ultrasound combined morphology and Doppler | 0.92 | 0.60 | 0.24 | 0.98 |
| MRI | 0.83 | 0.83 | 0.40 | 0.97 | |
| FDG-PET | 0.58 | 0.76 | 0.25 | 0.93 | |
| Jain et al., 1993118 (32) | Ultrasound morphology (descriptive) | 1.00 | 0.60 | 0.18 | 1.00 |
| MRI | 0.67 | 1.00 | 1.00 | 0.97 | |
| Kawahara et al., 2004121 (38) | MRI descriptive | 0.91 | 0.87 | 0.91 | 0.87 |
| FDG-PET | 0.78 | 1.00 | 1.00 | 0.75 | |
| Komatsu et al., 1996122 (82) | Ultrasound morphology (unique score) | 1.00 | 0.46 | 0.57 | 1.00 |
| MRI descriptive (n = 59) | 0.91 | 0.88 | 0.91 | 0.88 | |
| Lin et al., 199339 (80) | Ultrasound morphology (descriptive) | 0.83 | 0.50 | 0.58 | 0.79 |
| CT descriptive | 0.86 | 0.36 | 0.74 | 0.56 | |
| Buist et al., 199478 (64) | CT reviewer a | 0.96 | 0.44 | 0.72 | 0.89 |
| CT reviewer b | 0.89 | 0.83 | 0.89 | 0.83 | |
| MRI reviewer a | 0.96 | 0.33 | 0.68 | 0.86 | |
| MRI reviewer b | 0.96 | 0.94 | 0.96 | 0.94 | |
| Ultrasound morphology (NR) | 0.89 | 0.44 | 0.71 | 0.73 | |
| Grab et al., 2000100 (101) | Ultrasound combination morphology and Doppler | 0.92 | 0.60 | 0.23 | 0.98 |
| MRI descriptive | 0.83 | 0.84 | 0.42 | 0.97 | |
| FDG-PET | 0.58 | 0.80 | 0.28 | 0.93 | |
| Hata et al., 1992106 (63) | Ultrasound (NR) | 0.85 | 0.69 | 0.68 | 0.86 |
| MRI score | 0.67 | 0.97 | 0.95 | 0.80 | |
| Huber et al., 2002112 (93) | Ultrasound morphology (descriptive) | 0.85 | 0.73 | 0.87 | 0.71 |
| MRI descriptive | 0.89 | 0.86 | 0.93 | 0.79 | |
| Reuter et al., 1998156 (65) | Ultrasound morphology (descriptive) | 1.00 | 0.66 | 0.40 | 1.00 |
| MRI descriptive | 1.00 | 0.78 | 0.50 | 1.00 | |
| Sohaib et al., 2005166 (72) | Ultrasound morphology (descriptive) | 1.00 | 0.40 | 0.53 | 1.00 |
| MRI descriptive | 0.97 | 0.84 | 0.80 | 0.97 | |
Abbreviations: CT = computed tomography; FDG = 18-Fluorodeoxyglucose; MRI = magnetic resonance imaging; NR = not reported; PET = positron emission tomography
Stratification by menopausal status. None of the studies describing MRI, CT, or PET reported results either by menopausal status or in data that would allow menopausal status to be stratified.
The concept of using tumor markers as either screening or diagnostic tests for ovarian cancer is dependent upon identifying an abnormal level of a particular marker in serum, reflecting a systemic effect of disease in the ovary. The most extensively investigated ovarian cancer associated antigen is CA-125. This antigen is recognized by a murine monoclonal antibody produced using an ovarian cancer cell line as an immunogen. Elevated levels are detected in approximately 80 percent of ovarian carcinomas at the time of diagnosis;136, 167 however, elevated serum levels have also been reported in a variety of benign conditions, potentially affecting specificity. In addition, CA-125 is not as commonly elevated in non-epithelial ovarian cancers. Because these stromal and germ cell tumors are proportionately more common in pre-menopausal women, the sensitivity of CA-125 may it is not as sensitive in premenopausal women.3
Reproducibility of tests. Only one study included specific information regarding the inter- and intra-assay coefficients of variation.66 They were < 7.5 percent and < 5.3 percent, respectively. The sensitivity of the assay in this study was < 5 U/ml.
Trials identified. We identified 66 studies that investigated the use of CA-125 as a serum marker in the evaluation of an adnexal mass. One study was a population-based screening study that employed CA-125 as part of the screening triage.51 Forty-six studies in total used 35 U/ml as a threshold - in 37 it was the only threshold used, whereas in five, both 35 U/ml and another threshold were reported for the same patient population. There were 24 studies that reported a threshold other than 35 U/ml ranging from >20 U/ml to >100 U/ml. In addition to the five studies that reported 35 U/ml and an additional level, there were four other studies that reported two threshold levels within the same study population. All but one of the studies were case series. Although there were a few studies that compared CA-125 results from operative cases with normal controls, only the data from the operative series were included in the 2-by-2 tables. The clinical presentation of the cases was rarely described. Some of the series were drawn from oncology clinics
The only screening study identified for CA-125 in our literature search51 included 2000 women. The sensitivity in this study was 1.00, specificity 0.99, PPV 0.17, and NPV 1.00. Few of the other studies achieved this degree of sensitivity, specificity, or NPV, although overall the PPV was higher. In the presence of an adnexal mass, the false negative rate increases compared with a screened population reflecting the fact that benign gynecologic disease can cause elevation of CA-125.
The most common threshold other than 35 U/ml was 65 U/ml. Most of the studies using 65 U/ml as a threshold were from Asia. The probable heterogeneity of study populations makes comparisons between these levels limited. Looking at the studies that reported results for different levels of CA-125 for within the same study population,87, 98, 134, 136, 147, 148, 162, 167, 180 in the higher threshold measurement, the specificity and PPV are higher, the sensitivity is lower, and the NPV is only slightly lower.
| Study | Threshold | Premenopausal | Postmenopausal | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Sens | Spec | PPV | NPV | Sens | Spec | PPV | NPV | ||
| Malkasian et al., 1988136 | > 100 | 0.60 | 0.95 | 0.67 | 0.93 | 0.77 | 0.97 | 0.98 | 0.72 |
| > 35 | 0.60 | 0.73 | 0.29 | 0.91 | 0.81 | 0.91 | 0.94 | 0.74 | |
| Gadducci et al., 199696 | > 65 | 0.67 | 0.91 | 0.67 | 0.91 | 0.80 | 1.00 | 1.00 | 0.69 |
| Gadducci et al., 199298 | > 64 | 0.50 | 0.26 | 0.05 | 0.86 | 0.81 | 0.86 | 0.88 | 0.78 |
| Franchi et al., 199595 | > 39 | 0.73 | 0.64 | 0.24 | 0.94 | 0.77 | 0.85 | 0.87 | 0.74 |
| Patsner and Mann, 1988151 | > 35 | 0.63 | 0.78 | 0.66 | 0.76 | 0.77 | 0.81 | 0.85 | 0.72 |
| Dowd et al., 199355 | > 35 | 0.74 | 0.73 | 0.60 | 0.84 | 0.86 | 0.82 | 0.90 | 0.76 |
| Finkler et al., 198856 | > 35 | 0.50 | 0.69 | 0.35 | 0.81 | 0.84 | 0.92 | 0.94 | 0.80 |
| Schutter et al., 199863 | > 35 | — | — | — | — | 0.69 | 0.84 | 0.73 | 0.81 |
| Antonic and Rakar, 199571 | > 35 | 0.67 | 0.92 | 0.40 | 0.97 | 0.87 | 0.93 | 0.93 | 0.87 |
Abbreviations: CA-125 = cancer antigen 125; NPV = negative predictive value; PPV = positive predictive value; Sens = sensitivity; Spec = specificity
The incidence of ovarian cancer is higher in postmenopausal women relative to benign gynecologic conditions, which also increase CA-125 levels. This should translate into a greater accuracy of CA-125 test performance in this population. Indeed, all test parameters except NPV are both higher and the range narrower in postmenopausal women. The lowest PPV was 0.73, with the remaining above 0.85, which is significantly higher than the range of PPV observed in studies that did not stratify their results by menopausal status. The NPV is lower in the postmenopausal population, despite the higher sensitivity, because of a greater prevalence of cancer in this population. CA-125 is consistently more helpful in discriminating benign from malignant lesions in postmenopausal women compared with premenopausal women.
The fact that CA-125 is < 35 U/ml in 20 percent of women with early stage ovarian cancer, has motivated research into other serum based tests. We identified 13 articles that described a total of 17 different sera studies in women with an adnexal mass. Some studies investigated the performance of other tumor-associated antigens such as tumor-associated glycoprotein 72 (TAG-72) or CA-19-9. Although most of the tumor-associated antigens achieved specificities of approximately 0.82 to 0.92, the sensitivity, PPV, and NPV were overall lower than those reported for CA-125. Two studies investigated carcinoembryonic antigen (CEA),114, 157 and although they employed slightly different thresholds, the sensitivity reported in both (0.16 and 0.22) are so poor as to lead both authors to conclude that assessment of CEA in the evaluation of an adnexal mass is not helpful. Roman et al.42 investigated whether the addition of human chorionic gonadotropin (hCG), alpha-fetoprotein (AFP), and lactate dehydrogenase (LDH) to CA-125 improved the test performance. In their series the sensitivity of CA-125 alone was 0.67, the specificity was 0.71, PPV 0.35, and NPV 0.90. The addition of the other three tests did not change the test results very much. The combined test (defined as any of the markers positive) sensitivity was 0.72, its specificity was 0.70, PPV 0.36, and NPV 0.94. AFP, hCG, and LDH do not appear to improve the diagnostic performance of CA-125.
Gadducci et al. investigated the role of D-Dimer in a series of 121 women with adnexal masses.96 The sensitivity for D-Dimer alone was 0.91, the specificity was 0.83, the PPV 0.82, and the NPV 0.92 - making D-Dimer one of the best performing tests identified in our review. Stratifying by menopausal status showed a greater performance in premenopausal women where the sensitivity, specificity, PPV and NPV were 1.00, 0.91, 0.75, and 1.00 respectively (n = 57). For postmenopausal women they were 0.89, 0.65, 0.85, and 0.72, respectively. Chalas et al. investigated the role of elevated platelets in 241 women.31 The specificity and PPV were similar to that reported for D-Dimer (0.84 and 0.83, respectively), but the sensitivity and NPV were significantly lower (0.56 and 0.59). These two studies are intriguing, but the results need to be established in future studies to better assess their possible contribution to the evaluation of adnexal masses.
Aside from D-Dimer, none of the studies contained information making stratification by menopausal status possible. In conclusion, none of the sera markers investigated in this review appears to perform better than CA-125, with the possible exception of D-Dimer in the premenopausal population.
| Study | N | Test | Sensitivity | Specificity | PPV | NPV |
|---|---|---|---|---|---|---|
| Marchetti et al., 2002140 | 4350 | Ultrasound screening: criteria | ||||
| NR | 1.00 | 0.37 | 0.07 | 1.00 | ||
| Operative cases only (n = 45) | ||||||
| Assuming all negatives were truly negative (n = 4359) | 1.00 | 0.96 | 0.01 | 1.00 | ||
| Menon et al., 2000145 | 1027 | Ultrasound | ||||
| Volume > 8.8 ml | 0.90 | 0.94 | 0.21 | 1.00 | ||
| Abnormal morphology | 1.00 | 0.94 | 0.24 | 1.00 | ||
| Complex morphology | 0.84 | 0.97 | 0.37 | 0.98 | ||
| Vuento et al., 1995186 | 1364 | Combined ultrasound morphology and Doppler (PI < 1.0) | 1.00 | 0.88 | 0.006 | 1.00 |
| DePriest et al., 199336 | 24/3220 | Ultrasound morphology (DePriest) | 1.00 | 0.71 | 0.33 | 1.00 |
| Operative cases only (n = 24) | ||||||
| Kurjak et al., 1992126 | 83/1000 | RI < 0.41 | 0.96 | 0.95 | 0.90 | 0.98 |
| Ultrasound morphology (unique score) | 0.48 | 0.98 | 0.93 | 0.78 | ||
| Presence of random vessels | 0.90 | 0.98 | 0.96 | 0.95 | ||
| Combined ultrasound and Doppler | 0.90 | 0.94 | 0.90 | 0.94 | ||
| Kurjak et al., 1994127 | 32/5013 | Ultrasound “persistent mass” | 1.00 | 0.97 | 0.80 | 1.00 |
| Ultrasound assuming all test negatives true negatives | 1.00 | 0.99 | 0.80 | 1.00 | ||
| Kurjak et al., 1991128 | 680/ 8620 | RI < 0.4 | 0.96 | 0.99 | 0.98 | 1.00 |
| DePriest et al., 199734 | 90/6470 | Ultrasound morphology (DePriest) (n = 90) | 1.00 | 0.59 | 0.17 | 1.00 |
| Assuming all test negatives true negatives (n = 6470) | 0.86 | 0.99 | 0.07 | 1.00 | ||
| Adonakis et al., 199651 | 2000/ 2000 | CA-125 > 35 | 1.00 | 0.99 | 0.17 | 1.00 |
| PE “palpable mass” | 0.67 | 0.97 | 0.03 | 1.00 | ||
| Andolf et al., 199052 | 801 | Combined ultrasound and BME (both positive for test to be positive) | 1.00 | 0.94 | 0.11 | 1.00 |
| Ultrasound and BME criteria not well described | ||||||
| Jacobs et al., 198858 | 1010 | CA-125 > 30 U/ml | 1.00 | 0.97 | 0.03 | 1.00 |
| BME | 1.00 | 0.97 | 0.04 | 1.00 | ||
| Ultrasound (ovarian volume > 8.8ml) (n = 58 for ultrasound) | 1.00 | 0.74 | 0.08 | 1.00 | ||
| Tailor et al., 2003171 | 2500 | Ultrasound morphology (descriptive) | 0.86 | 0.97 | 0.07 | 1.00 |
| N = 2500 | 1.00 | 0.99 | 0.21 | 1.00 | ||
| Ultrasound for second screening episode (n = 998) | 1.00 | 0.99 | 0.25 | 1.00 | ||
| Ultrasound for >= third screening episode (n = 733) | ||||||
| van Nagell et al., 200049 | 14469 | Ultrasound (ovarian volume > 20 cm3 for premenopausal women, > 10 cm3 for postmenopausal women) | 0.81 | 0.99 | 0.09 | 1.00 |
Abbreviations: BME = bimanual examination; CA-125 = cancer antigen 125; NR = not reported; PE = pelvic examination; PI = pulsatility index
In reviewing the literature on evaluation modalities, numerous methodological problems consistently reduced our ability to draw conclusions about the performance of various tests both individually and in comparison with each other. Some of these problems concerned study design, others related to statistical issues.
Patient population. With the exception of the 13 population-based screening studies, all of the articles were case series. Some were consecutive and others non-consecutive. Some were based on operative cases within a specific time frame at one or several institutions, whereas others were referral series, often located in oncology clinics. The path to diagnosis was almost never described, making it difficult to asses the generalizability of the results. Further, age was the only patient characteristic that was reliably documented. Other characteristics, such as family history, were almost never included. This has several implications. The overrepresentation of operative cases especially from academic facilities, likely overrepresents the prevalence of malignancy in the study populations when compared with the population of women with adnexal masses in general. It also exaggerates the performance of the evaluative modalities, especially in regards to sensitivity and PPV. Finally, it limits the generalizability of the evidence.
| Study | Test | LMP classified as malignant | LMP classified as benign | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Sens | Spec | PPV | NPV | Sens | Spec | PPV | NPV | ||
| Roman et al., 1998157 | CEA | 0.16 | 0.93 | 0.35 | 0.83 | 0.19 | 0.93 | 0.25 | 0.90 |
| Wakahara et al., 2001187 | Ultrasound morphology | 0.82 | 0.82 | 0.65 | 0.92 | 0.86 | 0.78 | 0.54 | 0.95 |
| CA-125 | 0.45 | 0.86 | 0.74 | 0.63 | 0.77 | 0.61 | 0.37 | 0.90 | |
| Timmerman et al., 1999178 | CA-125 | 0.80 | 0.82 | 0.63 | 0.91 | 0.77 | 0.79 | 0.56 | 0.91 |
Abbreviations: CA-125 = cancer antigen 125; CEA = carcinoembryonic antigen; LMP = low malignant potential (tumors); NPV = negative predictive value; PPV = positive predictive value; Sens = sensitivity; Spec = specificity
Variability in test criteria. Of the 69 articles that evaluated ultrasound morphology, only 31 used established scoring criteria; 38 used a novel method. This resulted in a great heterogeneity of tests for ultrasound morphology and contributed to the range in performance noted. Many of the studies employed purely descriptive analysis to arrive at a benign versus malignant diagnosis. This limits the reproducibility of those results. Many of the scoring systems and descriptive categories had never been independently verified, and the paucity of details regarding what constituted a positive test makes such verification impossible. In terms of ultrasound evaluation by color Doppler, there was also a range of reported thresholds. Some of the variability in test criteria reflects the limitations of ultrasound technology. However, such differences limited the comparability between studies.
Masses as numerator. While most studies examined persons as the unit of 2-by-2 analysis, there were many studies that analyzed their data by masses. Even though the number of persons in the study was usually reported, it was often impossible to reconfigure the 2-by-2 table to refer to persons not masses. This was especially true in the radiology literature. This influenced the comparability between studies.
Menopausal status. Most of the studies did describe the patient population in terms of age. We were able to calculate the proportion of menopausal patients in most studies. However, the results were rarely reported in a way that allowed stratification by menopausal status. Where stratification was possible, a difference in test performance was seen. The heterogeneity in test performances was magnified by the different proportions of pre- and postmenopause in the different study populations.
Sample size. Few studies discussed sample size issues, potentially leading to inappropriate conclusions, especially regarding comparability of test characteristics.
Failure to account for observer variability. No studies attempted to account for the effects of observer variation on the precision of estimates, although a few did calculate interobserver coefficients. For tests where the thresholds for normal and abnormal were based on either qualitative assessments (such as descriptions of ultrasound morphology) or quantitative measures (such as ultrasound morphology scores), this variability will have implications for the precision of sensitivity and specificity.
Prevalence and predictive value. We did not limit our analysis of test characteristics to studies from the United States. As the incidence of ovarian cancer is different in different countries, this influences the range of predictive values reported in the literature. Locations with low disease prevalence will have low PPVs compared with higher prevalence areas. The heterogeneity of study locations influenced the range of reported test characteristics and somewhat limits the comparability of the results.
| Diagnostic Test | Pooled Sensitivity (95% CI) | Pooled Specificity (95% CI) |
|---|---|---|
| ULTRASOUND: MORPHOLOGY | ||
| Scoring system: Sassone | 0.86 (0.79 to 0.91) | 0.77 (0.73 to 0.81) |
| Scoring system: DePriest | 0.91 (0.84 to 0.95) | 0.68 (0.49 to 0.82) |
| Scoring system: Ferrazzi | 0.87 (0.80 to 0.92) | 0.81 (0.62 to 0.91) |
| Scoring system: Finkler | 0.82 (0.65 to 0.91) | 0.78 (0.59 to 0.91) |
| Other | 0.86 (0.82 to 0.89) | 0.83 (0.76 to 0.88) |
| ULTRASOUND: DOPPLER | ||
| Resistive index | 0.72 (0.61 to 0.82) | 0.90 (0.84 to 0.94) |
| Pulsatility index | 0.80 (0.74 to 0.85) | 0.73 (0.62 to 0.81) |
| Maximum systolic velocity | 0.74 (0.56 to 0.86) | 0.81 (0.59 to 0.83) |
| Presence of vessels | 0.88 (0.80 to 0.92) | 0.78 (0.65 to 0.87) |
| MORPHOLOGY PLUS DOPPLER | 0.86 (0.79 to 0.91) | 0.91 (0.80 to 0.97) |
| MRI | 0.91 (0.86 to 0.94) | 0.87 (0.83 to 0.90) |
| CT | 0.90 (0.83 to 0.94) | 0.75 (0.36 to 0.94) |
| FDG-PET | 0.67 (0.52 to 0.79) | 0.79 (0.70 to 0.85) |
| CA-125 (threshold > 35) | 0.78 (0.75 to 0.81) | 0.78 (0.71 to 0.82) |
Abbreviations: CA-125 = cancer antigen 125; CI = confidence interval; CT = computed tomography; FDG = 18-Fluorodeoxyglucose; MRI = magnetic resonance imaging; PET = positron emission tomography
The use of established scoring systems in the evaluation of an adnexal mass by ultrasound morphology appears to perform slightly better than simple descriptive assessment. However, there does not appear to be a benefit of one scoring system over another. Based on small numbers of studies, 3D ultrasound shows some improvement over 2D. Although the pooled sensitivity and specificity of MRI was the highest of any imaging modality, its performance was less consistent in studies where it was directly compared to other modalities such as CT and ultrasound.
Color Doppler assessment by RI, PI, and maximum systolic velocity are not superior to the more simple assessment of the presence or absence of arterial vessels within the mass. The efficacy of RI, PI, and maximum systolic velocity are hampered by the overlap in values of these measurements between benign and malignant masses.
Combined ultrasound morphology and color Doppler assessments have higher sensitivity and specificity compared to either alone. Although ultrasound morphologic evaluation by a gynecologist appears to be as reliable as that performed by a radiologist, there was no evidence of Doppler measurements done outside of the context of a radiology referral.
In postmenopausal women, an elevated CA-125 is useful for helping rule in ovarian cancer.
Qualitatively, there was a consistent trade-off across all tests between sensitivity and specificity.
The relatively low PPVs in all of the tests are particularly striking given that many of the included studies were done in preoperative patients; the likely “screening” done prior to a decision for surgery suggests that the PPV of a particular test in the initial evaluation of an adnexal mass is likely to be even lower.
Question 4 is: What is the accuracy of explicit scoring systems which incorporate various combinations of imaging findings, patient risk factors, and/or CA-125 levels for detecting malignancy? Have these scoring systems been applied to a population of women before laparoscopy or laparotomy?
Explicit scoring systems were sought in the medical literature from among all studies of diagnostic assessment of adnexal or pelvic masses. We considered only scoring systems that combined data from more than one category of the following types of information: (1) imaging findings; (2) patient risk factors; and (3) laboratory data. Clinical prediction rules that utilized data entirely from only one category (for example, ultrasound based morphological indices56) are described as part of Question 3.
Imaging findings could include: (1) ultrasound based tests, such as transabdominal or transvaginal 2D ultrasound or Doppler ultrasound; (2) radiographic tests, such as CT; or (3) other imaging studies, such as MRI or PET scans.
Patient risk factors include menopausal status, age, or other risk factors.
Laboratory data was primarily CA-125, but we recorded data on other serum tumor markers as well.
Scoring systems identified. The scoring systems were of several types. The most common were models developed using statistical modeling techniques such as logistic regression (or artificial neural networks) to develop estimates for predicted probability of malignancy. Such estimates were then used to construct clinical prediction rules (e.g., the Risk of Malignancy Index [RMI], which calculates a numeric score based on CA-125 level multiplied by a menopausal score and an ultrasound morphology score) and decision thresholds (e.g., for RMI, the most common threshold is 200). Other scoring systems used simple combinations of criteria based on individual modalities, which were then combined using Boolean and or or (e.g. CA-125 > 65 U/ml and ultrasound morphology score > 10 points). Some models were validated in separate populations from the data set used to develop the scoring systems either described as part of its initial development, or in subsequent publications by the original developers or others.
Types of data incorporated. The most common scoring systems used ultrasound, CA-125 and menopausal status. Some type of ultrasound data was used in all 36 publications; studies varied with regard to the type of ultrasound technology that was used. All used 2D ultrasound to evaluate morphology, some using transabdominal and many using transvaginal probes. Studies that used Doppler ultrasound used a variety of parameters, including measures as simple as detection of flow, or as complex as specific indices derived from Doppler-measured flow rates, such as the RI or PI. Many described scoring rules based on combinations of features of morphology (Finkler score) or combined morphology and blood flow.
CA-125 was a component of the scoring system in 30 reports; other serum tumor markers included CA-72-4, incorporated into two reports,53, 63 and the markers AFP, LDH, and hCG, were used in one report.42 All studies that used these other serum markers also used CA-125.
Menopausal status was incorporated into scoring systems of 19 reports. The definition of menopausal status varied across studies, and in a few cases age was used as a proxy for clinically determined menopausal status. Three studies included only postmenopausal women,62, 63, 135 and thus could not use this variable in the scoring system.
Physical examination was a component of scoring systems in six reports.42, 51–53, 62, 63
Type of study populations. Most study populations were case series assembled at the time of referral for surgery and collected either at the point of preoperative ultrasound imaging or preoperative surgical evaluation. No studies were based in primary care clinical populations. One study described evaluation of adnexal masses detected during an ovarian cancer screening program.51
Reporting of study populations. Menopausal status of the study populations was described in 28 of the 36 reports; three reports included only postmenopausal women.62, 63, 135
Age was reported for the study population as a mean or median in 18 of 36 studies; it was reported in categories in one additional study. Symptom status was seldom described in the candidate reports.
Race/ethnicity was not reported in any of the studies.
Risk factors for ovarian cancer (besides menopausal status and age, describe above) were not reported, except in one study that reported the proportion of the study population that was nulliparous versus multiparous.138
Methodology. The methodological quality of the included studies may be described as follows:
Reference standard (handling of borderline). Some studies, particularly those assembled at the time of ultrasound investigation rather than surgery, encountered women with masses due to simple cysts with low risk of malignancy. Two studies allowed use of an operative report in lieu of histopathology as a reference standard,87, 116 and one used clinical followup without surgery as an alternate reference standard.48
Verification bias. Fourteen studies failed to verify disease status for all or a significant sample of test-negative women.
Test reliability. Only nine studies provided data on the reliability of test assessments.
Sample size. Only 11 of the reports described a priori recruitment targets or sample size calculations. We excluded studies with fewer than 50 women; however, some studies report subgroup analyses with fewer than 50 women, for example, the subset of postmenopausal women in Strigini et al.169
Use of appropriate statistical tests. The majority of reports (n = 28) used appropriate statistical analysis of the diagnostic data; however seven reports reported inadequate analyses.
Blinding. None of the reports described the use of techniques to blind investigators to the disease status of study patients.
Definition of positive and negative test. Most studies (n = 24) provided a priori definitions of a positive and negative test result; studies failed to meet this criterion most often when no explicit threshold was set a priori, but it was set based on study data.
Explicit validation method. Half of the reports (18/36) used some explicit validation method; many of the reports replicated previously described scoring systems in a new population. In many cases, these studies described new scoring systems which were not always validated.
The most common validation method was replication in a separate population. Two studies used validation techniques within a single study population: one split-sample,209 and one bootstrap.205
This section considers the diagnostic accuracy of the RMI (Jacobs 1990) and subsequent replications and refinements (RMI2, RMI3, Jacobs 1993, and Timmerman models).
The RMI is a clinical prediction rule based on ultrasound, CA-125, and menopausal status data defined as follows:
RMI = U × M × CA-125
where ultrasound (transabdominal) is scored 1 point for each of the following characteristics: multilocular cyst, evidence of solid areas, evidence of metastases, presence of ascites, and bilateral lesions.
U = 0 for ultrasound score of 0
= 1 for ultrasound score of 1
= 3 for ultrasound score ≥ 2
CA-125 = Serum CA-125 in U/ml
Menopausal status
M = 1 if premenopausal
= 3 if postmenopausal
When sensitivity and specificity are combined separately using a random-effects model, the pooled sensitivity is 0.78 (95% CI, 0.72 to 0.84) and the pooled specificity is 0.90 (0.81 to 0.95).
RMI2. In 1996, Tingulstad et al.180 reported a refinement to the original RMI scoring system, commonly referred to as RMI2. RMI2 is defined identically to RMI except that new weights were used for the ultrasound and menopause components as follows:
U = 1 for ultrasound score of 0–1
= 4 for ultrasound score ≥ 2
M = 1 if premenopausal
= 4 if postmenopausal
RMI3. Subsequently, a further refinement to the RMI and RMI2 was reported by Tingulstad et al.211 This third scoring system is defined identically to RMI and RMI2 except that new weights were used for the ultrasound and menopause components as follows:
U = 1 for ultrasound score of 0–1
= 3 for ultrasound score ≥ 2
M = 1 if premenopausal
= 3 if postmenopausal
A cutoff value of 200 was also recommended for RMI3. The RMI3 scoring system has been replicated in one additional study.139 The original report of RMI3 found sensitivity of 0.71 and specificity of 0.92, while the validation study reported very similar performance, with sensitivity of 0.74 (0.65 to 0.83) and specificity of 0.91 (0.83 to 0.99).
| Initial description | Subsequent validation | Sensitivity (95% CI) | Specificity (95% CI) | ||
|---|---|---|---|---|---|
| Initial development | Replication | Initial estimate | Replication | ||
| Timmerman LR1210 | Valentin 2001185 | 0.87 (0.79 to 0.97) | 0.62 (0.44 to 0.80) | 0.92 (0.87 to 0.97) | 0.79 (0.68 to 0.90) |
| Timmerman AAN1178 | Mol et al. 2001207 | 0.94 (0.87 to 1.0) | 0.90 (0.79 to 1.0) | 0.90 (0.85 to 0.96) | 0.60 (0.52 to 0.68) |
| Timmerman AAN2178 | Mol et al. 2001207 | 0.96 (0.90 to 1.0) | 0.90 (0.79 to 1.0) | 0.94 (0.89 to 0.98) | 0.46 (0.38 to 0.54) |
| Timmerman LR2178 | Mol et al. 2001207 | 0.96 (0.90 to 1.0) | 0.90 (0.79 to 1.0) | 0.86 (0.79 to 0.92) | 0.56 (0.48 to 0.64) |
| Jacobs 1993212 | Mol et al. 2001207 | 0.85 (0.74 to 0.96) | 0.90 (0.79 to 1.0) | 0.97 (0.94 to 1.0) | 0.61 (0.53 to 0.69) |
Thirteen further reports describe the diagnostic performance of simple rules for combining single test or single modalities into a decision rule.42, 51, 52, 62, 63, 66, 86, 97, 103, 105, 135, 138, 169 None of these criteria has been validated in another population. Each of these studies used dichotomous rules for two or more tests (or modalities) and then combined them using a simple rule like “malignant if any test positive” (Boolean or) or “malignant if all tests positive” (Boolean and). Some of the studies reported diagnostic performance of several different simple rules.
Twelve of these studies used ultrasound and CA-125, five incorporated physical exam, two included other serum tumor markers42, 63 and one used age over 50 years.138
The results show a wide range of sensitivity and specificity. This variation reflects differences in decision thresholds (e.g., CA-125 > 35 U/ml versus CA-125 > 65 U/ml) and in the rules for combining tests (e.g., use of Boolean or versus and when combining results of two or more tests).
No scoring systems were both developed and validated expressly for evaluating adnexal masses in postmenopausal women. Existing scoring systems that have been validated have all been developed in mixed pre- and postmenopausal populations. Those scoring systems that have been described in populations of postmenopausal women were neither rigorously developed (they consist of simple combination rules) nor validated in other populations.
The highest demonstrated specificity obtained with these scoring systems appears to be in the range of 90 to 95 percent and, at this range of specificity, the sensitivity appears to be in the range of 65 to 80 percent. However, as suggested by the performance in the few populations of postmenopausal women studied, the same degree of sensitivity and specificity is unlikely to be possible. Reliable estimates of the diagnostic performance of scoring systems cannot be determined from these studies.
Question 5 is: Among women with suspected benign masses on initial investigation, what are the sensitivity and specificity of monitoring with periodic CA-125 and/or interval ultrasound examinations for detecting malignant masses? How does the interval of testing/definition of change affect sensitivity and predictive value?
For each study we sought to identify a population of patients with a screening abnormality which was “probably benign” and which the authors felt did not meet criteria for immediate surgical intervention. We then attempted to define the outcomes of further testing in the defined population, including the results of subsequent testing and final clinical outcome as defined by a pathology report or extended clinical followup. The interpretation of results is limited by the narrow scope of Question 5. Specifically, it is often difficult to identify a subgroup of patients with a screening abnormality which could be defined as a “suspected benign lesion” within larger screening studies. Often, results are not stratified with respect to these sub-populations, making it difficult to calculate sensitivity and specificity of the followup testing. In addition, by definition, it is also difficult to estimate the “sensitivity” of a followup regimen. We assumed that this refers to detection of cancer as part of the followup regimen, and that women with cancer diagnosed outside of the followup were “false negatives.”
| Study | Population | N | Followup interval | Length of followup | Loss to followup | True/false positivesdetected during followup | Cancers missed |
|---|---|---|---|---|---|---|---|
| Population-based studies (followup of “benign” masses identified in screening) | |||||||
| Menon et al., 200145 | Followup of scans considered “equivocal” | 17 | “Equivocal” scans followed every 6 weeks until clearly normal or abnormal; normal scans followed with CA-125 every 3 months | Median 6.8 years | Not reported | 1 cancer/5 benign lesions | 0 (1 within 6 weeks of initial test, before first followup scan) |
| Modesitt et al., 200340 | Followup of simple cysts < 10 cm | 2,763 | TVUS every 3–6 months for simple cysts | Mean 6.3 years | Not reported | 7 cancers/0 benign lesions | 3 cancers, none developed in the original cyst |
| Schin-caglia et al., 1994217 | Followup of post-menopausal ovaries > 9 cc, or with simple cyst | 347 | If cyst: followed with ultrasound every 6 months; if change, referred; others: referral if unchanged at 3 and 6 months | “At least” 1 year | Not reported, but all had “at least 1 year” | 2 cancers/96 benign lesions | None in 249 not referred |
| Kurjak et al., 1994127 | Followup of post-menopausal women with simple cyst > 2.5 cm but < 5 cm, resistive index ≥ 4.1) | 88 (of 404 with simple cysts) | Repeat ultrasound every 6 months | 6 months | Not reported | 1/17 with benign lesions | 0 |
| Castillo et al., 2004214 | Followup of post-menopausal women with simple cyst < 10 cm | 215 | Repeat ultrasound and CA-125 in 3 months, then every 6 months | Median 27 months | 30.6% | 0/44 benign masses | 1 |
| Case series (clinical history prior to identification of mass not routinely described) | |||||||
| Valentin and Akrawi, 2002218 | Followup of post-menopausal women with low score on ultrasound malignancy risk scale | 162 | Repeat ultrasound 3, 6, 9, and 12 months, then every 12 months; test positive if increase in size or cyst more complex | Median 3 years | 0 (cancer and mortality tracked through registry) | 0 cancers/7 patients underwent surgery for change | 0 |
| Maggino et al., 1994135 | Followup of post-menopausal women with cysts < 5 cm, thin wall, no septae, no free fluid | 45 | Details on followup strategy not reported | Not reported | 4.4% | 0/0 | 0 |
| Levine et al., 1992216 | Followup of voluntary screening of post-menopausal women with unilocular simple cyst | 32 | Repeat ultrasound every 3 months × 1 year, then every 6 months | “Over half at least one year” | 22.2% | 0/0 | 0 |
| Goldstein et al., 1989215 | Followup of post-menopausal women with simple cysts ≤ 5 cm | 16 | Repeat ultrasound (abdominal) | Mean 29 months | 6 (12% of original 48) | 0/2 with benign lesions | 0 |
Abbreviations: CA-125 = cancer antigen 125; TVUS = transvaginal ultrasound
Menon et al.145 performed a large prospective screening study of 22,000 postmenopausal women older than 45 years. Initial screening consisted of CA-125; patients with CA-125 ≥ 30 underwent endovaginal ultrasound evaluation. Results were interpreted as normal (ovarian volume < 8.8 ml/normal morphology), equivocal (volume < 8.8 ml, abnormal morphology), or abnormal (volume ≥ 8.8 ml). Normal morphology was defined as uniform hypoechogenicity and smooth outline. Abnormal morphology was defined as simple cysts or complex lesions. Patients with normal scans were triaged to repeat CA-125 every 3 months for a year and subsequently returned to yearly screening; median followup was 6.8 years, with loss to followup not reported. Patients with abnormal scans were referred to a gynecologist for consideration of surgical intervention. Patients with equivocal scans were triaged to repeat ultrasound at 6-week intervals until a scan could be classified as normal or abnormal. Of 741 patients who were triaged to ultrasound, 20 (2.7 percent) index cancers were identified. We focused on the group of patients with “equivocal” scans who were triaged to interval testing in an attempt to answer the study question. There were 17 equivocal scans. Of these, nine had simple cysts which were followed and did not result in a cancer diagnosis (true negatives). One patient died of pneumonia prior to her first repeat ultrasound, and one died of advanced ovarian cancer prior to her first repeat ultrasound; this cancer death could possibly be considered a false negative for the followup strategy, although it could also be considered a false negative from the original study since the death occurred within 6 weeks of the initial scan. Six patients were scheduled for surgery following an equivocal scan, presumably due to abnormal followup ultrasound. One of these had ovarian cancer (true positive), and the other five had benign disease (false positive). Because the number of equivocal scans was so small, and because the classification “equivocal” does not necessarily imply that the lesions were felt to be “suspected benign” as designated in Question 5, it is not possible to calculate the sensitivity and specificity of prolonged monitoring strategies using this study. The authors do not draw any conclusions regarding the appropriateness of interval testing.
Modesitt et al.40 performed a large screening study of 15,106 asymptomatic women at least 50 years old without a history of ovarian cancer. Patients were screened with TVUS. Criteria for abnormality were ovarian volume > 10 ml and any morphologic abnormality, including simple or complex cysts. Patients with abnormal TVUS were triaged to repeat TVUS in 4 to 6 weeks, with Doppler flow ultrasound, CA-125 level, and tumor morphology indexing performed at the second visit. Patients with simple unilocular cysts which were considered likely benign were triaged to repeat TVUS every 3 to 6 months. Mean followup was 6.3 years. Two thousand and seven hundred and sixty-three (2,763) women were diagnosed with 3,259 unilocular cysts. Spontaneous resolution of unilocular cysts occurred in 2,261 (69.4 percent) of lesions. Ten patients subsequently developed ovarian cancer. Seven of these had additional abnormal areas which subsequently developed on TVUS (considered true positives because they were subsequently identified by interval testing). Two developed ovarian cancer after the cyst in question had resolved on sonographic followup (these might be considered false negatives). One patient developed cancer in the ovary opposite the cyst being followed (this might also be considered a false negative). Calculated on a per-patient basis, the sensitivity and specificity of followup testing in the population with a simple unilocular ovarian cyst are 70 percent (95% CI, 41.6 to 98.4 percent) and 100 percent (99.9 to 100 percent), respectively. Because none of the unilocular cysts subsequently developed into a cancer, the sensitivity and specificity improve to 100 percent (57.1 to 100 percent) and 100 percent (99.9 to100 percent), respectively, when calculated on a per-lesion basis. Followup time is a major strength of this study. The authors conclude that unilocular ovarian cysts are associated with a very low risk of malignancy and can be safely followed with serial ultrasound.
Schincaglia et al.217 performed a screening study of 3,541 asymptomatic postmenopausal patients. All patients underwent transabdominal ultrasonography with assessment of ovarian volume and morphology. Patients were divided into four groups based on the results of the initial ultrasound. All patients with ovarian volume > 15 ml (Group 4) were referred for repeat “level II” ultrasonography for morphologic assessment and fine needle aspiration (FNA) when feasible. Patients with ovarian volume between 9 and 15 ml (Group 3) were triaged to followup ultrasound at 3 and 6 months. Patients with ovarian volume < 9 cm but a cystic appearance (Group 2) were triaged to followup ultrasound in 6 months. Patients with ovarian volume < 9 ml and homogeneous appearance (Group 1) were considered negative and had no further intervention. Clinical followup at 1 year and pathology results if surgery was performed were considered the reference standard. Two hundred and eighty-three (283) patients (Groups 2 and 3) were deemed appropriate for followup using repeat ultrasound at 3- to 6-month intervals without the need for immediate referral for FNA/surgery. Of these 283 patients, 34 subsequently developed concerning ultrasound findings and were referred for a level II scan and/or possible FNA. The clinical results of this group of 34 are not given separately. Of the 249 who had non-concerning followup scans, none developed cancer with followup of at least 1 year (“true negatives”). Therefore, the specificity of ultrasound followup is 100 percent (95% CI, 98.8 to 100 percent) for patients with an initial abnormal but “probably benign” ultrasound. Sensitivity within this group cannot be calculated with the information given in the publication. The ability to answer Question 5 would be enhanced if specific outcomes of each of the four groups defined by the authors had been given. The study was also limited by the fairly short followup interval and the lack of prior or concurrent validation of the ultrasonographic groups defined in the study.
Kurjak et al.127 screened 5,013 women 40 years old or older (30.6% postmenopausal), of whom 404 had simple cysts with a diameter between 2.5 and 5 cm and a resistive index greater or equal to 0.41. These women received a followup scan in 6 months. Investigators reported the results of 88 women for whom the 6-month scan results were available. The definition of change prompting further diagnosis was not explicitly described. Of the 88 women, 18 ultimately underwent surgery based on the findings at 6 months, with one cancer detected and 17 benign lesions. Results stratified by menopausal status were not provided. This study was limited by lack of details on clinical decision rules, and short followup.
Castillo et al.214 screened 8,794 postmenopausal women; 215 had simple unilocular cysts less than 10 cm in diameter. Twelve percent of these masses were asymptomatic. These women underwent repeat ultrasound and CA-125 in 3 months, with subsequent followup studies every 6 months. Progression was defined as an increase in diameter of 1 cm or more, regression as a decrease of 1 cm, and resolution as absence of the cyst at 2 consecutive visits 6 months apart. Median followup was 27 months. There was one interval ovarian cancer between studies, and 44 women had benign masses removed. Although this study was among the highest quality studies in terms of reporting of relevant data, it is limited by the relatively small size and the high loss to followup (30.6%).
There are limited data available to support a global definition of “probably benign” ovarian lesions or to support a specific method of interval testing to identify ovarian malignancy among patients in whom such lesions have been identified. For the most part, studies are limited by small size, variable length of followup, variable definitions of significant change and thresholds for intervention, and methods for followup.
The question of how best to define and evaluate “sensitivity” of followup regimens is a difficult one. Several factors need to be considered. First, interval cancers presenting between the initial study and the first followup visit may well be considered false negatives of the initial study; alternatively, they may reflect a too-long followup interval. Second, given the lack of data on the natural history of ovarian cancer, it is unclear whether cancers developing in benign-appearing lesions represent subclinical cancers present at the time of the initial diagnosis, or new cancers representing malignant transformation of a benign cyst. If the latter, then the ultimate success of any followup regimen may depend as much on the natural history of a given malignancy as on the sensitivity and specificity of the tests used for followup. Finally, cancers identified during followup should ideally have high survival rates (although whether such high survival rates would reflect the efficacy of the followup or the biology of cancers which are associated with benign-appearing cysts is unclear). The number of cancers identified in the reviewed studies was too small to draw any inferences about relative survival.
Overall, only two interval cancers occurred during followup in the studies identified (one prior to the first followup scan), and 10 cancers were identified during followup. As noted, an additional three cancers developed after resolution of a cyst or in the contralateral ovary. The highest quality study40 provides good evidence for the safety of prolonged followup with interval TVUS at 3- to 6-month intervals for patients with unilocular ovarian cysts of up to 10 cm in diameter, and the findings of the other studies are consistent with this conclusion.
Question 6 is: Among women with adnexal masses, what are the morbidity and mortality from diagnostic surgery (laparoscopy or laparotomy)? At what point does the risk of surgery outweigh the risk of detecting malignancy?
We searched the literature for studies that reported the morbidity and mortality of surgical management of adnexal masses. We also used the Nationwide Inpatient Sample (NIS) discharge database, maintained by the Agency for Healthcare Research and Quality (AHRQ), to obtain estimates of morbidity and mortality associated with diagnostic laparoscopy or exploratory laparotomy for a range of diagnoses associated with adnexal masses. The NIS is limited to inpatient procedures and does not cover ambulatory surgical centers, where some adnexal masses are likely to be managed, especially those masses thought to have a low likelihood of cancer. In addition to surgical complications, we also examined articles that provided data on the test characteristics of frozen section pathologic diagnosis; especially in the setting of minimally invasive procedures, false negative results on frozen section might lead to suboptimal surgical management and delayed therapy, while false positive results might lead to more extensive surgery than necessary, with possible implications for increased surgical morbidity and affects on ovarian function.
Size of population. None of the papers provided a description of the referral base; two32, 37 were limited to gynecologic oncology practices. Lack of information on the referral base prevents assessment of generalizability. Since all of these studies were performed in centers experienced in laparoscopic surgery, the generalizability may well be limited.
Number of cases. Five studies had fewer than 200 cases, with correspondingly wide confidence intervals for reported event rates. Two studies had larger numbers of cases, 683230 and 757.219 However, the study by Marana et al.230 was limited to women under 40.
Patient selection. None of the studies reported how patients were referred to the surgical practices. All provided criteria for laparoscopic management of masses, based on various criteria to suggest high or low risk of malignancy. We found two trials where patients were randomized to laparoscopy or laparotomy,224, 225 but randomization methods were not well described.
Application of reference standard. In this sense, “reference standard” refers to the method by which a complication was diagnosed. Only two studies described followup beyond 8 weeks, but they did not detail whether all patients underwent similar followup protocols.
There were three deaths in one study of 146 patients (all undergoing laparoscopy), and none in any of the other studies (a total of 5,599 patients). Pooling all patients, the mortality was 0.05 percent, with a 95% CI of 0.01 to 0.17 percent.
| Study | N | Patient population | Complication rate (95% CI) | Notes |
|---|---|---|---|---|
| Randomized trials of laparoscopy versus laparotomy | ||||
| Deckardt et al., 1994224 | 192 | 22.4% laparoscopy, | Laparotomy: 30.3% (21.8 to 42.3%) | “Randomized,” but some differences between two arms |
| 26.4% laparotomy postmenopausal | Laparoscopy: 11.2% (6.8 to 18.7%) | 3.5% conversion | ||
| Fanfani et al., 2004225 | 100 | Laparoscopy: 10% postmenopausal | Laparotomy 6% (1.8 to 17.5%) | No malignancies |
| Laparotomy: 20% postmenopausal | Laparoscopy 0% (0 to 10.6%) | Small sample size | ||
| Non-randomized comparisons | ||||
| Hidlebaugh et al., 1997227 | 405 | 199 laparoscopy | Laparotomy 27.2% (21.8 to 34.0%) | Selection criteria for laparoscopy not defined |
| 206 laparotomy | Laparoscopy 2.5% (1.0 to 6.0%) | Potential other risk factors for complications not described | ||
| 20.2% postmenopausal | ||||
| Yuen et al., 1997239 | 110 | Laparotomy: 6% postmenopausal | Laparotomy 28% (18.5 to 43.1%) | Difference between complication rates attributable to higher number of postoperative complications in laparotomy group |
| Laparoscopy: 3.8% postmenopausal | Laparoscopy: 9.6% (4.2 to 21.8%) | |||
| Carley et al., 2002221 | 106 | 44 laparotomy | Laparotomy 4.6% (0.7 to 16.7%) | |
| 62 laparoscopy | Laparoscopy 0% (0 to 8.6%) | |||
| Menopausal status not reported | ||||
| Chapron et al., 1997222 | 186 | 121 laparoscopy, | Laparotomy: 15.4% (8.9 to 27.0%) | Patients with high suspicion of malignancy went directly to laparotomy |
| 65 laparotomy | Laparoscopy: 8.3% (4.6 to 15.0%) | Results not analyzed by “intention to treat”—19 of laparotomy patients started as laparoscopy | ||
| 43% postmenopausal | 13.6% of laparoscopies converted to laparotomy | |||
| Laparoscopy only | ||||
| Childers et al., 199632 | 138 | Not described in detail; age range 9–91 | 10.1% (6.2 to 16.7%) | Length of followup not given for benign cases |
| Gynecologic oncology service | ||||
| Results not stratified by age or menopausal status | ||||
| 8.0% conversion to laparotomy | ||||
| Canis et al., 1994219 | 757 | 11.4% postmenopausal | 1.1% (0.53 to 2.1%) | Mean followup 42 months (range 3–153 months) |
| Dottino et al., 199937 | 160 | 53% postmenopausal | 7.5% (4.3 to 12.9%) | Gynecologic oncology service |
| Marana et al., 2004230 | 620 | All less than 40 years old | 0.9% (0.4 to 2.0%) | Mean followup 30 months |
| Single surgeon | ||||
| Parker et al., 199441 | 61 | 100% postmenopausal | 3.3% (0.4 to 12.3%) | Masses “presumptively benign” based on imaging, exam, clinical history |
| 4.9% conversion | ||||
| Sadik et al., 1999232 | 220 | 3.2% postmenopausal | 0.9% (0.06 to 3.5%) | Malignant masses “excluded from study” |
| Chi et al., 2004223 | 146 | Menopausal status not reported; median age 54 | Mortality 2.5% (0.5 to 6.3%) | Clinical history not described—not clear if other conditions besides adnexal mass included |
| Morbidity 22.1% (15.1 to 32.7%) | ||||
| Havrilesky et al., 2003226 | 396 | 37.2% postmenopausal | Laparoscopy 8.3% (6.0 to 11.6%) | Risk of complication increased with concurrent hysterectomy |
| Lok et al., 2000228 | 513 | 5.5% postmenopausal | Laparoscopy 13.3% (10.6 to 16.6%) | No malignancies 75.% symptomatic |
| Mann and Reich, 1992229 | 44 | 100% postmenopausal | Laparoscopy 4.6% (0.7 to 16.7%) | 1/44 had cancer |
| Parker and Proietto, 1997231 | 86 | Menopausal status not reported | Laparoscopy 22.1% (15.1 to 32.7%) | 1/86 had cancer |
| Serur et al., 2001233 | 100 | 49% postmenopausal | Laparoscopy 10% (5.6 to 19.0%) | - |
| Shalev et al., 1994234 | 55 | 100% postmenopausal | Laparoscopy 10.9% (5.2 to 22.9%) | - |
| Tarik and Fehmi, 2004237 | 1478 | Menopausal status not reported (but mean age 30) | Laparoscopy: Diagnostic procedures 1.8% (0.8 to 3.8%) | Proportion with preoperative diagnosis of adnexal mass not reported |
| Minor procedures: 1.4% (0.8 to 2.3%) | ||||
| Van Herendael et al., 1995238 | 121 | Menopausal status not reported | Laparoscopy: 1.7% (0.1 to 6.4%) | - |
Abbreviation: CI = confidence interval
| Number of discharges | Died | Mortality rate | Complications | Complication rate | |
|---|---|---|---|---|---|
| OVARIAN CANCER | 118,042 | 7099 | 6.0% | 515 | 0.4% |
| Laparoscopy (no ovarian procedures) | 222 | 5 | 2.3% | 0 | 0.0% |
| Laparoscopy plus conservative ovarian procedure | 27 | 0 | 0.0% | 0 | 0.0% |
| Laparoscopy plus oophorectomy | 16 | 0 | 0.0% | 0 | 0.0% |
| Laparotomy (no ovarian procedure) | 566 | 11 | 1.9% | 5 | 0.9% |
| Laparotomy plus conservative ovarian procedure | 68 | 0 | 0.0% | 0 | 0.0% |
| Laparotomy plus oophorectomy | 36 | 0 | 0.0% | 0 | 0.0% |
| OTHER ADNEXAL CANCER | 780 | 15 | 1.9% | 5 | 0.6% |
| Laparoscopy (no ovarian procedures) | 0 | 0 | 0.0% | 0 | 0.0% |
| Laparoscopy plus conservative ovarian procedure | 0 | 0 | 0.0% | 0 | 0.0% |
| Laparoscopy plus oophorectomy | 0 | 0 | 0.0% | 0 | |
| Laparotomy (no ovarian procedure) | 15 | 15 | 100.0% | 0 | 0.0% |
| Laparotomy plus conservative ovarian procedure | 0 | 0 | 0% | 0 | |
| Laparotomy plus oophorectomy | 0 | 0 | 0% | 0 | |
| BENIGN OVARIAN NEOPLASM | 145,024 | 255 | 0.2% | 964 | 0.7% |
| Laparoscopy (no ovarian procedures) | 1,560 | 5 | 0.3% | 35 | 2.2% |
| Laparoscopy plus conservative ovarian procedure | 75 | 0 | 0.0% | 0 | 0.0% |
| Laparoscopy plus oophorectomy | 24 | 0 | 0.0% | 0 | 0.0% |
| Laparotomy (no ovarian procedure) | 700 | 4 | 0.6% | 16 | 2.3% |
| Laparotomy plus conservative ovarian procedure | 72 | 0 | 0.0% | 0 | 0.0% |
| Laparotomy plus oophorectomy | 31 | 0 | 0.0% | 0 | 0.0% |
| PELVIC MASS | 13,625 | 30 | 0.2% | 60 | 0.4% |
| Laparoscopy (no ovarian procedures) | |||||
| Laparoscopy plus conservative ovarian procedure | 41 | ||||
| Laparoscopy plus oophorectomy | |||||
| Laparotomy (no ovarian procedure) | 35 | 5 | 14.3% | ||
| Laparotomy plus conservative ovarian procedure | |||||
| Laparotomy plus oophorectomy | |||||
| OVARIAN CYSTS | 474,485 | 376 | 0.08% | 3045 | 0.6% |
| Laparoscopy (no ovarian procedures) | 5,508 | 0.00% | 65 | 1.2% | |
| Laparoscopy plus conservative ovarian procedure | 274 | 0.00% | 0.0% | ||
| Laparoscopy plus oophorectomy | 173 | 0.00% | 0.0% | ||
| Laparotomy (no ovarian procedure) | 1,429 | 79 | 5.53% | 19 | 1.3% |
| Laparotomy plus conservative ovarian procedure | 99 | 0.00% | 0.0% | ||
| Laparotomy plus oophorectomy | 86 | 0.00% | 0.0% | ||
| PARA-OVARIAN CYST | 21,807 | 5 | 0.0% | 92 | 0.4% |
| Laparoscopy (no ovarian procedures) | 271 | 0.0% | 0.0% | ||
| Laparoscopy plus conservative ovarian procedure | 24 | 0 | 0.0% | 0 | 0.0% |
| Laparoscopy plus oophorectomy | 9 | 0 | 0.0% | 0 | 0.0% |
| Laparotomy (no ovarian procedure) | 61 | 10 | 16.4% | 0 | 0.0% |
| Laparotomy plus conservative ovarian procedure | 5 | 0 | 0.0% | 0 | 0.0% |
| Laparotomy plus oophorectomy | 5 | 0.0% | 0.0% | ||
| PELVIC INFLAMMATORY DISEASE | 430,027 | 439 | 0.1% | 4793 | 1.1% |
| Laparoscopy (no ovarian procedures) | 7,184 | 4 | 0.1% | 150 | 2.1% |
| Laparoscopy plus conservative ovarian procedure | 445 | 0 | 0.0% | 9 | 2.0% |
| Laparoscopy plus oophorectomy | 159 | 0 | 0.0% | 5 | 3.1% |
| Laparotomy (no ovarian procedure) | 2,129 | 10 | 0.5% | 53 | 2.5% |
| Laparotomy plus conservative ovarian procedure | 160 | 0 | 0.0% | 0 | 0.0% |
| Laparotomy plus oophorectomy | 45 | 0 | 0.0% | 0 | 0.0% |
| NORMAL PELVIS | 108.8 | 0 | 0 | 0 | 0 |
We identified two studies that reported on the sensitivity and specificity of intraoperative frozen section done to determine pathologic diagnosis.220, 236 They reported similar findings. Both studies defined low malignant potential tumors as cancer. Canis et al.220 reported a sensitivity of 92.2 percent and a specificity of 92.2 percent in 141 women (29.8 percent postmenopausal, 35 percent with cancer or low malignant potential tumors). Tangjitgomol et al.236 estimated similar values, with a reported sensitivity of 91.3 percent and specificity of 93.3 percent in 212 women (menopausal status not reported, cancer prevalence 77 percent). Defining low malignant potential cancers as benign decreased sensitivity in both cases.
We identified only one article that addressed the potential impact of surgical management of benign cysts on fertility. Somigliana et al.235 followed 32 women who received ovarian stimulation after removal of an endometriotic cyst. The mean number of follicles observed in the ovary where the cyst had been removed (2.0 ± 1.5) was significantly lower than in the contralateral ovary (4.2 ± 2.5), suggesting that the surgical procedure may have led to decreased ovarian reserve. An alternative explanation is that the cyst itself had an adverse effect on ovarian reserve.
Ideally, reports of adverse outcomes of diagnostic surgery for adnexal masses would be divided into four separate categories, based on preoperative symptoms and postoperative findings: (1) women with symptomatic masses which ultimately proved malignant; (2) women with symptomatic masses which ultimately proved benign; (3) women with asymptomatic masses which ultimately proved malignant; and (4) women with asymptomatic masses which ultimately proved benign. For the first three groups, the operative procedure could be considered appropriate even in the event of morbidity, since there is some benefit (primary surgical therapy for malignancy, or management of symptomatic nonmalignant adnexal pathology) to be gained from surgical diagnosis and treatment. For women with asymptomatic benign masses, there are theoretical benefits for detecting some benign masses, including (1) prevention of subsequent malignant transformation, (2) avoidance of rupture which, for certain benign masses (endometrioma and mature teratoma) could cause acute symptoms, (3) easier surgical management, with fewer complications, compared to management of a larger symptomatic mass, (4) avoidance of torsion (twisting of the adnexa) and emergent surgical management and (5) avoidance of effects on fertility, either from the underlying condition itself or from more extensive surgery for a larger mass. However, we did not identify any evidence for these benefits; the probabilities of these potential benefits also would differ widely depending on the underlying pathology and natural history of a particular mass, the patient's age and reproductive status, and other comorbidities.
Unfortunately, neither the literature nor available discharge data allow estimates of the probabilities of outcomes based on initial presentation. In the case of the literature, this is because of a lack of reporting of the clinical path by which patients come to undergo surgery. In the case of discharge data, it is because of the inherent limitations of the International Classification of Diseases, Ninth Revision (ICD-9) coding. Even if more recent data on ambulatory surgery were available, it would still be limited by coding.
Mortality for laparoscopic management of adnexal masses at experienced centers appears to be quite low, although the upper bound of this low rate is unclear.
Patient characteristics that determine risk of morbidity are unclear, although the need for more extensive procedures appears to increase the risk. Laparoscopy may have a lower morbidity rate than laparotomy, but this appears to be due, at least in part, to different patient selection criteria and surgical procedures performed.
Two small studies suggest that the false negative rate of intraoperative frozen section diagnosis is approximately eight percent, and the false positive rate is approximately five to seven percent. Whether either type of false result has a significant impact on outcome is unclear.
There is suggestive evidence that removal of a cyst in premenopausal women may affect ovarian reserve, potentially affecting fertility and/or age of menopause, but the underlying pathologic process may also play a role. More data are needed.
There are no data to allow estimation of the risks of a diagnostic procedure in the patient with an asymptomatic mass, or to assess the benefits of surgery in that patient compared to the risk of malignancy.
Question 7 is: What are the estimated trade-offs resulting from various strategies for evaluation of the adnexal mass?
A formal decision analytic approach is often quite helpful for synthesizing evidence coming from a range of sources, of varying quality, and of varying degrees of precision in estimates. Such models are also helpful in identifying which parameters are most important, in order to prioritize future research. Ideally, the underlying natural history of the disease in question can be modeled, with the impact of subsequent clinical interventions estimated based on test characteristics, effectiveness and morbidity from treatment, patient preferences, etc. In addition, the effect of varying both the incidence and natural history of ovarian cancer based on risk factors such as genetic predisposition can also be taken into account if adequate data are available. For example, such models have been quite helpful in exploring the impact of various interventions for cervical cancer prevention.240 In addition, data from currently ongoing trials of ovarian cancer screening will also provide valuable data on natural history. 241
Because of the methodological limitations of the literature on management of adnexal masses cited in the previous sections, a formal decision analysis does not seem appropriate at this time. In order to illustrate some of the key areas for future research, we did a simple estimate of the expected outcomes of several strategies for evaluation of the adnexal mass based on the findings of this review. Because models will ultimately need to incorporate the natural history of ovarian cancer, either to evaluate screening or to estimate the consequences of false negative diagnoses, we also performed a literature review of existing models of the natural history of ovarian cancer and the impact of screening or testing and developed an alternative model.
| Parameter | Value |
|---|---|
| Prevalence of adnexal masses in postmenopausal women (Question 1) | Malignant: 0.1% |
| Benign: 1.0% | |
| Sensitivity of the pelvic examination to detect adnexal masses (Question 2) | 0.45 |
| Specificity of the pelvic examination to detect adnexal masses (Question 2) | 0.90 |
| Sensitivity of combined morphology and Doppler (Question 3) | 0.86 |
| Specificity of combined morphology and Doppler (Question 3) | 0.91 |
| Note: We assumed that the specificity of ultrasound for determining the absence of pelvic mass was 100%. | |
| Sensitivity of CA-125 in postmenopausal women (Question 3) | 0.80 |
| Specificity of CA-125 in postmenopausal women (Question 3) | 0.87 |
| Sensitivity of RMI (Question 4) | 0.74 |
| Specificity of RMI (Question 4) | 0.91 |
Abbreviation: RMI = Risk of Malignancy Index
At the initial pelvic examination, the probability of detecting a mass equals:
Probability of true positive test + Probability of true negative test, or (Prevalence of mass * Test sensitivity) + (1-Prevalence of mass)*(1-Test Specificity)
Similarly, the probability of a negative test equals:
Probability of true negative + Probability of false negative, or (1-Prevalence)*Test Specificity + Prevalence*(1-Sensitivity)
At the time of ultrasound, the “prevalence” of disease is equal to the positive predictive value of the preceding test, the ultrasound, or:
Probability of true positive pelvic/(Probability of true positive pelvic + Probability of false negative pelvic)
Similar calculations were made for each test or combination of tests.
| Underlying pathology | Prevalence of malignancy among test positives | Proportion of all tests positive | Missed cancers | ||||
|---|---|---|---|---|---|---|---|
| Cancer | Benign mass | Normal | Total | ||||
| Baseline cases | 100 | 1,000 | 98,900 | 100,000 | 0.1% | ||
| Pelvic exam | |||||||
| Positive | 45 | 450 | 9,890 | 10,385 | |||
| Negative | 55 | 550 | 89,010 | 89615 | 0.4% | 10.4% | 55 |
| STRATEGY: CA-125 only | |||||||
| CA-125 | |||||||
| Positive | 36 | 59 | 1,286 | 1,380 | |||
| Negative | 9 | 392 | 8,604 | 9,005 | 2.6% | 15.3% | 9 |
| Surgery | |||||||
| Positive | 36 | 36 | |||||
| Negative | 59 | 1286 | 1,345 | 2.6% | |||
| STRATEGY: Morphology/Doppler only | |||||||
| Morphology/Doppler | |||||||
| Positive | 39 | 41 | 0 | 80 | |||
| Negative | 6 | 410 | 9,890 | 10,306 | 49.8% | 0.8% | 6 |
| Surgery | |||||||
| Positive | 39 | 0 | 0 | 39 | |||
| Negative | 0 | 41 | 0 | 41 | 49.8% | ||
Some numbers may not add up correctly because of rounding.Abbreviation: CA-125 = cancer antigen 125
| Underlying pathology | Prevalence of malignancy among positive tests | Proportion of all tests positive | Missed cancers | ||||
|---|---|---|---|---|---|---|---|
| Cancer | Benign mass | Normal | Total | ||||
| Baseline cases | 100 | 1,000 | 98,900 | 100,000 | 0.1% | ||
| Pelvic exam | |||||||
| Positive | 45 | 450 | 9,890 | 10,385 | |||
| Negative | 55 | 550 | 89,010 | 89,615 | 0.4% | 10.4% | 55 |
| STRATEGY: CA-125, followed by morphology/Doppler | |||||||
| CA-125 | |||||||
| Positive | 36 | 59 | 1,286 | 1,380 | |||
| Negative | 9 | 392 | 8,604 | 9,005 | 2.6% | 13.2% | 9 |
| Morphology/Doppler | |||||||
| Positive | 32 | 5 | 0 | 37 | |||
| Negative | 4 | 53 | 1,286 | 1,343 | 86.5% | 2.7% | 4 |
| Surgery | |||||||
| Positive | 32 | 0 | 0 | 32 | |||
| Negative | 0 | 5 | 0 | 5 | |||
| STRATEGY: Morphology/Doppler followed by CA-125 | |||||||
| Morphology/Doppler | |||||||
| Positive | 40 | 41 | 0 | 81 | |||
| Negative | 5 | 410 | 9,890 | 10,305 | 49.4% | 0.8% | 5 |
| CA-125 | |||||||
| Positive | 32 | 5 | 0 | 37 | |||
| Negative | 8 | 35 | 0 | 43 | 86.5% | 45.7% | 8 |
| Surgery | |||||||
| Positive | 32 | 32 | |||||
| Negative | 0 | 5 | 0 | 5 | 86.5% | ||
| STRATEGY: RMI (morphology + CA-125 + menopausal status) | |||||||
| RMI | |||||||
| Positive | 33 | 41 | 0 | 74 | |||
| Negative | 12 | 410 | 9,890 | 10,312 | 44.6% | 13.2% | 9 |
| Surgery | |||||||
| Positive | 33 | 0 | 0 | 33 | |||
| Negative | 0 | 41 | 0 | 41 | 44.6% | ||
Some numbers may not add up correctly because of rounding.Abbreviations: CA-125 = cancer antigen 125; RMI = Risk of Malignancy Index
| Strategies | |||||
|---|---|---|---|---|---|
| Single tests | Serial tests | Parallel tests | |||
| CA-125 | Ultrasound* | CA-125 followed by ultrasound | Ultrasound followed by CA-125 | Risk of Malignancy Index | |
| Total tests | 10,385 | 10,385 | 11,765 | 10,466 | 20,770 |
| Total missed cancers | 9 | 9 | 13 | 13 | 9 |
| Total surgeries | 1,380 | 80 | 37 | 37 | 74 |
Abbreviation: CA-125 = cancer antigen 125
| Strategies | |||||
|---|---|---|---|---|---|
| Single tests | Serial tests | Parallel tests | |||
| CA-125 | Ultrasound* | CA-125 followed by ultrasound | Ultrasound followed by CA-125 | Risk of Malignancy Index | |
| Total tests | 1,100 | 1,100 | 1,317 | 1,287 | 2,200 |
| Total missed cancers | 20 | 15 | 32 | 32 | 26 |
| Total surgeries | 197 | 184 | 90 | 90 | 155 |
Abbreviation: CA-125 = cancer antigen 125
This simple “model” illustrates several key points:
The prevalence of malignancy increases as additional diagnostic tests are performed. This is certainly clinically appropriate and reflects the effects of sequential testing strategies. However, specificity and, to some extent, sensitivity for many of the tests reviewed appear to vary with underlying disease prevalence. Thus, estimates for test characteristics calculated at one point in the clinical pathway may not be appropriate for other points.
Despite a poor sensitivity of 45 percent, the negative predictive value of a negative pelvic examination for malignancy is quite high (99.94 percent). The reassurance provided by a “normal” exam reflects the epidemiology of the underlying disease, rather than the intrinsic value of the test in discriminating benign from malignant. This reflects the low prevalence of ovarian cancer in the population. Conversely, the positive predictive value is only 0.4 percent, despite a specificity of 92 percent.
In order to judge the trade-offs between detection of masses that ultimately prove malignant compared with the risks of diagnostic surgery, we would need better estimates of morbidity and mortality within different diagnostic categories - as noted previously, these do not exist.
The most “efficient” strategy in terms of number of tests and surgeries is serial testing with ultrasound followed by CA-125; however, this results in four missed cancers compared with parallel testing using the RMI. However, parallel testing doubles the number of tests to be performed. A formal cost-effectiveness analysis requires significantly more data on test characteristics and ovarian cancer natural history, as well as the morbidity of surgical management.
Modeling parallel testing beyond the data in scoring systems is difficult. Besides requiring specific assumptions about how results that were positive for one test but negative for another would be managed, one would also need to know if the sensitivity and specificity of each test were independent or correlated in some way. For example, it seems likely that the sensitivity of both ultrasound and CA-125 would be greater for larger masses than for smaller masses.
In addition, for any screening modality, there needs to be evidence that early detection reduces disease-specific morbidity and mortality. In addition, in order to judge the impact of false negative results, data on the natural history of ovarian cancer are also needed. Since data from large trials are still pending, one way to examine the potential impact of different testing strategies for both initial screening and subsequent testing is through the development of simulation models.
We next review published models of the natural history of ovarian cancer.
| Parameter | Value | Range | Source |
|---|---|---|---|
| Prevalence of ovarian cancer | 28.6/100,000 | 20 to 200/100,000 | NCI monograph No. 41; 1975 |
| Percentage of prevalent cases in early stage | 50% | 20 to 80% | Assumed |
| Percentage of early stage disease diagnosed clinically | 25% | 20 to 80% | ACS Cancer Statistics 1990 |
| Sensitivity of CA-125 and TVUS (combined) for early stage disease | 45% | 20 to 80% | Literature review |
| Sensitivity of CA-125 and TVUS (combined) for late stage disease | 81% | 50 to 100% | Literature review |
| Specificity of CA-125 and TVUS | 99.95% | 96 to 100% | Literature review |
| Probability of post-laparotomy death | 0.23% | 0 to 10% | National Halothane Study JAMA 1966 |
Abbreviations: ACS = American Cancer Society; NCI = National Cancer Institute; TVUS = transvaginal ultrasound
Assumptions in the model were that survival time for clinically and screen-detected early stage disease is the same; morbidity and mortality rates associated with diagnostic laparotomy are the same for people with and without the disease; and there is no benefit gained from identifying benign disease.
The results of the analysis suggested that use of the combined strategy would result in a gain in life expectancy (compared to no screening) of one third of a day of life. No screening was preferred if the postoperative mortality rate exceeded 7.32 percent or the specificity of the test was less than 98.53 percent. An additional analysis, examining the use of testing for women aged 65+ suggested that the combined strategy would result in an average gain in life expectancy of approximately 3/4 of a day of life.
Skates and Singer244 developed a stochastic model to evaluate screening with CA-125. Key assumptions in this model included:
Stepwise progression from Stage I through Stage II through Stage III through Stage IV;
Log-normal distributions of progression rates;
Stage at clinical detection independent of duration of disease;
The coefficient of variation in stage length is constant across all stages;
Estimates for the duration of each stage were provided by two gynecologic oncologists.
In the base case, the model predicted that screening would save 3.4 years of life per detected case; of note, estimates for the gains in life expectancy for the entire population undergoing screening were not provided.
| Parameter | Estimate | Source |
|---|---|---|
| Stage of ovarian cancer | FIGO | |
| Relative stage lengths (relative to Stage 1) | 0.5, 1.333, 0.333 | Skates et al.244 FIGO stages III and IV assumed to comprise SEER stage 3 |
| Geometric mean stage length in months | 9; 4.5, 12 and 3 months | |
| Probability of disease during testing period | 0.0121 | Not stated |
| Probability of age at clinical detection | Age 50–54 - 0.153 | SEER |
| Age 55–59 - 0. 184 | ||
| Age 60–64 - 0.202 | ||
| Age 65–69 - 0.179 | ||
| Age 70–74 - 0.150 | ||
| Age 75–80 - 0.132 | ||
| Probability of stage at clinical detection | Stage 1 - 0.223 | SEER |
| Stage 2 - 0.153 | ||
| Stage ¾ - 0.624 | ||
| Point in stage at clinical detection | 0.5 of stage length | Assumed |
| Stage length distribution | Log normal (9, 4.5) | Assumed |
| TVUS sensitivity | 100% | van Nagell, CA 1990 |
| van Nagell, CA 1991 | ||
| TVUS - false positive | 1st screen 0.019; | Campbell, Br J Obstet and Gynecol 1990 |
| 2nd screen 0.010; | ||
| 3rdscreen 0.006 | ||
| CA-125 level in cases | Refer to page 254 of article for formula | Skates et al.244 Einhorn, Proc Am Soc Clin Oncol 1990 |
| % of false negatives for CA-125 | 5% | Assumption |
| CA-125 specificity in women with false positive TVUS | 0.85 | Bast, Gyn Onc 1985 |
| Woolas, JNCI, 1993 | ||
| Return to normal life-expectancy post-diagnosis | 15 years | Assumption |
| Probability of death in surgery among false-positive | 0.001 | Assumption |
Abbreviations: FIGO = International Federation of Gynecology and Obstetrics; SEER = Surveillance, Epidemiology, and End Results; TVUS = transvaginal ultrasound
Six screening strategies using TVUS and CA-125 either alone or in combination: annual TVUS; annual CA-125, elevated (35U/ml used for referral to laparoscopy); annual CA-125, rising or elevated (rising defined as CA-125 level that has doubled since last screen); annual TVUS conditional on rising or elevated CA-125; 6-month TVUS condition on rising or elevated CA-125; 2-year TVUS conditional on rising or elevated CA-125. Of these, the strategy of annual TVUS conditional on rising or elevated CA-125 was identified as efficient, meaning it saved an equivalent if not higher amount of life at lower costs compared to other strategies. The model was especially sensitive to assumptions about the duration of Stage I disease.
Secondary prevention of cancer mortality through screening has been remarkably effective in the case of cervical cancer. Mammography has also reduced mortality from breast cancer, although there remains some controversy. To date, although survival in early stage ovarian cancer is considerably higher than survival in later stage cancers, trials of screening have not yet demonstrated reduction in disease-specific mortality. Although the relative lack of effectiveness of ovarian cancer screening to date may reflect the lack of an appropriate test, differences in the biology and natural history of the different cancers may also result in some of the differences.
As outlined in a recent review,246 the most critical criteria for an effective screening strategy for ovarian cancer is that there is a time of sufficient duration during the development of ovarian cancer when cancer is detectable but in a stage when treatment effectiveness is high. As shown in the two most sophisticated models reviewed, estimates of the effectiveness of screening are highly dependent on assumptions about the duration of Stage I cancer. The basis for the estimates used in both models was the opinion of two clinicians; the methods used to derive these estimates were not described.
Cervical cancer is, in the majority of cases, a squamous carcinoma, which spreads primarily through direct extension and secondarily through lymphatic invasion. The most common type of ovarian cancer, on the other hand, is typically an adenocarcinoma, which spreads by dissemination of tumor cells throughout the peritoneal cavity.
One assumption commonly made in the models of ovarian cancer we identified is that ovarian cancer staging represents the natural history. Figure 24
Although this stepwise progression through stages is the case for cervical cancer, there is no evidence to suggest that tumors limited to the ovary (Stage I) must necessarily spread first to adjacent pelvic organs (Stage II) prior to spread throughout the peritoneal cavity (Stage III). Although staging systems represent the extent of disease, they are developed to help with prognosis, and to allow comparison of treatment effectiveness - there is no explicit assumption that each stage necessarily must be preceded by the next lowest one. Figure 25
Using the Markov model described in Chapter 2, we performed sensitivity analyses on progression rates and type of progression to determine if this second “model” of progression could result in similar stage distributions to observed data.
Figure 26
| Model 1 (Stage 1 must progress through Stage II) | Model 2 (some Stage I can progress directly through Stage III) | Stage distribution: FIGO (local data from Skates et al.244) | Stage distribution: SEER (1995-2001) | |
|---|---|---|---|---|
| Parameter estimate | ||||
| Annual probability of presenting with symptoms: Stage I | 0.095 | 0.1 | ||
| Annual probability of presenting with symptoms: Stage II | 0.095 | 0.15 | ||
| Annual probability of presenting with symptoms: Stage III | 0.7 | 0.9 | ||
| Annual probability of presenting with symptoms: Stage IV | 1 | 1 | ||
| Proportion of Stage I progressing directly to Stage III | 0 | 0.25 | ||
| Model output: stage distribution | ||||
| FIGO: | ||||
| Stage I | 19.1% | 19.6% | 25% | |
| Stage II | 8.2% | 9.3% | 8% | |
| Stage III | 54.2% | 65.2% | 52% | |
| Stage IV | 18.6% | 5.9% | 15% | |
| SEER/WHO: | ||||
| Local | 19.1% | 19.6% | 25% | 19% |
| Regional | 8.2% | 9.3% | 8% | 7% |
| Distant and unstaged | 72.8% | 71.1% | 67% | 75% |
Abbreviations: FIGO = International Federation of Gynecology and Obstetrics; SEER = Surveillance, Epidemiology, and End Results; WHO = World Health Organization
With relatively small changes in the probability of presenting with symptoms, a model that allows 25 percent of Stage I tumors to progress directly to Stage III results in stage distributions similar to observed data, and results in similar lifetime risk of ovarian cancer as the Urban model,243 In a model with multiple input parameters, a huge number of combinations of parameters can result in similar outputs. Given that estimations of the duration of the different stages of ovarian cancer are based on little empirical data, and that there is no empirical data on the natural history of ovarian cancer, further exploration of the implications for screening, and the evaluation of masses detected through screening, is warranted.
The evidence is insufficient to develop a comprehensive model to estimate the relative benefits and risks of different management strategies for evaluating the adnexal mass.
Based on summary estimates of pooled sensitivity and specificity, management strategies that use imaging as the first step for evaluating an adnexal mass detected on examination (as opposed to CA-125) are more efficient, since they exclude false positive results from further examination. Serial testing with imaging followed by CA-125 results in the fewest number of surgeries, but misses more cancers than parallel testing. Parallel testing greatly increases the number of tests required, but results in fewer missed cancers. Additional data are needed to evaluate cost-effectiveness.
Alternative assumptions about the natural history of ovarian cancer can result in modeled outcomes similar to those of published models; the implications of these assumptions should be explored further.
There are several limitations to this evidence report:
We did not review articles published in languages other than English because of a lack of resources for translation. It is possible that this led to the failure to include some relevant articles.
For our review of prevalence studies (Question 1), we excluded studies performed outside the United States. Because the report was requested by the Centers for Disease Control and Prevention (CDC) to help with development of their policies and research agenda into ovarian cancer prevention strategies, we focused on U.S. populations and reasoned that the underlying prevalence of different conditions in women with adnexal masses could well differ in potentially important ways due to differences in racial/ethnic distribution and/or environmental exposures. As discussed in Chapter 2, this is supported by wide international variation in the incidence of cancer. Variations in screening, diagnosis, and surgical management could also lead to differences in the prevalence of various conditions among women with adnexal masses. It is possible that this reasoning was incorrect, and that some relevant articles were excluded. However, some non-U.S.-based articles were reviewed for other questions, and the majority shared the same biases as U.S.-based studies (i.e., most were done immediately preoperatively).
There was considerable heterogeneity in design and patient populations among studies, and our use of a random-effects model to perform meta-analyses for some questions may have led to inaccurate estimates of pooled sensitivity and specificity. We also did not weight the results by anything other than sample size; it is possible that different results might have been obtained by weighting for study quality, for example.
In our review of data from the Nationwide Inpatient Sample, we used only specific International Classification of Diseases, Ninth Revision (ICD-9) “E” class codes to identify complications. A more exhaustive strategy (e.g., identifying procedures not typically performed at the time of diagnostic surgery, identifying blood transfusions through procedure or charge codes, including patients with cancer who underwent hysterectomy) might have revealed more complications,26 but would have required additional assumptions about the original indication for the surgery and the likely potential contribution of different aspects of the procedure to the complication (e.g., hysterectomy vs. oophorectomy).
Our exploration of alternative models for the natural history of ovarian cancer did not directly compare estimated outcomes of screening strategies to other models. However, a comprehensive evaluation of screening for ovarian cancer was beyond the scope of this report. We are currently developing the model further to conduct these analyses.
The main shortcoming of many of the papers reviewed was a failure to adequately describe the patient population, including the manner in which the adnexal mass was originally detected and subsequent evaluation. In Chapter 1, we described the importance of understanding the clinical presentation of the subjects in studies of management of adnexal masses. Because prevalence directly affects predictive values and may indirectly affect estimates of sensitivity and specificity, the probability that a patient is a true or false positive, or true or false negative, is dependent on the prevalence. In addition, the presence or absence of symptoms can affect the probability that a patient will undergo surgery if test findings indicate a benign mass, since surgery may still be the treatment of choice for the underlying condition. We were disappointed that the overwhelming majority of the studies we reviewed, relevant to all of the questions, did not adequately describe their population, so that the proportions of patients who presented with asymptomatic masses versus those with symptoms could be compared.
To be fair, there is an inherent feasibility issue in studies of diagnostic test accuracy for ovarian cancer - the ideal reference standard is histological confirmation, yet this confirmation requires surgery. Although this is a limitation of all cancer screening tests, the surgery required for a definitive diagnosis of ovarian cancer is more extensive than that for many cancers (for example, cervical, breast, and colon cancer can all be diagnosed without a requirement for general anesthesia). Especially with screening, or early in the diagnostic evaluation, the risks of surgery may be difficult to justify (especially since the low prevalence of malignancy makes the positive predictive value of tests early in the diagnostic evaluation quite low). From a research ethics perspective, it is certainly reasonable to limit diagnostic test studies to patients already scheduled for surgery. However, readers of these studies should recognize that the prevalence of malignancy will be substantially higher in preoperative patients than in patients at the time of the initial diagnosis of adnexal mass. Because test performance may be affected by prevalence, the outcomes (in terms of true and false test results) may be quite different in these two patient populations.
The same caveats hold for studies of the outcomes of surgery. Morbidity and mortality related to surgical diagnosis are influenced by the underlying diagnosis, as well as the extent of the disease (such as size of the mass, presence of adhesions from the disease process or prior unrelated surgery, or cancer stage). Interpreting surgical outcomes from studies that do not provide relevant clinical information is difficult; at the least, generalizablity is a major concern. Lack of relevant clinical information is a particular problem with administrative databases, which otherwise have the attraction of large sample size and better generalizability.26
An even more basic shortcoming was the failure to describe potential differences in study results stratified by age or menopausal status. Given the clear and widely recognized relationship between age and ovarian cancer risk, all studies in this area should present results in a way that allows separate estimation of outcome by age/menopausal status.
Few of the studies we reviewed included a priori sample size calculations. Use of confidence intervals for parameter estimates was uncommon. In studies of scoring systems, there were often too few cases of cancer for the number of variables included in the original models.
Relatively few of the diagnostic studies reported whether those interpreting test results were blinded to either clinical presentation or ultimate diagnosis. This could clearly have an impact, particularly in studies of the bimanual pelvic examination; the finding that specificity decreased as prevalence increased suggests that the threshold for identifying a mass as cancer is lower if the clinical suspicion - based on other factors such as patient age, menopausal status, or history - is higher. Although this may be appropriate clinically, it results in biased estimates of test performance.
Few studies addressed the potential impact of observer variability on the precision of test characteristics.
As discussed in more detail in the section on Question 7, ovarian cancer has been implicitly assumed to progress through a series of stages in a way analogous to cervical cancer. Alternative models are biologically plausible, and mathematical models can be “fitted” to match reported data under a variety of scenarios. Since existing models already show that the effectiveness of screening is dependent on assumptions about the length of Stage I, further exploration of the impact of varying assumptions about natural history is warranted.
The most important parameter in these models, stage duration, is inherently unknowable; however, the source for the parameter estimate in the two most sophisticated models were “personal communications” with two gynecologic oncologists. At the least, more formal methods of eliciting expert opinion are probably warranted for future modeling studies.
The prevalence of malignancy, even in postmenopausal women, is low - approximately 0.1 percent (1 in 10,000) in large screening studies in the United States. The potential for screening to reduce morbidity and mortality is currently being tested in at least three large trials; these trials should also provide valuable data on disease prevalence and the effectiveness of various followup strategies.
Until the results of the large screening trials are available, many, if not most, women with asymptomatic adnexal masses will have had the mass detected as part of a routine health maintenance examination.
The bimanual pelvic examination appears to have a sensitivity of less than 60 percent, whether for detecting adnexal masses in general or for distinguishing benign from malignant masses. Based on the best pooled estimate of sensitivity (45 percent) and a prevalence of 0.1 percent, a normal risk, asymptomatic, postmenopausal woman with a normal pelvic examination has a 99.94 percent chance of not having cancer, even though over half of the cancers would be missed. This is due to the low prevalence of ovarian cancer, since, even without the test, her probability of not having cancer is 99.99 percent. Given these test characteristics, the value of the pelvic examination in reducing ovarian cancer morbidity and mortality appears to be extremely limited, at best. Although there may be some rationales for an annual bimanual examination (discussed in Chapter 5), ovarian cancer screening is not one of them.
Of the various diagnostic imaging modalities, either a combination of ultrasound morphology and Doppler velocimetry, or magnetic resonance imaging (MRI), had the best combination of sensitivity and specificity for distinguishing benign from malignant disease. If confirmed by direct comparison, cost-effectiveness might be the most important determinant of which would be the optimal diagnostic procedure. Because the specificity of cancer antigen 125 (CA-125) is high in postmenopausal women, it is helpful in ruling in disease.
Additional validation of scoring systems in new populations is required before widespread adaptation can be recommended.
The most effective and efficient method for following patients who have been classified as having a benign mass is unclear, although unilocular cysts less than 10 cm appear to have a very low risk of malignancy.
The risks of diagnostic laparoscopy or laparotomy, particularly in asymptomatic women who ultimately prove to have a benign lesion, are unclear. Overall morbidity appears to be low in reported series, but these are subject to numerous biases, particularly regarding selection for laparoscopy. Two small randomized trials suggest higher short-term morbidity with laparotomy compared to laparoscopy, but differences between the two groups raise the possibility of confounding.
Based on our pooled estimates of sensitivity and specificity, serial testing of postmenopausal women with an adnexal mass detected by pelvic examination with either ultrasound morphology plus Doppler imaging, or MRI (which had similar sensitivities and specificities), followed by CA-125, resulted in the most efficient combination of number of tests, missed cancers, and surgeries. Parallel testing and using a scoring system such as the Risk of Malignancy Index resulted in fewer missed cancers than serial testing, but more overall tests and more surgeries. Additional data are needed to refine these estimates, to include the morbidities of the tests and surgeries, and to perform cost-effectiveness analyses. Either combined strategy is preferable to using imaging alone or CA-125 alone.
We cannot directly compare these results to the joint guidelines of the Society of Gynecologic Oncologists (SGO) and American College of Obstetricians and Gynecologists (ACOG) on which patient to refer to a gynecologic oncologist247 because the data were not available to replicate their findings. However, our results are consistent with the guidelines, which recommend a CA-125 level above 35 for postmenopausal women, the presence of ascites, or evidence of adnexal or distant metastasis.
Alternative assumptions and parameter estimates can be used to generate predicted cancer incidences similar to those seen in published models of the natural history of ovarian cancer. In order to better estimate the potential impact of different strategies for ovarian cancer screening, and for managing masses detected through screening or presenting with symptoms, additional models that explore the implications for alternative natural history assumptions are needed. Data from ongoing screening trials may provide estimates of many of the currently unknown parameters.
This section outlines research priorities identified through the review, both in terms of fundamental gaps in knowledge and in addressing methodological issues of existing studies.
Our ability to stratify results by relevant patient characteristics, or to compare the potential effect of patient characteristics on different results from different studies, was limited by the lack of information in most studies. We would suggest that future studies relevant to the diagnosis and management of adnexal masses provide data on, and present results stratified by, the following minimum characteristics:
Patient age and/or menopausal status
Patient body mass index
Patient race and ethnicity
Presence or absence of risk factors for ovarian cancer, particularly family history
Means by which the adnexal mass was initially diagnosed—pelvic examination or imaging
Reason for the initial examination which led to diagnosis of mass: symptoms referable to pelvic mass or ovarian cancer, examination for other symptoms, asymptomatic screening for ovarian cancer, or asymptomatic screening for other conditions
Large scale screening trials will provide some data on the prevalence of different types of masses.
Administrative data from surgical procedures may provide crude estimates, but some important information (like stage and grade of cancer, or histologic subtype) will likely be missing. In addition, relevant clinical data on presence or absence of symptoms and the diagnostic pathway leading to diagnosis will likely be missing. The best resource for obtaining the necessary data would likely be a large health maintenance organization (HMO) or third-party payer, which would allow comparison of inpatient and outpatient records, and followup of patients after diagnosis. Medicare data would provide similar information for women 65 and older.
Separate reporting of the prevalence of different types of masses among women with and without symptoms would be helpful for clinical decisionmaking.
Ideally, tests would be evaluated at the stage in the clinical pathway in which they are to be used.
Since this means that many women who have a negative test will not undergo the reference standard, careful attention should be paid to development of alternative reference standards, including definitions of appropriate length of followup.
More direct comparisons of alternative tests should be performed; existing studies are frequently underpowered to detect clinically meaningful differences, or to establish equivalence. Based on pooled analyses, either magnetic resonance imaging (MRI) or combined ultrasound evaluation of morphology and Doppler velocimetry have attractive sensitivity and specificity. Only two studies, with a total of 200 subjects, have directly compared these modalities in the same patient population.91, 100In both of these studies, MRI was less sensitive but more specific than combined morphology/Doppler. More precise comparative estimates should be obtained.
There is a paucity of studies on positron emission tomography (PET) compared to other imaging modalities. Given that the Centers for Medicaid and Medicare Services (CMS) is now reimbursing for PET scans done within the setting of a clinical trial, there is an excellent opportunity for high-quality studies which avoid the deficiencies outlined in this report.
Although discriminating between benign and malignant lesions is the highest priority in most clinical situations, estimates of the sensitivity and specificity of various imaging modalities for specific nonmalignant lesions (endometriomas, mature teratomas, etc.) would be helpful for developing comprehensive management strategies, particularly in conjunction with good data on prevalence in premenopausal women. We identified multiple articles relevant to this question during our search, which were excluded because they were not relevant to the main study questions. Although many of the methodological issues identified here would be issues with these studies, a systematic review of this literature would have value.
New tumor markers should continue to undergo evaluation as diagnostic tests as they are identified, using appropriate methodological standards.
Validation studies in new populations are needed.
Attention should be paid to adequate sample size.
Additional studies, with clear definitions for “benign” lesions and clear protocols for followup, with documentation of loss to followup, are needed. Because by definition these types of studies will not have histological confirmation of all test results, estimates of test performance from such studies may have some bias.
As with studies of prevalence, both currently published studies (mostly case series) and administrative data have significant deficiencies. Case series would be improved by clearer description of the clinical pathway by which patients ended up undergoing surgery, as well as by providing relevant clinical data (such as body mass index, history of prior surgeries, and extent of disease).
Data on outcomes from a variety of settings, including community settings, are needed.
Again, as with studies of prevalence, data from sources able to provide both inpatient and outpatient data over time, such as HMOs, third-party payers, and Medicare, are likely to provide the best combination of sample size, generalizability, and clinical detail.
The annual bimanual pelvic examination appears to have little, if any, benefit for reducing ovarian cancer morbidity and mortality in asymptomatic women. Given that many organizations now recommend less frequent cervical cancer screening in many women, that no screening test has ever been shown to reduce morbidity and mortality from endometrial cancer, and that other gynecological cancers are too rare to justify population-based screening, it would appear that annual bimanual pelvic exams do not have a substantial benefit in reducing mortality. Therefore, evidence on the benefits of the exam would be helpful for patients, clinicians, and policymakers. Possible research areas include:
Many clinicians argue that the annual exam provides a “cue” for women to interact with a clinician and receive other preventive services.
Would women be less likely to see a health professional on a regular basis if they would not get a pelvic examination?
If the exam does provide a “cue” for some women, what is its effectiveness and cost-effectiveness compared to alternative methods of improving adherence to periodic health maintenance schedules?
Are there some women who do not regularly see a health professional because of embarrassment/fear/discomfort regarding a pelvic exam who would be more likely to see one if they could be assured they would not get an exam?
Others have argued that, after long experience, women expect to receive a pelvic examination (and Pap test) on an annual basis and will continue to demand the examination, despite evidence that the test has little benefit, or does not need to be performed on an annual basis.
How have patients reacted to other changes or paradigm shifts in medicine? Can patient expectations be changed in the face of new evidence? Do patient responses differ between changes in which one intervention is replaced by another, versus changes in which an intervention is no longer performed at all?
Although the pelvic examination does not appear to have significant benefit as a screening test, does it have more value as a diagnostic test?
Assuming the pelvic examination does have value as a diagnostic test, is there a relationship between volume/experience and test accuracy, as suggested by two of the studies we reviewed? If so, can routine examinations in asymptomatic women be justified as a method for maintaining exam skills?
If there is a relationship between volume and accuracy, what are the implications for the performance of diagnostic bimanual examinations by generalists (e.g., internists, pediatricians, family practitioners, generalist nurse practitioners) versus specialists (e.g., obstetrician/gynecologists, nurse-midwives, etc)
Our modeling of the likely outcomes of different screening strategies was limited by the quantity and quality of data available for key parameters. Because this limited direct comparison of different testing strategies, we were not able to do a comprehensive comparison. The lack of data on patient characteristics, particularly symptom status, also prevented extensive analysis of the effects of different strategies in different clinical scenarios. Improving the evidence base for the other questions considered in the evidence report will make a substantial improvement in the ability to meaningfully model outcomes.
Data on relevant patient preferences for different outcomes are needed.
Data on relevant cost parameters are needed for cost-effectiveness analysis.
Data on relative test reproducibility can help determine the effect of observer variability on effectiveness and cost-effectiveness.
We identified only three models, one of which was an updated version of another. Having several groups working on simulation modeling, using different assumptions, software, model structure, etc., has proven quite helpful in the case of cervical cancer. Additional work should be strongly encouraged.
In particular, models should explore alternative disease natural history parameters, and the implications for various strategies, including screening and primary prevention.
Developing an effective and efficient algorithm for the evaluation of any condition requires good evidence on the prevalence of the condition at the first diagnostic encounter, and the sensitivity and specificity of the potential diagnostic tests to be used. With this information, one can estimate the outcomes, in terms of true and false positive and negative results, of each test. Various combinations of tests can be compared, and, ideally, the consequences of each test's results in terms of benefits, harms, and costs can be estimated.
In the setting of an adnexal mass, the primary issue is discriminating benign from malignant masses; ideally, all women with an underlying ovarian malignancy would receive appropriate surgical management (perfect sensitivity), and no woman with an asymptomatic benign mass would undergo unnecessary surgery (perfect specificity). The optimal strategy may well differ based on whether or not the patient presents with symptoms, both because the prevalence of disease is likely to be higher in the patient with symptoms (making the positive predictive value higher and the negative predictive value lower), and because surgical management may ultimately be appropriate for a symptomatic patient, and some asymptomatic patients, even if the mass is benign. Age and/or menopausal status are also important considerations, primarily because ovarian cancer is rare prior to age 50, but also because some of the risks of surgery may increase with age.
Unfortunately, the overwhelming majority of the literature we reviewed did not provide sufficient detail on these important patient characteristics to allow confident estimation of the outcomes of different diagnostic strategies, so that we are unable to conclude that any of the strategies achieve the aims of maximizing appropriate treatment and minimizing unnecessary surgery. Outside of studies that were explicitly designed to evaluate screening, few articles described whether patients were symptomatic or asymptomatic, or testing done prior to the diagnostic test being evaluated. Surprisingly few studies reported results separately for premenopausal and postmenopausal women. Future studies need to provide this information.
All of the diagnostic tests and scoring systems we evaluated exhibited a trade-off between sensitivity and specificity - studies of a given test that reported higher sensitivity had lower specificity, and vice versa. In pooled analysis, either the combination of ultrasound morphology and Doppler blood flow, or magnetic resonance imaging (MRI), had the best combination of sensitivity and specificity. Simple modeling of series and parallel tests suggests that, in postmenopausal women, imaging using ultrasound morphology and Doppler blood flow, or MRI, followed by CA-125, is both more sensitive (misses fewer cancers) and more specific (avoids more surgery) than either test alone. A strategy in which both tests were performed and used in a scoring system, the Risk Malignancy Index, prevented additional cancers but with twice as many tests and more surgeries. More data on key parameters are needed to determine if, in certain settings, alternative combinations of tests, performed in parallel or series, might have better outcomes or be more efficient.
Studies of surgical management suffered from the same limitations in terms of description of patient characteristics, making estimation of the risks of false positive diagnostic testing impossible. Similarly, administrative data that only includes discharge information do not provide important clinical information.
The bimanual pelvic examination has low sensitivity for both detection of adnexal masses and discriminating benign from malignant masses, raising doubts about its utility as a screening test in asymptomatic women.
Ultimately, evaluation of potential strategies for reducing morbidity and mortality from ovarian cancer may require use of simulation models, a technique that has proven helpful in evaluating prevention strategies for other cancers. Because the natural history of ovarian cancer is relatively unknown, testing of alternative models is critical. Although a few sophisticated models exist, development of additional models would be helpful, especially in the context of evaluating results from ongoing trials of screening. If any of these trials show a benefit from screening, then the need for better evidence on the diagnostic evaluation of adnexal masses will become even more critical.
| 2D | Two-dimensional |
| 3D | Three-dimensional |
| ACOG | American College of Obstetricians and Gynecologists |
| ACR | American College of Radiology |
| AFP | Alpha-fetoprotein |
| AHRQ | Agency for Healthcare Research and Quality |
| AUC | Area under the curve |
| CA-125 | Cancer antigen 125 |
| CDC | Centers for Disease Control and Prevention |
| CEA | Carcinoembryonic antigen |
| CI | Confidence interval |
| CMS | Centers for Medicaid and Medicare Services |
| CT | Computed tomography |
| FDG | 18-Fluorodeoxyglucose |
| FIGO | International Federation of Gynecology and Obstetrics |
| FNA | Fine needle aspiration |
| hCG | Human chorionic gonadotropin |
| ICD-9 | International Classification of Diseases, Ninth Revision |
| LDH | Lactate dehydrogenase |
| LMP | Low malignant potential |
| MeSH | Medical Subject Heading |
| MRI | Magnetic resonance imaging |
| NIS | Nationwide Inpatient Sample |
| NPV | Negative predictive value |
| PET | Positron emission tomography |
| PI | Pulsatility index |
| PPV | Positive predictive value |
| RI | Resistance index |
| RMI | Risk of Malignancy Index |
| ROC | Receiver operating characteristic |
| SEER | Surveillance, Epidemiology, and End Results |
| SGO | Society of Gynecologic Oncologists |
| TAG-72 | Tumor-associated glycoprotein 72 |
| TVUS | Transvaginal ultrasound |
Search Strategy 1: pelvic exam performance
(developed and run by McCrory and Myers on September 10, 2004)
Database: Ovid MEDLINE(R) <1966 to September Week 1 2004>
Search Strategy:
————————————————————————————————————————
pelvic exam.mp.(53)
(bimanual adj pelvic).mp. [mp=title, original title, abstract, name of substance, mesh subject heading] (25)
(physical exam and pelvis).mp.(7)
“diagnostic techniques, obstetrical and gynecological”/ or culdoscopy/ or laparoscopy/ or physical examination/ (45383)
physical examination/ (18265)
Ovarian Cysts/ or Ovarian Neoplasms/ or Genital Neoplasms, Female/ or Adnexal Diseases/ or adnexal mass.mp. (48599)
exp Ovarian Cysts/ or exp Ovarian Neoplasms/ or Genital Neoplasms, Female/ or Adnexal Diseases/ or adnexal mass.mp. (53879)
exp fallopian tube diseases/ (4449)
5 and (7 or 8) (124)
(or/1–3) and (or/7–8) (18)
9 and 10 (5)
“diagnostic techniques, obstetrical and gynecological”/ and (or/7–8) (8)
culdoscopy/ and (or/7–8) (52)
or/1–3,9–10 (204)
limit 14 to (human and english language and yr=1980 - 2004) (147)
from 15 keep 1–147 (147)
***************************
Search Strategy 2: test performance
Developed and run by McCrory on September 28, 2004
Database: Ovid MEDLINE(R) <1966 to September Week 3 2004>
Search Strategy:
————————————————————————————————————————
(vagin$ adj ultraso$).mp. [mp=title, original title, abstract, name of substance, mesh subject heading] (1391)
(adnex$ adj2 mas$).mp. (873)
(pelvi$ adj mas$).mp. (1537)
(ovar$ adj mas$).mp. (1479)
or/2–4 (3696)
“sensitivity and specificity”/ (121128)
6 and 1 (132)
6 and 5 (316)
7 or 8 (431)
limit 9 to (human and english language) (387)
from 10 keep 1–387 (387)
(ovar$ adj tumo$).mp. (11435)
12 and 6 (405)
ROC Curve/ (7282)
13 and 14 (27)
from 15 keep 4,7,9,15,19–20,22–23,27 (9)
from 15 keep 22–23,27 (3)
16 not 11 (4)
11 or 18 (391)
limit 19 to yr=1980 - 2004 (391)
from 20 keep 1–391 (391)
***************************
Search Strategy 3: predictive models
(strategy developed and run by McCrory on September 29, 2004)
Database: Ovid MEDLINE(R) <1966 to September Week 3 2004>
Search Strategy:
————————————————————————————————————————
(vagin$ adj ultraso$).mp. [mp=title, original title, abstract, name of substance, mesh subject heading] (1391)
(adnex$ adj2 mas$).mp. (873)
(pelvi$ adj mas$).mp. (1537)
(ovar$ adj mas$).mp. (1479)
or/2–4 (3696)
“sensitivity and specificity”/ (121128)
6 and 1 (132)
6 and 5 (316)
7 or 8 (431)
limit 9 to (human and english language) (387)
predictive value of tests/ (56850)
Risk Assessment/ (47548)
roc curve/ (7282)
“Multivariate Analysis”/ (31714)
or/11–14 (136223)
15 and 5 (260)
16 not 9 (142)
limit 17 to (human and english language) (131)
from 18 keep 1–131 (131)
***************************
All excluded studies listed below were reviewed in their full text version. Following each reference, in italics, is the reason(s) for exclusion and the Question (Q) for which the article was considered. If no Q is indicated, then the article was excluded a priori from the study for the reason given. An article can be considered (and therefore excluded) for more than one question, and all questions for which the article was excluded are identified. Reasons for exclusion signify only the usefulness of the articles for this study and are not intended as criticisms of the articles.
For reference, the questions are:
Question 1: What is the prevalence of various tumor types among women with an adnexal mass, stratified by cancer status (malignant vs. benign), age, menopausal status, and size of tumor?
Question 2: What are the sensitivity, specificity, and reliability of the bimanual examination?
Question 3: Among women with a palpable adnexal mass on exam or a mass identified by ultrasound/imaging, what is the sensitivity/specificity of various evaluation modalities including ultrasound (transvaginal ultrasound, transabdominal ultrasound, color Doppler, 2D vs. 3D ultrasound, CT scan, MRI scan, and CA-125 levels) for diagnosing malignant masses?
Question 4: What is the accuracy of explicit scoring systems which incorporate various combinations of imaging findings, patient risk factors, and/or CA-125 levels for detecting malignancy? Have these scoring systems been applied to a population of women before laparoscopy?
Question 5: Among women with suspected benign lesions on initial investigation, what are the sensitivity and specificity of monitoring with periodic CA-125 and/or interval ultrasound examinations for detecting malignant masses? How does the interval of testing/definition of change affect sensitivity and predictive value?
Question 6: Among women with adnexal masses, what are the morbidity and mortality from diagnostic surgery (laparoscopy or laparotomy)? At what point does the risk of laparoscopy outweigh the risk of detecting malignancy?
Question 7: What are the estimated trade-offs resulting from various strategies for evaluation of the adnexal mass?
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]| Study | Study Design | Patients | Clinical Presentation | Results | Comments/Quality Scoring |
|---|---|---|---|---|---|
| StudyID | Geographical location: | Age: | Symptomatic (n [%]): | [Proportion of each type of finding, stratified by cancer status, age/menopausal status (<45, 45–55, >55 or pre-peri-post-menopausal), and size of tumor. Include individual tumor types where possible.] | [IF ARTICLE SHOULD BE EXCLUDED, PLEASE EXPLAIN WHY HERE] |
| Dates: | Mean (SD): | Detected by exam (n [%]): | Use Excel spreadsheet to calculate confidence intervals for prevalence data from screening studies | [COMMENT ON BIASES, ETC. AFFECTING CLINICAL INTERPRETATION] | |
| Size of population: | Median: | Detected by imaging (n [%]): | 1) | Quality assessment: | |
| [num/denom for screening studies] | Range: | Combination (n [%]): | 2) | [assign + or - to each item, and provide a brief rationale] | |
| Screening study Registry | Menopausal status (n [%]): | Additional data used for diagnosis: | 3) | Size of population from which sample drawn: | |
| Other | Pre (< 45): | 4) | Number of cases: | ||
| [delete all but one; please specify “Other”] | Peri (45–55): | 5) | Patient selection: | ||
| Post (> 55): | Application of reference standard: | ||||
| Race/ethnicity (n [%]): | This article is also relevant to: [delete as appropriate] | ||||
| Risk factors (n [%]): | Question 2 | ||||
| Family history: | Question 3 | ||||
| Genotype: | Question 4 | ||||
| Other [specify]: | Question 5 | ||||
| Question 6 | |||||
| Question 7 |
| Study | Study Design | Patients | Clinical Presentation | Clinical Setting of Exam | Results | Comments/Quality Scoring |
|---|---|---|---|---|---|---|
| StudyID | Geographical location: | Age: | Symptomatic (n [%]): | [Please provide brief description of clinical setting in which bimanual exam was performed] | [For bimanual exam, provide reported sensitivity/specificity and provide 2×2 tables (if possible). If possible and appropriate, stratify by age or menopausal status. If data are available on reliability/ reproducibility, report these as well. Include kappa scores if these are reported or can be calculated.] | [IF ARTICLE SHOULD BE EXCLUDED, PLEASE EXPLAIN WHY HERE] |
| Dates: | Mean (SD): | Detected by exam (n [%]): | 1) [Use this space to provide information needed for reader to interpret Test +, Test -, Disease +, and Disease - headings in following table.] | [COMMENT ON BIASES, ETC. AFFECTING CLINICAL INTERPRETATION] | ||
| Size of population: | Median: | Detected by imaging (n [%]): | Quality assessment: | |||
| [num/denom for screening studies] | Range: | Combination (n [%]): | [assign + or - to each item, and provide a brief rationale] | |||
| Screening study Registry | Menopausal status (n [%]): | Additional data used for diagnosis: | Reference standard: | |||
| Other | Pre (< 45): | Verification bias: | ||||
| [delete all but one; please specify “Other”] | Peri (45–55): | Test reliability/variability: | ||||
| Reference standard: | Post (> 55): | Sample size: | ||||
| Reference standard applied to all test negatives?: | Race/ethnicity (n [%]): | Statistical tests: | ||||
| Test reliability established?: | Risk factors (n [%]): | Blinding: | ||||
| Statistical tests used: | Family history: | Definition of +/- on screening test: | ||||
| Blinding: | Genotype: | This article is also relevant to: [delete as appropriate] | ||||
| Definition of positive and negative on screening test: | Other [specify]: | 2) | Question 1 | |||
| Inclusion criteria: | Question 2 | |||||
| Exclusion criteria: | Question 3 | |||||
| Question 4 | ||||||
| Question 5 | ||||||
| Question 7 |
| Study | Study Design | Patients | Clinical Presentation | Results | Comments/Quality Scoring |
|---|---|---|---|---|---|
| StudyID | Geographical location: | Age: | Symptomatic (n [%]): | [For each test reported, please provide a 2×2 table and report or calculate sensitivity, specificity, NPV, and PPV (all with confidence intervals). If possible and appropriate, stratify by age or menopausal status.] | [IF ARTICLE SHOULD BE EXCLUDED, PLEASE EXPLAIN WHY HERE] |
| Dates: | Mean (SD): | Detected by exam (n [%]): | 1) [Use this space to provide information needed for reader to interpret Test +, Test -, Disease +, and Disease - headings in following table.] | [COMMENT ON BIASES, ETC. AFFECTING CLINICAL INTERPRETATION] | |
| Size of population: | Median: | Detected by imaging (n [%]): | Quality assessment: | ||
| [num/denom for screening studies] | Range: | Combination (n [%]): | [assign + or - to each item, and provide a brief rationale] | ||
| Screening study Registry | Menopausal status (n [%]): | Additional data used for diagnosis: | Reference standard: | ||
| Other | Pre (< 45): | Verification bias: | |||
| [delete all but one; please specify “Other”] | Peri (45ndash;55): | Test reliability/variability: | |||
| Reference standard: | Post (> 55): | Sample size: | |||
| Reference standard applied to all test negatives?: | Race/ethnicity (n [%]): | Statistical tests: | |||
| Test reliability established?: | Risk factors (n [%]): | Blinding: | |||
| Statistical tests used: | Family history: | Definition of +/- on screening test: | |||
| Blinding: | Genotype: | 2) | This article is also relevant to: [delete as appropriate] | ||
| Definition of positive and negative on screening test: | Other [specify]: | Question 1 | |||
| Inclusion criteria: | Question 3 | ||||
| Exclusion criteria: | Question 4 | ||||
| Question 5 | |||||
| Question 6 | |||||
| Question 7 |
| Study | Study Design | Patients | Clinical Presentation | Items Included in Scoring System | Results | Comments/Quality Scoring |
|---|---|---|---|---|---|---|
| StudyID | Geographical location: | Age: | Symptomatic (n [%]): | 1) | [For each reported scoring system (and individual components, if reported), provide reported sensitivity/specificity and provide 2×2 table; if multivariate analysis, provide area under ROC curve or c-statistic, if reported. If possible and appropriate, stratify by age or menopausal status.] | [IF ARTICLE SHOULD BE EXCLUDED, PLEASE EXPLAIN WHY HERE] |
| Dates: | Mean (SD): | Detected by exam (n [%]): | 2) | 1) [Use this space to provide information needed for reader to interpret Test +, Test -, Disease +, and Disease - headings in following table.] | [COMMENT ON BIASES, ETC. AFFECTING CLINICAL INTERPRETATION] | |
| Size of population: | Median: | Detected by imaging (n [%]): | 3) | 2) | Quality assessment: | |
| [num/denom for screening studies] | Range: | Combination (n [%]): | 4) | Results were reported, but have not been abstracted, for the following combinations: [list] | [assign + or - to each item, and provide a brief rationale] | |
| Screening study Registry | Menopausal status (n [%]): | Additional data used for diagnosis: | 5) | Reference standard: | ||
| Other | Pre (< 45): | 6) | Verification bias: | |||
| [delete all but one; please specify “Other”] | Peri (45–55): | 7) | Test reliability/variability: | |||
| Reference standard: | Post (> 55): | 8) | Sample size: | |||
| Reference standard applied to all test negatives?: | Race/ethnicity (n [%]): | 9) | Statistical tests: | |||
| Statistical tests used: | Risk factors (n [%]): | 10) | Blinding: | |||
| Blinding: | Family history: | Definition of +/- on screening test: | ||||
| Definition of positive and negative on screening test: | Genotype: | Explicit validation method?: | ||||
| Other [specify]: | This article is also relevant to: [delete as appropriate] | |||||
| Inclusion criteria: | Question 1 | |||||
| Exclusion criteria: | Question 2 | |||||
| Question 4 | ||||||
| Question 5 | ||||||
| Question 6 | ||||||
| Question 7 |
| Study | Study Design | Patients | Clinical Presentation | Monitoring Strategy | Results | Comments/Quality Scoring |
|---|---|---|---|---|---|---|
| StudyID | Geographical location: | Age: | Symptomatic (n [%]): | Monitoring test: | [For each reported monitoring strategy, provide reported sensitivity/specificity and provide 2×2 table; if multivariate analysis, provide area under ROC curve or c-statistic, if reported. If possible and appropriate, stratify by age or menopausal status.] | [IF ARTICLE SHOULD BE EXCLUDED, PLEASE EXPLAIN WHY HERE] |
| Dates: | Mean (SD): | Detected by exam (n [%]): | Interval of testing: | 1) [Use this space to provide information needed for reader to interpret Test +, Test -, Disease +, and Disease - headings in following table.] | [COMMENT ON BIASES, ETC. AFFECTING CLINICAL INTERPRETATION] | |
| Size of population: | Median: | Detected by imaging (n [%]): | Definition of change: | 2) | Quality assessment: | |
| [num/denom for screening studies] | Range: | Combination (n [%]): | 3) | [assign + or - to each item, and provide a brief rationale] | ||
| Screening study Registry | Menopausal status (n [%]): | Additional data used for diagnosis: | Reference standard: | |||
| Other | Pre (< 45): | Verification bias: | ||||
| [delete all but one; please specify “Other”] | Peri (45–55): | Test reliability/variability: | ||||
| Reference standard: | Post (> 55): | Sample size: | ||||
| Reference standard applied to all test negatives?: | Race/ethnicity (n [%]): | Statistical tests: | ||||
| Test reliability established?: | Risk factors (n [%]): | Blinding: | ||||
| Statistical tests used: | Family history: | Definition of +/- on screening test: | ||||
| Blinding: | Genotype: | Explicit validation method?: | ||||
| Definition of positive and negative on screening test: | Other [specify]: | This article is also relevant to: [delete as appropriate] | ||||
| Length of follow up: | Inclusion criteria: | Question 1 | ||||
| Type of follow up: | Exclusion criteria: | Question 2 | ||||
| Follow-up interval: | Loss to follow up: | Question 3 | ||||
| Question 5 | ||||||
| Question 6 | ||||||
| Question 7 |
| Study | Study Design | Patients | Clinical Presentation | Results | Comments/Quality Scoring |
|---|---|---|---|---|---|
| StudyID | Geographical location: | Age: | Symptomatic (n [%]): | [For each, provide reported rate and 95% CI, if appropriate. If possible and appropriate, stratify results by age or menopausal status.] | [IF ARTICLE SHOULD BE EXCLUDED, PLEASE EXPLAIN WHY HERE] |
| Dates: | Mean (SD): | Detected by exam (n [%]): | Use Excel spreadsheet to calculate confidence intervals for morbidity/mortality | [COMMENT ON BIASES, ETC. AFFECTING CLINICAL INTERPRETATION] | |
| Size of population: | Median: | Detected by imaging (n [%]): | 1) Mortality: | Quality assessment: | |
| [num/denom for screening studies] | Range: | Combination (n [%]): | 2) Morbidity (total all complications): | [assign + or - to each item, and provide a brief rationale] | |
| Single center Registry | Menopausal status (n [%]): | Additional data used for diagnosis: | 3) Specific complications: | Size of population from which sample drawn: | |
| [delete one] | Pre (< 45): | 4) Rate of conversion to laparotomy: | Number of cases: | ||
| Morbidity definitions: | Peri (45–55): | 5) | Patient selection: | ||
| Length of follow up after surgery: | Post (> 55): | 6) | Application of reference standard: | ||
| Race/ethnicity (n [%]): | This article is also relevant to: [delete as appropriate] | ||||
| Risk factors (n [%]): | Question 1 | ||||
| Family history: | Question 2 | ||||
| Genotype: | Question 3 | ||||
| Other [specify]: | Question 4 | ||||
| Loss to follow up: | Question 6 | ||||
| Question 7 |
| Study | Study Design | Study Outcomes | Sources for Model Probabilities | Sources for Model Outcomes | Results | Comments |
|---|---|---|---|---|---|---|
| StudyID | Type of model: | [Life expectancy, quality of life, cancer incidence, cancer death, etc. Include costs, but we will not be using them here] | [In particular, sources for transition probabilities between different stages of pre-cancer/cancer] | [For each strategy compared, compare results for different outcomes; also, report results of significant sensitivity analyses.] | [IF ARTICLE SHOULD BE EXCLUDED, PLEASE EXPLAIN WHY HERE] | |
| Population modeled (age, range): | Simplifying assumptions: | 1) | [COMMENT ON BIASES, ETC. AFFECTING CLINICAL INTERPRETATION] | |||
| Strategies compared: | 2) | This article is also relevant to: [delete as appropriate] | ||||
| 3) | Question 1 | |||||
| 4) | Question 2 | |||||
| 5) | Question 3 | |||||
| 6) | Question 4 | |||||
| Question 5 | ||||||
| Question 6 |
| 2D | Two-dimensional |
| 3D | Three-dimensional |
| AFP | Alpha-fetoprotein |
| AHRQ | Agency for Healthcare Research and Quality |
| AUC | Area under the curve |
| BME | Bimanual examination |
| BMI | Body mass index |
| CA-19-9 | Cancer antigen 19-9 |
| CA-72–4 | Cancer antigen 72–4 |
| CA-125 | Cancer antigen 125 |
| CEA | Carcinoembryonic antigen |
| CI | Confidence interval |
| CPP | Chronic pelvic pain |
| CT | Computed tomography |
| F-FDG | 18-Fluorodeoxyglucose |
| FNA | Fine needle aspiration |
| FSH | Follicle-stimulating hormone |
| GI | Gastrointestinal |
| hCG | Human chorionic gonadotropin |
| ICD-9 | International Classification of Diseases, Ninth Revision |
| LDH | Lactate dehydrogenase |
| LMP | Low malignant potential |
| MRI | Magnetic resonance imaging |
| NIS | Nationwide Inpatient Sample |
| NA | Not applicable |
| NPV | Negative predictive value |
| NR | Not reported |
| OR | Odds ratio |
| PE | Pelvic examination |
| PET | Positron emission tomography |
| PI | Pulsatility index |
| PID | Pelvic inflammatory disease |
| PPS | Papillary projection score |
| PPV | Positive predictive value |
| PSV | Peak systolic velocity |
| RI | Resistance index |
| RMI | Risk of Malignancy Index |
| ROC | Receiver operating characteristic |
| SD | Standard deviation |
| Se | Sensitivity |
| SEM | Standard error of the mean |
| Sp | Specificity |
| TAG-72 | Tumor-associated glycoprotein 72 |
| TAMXV | Time-averaged maximum velocity |
| ATI | Tumor-associated trypsin inhibitor |
| TVUS | Transvaginal ultrasound |
| US | Ultrasound |
| UTI | Urinary tract infection |
The Duke Evidence-based Practice Center is grateful to the following peer reviewers who read and commented on a draft version of this report:
Susan M. Ascher, MD; Department of Radiology, Georgetown University Hospital; Washington, DC
Andrew Berchuck, MD; Division of Gynecologic Oncology, Duke University Medical Center; Durham, NC
Michael L. Berman; Division of Gynecologic Oncology, UCI Medical Center; Orange, CA
Christie R. Eheman, MD; Centers for Disease Control and Prevention; Atlanta, GA
Barbara Goff, MD; University of Washington School of Medicine; Seattle, WA
Walter Kinney, MD; Department of Obstetrics and Gynecology, The Permanente Medical Group; Sacramento, CA
Ann Kolker; Ovarian Cancer National Alliance; Washington, DC
Herschel Lawson, MD; Center for Disease Control and Prevention; Atlanta, GA
Saralyn Mark, MD; Department of Health and Human Services; Washington, DC
Susan Meikle, MD, MSPH; Agency for Healthcare Research and Quality; Rockville, MD
Valerie McGuire, PhD; Department of Health Research and Policy, Stanford University; Stanford, CA
Edward E. Partridge, MD; Department of Obstetrics and Gynecology, University of Alabama; Birmingham, AL
Mona Saraiya, MD, MPH; Centers for Disease Control and Prevention; Atlanta, GA
George F. Sawaya, MD; Department of Obstetrics and Gynecology, UCSF; San Francisco, CA
Howard T. Sharp, MD; University of Utah Hospitals and Clinics; Salt Lake City, UT
Edward L. Trimble, MD, MPH; National Cancer Institute; Rockville, MD
John R. van Nagell Jr., MD; Department of Obstetrics and Gynecology, University of Kentucky Medical Center; Lexington, Kentucky
Nominations for peer reviewers were solicited from several sources, including the project's technical expert panel and interested federal agencies. The list of nominees was vetted and approved by the Agency for Healthcare Research and Quality (AHRQ).