Evaluation of a Prototype Search and Visualization System for Exploring Scientific Communities

Michael E. Bales, MPH, M.Phil, David R. Kaufman, PhD, and Stephen B. Johnson, PhD

Additional article information

Abstract

Searches of bibliographic databases generate lists of articles but do little to reveal connections between authors, institutions, and grants. As a result, search results cannot be fully leveraged. To address this problem we developed Sciologer, a prototype search and visualization system. Sciologer presents the results of any PubMed query as an interactive network diagram of the above elements. We conducted a cognitive evaluation with six neuroscience and six obesity researchers. Researchers used the system effectively. They used geographic, color, and shape metaphors to describe community structure and made accurate inferences pertaining to a) collaboration among research groups; b) prominence of individual researchers; and c) differentiation of expertise. The tool confirmed certain beliefs, disconfirmed others, and extended their understanding of their own discipline. The majority indicated the system offered information of value beyond a traditional PubMed search and that they would use the tool if available.

Keywords: PubMed, Medline, search, visualization, social networks, translational research

Introduction

Interdisciplinary collaboration is widely recognized as a key factor for enhancing scientific productivity[1]. An important first step towards facilitating collaboration is to gain a high-level view of collaboration patterns in any given community. Interest is growing around the use of social network analysis (SNA) to understand such patterns. A main barrier is that SNA is often viewed as a time-consuming process requiring highly specialized skill. The biomedical literature indexed in Medline is a significant resource for understanding relationships not only among authors, but articles, journals, institutions, substances, grants, and keywords. However, interfaces to this data, such as PubMed, are not designed to allow researchers to understand the emergent community structure hidden in lists of articles. In response to this problem we have developed Sciologer, a prototype fully automated search system for the interactive visual exploration of scientific communities. The system is designed to allow researchers to find patterns in scientific communities; identify experts and potential collaborators; understand how a field is organized; and learn about results in less familiar fields. The aims of this formative evaluation are to determine: 1) whether the system shows promise as a useful tool for understanding scientific communities; 2) whether researchers are able to interpret the visualizations meaningfully; and 3) satisfaction with the prototype.

Background

SNA has long been an important technique in sociology and organizational studies[2]. The techniques of SNA, which include graph theory and matrix mathematics, have been used to analyze and visualize a variety of complex systems including transportation networks and the Internet. Recently, SNA has been used increasingly in informatics; for example, to study the structure of organizations[3] and terminologies[4] and to promote translational research by identifying synergistic properties among specific researchers[5]. Because network visualization is an effective way to leverage the visual cortex to help people comprehend complex data, there has also been significant work in using network models to map knowledge domains[6]. There are a number of freely available tools for visualizing and analyzing existing network data sets. CiteSpace[7] is a freely available tool for generating networks of authors, articles, and journals from bibliographic data.

There are several challenges involved in generating network representations of scientific communities. These include name heterogeneity, simplifying networks, assigning appropriate labels, and automating and scripting processes. Currently available systems are designed for expert users, require specific data formats, and often have complex user interfaces which result in steep learning curves. Our system was developed iteratively, with multiple successive prototypes and continual feedback from users. Its design was informed by a number of interrelated design principles. These include Gestalt principles of visual perception pertaining to color, comparison, proximity, grouping of elements, and uniform connectedness[8]. Icons were developed in consultation with an expert in iconic representation. Default sizes and distances were determined by trial and error in appreciation of Fitts’ Law[9], which states that the time required to navigate to a target is a function of the target’s size and displacement.

In response to a user’s PubMed query entered through a web browser, the system generates network images by piping data through a series of independent modules (Figure 1).

Figure 1.
Automated steps for generating network visualizations.

After parsing elements from individual citations, the system assigns links according to a customizable linking schema (Figure 2). By default, articles are linked to grants, substances, journals, MeSH terms, and authors. The first author of a paper is linked to the institutional affiliation, and authors are linked to their co-authors.

Figure 2.
Iconic elements and linking schema.

Network layout is done by CCVisu, a freely available algorithm developed by Dirk Beyer[10]. CCVisu uses a weighted edge-repulsion LinLog energy model (a standard technique for force-directed layout) to determine node placement. After CCVisu calculates node placement data, Sciologer calculates sizes of icons based on node degree (number of links). It integrates icon type, placement, and size data, determines colors based on node position, prepares link data, and writes output to any of several specified formats. Formats include X3D (the ISO standard XML-based file format for representing 3D computer graphics), POV-Ray (for textbook-quality graphics), and KML (Keyhole Markup Language, for visualization in Google Earth software, the output modality used in this study). Google Earth software has native controls for zooming, panning, and rotating the view, clicking icons for details, and selectively painting and labeling elements. These features allow for a dynamic interaction between the user and the network: The user may zoom in to understand local structure, and zoom out to get a sense of global structure. The software also has a find command for identifying elements of interest within a network. Response time to generate a network from 10 Medline records is approximately five seconds. Response time for 100 Medline records is approximately 20 seconds. The prototype has not yet been optimized for speed; response times are expected to be faster in a production system.

The prototype differs from existing tools in several ways. First, while most tools use a control panel-based user interface, Sciologer is based on a search paradigm and uses a generative approach for network creation. In addition, while existing tools output mono- and bipartite networks, Sciologer creates networks with many node types. Another distinguishing feature is the system’s method of assigning colors to nodes using a 3D color space. This method gives proximal nodes similar (but not identical) colors.

Evaluation Methods

We conducted a cognitive evaluation with six neuroscience and six obesity researchers. Prospective participants were identified via Medline searches and recruited by e-mail. Each session consisted of a brief training exercise for an individual followed by a series of scenario-based tasks. For example:

Imagine you are preparing a grant application to improve interdisciplinary collaboration among multiple sclerosis researchers. What institutions in the state of New York have the most active researchers in multiple sclerosis? Please enter this query: “multiple sclerosis” AND “New York” [AD] AND 2003:2008 [DP]

  1. Do you see any patterns in this image? What patterns do you see?
  2. Which institutions in New York are most active in multiple sclerosis research?
  3. Which research groups do and do not collaborate most closely?
  4. What might you conclude about the groups most involved in multiple sclerosis research?

Participants were asked to think aloud while working through the tasks. Participants then completed a paper-based survey to indicate their level of agreement or disagreement with ten statements pertaining to the system. Each session concluded with a semi-structured interview. Sessions were video, audio, and screen-captured using Morae[11] software for video-analytic usability testing and analysis. Events captured during the task-based scenarios were given a time-stamp (Table 1).

Table 1.
Coding framework for task-based scenarios. The table includes a selection of time-stamped events (markers) captured during one scenario with one participant. Markers pertain to difficulties using the interface, inferences by the participant, and prompts ...

Results

We report results pertaining to users’ interactions with the system, followed by results obtained in the semi-structured interviews.

Interaction with system:

Participants were able to use the system in a productive way. They reported a high likelihood of encountering similar tasks in their work and all participants were able to carry out the tasks effectively. Participants used geographic, color, and shape metaphors to describe community structure, and made inferences about collaboration between groups. Table 1 contains excerpts of a researcher’s session with the system. The session is segmented into a series of tasks or questions. The table illustrates the process of coding and demonstrates the facility with which the participant was able to make inferences from the visual representations. When interacting with the output the participant was able to discern visual patterns in the image and make inferences about collaboration among groups of researchers. For example, he identifies two universities that do not appear to collaborate on the topic, and notices a “triangle of collaboration” between three researchers.

Figure 3 shows how the same researcher uses a geographic metaphor to interpret multiple sclerosis research groups in New York State and makes an inference about collaboration between two research groups. In this example, the participant stated that there was some collaboration between two distant groups, noting the presence of a number of lines between the groups. This inference may have been motivated in part by links that occurred because papers published by researchers in each group shared a common journal or substance. In fact, such links do not reflect direct collaboration between researchers. Other errors of inference could be traced to misperceptions about Medline database or to the use of Boolean operators when querying PubMed. Errors of inference were infrequent and all were easily corrected via prompting from the study coordinator. With the exception of examples such as these, participants made accurate inferences pertaining to collaboration among (and sharp separation between) research groups; prominence of individual researchers; and differentiation of expertise. In several cases participants reported a correspondence between the images and their knowledge of the domain. For example, after seeing the label of a visually prominent institution icon, one participant remarked, “They have, indeed, a good department.”

Figure 3.
Participant interprets image generated by Sciologer. The query was: “multiple sclerosis” AND “New York” [AD] and 2003:2008[DP]. Some links between these two groups occurred because papers shared a common substance or journal, ...

Paper-based survey:

The majority of participants (11 of 12, 91.7%) stated they were confident in their ability to use the system. Most (10 of 12, 83.3%) agreed that the system offered useful information beyond a traditional PubMed search, and the majority (9 of 12, 75%) stated they would have a reason to use it in their own work in the next six months. Although participants unanimously agreed that the system shows promise as a useful tool for researchers, most participants (9 of 12, 75%) agreed that the prototype needs significant modifications.

Semi-structured interview:

Participants referred to the output most frequently as “good” and “comprehensive”. The most prominent concern voiced by participants pertained to the amount of visual information in the output, with some referring to the output as busy or crowded. When asked to agree or disagree with the statement “I found the output difficult to understand”, two participants agreed, six disagreed, and four neither agreed nor disagreed. Some participants found it difficult to discern individual icons, and some expressed that dense areas of links made the images more difficult to interpret. Several participants stated that although the amount of information in the images was sometimes overwhelming, the ability to zoom in on dense areas partially alleviated the complexity.

When asked about how they would use the system, the majority of participants indicated that they would use the system to identify experts (11 of 12, 91.7%); to understand groups of researchers (11 of 12, 91.7%); to identify potential collaborators (9 of 12, 75.0%); and to understand the body of work of a specific researcher (9 of 12, 75.0%). Less than half (5 of 12, 41.7%) stated they would use the system to identify journals in which to consider publishing; several reported this to be less useful because they already know the journals in their field. Only four of the 12 participants (33.3%) stated that they would use the tool to identify peripheral groups of researchers. Most participants reported that they were less interested in peripheral groups or considered them less important.

When prompted, participants described several additional possible uses. These uses pertained to marketing one’s research or research program; writing grants; finding collaborators; understanding journals, substances, research groups, and institutions; exploring an unfamiliar area of research; recruiting researchers for various purposes; and searching for jobs. One participant also suggested the system might be useful to the general public for identifying hospitals that are strong in a given area of clinical research.

When asked about the validity of the information provided by the system, some participants indicated that the results were valid or that they had no reason to question validity. Others identified aspects of the system that could lead participants to misinterpret the information. Some stated that they were unable to assess validity, citing a lack of sufficient time using the system. Modifications recommended by the participants pertained most frequently to identifying ways to reduce the system’s visual complexity to diminish visual noise and reduce cognitive load. Participants also suggested a variety of additional enhancements that had not been considered by the developers, such as including additional element types (e.g., diseases), providing hyperlinks to researcher biography pages on scientific networking web sites, and incorporating additional bibliometric data such as impact factors and reverse citations.

Discussion

The prototype system shows promise as a useful tool for researchers. However, several participants felt overwhelmed by the visual complexity of the networks. There are a number of approaches that can be to alleviate this problem, such as grouping of elements at broader zoom levels and disclosing additional detail progressively at finer levels of zooming. In addition to guiding the development of future prototypes, the results of this evaluation highlight ways in which instructional material should be clarified. It will be important for the system to be operable “out of the box” by novice users, with detailed instructional material available for novice and expert users alike. Finally, based on self report, some people are highly visual while others are not. As such, further research is needed to determine how to make the system appealing both to visual thinkers and to those more inclined towards text and other representations.

Conclusion

Network visualization is a highly effective modality for understanding relational information. Based on the results of this formative evaluation, the prototype system shows promise as a tool for exploring the biomedical literature. The priority for the next round of development will be to implement data filtering, grouping, and automated labeling features to limit and manage visual complexity. Further research is needed to understand how users will interact with the system in their workplace environment. The results of this evaluation provide rich data to inform the development of next-generation visual interfaces to explore relational semantic data.

Acknowledgments

The authors thank Jessica Hullman at the University of Michigan for her assistance with the development of icons. Support for this research was provided by NLM training grant 5T15LM07079.

Article information

AMIA Annu Symp Proc. 2009; 2009: 24–28.
Published online 2009 Nov 14.
PMCID: PMC2815483
PMID: 20351816
Department of Biomedical Informatics, Columbia University, New York, NY
This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose
Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

References

1. Committee on Facilitating Interdisciplinary Research; National Academy of Sciences; National Academy of Engineering; Institute of Medicine. Facilitating Interdisciplinary Research. Washington, D.C.: The National Academies Press; 2004. [Google Scholar]
2. Borgatti SP, Mehra A, Brass DJ, Labianca G. Network analysis in the social sciences. Science. 2009;323(5916):892–5. [PubMed] [Google Scholar]
3. Merrill J, Hripcsak G. Using social network analysis within a department of biomedical informatics to induce a discussion of academic communities of practice. J Am Med Inform Assoc. 2008;15(6):780–2. [PMC free article] [PubMed] [Google Scholar]
4. Bales ME, Lussier YA, Johnson SB. Topological analysis of large-scale biomedical terminology structures. J Am Med Inform Assoc. 2007(14):788–97. [PMC free article] [PubMed] [Google Scholar]
5. Bahr NJ, Cohen AM. Discovering synergistic qualities of published authors to enhance translational research. Proc Am Med Inform Assoc Annu Symp. 2008:31–5. [PMC free article] [PubMed] [Google Scholar]
6. Shiffrin RM, Borner K. Mapping knowledge domains. Proc Natl Acad Sci USA. 2004 Apr;101:5183–5. [PMC free article] [PubMed] [Google Scholar]
7. Chen CM. CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. J Am Soc Inf Sci Technol. 2006;57(3):359–77. [Google Scholar]
8. Rosson MB, Carroll JM. Usability Engineering: Scenario-Based Development of Human Computer Interaction. San Francisco, CA: Morgan Kaufmann; 2002. [Google Scholar]
9. Fitts PM. The information capacity of the human motor system in controlling amplitude of movement. J Exp Psychol. 1954;47:381–91. [PubMed] [Google Scholar]
10. Beyer D. CCVisu: Automatic visual software decomposition; International Conference on Software Engineering; 2008. [Google Scholar]
11. Techsmith. Morae: Usability Testing Solution for Web Sites and Software 2007 [Google Scholar]
AMIA Annu Symp Proc. 2009; 2009: 24–28.
Published online 2009 Nov 14.

Figure 3.

An external file that holds a picture, illustration, etc.
Object name is amia-f2009-24f3.jpg

Participant interprets image generated by Sciologer. The query was: “multiple sclerosis” AND “New York” [AD] and 2003:2008[DP]. Some links between these two groups occurred because papers shared a common substance or journal, and therefore do not reflect direct collaboration between researchers.

Table 1.

Coding framework for task-based scenarios. The table includes a selection of time-stamped events (markers) captured during one scenario with one participant. Markers pertain to difficulties using the interface, inferences by the participant, and prompts by the study coordinator.

TimeMarker typeMarker labelParticipant statement
02:54.3Start taskPatterns in image
03:01.1ObservationInference (visual pattern)The image is... like a square... diamond shape.
03:06.7ObservationInference (clusters)There’s clearly four corners of intense research activity.
03:16.7ObservationInference (clusters)Each corner is associated with a different institution.
03:41.5PromptZoom in
04:05.7ErrorDifficulty clicking on tethered icon
04:10.7ReadoutSUNYSUNY.
04:17.5ReadoutColumbiaAnd Columbia.
04:27.2ObservationInference (clusters)So we’ve got the Long Islanders.
04:28.7ObservationInference (output)So it’s interesting, so it’s geographic.
04:40.3ObservationInference (collaboration)Some of the upstate universities are in close collaboration.

06:32.4Start taskWhich groups collaborate
06:43.9ObservationInference (collaboration)You can tell which research groups collaborate most closely.
06:52.6ObservationInference (collaboration)It looks like Columbia and NYU are very exclusive.
07:14.6ObservationInference (collaboration)It looks like there’s some collaboration between these two... a lot of lines going directly between these two groups.
07:23.3Quote/commentNot many links between two groupsAnd <laughs> there aren’t that many lines going between these two groups.

08:01.9Start taskWhich people collaborate
08:05.5ObservationInferenceIt looks like people at the same institution tend to collaborate.
08:30.3ObservationInferenceIt looks like a triangle of collaboration between Weinstock-Guttman, Zivadinov, and Benedict.