U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Ariel P. UltraMicroscope II – A User Guide [Internet]. Chapel Hill (NC): University of North Carolina at Chapel Hill, University Libraries; 2018 Sep 20-.

Cover of UltraMicroscope II – A User Guide

UltraMicroscope II – A User Guide [Internet].

Show details

Chapter 11Data Transfer

Using the experience in the UNC Core as a case study

When acquiring large datasets, data transfer can be a rate-limiting step in the workflow, so it is beneficial to identify the potential problems, constraints and possible solutions.

11.1. Defining the problem

Transferring data is slow. “Slow” is anything that annoys users of the microscope. In the UNC Core, users get annoyed when:

  • Transferring data from an acquisition computer to an analysis workstation, network storage or external hard drive takes longer than 5 minutes.
  • The core charges them for usage on an instrument when they are just transferring data.

11.2. Defining other problems that the UNC Core cannot solve quickly, cheaply or at all

  • Long-term storage and backup of data.
  • Speed of internet cables and switches in the building and across campus.
  • Inability to access workstation visualization/analysis software from an individual lab.

11.3. Constraints

  • Writing directly to a network while acquiring a dataset is too risky. If there is any hiccup in the network, the acquisition could fail.
  • Data write speeds under good conditions are:
    • 10 Gb Ethernet: 500 Mb/s
    • 1 Gb Ethernet: 50-100 Mb/s
    • USB3: 50 Mb/s
  • Data write times are shown in Table 11.1.
  • A single image without cropping for the LaVision BioTec light sheet is around 10 Mbs. Multichannel stacks over 5 mm at 2.5 μm are common, so 20-60 Gbs are typical datasets. With tiling, this can rapidly scale to hundreds of GBs or even 1 TB+.
  • Typical labs have access to USB3 hard drives and 1 Gb Ethernet.
  • People want to do three things when they acquire data:
    • Visualize data
    • Analyze data
    • Store data somewhere accessible
  • The UNC Core cannot provide permanent storage of data.
  • The software to work with data is on a core workstation, not on the network nor on the acquisition PC.
  • When transferring data to/from acquisition or workstation computers, these computers cannot be used for other tasks without compromising performance.
Table 11.1 . Data write times for different amounts of data and transfer methods.

Table 11.1

Data write times for different amounts of data and transfer methods. Note that these are best-case values which are not always achieved in practice.

11.4. Narrowing the problem

There are two data transfers that need to take place:

  • From the acquisition computer to the analysis workstation.
  • From the analysis workstation out of the core, which could be the network (University IT, commercial cloud), or an external hard-drive.

The objective is to make those two steps as fast as possible.

11.5. Current configuration in the UNC Core

The transfer from acquisition to analysis workstation can be sped up by connecting those two steps via a 10 Gb Ethernet connection. We set this up by installing a second 10 Gb Ethernet card in each of those computers and hooking a direct cable between them (this can be up to around 100 m long). In addition, each of those computers also has their standard 1 Gb Ethernet card to access the internet and the University’s network. A more elegant alternative would be wiring the lab for 10 Gb Ethernet but this was too expensive and slow to get off the ground at UNC. The direct computer-to-computer 10 Gb connection solves the problem of the first transfer (from acquisition to workstation PC) for datasets up to 100 Gb, but not the second (from workstation to network or external hard drives). Transfer of TB-sized datasets are too slow, even with the 10 Gb connections (see below).

To further speed up data transfer, we have 3x2 TB SSDs in a RAID 0 array on the analysis workstation, which is the fastest configuration we could afford. How much faster this is than a single HD is something I don’t know exactly. In addition, RAID 0 is much riskier than other RAID configurations, because if a drive fails, the whole thing collapses. However, we went this route because we are not providing final storage, and resilience to failure is not as important as speed.

For 100 GB (and higher) datasets, we installed hot swap drive bays in both our acquisition and analysis workstations and use an SSDs in one of those bays (specifically, an Icy Dock ToughArmor MB994SP-4SB-1 bay, and a Samsung 860 EVO MZ-76E2T0E 2 TB SSD). This is low-tech, cheap and the fastest possible option for moving data around. For example, for a 1 TB dataset, we can just walk it over from acquisition computer to workstation, which takes around 20 s in the UNC Core. Add another minute or so of getting hard drives in and out of bays and waiting for the computer to recognize them, and our transfer rate is 1 TB/2 min, or 8 Gb/s. This is 80 times higher than 1 Gb Ethernet at its best.

Different distances, SSD sizes and walking speeds give different numbers, but they are all much higher than cables, no matter how fast the connections. The fact that this low-tech scheme is so much faster than cables has been well known for some time (see https://en.wikipedia.org/wiki/Sneakernet and https://what-if.xkcd.com/31/ ) but is often surprising to most people with a background in biology (myself included).

11.6. Thoughts about even bigger datasets

For datasets that are tens of terabytes, an important principle is to get the data to a final storage as soon as possible and have that final storage be very close (computationally speaking) to where the high-power computing on those datasets can take place (the “cluster”). Ideally, we would only want to move the data once, from the acquisition computer to the cluster. However, if we want to do this via cables, we are stuck with the bandwidth limitations of those cables, which make the transfer slow enough that this might eat into usage time on the instrument itself. An idea I’ve discussed with Eric Wait, a staff member at Janelia Farm’s Advanced Imaging Center that is developing data analysis options, is to have a server connected to the acquisition computer via something called “infiniband.” This can go at TB/s, but is very expensive, making it prohibitive to wire the whole building with this. The data could get dumped to that server after an experiment ends and then could go from the server to the cluster while the next—long—experiment runs. This two-tiered system would keep the experimental rig-up time as high as possible, which is critical for a system like the Janelia SIMView, which is running multi-day experiments. For the UltraMicroscope II, this arrangement might be useful if it were being used continuously to tile large organs at maximal resolution.

Copyright 2018 Pablo Ariel.

This work and associated files are licensed under the Creative Commons Attribution-NonCommercial 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

Bookshelf ID: NBK536389

Views

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...