Mapping the Human Genome: an Influx of Data That Needs to be Managed

posted in: Blogs | 0

Over the last 30 years, the human genome is the lock that scientists have been trying to pick in order to understand why certain demographics are more prone to diseases, and others aren’t. In this 30 year timeframe, biologists and scientists have cracked the code and successfully mapped a human genome. It took 13 years to make the initial breakthrough, with a price tag of $2.7 billion. But the advancements are becoming telescopic in nature and Moores Law is in full effect. Today, we now have over 1 million genomes fully mapped, and a new price tag of a mere $1,000 per genome.

Data Overload

As biology, science, and technology all converge for this project, and many others similar to it, the amount generated big data has become staggering.

Jim Sullivan, AbbVie’s global VP of Discovery, when talking about the size of this big data in physical terms, remarked that if single human genome were printed in paper form it would create a book that is 150 feet tall. If that is how much data one genome generates, the amount of data would surpass the size of a library in no time.

Managing and Moving this Big Data

The genome project has, and will continue to, to create massive amounts of data with every new genome that gets mapped. As more organizations across different industries and disciplines become involved in these projects, the harder it becomes to share and collaborate on this new data.

Mapping a genome also requires High-Performance Computing (HPC) to analyze and visualize these datasets. This is usually done offsite, and the HPC center may not always be next door. 100 genome samples alone can represent 15 TB of data (30 TB with a backup).

This has created an environment where software and hardware vendors are jumping to action to try and fill the need for solutions help move and manage this data. Sure, shipping physical mediums and traditional FTP-based file transfer solutions are available. But in the new age of the petabyte and zettabyte, they are quickly becoming unable to handle the transfer tasks required for these projects. And with the End of Globus Toolkit making grid computing a less feasible option, it is time to move away from the FTP protocol as a whole.