Layers of Data Infrastructure 3: Storage

Design decisions for how your systems and pipelines store data.

Dec 09, 2020

New Post: Layers of Data Infrastructure 3: Storage

This post is the second in a three-part series exploring the high-level design decisions you need to make for each stage in each use case category. The Storage layer defines how data is stored between, during and after the all the stages in a pipeline or system.

The quick-read version:

Storage options are defined by the structure of the data and the ways you’ll write and query it.
Local files - Your hard drive. Simple to implement, but mostly for local compute.
Remote files - Shared hard drive. Scales well. Built-in backup & security. Limited read/write compared to local files, and it’s still just files. AKA “Data Lake”
Relational database - Excel-like data, read and write a few entries at a time, or query with SQL-like languages. Up to millions of rows, but not billions.
Analytics database - Similar structure to relational. Write in large batches, query at any scale. AKA “Data Warehouse”
KV-store - Write anything, with or without structure, one entry at a time or in large batches. Scales well, but querying is limited. AKA “NoSQL”
Graph Database - Similar data structure to relational, but viewed as graph/network. Slow for SQL-like queries, but only option for graph-like queries.

For Further Consideration

How many of these options is your organization currently using? Do you store the same data in different places for different purposes?
What types of queries do your users currently rely on? Could you improve performance by switching to a more or less structured form of storage?

Up Next

My next post will attempt to demystify some aspects of data governance,
Followed by a series of case studies of exploring design options of specific use cases.
Then I want to take a step back and examine what it means to have a coherent, integrated data platform, and why you might want to invest in one.

3 Comments

en zyme

en’s zyme

Dec 17, 2022·edited Jan 1, 2023

The myriad ways of organizing data is confusing. The articles you've linked to are good reads. However, designing the data structures for a given industry is more than an afternoon's work. It feels far removed from the day to day of biotech. Any strategies to avoid "paralysis by analysis"?

https://miro.medium.com/max/720/0*h60AcWEOy-5Qdmr2

https://www.sqlshack.com/wp-content/uploads/2018/05/word-image-281.png

Expand full comment

1 reply by Jesse Johnson

Dec 17, 2022

For storage options I like to consider capacity, cost, convenience, and latency. Over the years there have been many expensive high tech solutions such as tape libraries and data closets.

The ETL vs ELT analysis, you mentioned is a a good place to start. Understanding scale and scope is hard to do in advance, so it's important to leverage lessons learned. Data Lakes and Graph Databases require understanding of the broader objectives, significant planning, and commitment of resources. Biologists grapple with the layering of biochemical, cellular, organ, system, and behaviour. A haphazard storage strategy will be as temperamental as a hyena and as sluggish as, well, as sluggish as a slug.

https://media.sciencephoto.com/image/c0049078/400wm/C0049078-Computer_Tape_Library.jpg

https://images.computerhistory.org/revonline/images/500004392-03-01.jpg

1 more comment...

Scaling Biotech