Layers of Data Infrastructure 3: Storage

Design decisions for how your systems and pipelines store data.

New Post: Layers of Data Infrastructure 3: Storage

This post is the second in a three-part series exploring the high-level design decisions you need to make for each stage in each use case category. The Storage layer defines how data is stored between, during and after the all the stages in a pipeline or system.

The quick-read version:

  • Storage options are defined by the structure of the data and the ways you’ll write and query it.

  • Local files - Your hard drive. Simple to implement, but mostly for local compute.

  • Remote files - Shared hard drive. Scales well. Built-in backup & security. Limited read/write compared to local files, and it’s still just files. AKA “Data Lake”

  • Relational database - Excel-like data, read and write a few entries at a time, or query with SQL-like languages. Up to millions of rows, but not billions.

  • Analytics database - Similar structure to relational. Write in large batches, query at any scale. AKA “Data Warehouse”

  • KV-store - Write anything, with or without structure, one entry at a time or in large batches. Scales well, but querying is limited. AKA “NoSQL”

  • Graph Database - Similar data structure to relational, but viewed as graph/network. Slow for SQL-like queries, but only option for graph-like queries.

For Further Consideration

  • How many of these options is your organization currently using? Do you store the same data in different places for different purposes?

  • What types of queries do your users currently rely on? Could you improve performance by switching to a more or less structured form of storage?

Further Reading

The rise of graph databases is a relatively recent trend in the storage layer.

Up Next

  • My next post will attempt to demystify some aspects of data governance,

  • Followed by a series of case studies of exploring design options of specific use cases.

  • Then I want to take a step back and examine what it means to have a coherent, integrated data platform, and why you might want to invest in one.