Database Tech Jargon for Product Leaders - Sketched & Explained
Check out our post on full-stack system design to understand how architecture, APIs, databases, and cloud all fit together if you want to understand the bigger picture before you dive into this post about databases specifically
Looking for a way to understand technical jargon around data storage and databases?
Here’s how I’d sketch it out and explain it to non-technical product leaders.
Architecture – Data storage and database concepts you must know
Data is core to powering helpful products, whether it's used for real-time operational functionality or providing deep analytical insights for your business.
Data is stored in different ways depending on how structured and how large it is.
Relational (SQL) databases are used to store highly structured data where each record has the same shape and fields, for example, pets' profiles with details like their name, age, and breed.
Non-relational (no-SQL) databases such as document databases are used to store unstructured data.
Document databases are a form of schemaless no-SQL database used to store data in flexible structures (like JSON) so that each record can store different details without needing the same fields per record.
Vector databases are a specialised type of database designed to identify the degree of similarity across data in large datasets for scenarios like finding related images or powering recommendation engines.
Vector databases are specialised to store and search high-dimensional vectors – which are numerical representations of data like images, text, and combinations of data like an entire pet's profile – to quickly find records that are most similar based on their embedded traits.
For example, shelter volunteers' comments and notes about each pet might be stored in a document database. While a vector database could power a recommendation engine that suggests similar pets based on the underlying profile data that's been converted into vector representations – which may reflect traits like their energy level, temperament, or size – depending on how the embeddings were generated.
Object storage is used to cheaply store media like photos and videos as binary blobs (1s and 0s), for example, pet photos and videos taken by volunteers.
Operational data provides applications with real-time information used to enable real-time functionality within the product. It's optimised for many concurrent but small interactions with fast response times
Streaming processes operational data in real-time continuously, record by record, where immediate insight with minimal latency is critical.
Analytical data provides businesses deep insight into trends, patterns, and anomalies that can be used to enable data-driven decision making. It's optimised for processing complex queries rather than fast response times
Batching processes analytical data in large chunks periodically – such as weekly, monthly, or quarterly – where the output of intensive data crunching is more valuable than immediate insight.
For example, quarterly reporting might show that pet adoptions increase during warmer months, which can help you plan volunteer capacity and outreach campaigns. In this case, the value of the report is in the insights even if it takes hours to crunch overnight once per quarter.
Communication – Build your technical fluency around data storage and databases
Encryption at rest
Encryption scrambles data into a mess of characters so that it appears random and meaningless until it is decrypted again into its original format using a decryption key.
There are two types of encryption - encryption at rest, and encryption in transit.
Encryption at rest protects stored data from exposure so that it cannot be read and understood without decoding it.
This means that if a malicious actor accesses the encrypted data, they'll not be able to make sense of it without having the decryption key.
For example, volunteers' personal details such as their name may be protected by being stored in an encrypted format using encryption at rest so that if a malicious attacker gets access to the database, they cannot see what the volunteer's name is.
Encryption in transit protects data moving over networks from eavesdropping and is handled as part of overall software security at an infrastructure level rather than being a data storage and database specific concern.
Scalability and architecture
Databases can be dynamically scaled during periods of higher or lower than usual usage to ensure that they perform optimally and meet predefined thresholds like response times while minimising cost.
Scaling up (vertically) increases the computational power, such as CPU or RAM, of a single database instance to handle higher load.
Scaling out (horizontally) increases the number of database instances to distribute the higher loads across them.
Vertically scaling a database in cloud platforms is generally easier since you can just increase the tier to add CPU or RAM to a single database instance.
However, vertically scaling becomes expensive quickly since it costs more to run a single, more powerful database than it does to run several smaller, parallel databases together.
Horizontal scaling leverages this cost difference by spinning up additional database instances as needed to distribute the load.
However, horizontal scaling increases a system's complexity since loads need to be distributed while their results are processed and combined in the correct sequencing.
Load balancers distribute incoming requests to different database instances in a horizontally scaled system to improve response times.
For example, the pet adoption app's shelter admin portal may use vertical scaling for quickly increasing the computational resources available during occasional spikes like when bulk importing a newly onboarded shelter's list of available pets.
In this case of occasional spikes, it may be more cost-effective to temporarily upgrade a single database than to invest in setting up and managing multiple distributed instances.
On the other hand, the pet adoption app's mobile, adopter-facing apps may use horizontal scaling to cater for recurring, predictable, periodic, increase in usage during certain periods such as around the holiday season if that's when the number of pet adoptions increase every year.
Building resilient systems
Data replication and data redundancy are two techniques used to increase the resilience of software systems thereby making them more tolerant to faults and allowing them to continue functioning uninterrupted even when things go wrong. This is especially important for mission-critical workflows in industries like healthcare and finance.
Data replication creates synced copies of the same data and introduces data redundancy in a system to increase reliability if there are issues like services becoming unavailable in a region.
Running more than one database instance at a time with redundant data helps create high availability systems for critical workloads but increases cost and complexity.
Distributed systems trade off between strong consistency – where data is always in-sync for critical workloads – and high availability – where some data may be slightly out of date for general consumer applications. This is known as the CAP theorem.
For example, if it were mission critical for the pet adoption app to always be available for potential adopters to view pets – which, granted, in practice is unlikely to be the case – then data replication across different regions of a cloud providers offerings could be used to build redundancy so that the app can continue to function as normal if one of the regions experienced downtime.
Extracting, transforming, and loading data
Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) are two alternative approaches to moving data between sources.
ETL pulls data from a source, transforms it in transit, then stores the refined results in the destination system.
In this case, the destination system contains only the cleaned data and not a copy of the source data. This reduces data storage costs but it can make troubleshooting or data restoration more difficult if there are issues found in the transformed data.
ELT pulls data from a source, stores that raw data as-is, then transforms that data in-place in the destination system.
In this case, the destination system can be set up to either contain only the cleaned data or to also include the raw data. Keeping the raw data too increases storage costs but can make troubleshooting easier and allows you to re-run the transformation if requirements change or there are bugs in the transformation.
Planning – Build your technical intuition around data storage and databases
Different types of data are better suited to different storage methods. Databases store text details like pet names and breed well, while object storage excels at cheaply storing media like pet photos and videos at scale.
Understanding these differences can help you plan more intuitively and earn engineers' trust when you're cognisant that it takes additional time to set up additional infrastructure.
For example, if you're adding pet photos for the first time, there'll be extra complexity to set up object storage across your different environments like Dev, QA, and Production.
But future features, like including pet videos, may have lower complexity since the infrastructure would already be in place.
For mission-critical systems, be sure to understand and clearly communicate what acceptable thresholds and SLAs are.
For example, if your product and company offers a certain uptime and availability factor for enterprise customers' data, ensure you and your engineering team are proactively aligned on ensuring this can be met rather than reactively firefighting when something goes wrong.
Understand that increasing reliability by introducing redundancy increases direct computational costs and architectural complexity but ensures users always have access to critical data.
Work together with your engineering team to plan, monitor, iterate, and optimise on this.
Be clear about whether users need to have access to always-up-to-date information or whether certain functionality is valuable even with a delay - such as weekly, or monthly reporting.
Batch processing monthly reports during low-usage hours can help you better optimise your infrastructure costs by better balancing load but trades off real-time data accuracy.
For example, you could save on your data storage and database bills by running quarterly reports over the weekend if that's when your app is used the least.
Encrypting users' personal information (PI) helps protect them while enabling you to create rich functionality within your product.
But you can take it a step further by minimising what personal information you store in the first place.
For example, it may be sufficient to only store volunteers' names since that's shown in the shelter admin portal when creating rosters, but it may not be necessary to store information like their gender.
Make sure your data retention policies are clear – and that every new feature knows how to follow them.
Okay, and that’s a wrap.
If you liked this style of sketched & explained knowledge share, either check out our post on API technical jargon sketched and explained, or download your FREE copy of our no-fluff ebook System design, APIs, and databases sketched & explained before lunch
Whenever you're ready, here's how I can help you
System design, APIs, and databases sketched and explained before lunch - FREE
Download your free copy of our ebook and build your intuitive technical fluency around 3 core concepts of building software products to earn engineers' trust FAST
Product Tech Sense™ Framework - PREMIUM
Request an invite to our limited-seats, premium, online live cohorts to master technical fluency, build technical intuition, and earn engineers' trust with our guided, high-touch sessions