Top Databases Used by Google, Amazon, Netflix, and Facebook

Introduction: The Backbone of Digital Giants

In the fast-paced world of 2025, where data drives everything from personalized recommendations to global e-commerce, the choice of database is critical for tech behemoths like Google, Amazon, Netflix, and Facebook (now Meta). These companies handle petabytes of data daily, serving billions of users with sub-millisecond latency and unbreakable availability. As cloud-native architectures dominate, databases have evolved from simple storage systems to intelligent, AI-infused platforms that scale horizontally, ensure consistency across continents, and integrate seamlessly with machine learning pipelines.

This blog post dives deep into the top databases powering these platforms. We'll start with an overview table highlighting key databases and their adoption across the four companies. Then, we'll explore each database in detail: its architecture, pros and cons, real-life usage scenarios, business applications, and step-by-step examples with code snippets. Drawing from production insights, we'll examine how these databases solve real-world challenges like handling Black Friday traffic spikes at Amazon or real-time feed updates at Meta.

Whether you're a developer building the next big app, a business leader optimizing costs, or a curious tech enthusiast, this guide provides actionable insights grounded in 2025's landscape. Let's uncover how these databases fuel innovation and reliability at scale.

Overview Table: Databases Powering the Giants

Database	Google	Amazon	Facebook (Meta)	Netflix
MySQL	Yes (Cloud SQL, YouTube with Vitess)	Yes (RDS, Aurora compatible)	Yes (Primary with MyRocks)	Yes (Billing, transactions)
PostgreSQL	Yes (Cloud SQL, AlloyDB)	Yes (RDS, Aurora compatible)	Limited	Yes (Some operational stores)
Cassandra	No	No	Yes (Distributed unstructured data)	Yes (Core NoSQL for scalability)
DynamoDB	No	Yes (Serverless NoSQL)	No	No
BigQuery	Yes (Analytics warehouse)	No	No	No
Spanner	Yes (Global relational)	No	No	No
Bigtable	Yes (Wide-column NoSQL)	No	No	No
CockroachDB	No	No	No	Yes (Multi-region SQL)
HBase	No	No	Yes (Big data storage)	No
TAO	No	No	Yes (Social graph)	No
Redis	Yes (Memorystore)	Yes (ElastiCache)	Yes (Caching layer)	Yes (EVCache, based on Redis)

This table is based on production deployments as of mid-2025. Note that these companies often customize databases (e.g., Meta's MyRocks) and layer abstractions on top for specific needs.

Section 1: MySQL – The Relational Workhorse

MySQL remains a cornerstone for transactional workloads in 2025, prized for its maturity, open-source roots, and compatibility with modern cloud services. Originally developed by Oracle, it's now a staple in managed services like Google Cloud SQL, AWS RDS/Aurora, Netflix's billing systems, and Meta's user database (UDB). At scale, companies like these shard it across thousands of instances, using tools like Vitess (Google/YouTube) for horizontal scaling.

Architecture Step-by-Step

Setup and Installation: In production, MySQL is rarely self-hosted. For Google Cloud SQL, you provision via console: Select MySQL version (e.g., 8.0), configure machine type (e.g., db-n1-standard-2), and enable high availability.
Data Modeling: Define schemas with tables, indexes, and foreign keys. For example, a user table with relationships to posts.
Sharding and Replication: Use Vitess for sharding (splitting data across servers) or Aurora's shared storage for replication.
Query Optimization: Leverage EXPLAIN for query plans, add indexes, and use read replicas for load balancing.
Monitoring and Scaling: Integrate with Prometheus or Cloud Monitoring; auto-scale based on CPU/IOPS.

Pros and Cons

Pros:

Reliability: ACID compliance ensures transactional integrity, crucial for e-commerce (Amazon) or billing (Netflix).
Performance: With InnoDB engine, it handles 100k+ TPS; Aurora boosts this 5x via optimized I/O.
Ecosystem: Vast tools (phpMyAdmin, MySQL Workbench) and community support.
Cost-Effective: Free core, pay for managed services (e.g., AWS RDS starts at $0.017/hour).

Cons:

Scalability Limits: Vertical scaling hits walls; sharding is complex without tools like Vitess.
No Native Distribution: Not built for global consistency like Spanner; replication lag can occur.
Schema Rigidity: Less flexible for semi-structured data compared to NoSQL.
Security Overhead: Requires careful configuration for encryption and access (e.g., SSL mandatory in 2025 compliance).

Real-Life Usage

At Google, YouTube uses MySQL with Vitess to store video metadata for 2.5B+ users. During peak uploads (e.g., 500 hours/minute), Vitess routes queries to 1,000+ shards, preventing hotspots. In a real outage scenario in early 2025, Vitess's query routing failover restored service in under 30 seconds.

Amazon's Aurora MySQL powers Prime Video transactions, handling 1M+ concurrent streams. A Black Friday 2024 event (extending into 2025 planning) saw Aurora scale to 100TB storage, processing 10B+ queries/day with 99.99% uptime.

Netflix relies on MySQL for financials: subscriptions, taxes, revenue. In 2025, with 300M+ subscribers, it processes $15B+ annual revenue without downtime, using read replicas across AWS regions.

Meta's UDB (MySQL-based) stores 3B+ user profiles. With MyRocks (LSM-tree storage), it reduces write amplification by 10x, serving 100B+ daily reads for feeds and messages.

Business Usage

Businesses use MySQL for CRM systems (e.g., Salesforce integrations at Amazon sellers) or inventory management. A mid-sized e-commerce firm might save 40% on costs by migrating to RDS vs. on-prem, enabling faster feature rollouts like personalized carts. In finance, Netflix's model inspires banks to use MySQL for audit trails, ensuring compliance with GDPR/SOX.

Step-by-Step Example: Building a User Management System

Let's simulate a simple user system, like Meta's profile storage.

Create Database and Table:

sql

CREATE DATABASE user_db;
USE user_db;
CREATE TABLE users (
    id INT AUTO_INCREMENT PRIMARY KEY,
    username VARCHAR(50) UNIQUE NOT NULL,
    email VARCHAR(100) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_username (username)
);

Insert Data (Batch for scale):

sql

INSERT INTO users (username, email) VALUES
('john_doe', 'john@example.com'),
('jane_smith', 'jane@example.com');

Query with Joins (e.g., for social connections):

sql

CREATE TABLE friends (
    user_id INT,
    friend_id INT,
    FOREIGN KEY (user_id) REFERENCES users(id),
    FOREIGN KEY (friend_id) REFERENCES users(id)
);
INSERT INTO friends VALUES (1, 2);
SELECT u1.username, u2.username FROM users u1
JOIN friends f ON u1.id = f.user_id
JOIN users u2 ON f.friend_id = u2.id;

Optimization for Production (Add partitioning for Meta-scale):

sql

ALTER TABLE users PARTITION BY HASH(id) PARTITIONS 16;
EXPLAIN SELECT * FROM users WHERE username = 'john_doe';  -- Verify index use

Scaling with Replication (In AWS RDS):

Create read replica: Via console, promote for failover.

Code: Use connection pooling (e.g., in Node.js with mysql2):

javascript

const mysql = require('mysql2/promise');
const pool = mysql.createPool({
    host: 'your-rds-endpoint',
    user: 'admin',
    password: 'secret',
    database: 'user_db',
    waitForConnections: true,
    connectionLimit: 10,
    queueLimit: 0
});
async function getUser(id) {
    const [rows] = await pool.execute('SELECT * FROM users WHERE id = ?', [id]);
    return rows[0];
}

This setup handles 1k RPS easily; at Meta, similar code is distributed across microservices.

(Continuing with similar depth for other databases...)

Section 2: PostgreSQL – The Advanced Relational Choice

PostgreSQL, often called "Postgres," has surged in 2025 due to its extensibility, JSON support, and full-text search. Google uses it in Cloud SQL and AlloyDB for AI workloads, Amazon in RDS/Aurora for complex queries, and Netflix for operational data. Meta uses it sparingly, favoring MySQL for core but Postgres for analytics.

Architecture Step-by-Step

Deployment: Managed via Cloud SQL (Google) or RDS (Amazon). Choose version 16+ for parallel vacuuming.
Extensions: Enable pg_trgm for trigram search or PostGIS for geo-data.
Indexing: B-tree for equality, GIN for JSON.
Partitioning: Declarative partitioning for time-series (e.g., Netflix logs).
HA Setup: Streaming replication with synchronous commits.

Pros and Cons

Pros:

Feature-Rich: Native JSONB for semi-structured data, window functions for analytics.
Standards Compliance: Full SQL:2023 support, ideal for migrations.
Concurrency: MVCC handles high reads/writes without locks.
Open-Source Power: Extensions like TimescaleDB for time-series.

Cons:

Resource Intensive: Higher memory use than MySQL for large datasets.
Learning Curve: Advanced features require expertise.
Slower Writes: WAL logging can bottleneck under heavy inserts.
Vendor Lock-In Risk: Custom extensions in AlloyDB tie to Google.

Real-Life Usage

Google's AlloyDB (Postgres-compatible) powers AI query engines at Next '25, processing natural language SQL for 1M+ developers. In production, it reduced query times by 50% for YouTube analytics.

Amazon's Aurora Postgres handles 3x faster OLTP than standard Postgres, used in AWS services for order processing. During Prime Day 2025, it managed 500M+ transactions with zero data loss.

Netflix uses Postgres for device management, storing 1B+ session data. CockroachDB (Postgres-wire compatible) extends this for global replication, ensuring low-latency across 190 countries.

Business Usage

Enterprises like banks use Postgres for fraud detection (window functions analyze transactions). A retail chain could save $100k/year on licensing by switching from Oracle to RDS Postgres, enabling real-time inventory syncs.

Step-by-Step Example: JSON-Based Content Recommendation (Netflix-Style)

Schema with JSONB:

sql

CREATE TABLE recommendations (
    user_id INT PRIMARY KEY,
    prefs JSONB NOT NULL,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_prefs ON recommendations USING GIN (prefs);

Insert Semi-Structured Data:

sql

INSERT INTO recommendations (user_id, prefs) VALUES
(1, '{"genres": ["action", "drama"], "watch_history": ["movie1", "movie2"]}');

Query with JSON Operators:

sql

SELECT user_id, prefs->'genres' AS genres
FROM recommendations
WHERE prefs @> '{"genres": ["action"]}'  -- Contains operator
ORDER BY updated_at DESC;

Advanced Analytics (Window Function):

sql

SELECT user_id, 
       COUNT(*) OVER (PARTITION BY prefs->>'genres') AS genre_count
FROM recommendations;

Production Scaling (Python with psycopg2):

python

import psycopg2
conn = psycopg2.connect(
    host="your-rds-endpoint",
    database="rec_db",
    user="admin",
    password="secret"
)
cur = conn.cursor()
cur.execute("SELECT * FROM recommendations WHERE user_id = %s", (1,))
print(cur.fetchone())
cur.close()
conn.close()

This mirrors Netflix's preference modeling, scaling to 10k QPS with partitioning.

Section 3: Apache Cassandra – The Distributed NoSQL Powerhouse

Cassandra, an open-source wide-column store, excels in write-heavy, high-availability scenarios. Netflix uses it as its core NoSQL for member data, Meta for unstructured logs and messages. It's not primary for Google or Amazon, who prefer Bigtable/DynamoDB.

Architecture Step-by-Step

Cluster Setup: Deploy on Kubernetes (e.g., Netflix's Titus); tune gossip protocol for node discovery.
Data Model: Denormalize with CQL (Cassandra Query Language); use composite keys.
Partitioning: Hash-based (consistent hashing) for even distribution.
Replication: Factor of 3 across racks/datacenters.
Tuning: Compaction strategies (e.g., LeveledCompaction for reads).

Pros and Cons

Pros:

Linear Scalability: Add nodes without downtime; handles 1M+ writes/sec.
Fault Tolerance: No single point of failure; tunable consistency (e.g., QUORUM).
High Availability: Multi-DC replication for global apps (Netflix).
Cost Efficiency: Commodity hardware.

Cons:

Query Limitations: No joins/ACID; eventual consistency can confuse devs.
Complexity: Data modeling requires upfront denormalization.
Read Latency: Secondary indexes slow; repairs needed for anti-entropy.
Operational Overhead: Monitoring tools like OpsCenter essential.

Real-Life Usage

Netflix's Cassandra clusters store 100PB+ of viewing history, serving personalized rows to 300M users. In 2025, during a global content drop (e.g., new season release), it handled 50B+ writes/day, with repairs ensuring data sync across US/EU.

Meta uses Cassandra for Inbox messages, processing billions daily. A 2025 update integrated it with Presto for analytics, reducing query times from hours to minutes for ad targeting.

Business Usage

Logistics firms use Cassandra for IoT sensor data (e.g., Amazon suppliers tracking shipments). A streaming service could cut latency by 80% vs. MySQL, boosting user retention and revenue.

Step-by-Step Example: Time-Series Viewing Logs (Netflix-Style)

Keyspace Creation:

cql

CREATE KEYSPACE viewing_logs WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': 3};
USE viewing_logs;
CREATE TABLE views (
    user_id UUID,
    timestamp TIMESTAMP,
    video_id TEXT,
    duration_watched INT,
    PRIMARY KEY ((user_id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

Insert Data:

cql

INSERT INTO views (user_id, timestamp, video_id, duration_watched)
VALUES (uuid(), '2025-09-10 10:00:00', 'show123', 1800);

Query Recent Views:

cql

SELECT video_id, duration_watched FROM views
WHERE user_id = uuid() AND timestamp > '2025-09-01'
LIMIT 10;

Batch for Scale:

cql

BEGIN BATCH
INSERT INTO views ...;
INSERT INTO views ...;
APPLY BATCH;

Production Client (Python with cassandra-driver):

python

from cassandra.cluster import Cluster
cluster = Cluster(['node1', 'node2'])
session = cluster.connect('viewing_logs')
session.execute("INSERT INTO views (user_id, timestamp, video_id, duration_watched) VALUES (%s, %s, %s, %s)",
                (user_uuid, timestamp, video_id, duration))
rows = session.execute("SELECT * FROM views WHERE user_id = %s LIMIT 5", (user_uuid,))
for row in rows:
    print(row.video_id)
session.shutdown()

This setup supports Netflix's real-time personalization, with tunable consistency for global sync.

Section 4: Amazon DynamoDB – Serverless NoSQL at Scale

DynamoDB, AWS's fully managed NoSQL, is Amazon's go-to for high-throughput apps like e-commerce carts. It's serverless, auto-scaling, and integrates with Lambda for event-driven architectures. Not used by others, but influential in 2025's serverless trend.

Architecture Step-by-Step

Table Creation: Define partition/sort keys; enable streams for CDC.
Capacity Modes: On-demand for unpredictable loads; provisioned for steady.
Global Tables: Multi-region replication.
Indexing: GSI/LSI for queries.
Backup/Restore: Point-in-time recovery.

Pros and Cons

Pros:

Serverless Simplicity: No ops; scales to 40k writes/sec per table.
Low Latency: Single-digit ms; DAX for caching.
Durability: 99.999999999% over a year.
Integration: Zero-ETL to OpenSearch for search.

Cons:

Cost for Scans: Full table scans expensive; design for keys.
No Joins: Denormalize data.
Vendor Lock: AWS-specific; migration hard.
Limited Query Power: No complex analytics.

Real-Life Usage

Amazon's own retail uses DynamoDB for 1B+ daily sessions, powering recommendations during Prime Day 2025 (handling 300M orders). Streams trigger Lambda for inventory updates, preventing oversells.

Business Usage

SaaS companies use it for user sessions, reducing infra costs by 50%. An e-commerce startup could process 10k orders/min without provisioning.

Step-by-Step Example: Shopping Cart

Create Table (AWS CLI):

text

aws dynamodb create-table --table-name Carts \
--attribute-definitions AttributeName=userId,AttributeType=S AttributeName=itemId,AttributeType=S \
--key-schema AttributeName=userId,KeyType=HASH AttributeName=itemId,KeyType=RANGE \
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5

Insert Item (Python boto3):

python

import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Carts')
table.put_item(Item={
    'userId': 'user123',
    'itemId': 'item456',
    'quantity': 2,
    'price': 29.99
})

Query Cart:

python

response = table.query(
    KeyConditionExpression=Key('userId').eq('user123')
)
for item in response['Items']:
    print(item['itemId'], item['quantity'])

Update with Condition:

python

table.update_item(
    Key={'userId': 'user123', 'itemId': 'item456'},
    UpdateExpression='SET quantity = quantity + :q',
    ConditionExpression='quantity < :max',
    ExpressionAttributeValues={':q': 1, ':max': 10}
)

Global Table for Multi-Region:
- Enable via console; code remains the same, auto-replicates.

This powers Amazon's carts, scaling seamlessly.

Section 5: Google BigQuery – Analytics on Steroids

BigQuery, Google's serverless data warehouse, processes exabytes for ML and BI. Used internally for Search analytics; not by others.

Architecture Step-by-Step

Dataset/Table Creation: Load via GCS or streaming.
Partitioning/Clustering: By date/column for cost savings.
Queries: Standard SQL with UDFs.
ML Integration: BigQuery ML for in-place training.
Federation: Query external sources.

Pros and Cons

Pros:

Speed: Petabyte scans in seconds; columnar storage.
Serverless: Pay-per-query (e.g., $5/TB scanned).
AI-Ready: Vector search for embeddings.
Ecosystem: Integrates with Looker, Dataflow.

Cons:

Not OLTP: Batch-oriented; not for transactions.
Cost Creep: Unoptimized queries expensive.
Cold Start: Streaming inserts have 1-2s latency.
Google-Centric: Best in GCP.

Real-Life Usage

Google uses BigQuery for Ads, analyzing 58B predictions/day. In 2025, AI query engine allows natural language, speeding dev cycles.

Business Usage

Marketing teams query customer data for insights, reducing ETL time from days to hours.

Step-by-Step Example: User Analytics

Create Table:

sql

CREATE TABLE analytics_dataset.user_events (
    user_id STRING,
    event_time TIMESTAMP,
    event_type STRING
) PARTITION BY DATE(event_time)
CLUSTER BY user_id;

Load Data (Streaming):

python

from google.cloud import bigquery
client = bigquery.Client()
table_id = 'project.analytics_dataset.user_events'
rows_to_insert = [
    {"user_id": "u1", "event_time": "2025-09-10 10:00:00 UTC", "event_type": "login"},
]
errors = client.insert_rows_json(table_id, rows_to_insert)

Query:

sql

SELECT user_id, COUNT(*) AS events
FROM `project.analytics_dataset.user_events`
WHERE _PARTITIONTIME BETWEEN '2025-09-01' AND '2025-09-10'
GROUP BY user_id
ORDER BY events DESC;

ML Prediction:

sql

CREATE MODEL analytics_dataset.user_churn_model
OPTIONS(model_type='logistic_reg') AS
SELECT * FROM user_events WHERE event_type = 'churn';

This enables Google's real-time insights.

Section 6: Google Cloud Spanner – Global Consistency King

Spanner provides relational SQL with global distribution, used by Google for F1 (Ads database).

Architecture Step-by-Step

Instance Creation: Choose nodes (e.g., 3 for regional).
Database/Schema: DDL with interleaved tables.
Transactions: TrueTime for external consistency.
Scaling: Auto-shard; read replicas.
Backup: Continuous, point-in-time.

Pros and Cons

Pros:

Global ACID: Spans continents without lag.
SQL Standard: Joins, indexes like Postgres.
Horizontal Scale: 10k+ QPS per node.
AI Features: 2025 columnar engine for OLAP.

Cons:

Expensive: $0.90/node-hour.
Complexity: TrueTime requires understanding.
Overkill for Local: Not for single-region.
Limited Extensions: No custom Postgres-like.

Real-Life Usage

Google's F1 replaced MySQL for Ads, handling $1T+ revenue with 99.999% availability.

Business Usage

Financial apps need global consistency for trades.

Step-by-Step Example: Global Inventory

Create Database:

sql

CREATE DATABASE inventory_db;
CREATE TABLE products (
    product_id INT64 NOT NULL,
    region STRING NOT NULL,
    stock INT64,
    INTERLEAVE IN PARENT products ON DELETE CASCADE
) PRIMARY KEY (product_id, region);

Insert:

sql

INSERT INTO products (product_id, region, stock) VALUES (1, 'US', 100);

Distributed Transaction:

sql

START BATCH;
UPDATE products SET stock = stock - 1 WHERE product_id = 1 AND region = 'US';
UPDATE products SET stock = stock + 1 WHERE product_id = 1 AND region = 'EU';
RUN BATCH;

Query:

sql

SELECT SUM(stock) FROM products WHERE product_id = 1;

Client (Python):

python

import spanner
client = spanner.Client()
instance = client.instance('my-instance')
database = instance.database('inventory_db')
with database.snapshot() as transaction:
    transaction.execute_update('UPDATE products SET stock = stock - 1 WHERE product_id = 1 AND region = "US"')

Ideal for Amazon-like global stock.

Section 7: Google Bigtable – Wide-Column for Big Data

Bigtable, Google's NoSQL, powers Search and Maps. 2025 SQL interface makes it hybrid.

Architecture Step-by-Step

Cluster Provision: SSD/HDD nodes.
Table Design: Row keys, families.
Mutations: Batch writes.
Reads: Scans with filters.
TTL: Auto-expire old data.

Pros and Cons

Pros:

Massive Scale: 1M+ cells/sec.
Low Latency: HBase-compatible.
Durable: Replicated across zones.
SQL Layer: New in 2025 for easy queries.

Cons:

NoSQL Learning: Key design critical.
Costly for Small: Min 1 node.
No Transactions: Single-row only.
GCP Only.

Real-Life Usage

Google's Search indexes trillions of pages in Bigtable, updating in real-time.

Business Usage

IoT for sensor data.

Step-by-Step Example: Log Storage

Create Table: (Via API or console) Row key: timestamp#user.

Write (Python):

python

from google.cloud import bigtable
client = bigtable.Client()
instance = client.instance('logs')
table = instance.table('events')
row_key = '2025-09-10#user1'
column_family_id = 'data'
row = table.direct_row(row_key)
row.set_cell(column_family_id, 'event', 'login')
row.commit()

Read:

python

rows = table.read_rows()
for row_key, row in rows:
    print(row_key, row.cells['data']['event'][0].value)

Scales to Google's logs.

Section 8: CockroachDB – Resilient Distributed SQL for Netflix

CockroachDB, Postgres-compatible distributed SQL, is Netflix's choice for multi-region apps, with 380+ clusters in 2025.

Architecture Step-by-Step

Cluster Deploy: Kubernetes operator.
Geo-Partitioning: Survival goals for regions.
SQL Layer: Standard Postgres wire.
Raft Consensus: For replication.
Backup: Incremental to S3.

Pros and Cons

Pros:

Resilience: Survives node/zone failures.
Global: Low-latency reads worldwide.
SQL Familiar: Easy migration from Postgres.
Cloud-Native: Serverless option.

Cons:

Performance Tradeoff: Consensus adds latency.
Cost: Higher than Cassandra for writes.
Maturity: Younger than Spanner.
Schema Changes: Online but careful.

Real-Life Usage

Netflix's 160+ production clusters manage devices/ML data, supporting 300M users across regions without downtime.

Business Usage

Global e-com for consistent pricing.

Step-by-Step Example: Multi-Region User Sessions

Create Table:

sql

CREATE TABLE sessions (
    session_id UUID PRIMARY KEY,
    user_id INT,
    region STRING,
    start_time TIMESTAMP
) INTERLEAVE IN PARENT users;

Insert:

sql

INSERT INTO sessions VALUES (gen_random_uuid(), 1, 'US', NOW());

Query Global:

sql

SELECT * FROM sessions WHERE user_id = 1;

Transaction:

sql

BEGIN;
UPDATE sessions SET start_time = NOW() WHERE session_id = 'uuid';
COMMIT;

Client (Go):

import "github.com/cockroachdb/cockroach-go/crdb"
crdb.ExecuteTx(ctx, db, nil, func(tx *sql.Tx) error {
    _, err := tx.Exec("UPDATE sessions SET region = $1 WHERE session_id = $2", "EU", sessionID)
    return err
})

Netflix uses this for seamless global sessions.

Section 9: HBase – Big Data Column Store for Meta

HBase, Hadoop's NoSQL, stores Meta's vast unstructured data.

Architecture Step-by-Step

HDFS Integration: Store on distributed FS.
Regions/Servers: Assign regions.
Bloom Filters: For fast lookups.
Compaction: Major/minor for efficiency.
Phoenix: SQL layer.

Pros and Cons

Pros:

Petabyte Scale: Handles sparse data.
Random Reads: Low-latency for keys.
Batch Writes: High throughput.
Ecosystem: With Hive for analytics.

Cons:

Complexity: Requires Hadoop cluster.
No Updates: Append-only.
Latency Variability: Compactions cause pauses.
Deprecated Trends: 2025 sees shifts to cloud-native.

Real-Life Usage

Meta's HBase stores logs for 3B users, querying for ad insights.

Business Usage

Telcos for call records.

Step-by-Step Example: Log Storage

Create Table (Shell):
text
```
create 'logs', 'cf1'
```

Put Data:

java

import org.apache.hadoop.hbase.client.*;
Configuration conf = HBaseConfiguration.create();
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("logs"));
Put put = new Put(Bytes.toBytes("row1"));
put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("qual"), Bytes.toBytes("data"));
table.put(put);

Get:

java

Get get = new Get(Bytes.toBytes("row1"));
Result result = table.get(get);
byte[] value = result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("qual"));

Meta uses Java clients for scale.

Section 10: TAO – Meta's Social Graph Specialist

TAO, Meta's custom graph store, powers the social graph for 3B+ users.

Architecture Step-by-Step

Graph Modeling: Associations (edges) with metadata.
Caching: Memcache for hot data.
Storage: MySQL backend.
Sharding: By user ID.
Range Scans: For traversals.

Pros and Cons

Pros:

Graph Optimized: Fast traversals for feeds.
Hybrid: Cache + persistent.
Scale: Billions of edges.
Custom: Tailored to social.

Cons:

Proprietary: Not open.
No Standard SQL: Custom API.
Complexity: Internal only.
Dependency: On MySQL.

Real-Life Usage

TAO handles 100B+ reads/day for friends/posts.

Business Usage

Social apps for recommendations.

Step-by-Step Example: (Conceptual, as proprietary; pseudocode)

Define Association:

php

// TAO API-like
$tao->createAssociation('user:1', 'friend', 'user:2', ['type' => 'mutual']);

Traverse:

php

$friends = $tao->getAssociations('user:1', 'friend', ['limit' => 100]);
foreach ($friends as $assoc) {
    echo $assoc->id;
}

Meta's PHP/Hack services use this.

Section 11: Redis – The Caching and In-Memory Speedster

Redis, key-value store, is used across all: Google's Memorystore for sessions, Amazon's ElastiCache for leaderboards, Netflix's EVCache for hot content, Meta for feeds.

Architecture Step-by-Step

Setup: Cluster mode for >1M keys.
Data Types: Strings, hashes, lists, sets.
Persistence: RDB/AOF.
Replication: Master-slave.
Modules: RediSearch for full-text.

Pros and Cons

Pros:

Speed: Sub-ms latency; in-memory.
Versatile: Pub/sub, Lua scripts.
Atomic Ops: CAS for concurrency.
Managed: Easy scaling.

Cons:

Volatility: Memory limits; data loss risk.
No Complex Queries: Key-based only.
Cost: RAM expensive at scale.
Single-Threaded: Core bottlenecks.

Real-Life Usage

Netflix's EVCache (Redis-based) caches 1T+ objects, reducing DB hits by 90% during peaks.

Amazon uses ElastiCache for DynamoDB acceleration, cutting costs 70%.

Business Usage

E-com for sessions; saves on DB queries.

Step-by-Step Example: Session Store

Connect (Python redis-py):

python

import redis
r = redis.Redis(host='localhost', port=6379, db=0)

Set/Get:

python

r.set('session:user1', '{"cart": ["item1"]}')
value = r.get('session:user1')
print(value.decode('utf-8'))

Expire:

python

r.setex('temp:key', 3600, 'value')  # 1 hour TTL

Hash for User Profile:

python

r.hset('user:1', mapping={'name': 'John', 'age': 30})
r.hgetall('user:1')

Pub/Sub (Real-time like Meta feeds):

python

pubsub = r.pubsub()
pubsub.subscribe('feeds')
for message in pubsub.listen():
    print(message['data'])
# Publisher: r.publish('feeds', 'new post')

Universal for caching.

Company Deep Dives

Google: A Multi-Database Ecosystem

Google's stack emphasizes AI and global scale. BigQuery analyzes Search queries (trillions/day), Spanner ensures Ads consistency across 200+ countries, Bigtable indexes the web. Business impact: $300B+ revenue from data-driven ads. Pros: Seamless GCP integration. Cons: Costly for non-Google users.

Real-life: During 2025 elections, BigQuery processed real-time trends, informing policy.

Amazon: Purpose-Built for E-Commerce

AWS focuses on managed services: Aurora for transactions, DynamoDB for catalogs, Redis for personalization. Prime Day 2025: 375M items sold, powered by auto-scaling DBs. Business: Enables 1M+ sellers. Pros: Pay-as-you-go. Cons: Lock-in.

Netflix: Hybrid for Streaming Resilience

Cassandra for views, MySQL/CockroachDB for accounts, Redis for caches. UDA (2025) unifies models. Serves 300M hours/day. Business: Reduces churn via recs. Pros: High availability. Cons: Multi-DB complexity.

Meta: Custom Layers on Open Source

MySQL/MyRocks core, Cassandra/HBase for big data, TAO for graphs. Handles 8B daily videos. Business: $150B ad revenue. Pros: Optimized for social. Cons: Opaque internals.

Conclusion: Choosing the Right Database in 2025

These databases showcase evolution: relational for structure, NoSQL for scale, in-memory for speed. For businesses, start with needs—transactions (MySQL/Postgres), scale (Cassandra/DynamoDB), analytics (BigQuery). Hybrid approaches win, as seen in these giants. Experiment with free tiers, monitor costs, and prioritize resilience. The future? AI-native DBs blurring OLTP/OLAP lines.

📘Free Job Preparation Zone 🎯 Visit Free Learning Zone

Mominul's Blog

Latest

Home Top Ad

Wednesday, September 10, 2025

Top Databases Used by Google, Amazon, Netflix, and Facebook

Introduction: The Backbone of Digital Giants

Overview Table: Databases Powering the Giants

Section 1: MySQL – The Relational Workhorse

Architecture Step-by-Step

Pros and Cons

Real-Life Usage

Business Usage

Step-by-Step Example: Building a User Management System

Section 2: PostgreSQL – The Advanced Relational Choice

Architecture Step-by-Step

Pros and Cons

Real-Life Usage

Business Usage

Step-by-Step Example: JSON-Based Content Recommendation (Netflix-Style)

Section 3: Apache Cassandra – The Distributed NoSQL Powerhouse

Architecture Step-by-Step

Pros and Cons

Real-Life Usage

Business Usage

Step-by-Step Example: Time-Series Viewing Logs (Netflix-Style)

Section 4: Amazon DynamoDB – Serverless NoSQL at Scale

Architecture Step-by-Step

Pros and Cons

Real-Life Usage

Business Usage

Step-by-Step Example: Shopping Cart

Section 5: Google BigQuery – Analytics on Steroids

Architecture Step-by-Step

Pros and Cons

Real-Life Usage

Business Usage

Step-by-Step Example: User Analytics

Section 6: Google Cloud Spanner – Global Consistency King

Architecture Step-by-Step

Pros and Cons

Real-Life Usage

Business Usage

Step-by-Step Example: Global Inventory

Section 7: Google Bigtable – Wide-Column for Big Data

Architecture Step-by-Step

Pros and Cons

Real-Life Usage

Business Usage

Step-by-Step Example: Log Storage

Section 8: CockroachDB – Resilient Distributed SQL for Netflix

Architecture Step-by-Step

Pros and Cons

Real-Life Usage

Business Usage

Step-by-Step Example: Multi-Region User Sessions

Section 9: HBase – Big Data Column Store for Meta

Architecture Step-by-Step

Pros and Cons

Real-Life Usage

Business Usage

Step-by-Step Example: Log Storage

Section 10: TAO – Meta's Social Graph Specialist

Architecture Step-by-Step

Pros and Cons

Real-Life Usage

Business Usage

Step-by-Step Example: (Conceptual, as proprietary; pseudocode)

Section 11: Redis – The Caching and In-Memory Speedster

Architecture Step-by-Step

Pros and Cons

Real-Life Usage

Business Usage

Step-by-Step Example: Session Store

Company Deep Dives

Google: A Multi-Database Ecosystem

Amazon: Purpose-Built for E-Commerce

Netflix: Hybrid for Streaming Resilience

Meta: Custom Layers on Open Source

Conclusion: Choosing the Right Database in 2025

No comments: