Logo
PortfolioPricingContact
What is DBMS? A Complete Guide to Database Management System

What is DBMS? A Complete Guide to Database Management System

Akash_Halder
•October 27, 2025•

What is DBMS? A Complete Guide to Database Management System

Learn what DBMS (Database Management System) is, how it works, its types, advantages, and real-world applications.

#dbms#sql#no-sql#interview-topics

In today’s data-driven world, managing information efficiently is the backbone of every software system — from small startups to enterprise-scale applications like banking, social media, and e-commerce. This is where DBMS (Database Management System) comes into play.

This article condenses essential DBMS concepts you’ll encounter in engineering interviews and real-world system design: SQL vs NoSQL, partitioning & sharding, the CAP theorem, Master–Slave architecture, and scaling patterns used by production systems (e.g., a cab-booking app). Each section is written to be clear for students while remaining useful for practitioners — with short code examples and a concise most asked interview topics at the end.

NOTE: In this article I have not included the SQL because its in itself a skill to master. If you want to learn about SQL, especially MySQL, Check this article out for better understanding of the topic: Mastering MySQL.



Module 1: Database System Architecture


Database System Architecture

Data Abstraction

  • Data abstraction is the process of hiding the complex internal details of the database from the user and providing simplified views.
  • It makes database usage more efficient by showing only what is necessary to the user.
  • There are three levels of abstraction in DBMS.
  • Physical or Internal Level describes how the data is physically stored using files, indices, hashing, or B+ trees.
  • Logical or Conceptual Level describes the logical structure of the entire database including tables, attributes, and relationships.
  • View or External Level describes the user-specific view of data, hiding irrelevant details and ensuring security.

Data Independence

  • Data independence ensures that schema changes at one level do not affect the higher levels of abstraction.
  • Physical Data Independence means changes in physical storage such as changing indexing methods, partitioning, or compression should not affect the logical schema.
  • Logical Data Independence means changes in the logical schema such as adding a new column or modifying relationships should not affect the user’s view or application programs.
  • This property makes databases flexible and easier to maintain over time.

Database Languages

  • Data Definition Language (DDL) is used to define and modify database schema. Examples are CREATE, ALTER, DROP, TRUNCATE, RENAME, COMMENT.
CREATE TABLE Student (StudentID INT PRIMARY KEY, Name VARCHAR(50), Age INT);
ALTER TABLE Student ADD Email VARCHAR(100);
DROP TABLE Student;
  • Data Manipulation Language (DML) is used for storing, modifying, deleting, and retrieving data. Examples are SELECT, INSERT, UPDATE, DELETE, MERGE.
INSERT INTO Student VALUES (1, 'Ravi', 20);
UPDATE Student SET Age = 21 WHERE StudentID = 1;
DELETE FROM Student WHERE Age < 18;
SELECT Name, Age FROM Student WHERE Age > 19;
  • Data Control Language (DCL) is used for permissions. Examples are GRANT and REVOKE.
GRANT SELECT ON Student TO user1;
REVOKE UPDATE ON Student FROM user1;
  • Transaction Control Language (TCL) manages transactions. Examples are COMMIT, ROLLBACK, SAVEPOINT.
START TRANSACTION;
UPDATE Student SET Age=22 WHERE StudentID=1;
ROLLBACK;   -- Undo change
COMMIT;     -- Save change

Data Models

Entity-Relationship (E-R) Model

  • The ER model is a high-level data model that represents real-world entities and their relationships.
  • An entity is an object in the real world such as Student or Course.
  • An attribute is a property of an entity such as Name, Age, or CourseID.
  • A relationship represents associations among entities such as Student enrolls in Course.
  • ER diagrams provide a blueprint of the database design.

Example Table Representation

Entity Attributes
Student StudentID (PK), Name, Age
Course CourseID (PK), Title, Credits

Relationship: Student – Enrolls – Course (Many-to-Many).

Types of ER relationships

Network Model

  • The network model organizes data using records and links (pointers).
  • It represents data as nodes and relationships as edges.
  • Each record can have multiple parent and child relationships.
  • Example: A Department has multiple Employees, and Employees may belong to multiple Projects.

Relational Model

  • The relational model organizes data into tables also called relations.
  • Each table has rows called tuples and columns called attributes.
  • Properties include unique table names, atomic attribute values, no duplicate rows, and no importance of row or column order.
  • SQL is based on the relational model.

Example Relation: Student

StudentID Name Age Course
1 Ravi 20 CSE
2 Meena 21 AIML

Object-Oriented Data Model

  • The object-oriented data model combines database concepts with object-oriented programming.
  • Data is represented as objects with attributes and methods.
  • It supports inheritance, encapsulation, and polymorphism.
  • Useful for complex applications such as multimedia systems, CAD/CAM, and scientific research.

Example in Java Style

class Student {
    int id;
    String name;
    void enroll(Course c) { ... }
}

Integrity Constraints

  • Domain Constraint ensures attribute values must come from a defined domain such as Age between 18 and 60.
  • Entity Integrity ensures that the primary key of a table cannot be NULL.
  • Referential Integrity ensures that foreign key values must match an existing primary key in another table or be NULL.
  • Key Constraints ensure that each row in a table is uniquely identified.

Example SQL

CREATE TABLE Course (
    CourseID INT PRIMARY KEY,
    Title VARCHAR(50)
);
 
CREATE TABLE Student (
    StudentID INT PRIMARY KEY,
    Name VARCHAR(50),
    CourseID INT,
    FOREIGN KEY (CourseID) REFERENCES Course(CourseID)
);

Data Manipulation Operations

  • Insertion is used to add new records into a table.
  • Deletion is used to remove records from a table.
  • Updation is used to modify existing records.
  • Retrieval is used to query data using SELECT statements.

Example SQL

INSERT INTO Course VALUES (101, 'DBMS');
SELECT Name FROM Student WHERE CourseID = 101;

Comparisons of Data Models

Feature ER Model Network Model Relational Model Object-Oriented Model
Representation Entities & Relationships Records and Links (Graph) Tables with rows and columns Objects with attributes & methods
Schema Flexibility Moderate Low (complex pointers) High (SQL based) High (OOP concepts)
User Friendliness High (visual) Low High Moderate
Common Usage Database design stage Legacy applications Most modern DBMS (MySQL, Oracle) Complex apps (CAD, Multimedia)









Module 2: Relational Query Languages & Relational Database Design


Relational Query Languages

Relational Algebra

  • Relational algebra is a procedural query language used to retrieve data from relations (tables).
  • It defines a sequence of operations to obtain a required result.
  • Each operation takes one or two relations as input and produces a new relation as output.

Core Operators

  • Selection (sigma σ): Selects tuples that satisfy a condition. Example: σ Age > 20 (Student).
  • Projection (pi ∏): Selects specific attributes (columns). Example: ∏ Name, Age (Student).
  • Union (U): Combines tuples from two relations, removes duplicates. Example: Student U Teacher.
  • Intersection (∩): Returns common tuples between two relations. Example: Student ∩ Alumni.
  • Difference (−): Returns tuples in one relation but not in another. Example: Student − Alumni.
  • Cartesian Product (×): Combines each tuple of one relation with each tuple of another. Example: Student × Course.
  • Join (⨝): Combines related tuples from two relations based on a condition. Example: Student ⨝ Student.CourseID = Course.CourseID Course.
  • Rename (rho ρ): Renames the relation or its attributes. Example: ρ(S1, Student).

Example Input Table: STUDENT

StudentID Name Age CourseID
1 Ravi 20 101
2 Meena 21 102
3 Arjun 19 101

Query: ∏ Name (σ Age > 20 (STUDENT)) Result: {Meena}


Tuple Relational Calculus (TRC)

  • TRC is a non-procedural query language where queries specify what to retrieve instead of how.
  • It uses tuple variables to represent rows of a relation.

General Form {t | P(t)} where t = tuple variable and P(t) = condition applied on t.

Example {t.Name | Student(t) AND t.Age > 20} This retrieves names of students older than 20.


Domain Relational Calculus (DRC)

  • DRC is another non-procedural query language.
  • It uses domain variables that represent attribute values instead of whole tuples.

General Form {<a1, a2, …, an> | P(a1, a2, …, an)} where ai = attribute values and P = condition.

Example {<Name, Age> | ∃ CourseID (Student(Name, Age, CourseID) AND Age > 20)} This retrieves name and age of students older than 20.


SQL3 (Extended SQL)

  • SQL3 is an extension of SQL with object-oriented and advanced features.
  • It supports recursive queries, triggers, user-defined types, and inheritance.

Example Recursive Query

WITH RECURSIVE EmployeeHierarchy AS (
    SELECT EmpID, ManagerID FROM Employee WHERE ManagerID IS NULL
    UNION
    SELECT e.EmpID, e.ManagerID
    FROM Employee e, EmployeeHierarchy h
    WHERE e.ManagerID = h.EmpID
)
SELECT * FROM EmployeeHierarchy;

This query retrieves employees in a recursive management hierarchy.


Open Source and Commercial DBMS

DBMS Type Strengths Limitations
MySQL Open-source Fast, reliable, widely used in web apps, community-driven Fewer enterprise-level features
Oracle Commercial High availability, clustering, scalability, advanced transaction support Expensive licensing, complex administration
IBM DB2 Commercial Optimized for large-scale analytics and transactions, strong AI integration Requires specialized skills, costly
SQL Server Commercial Excellent integration with Windows, Azure, Power BI, user-friendly Licensing cost, Windows-centric

Relational Database Design

Domain and Data Dependency

  • Domain dependency means attribute values must come from a specific domain. Example: Age must be an integer between 18 and 60.
  • Data dependency means one attribute depends on another. Example: EmpID determines EmpName.

Armstrong’s Axioms

  • Armstrong’s axioms are inference rules used to derive functional dependencies.
  • Reflexive Rule: If Y is a subset of X, then X determines Y.
  • Augmentation Rule: If X determines Y, then XZ determines YZ.
  • Transitive Rule: If X determines Y and Y determines Z, then X determines Z.
  • Union Rule: If X determines Y and X determines Z, then X determines YZ.
  • Decomposition Rule: If X determines YZ, then X determines Y and X determines Z.
  • Pseudotransitivity Rule: If X determines Y and WY determines Z, then WX determines Z.

Normal Forms

Normal Form Condition Example
1NF Values must be atomic (no repeating groups) Name = "Ravi, Meena" violates 1NF
2NF In 1NF and no partial dependency (non-key depends on full key) StudentID, CourseID → Marks but CourseName depends only on CourseID
3NF In 2NF and no transitive dependency StudentID → AdvisorID, AdvisorID → AdvisorName
BCNF Every determinant must be a candidate key DeptID → HOD, but HOD not a candidate key
4NF In BCNF and no multi-valued dependencies A teacher teaches multiple subjects and multiple classes
5NF In 4NF and no join dependency Decomposition without loss of information
6NF Further decomposition into irreducible tables (temporal databases) Used in advanced systems

Dependency Preservation

  • Ensures that functional dependencies are preserved after decomposition.
  • If a dependency is lost, additional joins may be required to enforce it.

Lossless Design

  • Decomposition is lossless if no information is lost after splitting a relation.
  • Condition: If R is decomposed into R1 and R2, then R1 ∩ R2 must be a key of either R1 or R2.

Query Processing and Optimization

Evaluation of Relational Algebra Expressions

  • Expressions are evaluated using relational algebra trees.
  • Example: σ Age > 20 (∏ Name, Age (Student)) → first project Name and Age, then select tuples where Age > 20.

Query Equivalence

  • Two queries are equivalent if they produce the same result for all databases.

  • Example:

    • σ Age > 20 (∏ Name, Age (Student))
    • ∏ Name, Age (σ Age > 20 (Student)) Both are equivalent since selection and projection commute.

Join Strategies

  • Nested Loop Join: Compare each tuple in R with each tuple in S. Simple but costly.
  • Sort-Merge Join: Sort relations on join key, then merge. Efficient for sorted inputs.
  • Hash Join: Partition both relations using a hash function, then match partitions. Fast for equality joins.

Query Optimization

  • Query optimization selects the most efficient execution plan.
  • Goals include reducing disk I/O, CPU usage, and memory consumption.
  • Techniques include pushing selections before joins, using indexes, reordering joins, and using cost-based optimization.




Module 3: Storage strategies


1. Indexing & Hashing


In a database, information is stored as records. Each record contains a key field that uniquely identifies it (like a roll number identifying a student).

  • To find records efficiently — especially when dealing with large amounts of data — databases use a technique called indexing.

  • Indexing acts like the index section of a book, allowing us to quickly locate the required information without scanning every page.


Indices

An index is a data structure (like an index in a book) that allows faster retrieval of records from a database table.

It's main purpose is to speed up read operations such as SELECT queries by minimizing disk access.

Each index entry contains:

  • Search Key: Copy of the attribute being indexed (e.g., Employee_ID).
  • Data Reference: Pointer to the location of the actual record in the disk block.

Types of Indexing:

  • Primary Index: Primary index is defined on an ordered data file. The data file is ordered on a key field. The key field is generally the primary key of the relation.

  • Secondary (Non Clustered) Index: Secondary index may be generated from a field which is a candidate key and has a unique value in every record, or a non-key with duplicate values.

  • Clustering Index: Clustering index is defined on an ordered data file. The data file is ordered on a non-key field.

Types of Indexing

Ordered Indexing:

  • Dense Index: In dense index, there is an index record for every search key value in the database. This makes searching faster but requires more space to store index records itself. Index records contain search key value and a pointer to the actual record on the disk.

  • Sparse Index: In sparse index, index records are not created for every search key. An index record here contains a search key and an actual pointer to the data on the disk. To search a record, we first proceed by index record and reach at the actual location of the data. If the data we are looking for is not where we directly reach by following the index, then the system starts sequential search until the desired data is found.

Ordered Indexing

Multi-Level Indexing:

When databases grow large, even the index itself can become too big to store in memory. To solve this, multilevel indexing is used.

Concept:

  • The index is divided into multiple smaller indexes.
  • The lowest level points to actual data.
  • The upper levels act as indexes of indexes.
  • The topmost level (root index) is small enough to fit in main memory.

Benefits:

  • Reduces disk access time.
  • Makes searching faster and scalable for large databases.

B-Trees

In database systems, dynamic indexing is essential for efficiently locating and managing records that are frequently inserted, deleted, or updated. The B-Tree (Balanced Tree) is one of the most efficient data structures designed for this purpose. It maintains balance dynamically while allowing quick data retrieval, insertion, and deletion operations.


What is a B-Tree?

A B-Tree is a self-balancing multi-way search tree where data is stored in nodes that can have multiple keys and pointers. Unlike binary search trees, each node in a B-Tree may contain multiple keys, which helps reduce the height of the tree and thereby minimizes disk I/O operations during searches.

Key Characteristics:

  • Records (or pointers to records) are stored in the nodes.
  • All leaf nodes exist at the same level, ensuring balance.
  • Nodes split or merge automatically during insertions and deletions to maintain uniform height.
B-Tree

Working of a B-Tree

1. Nodes and Keys

Each node can have multiple keys and pointers. The number of keys and pointers is determined by the order (p) of the B-Tree.

  • A node can have up to p pointers and p−1 keys.
  • Keys are arranged in ascending order.

2. Balanced Structure

The B-Tree is always balanced, meaning the path length from root to any leaf node is the same.

3. Search Operation

  • The search starts from the root node.
  • Based on key comparisons, appropriate pointers are followed down to lower levels.
  • The process continues until the desired record is found or confirmed absent.

Key Properties of B-Trees

Property Description
Order (p) Maximum number of child pointers per node
Keys per Node Up to p−1 keys arranged in ascending order
Balance All leaf nodes appear at the same level
Minimum Occupancy Each node is at least half full (except the root)
Linked Leaves Leaf nodes may be linked for sequential access

Example Structure of a B-Tree

For a database with:

  • Order (p): 23
  • Fan-out: 16

The growth pattern is as follows:

Level Nodes Keys Pointers
Root (Level 0) 1 15 16
Level 1 16 240 256
Level 2 256 3840 4096
Leaf Level 4096 61,440 —

This structure efficiently supports over 65,535 records using just three levels, resulting in very fast search operations.


B+ Trees: The Improved Version

A B+ Tree is an enhanced form of a B-Tree, optimized specifically for indexing. The primary difference is that data records are stored only in the leaf nodes, while internal nodes store only keys and pointers.

Structure:

  • Internal Nodes: Contain only keys and pointers to child nodes.
  • Leaf Nodes: Contain actual data records or pointers to them.
  • Linked Leaves: Leaf nodes are linked to allow range queries and sequential access.
B+ Tree

Key Properties of B+ Trees

Property Description
Internal Nodes Contain only keys and child pointers
Leaf Nodes Contain actual records or record pointers
Order (p) Each internal node can have up to p pointers and p−1 keys
Linked Leaves Enables fast range queries
Balanced Automatically maintains balance during insertions/deletions

Example Structure of a B+ Tree

Assume:

  • Key size: 9 bytes
  • Pointer size: 7 bytes (record), 6 bytes (block)
  • Block size: 512 bytes

Derived capacity:

  • Internal Node: 34 keys, 35 pointers
  • Leaf Node: 31 records
Level Nodes Keys Pointers
Root 1 22 23
Level 1 23 506 529
Level 2 529 11,638 12,167
Leaf Level 12,167 255,507 —

Thus, the structure can manage over 255,000 records with only three levels, offering extremely efficient access.


Advantages of Dynamic Multilevel Indexing

Advantage Explanation
Automatic Balancing Nodes split or merge automatically, maintaining a balanced structure.
Efficient Searches Reduced tree height leads to faster searches.
Quick Updates Insertion and deletion are handled efficiently.
Scalability Supports large datasets without performance degradation.

Real-world Applications

Domain Use Case
DBMS Indexing tables for fast query performance
File Systems Efficient file access and directory management
Search Engines Optimized keyword and document indexing
Operating Systems Directory and metadata indexing

Difference Between B-Trees and B+ Trees

Feature B-Tree B+ Tree
Data Storage In all nodes Only in leaf nodes
Data Retrieval Slower for range queries Faster due to linked leaves
Tree Depth Deeper Shallower
Range Queries Less efficient Highly efficient
Use Cases General-purpose indexing Database and file system indexing

Conclusion

B-Trees and B+ Trees play a critical role in dynamic multilevel indexing by maintaining balanced, efficient search structures.

  • B-Trees store data in all nodes and provide fast direct access.
  • B+ Trees, optimized for indexing, store records only at leaf nodes, ensuring faster sequential and range-based searches.

Hashing

In large database systems, searching through multilevel indexes can be time-consuming and storage-intensive. Hashing offers an efficient alternative by providing direct access to data records using a hash function. Instead of traversing index structures, hashing computes the storage address of a record directly based on its key value, leading to constant-time data retrieval.


Concept of Hashing

Hashing uses a hash function (h) that takes a search key (K) as input and returns an address or bucket number where the record is stored. This allows the database to quickly locate records without performing multi-level index lookups.

Components of Hashing

  • Hash Function (h): Maps a search key to a specific bucket address.
  • Bucket: The unit of storage that typically corresponds to a disk block. Each bucket can store one or more records.

Formula:

Bucket address = h(K)

Hash Organization

Component Description
Bucket Logical storage unit that holds one or more data records. Usually corresponds to a physical disk block.
Hash Function A deterministic mapping function h(K) that converts search keys into bucket addresses.

Example: If h(K) = K % 4, possible bucket addresses are 0, 1, 2, 3. Records are placed into one of these four buckets.


Static Hashing

In static hashing, the number of buckets remains fixed. The same hash function always produces the same address for a given key. While efficient for stable data sizes, static hashing struggles when the database grows or shrinks.

Operations in Static Hashing

  1. Insertion Compute the bucket address using the hash function and store the record in that bucket.

    address = h(K)
    
  2. Search Use the same hash function to locate the bucket and retrieve the desired record.

  3. Deletion Locate the record using the hash function and remove it from the corresponding bucket.


Bucket Overflow and Collision Handling

Bucket overflow occurs when multiple records hash to the same bucket and it becomes full. This situation is known as a collision. Two primary strategies handle collisions:

1. Overflow Chaining (Closed Hashing)

  • When a bucket is full, an overflow bucket is created.
  • The new bucket is linked to the original one.
  • The chain continues as needed.

This mechanism is suitable when data access patterns are unpredictable but not excessively dense.

2. Linear Probing (Open Hashing)

  • When a collision occurs, the system searches for the next available bucket sequentially.
  • The record is placed in the nearest free slot.

This method keeps all data within the main hash table, avoiding external overflow buckets.


Dynamic Hashing

Dynamic hashing (also called extendible hashing) resolves the limitations of static hashing. It allows the hash structure to expand and shrink dynamically as data grows or decreases.

Unlike static hashing, dynamic hashing uses a hash function that can produce a large number of potential values, but initially utilizes only a subset.


Organization of Dynamic Hashing

Dynamic hashing uses a directory and a hash function that produces binary hash values.

  • The prefix bits of the hash value determine the bucket address.
  • Each directory entry maintains a depth value, representing the number of bits used.
  • When a bucket overflows, the system increases the depth and doubles the directory size, allocating new buckets accordingly.

Example

If a hash value uses 3 bits, it can address 2³ = 8 buckets. If all 8 buckets are filled, the system increases the depth to 4 bits, allowing 16 buckets.


Operations in Dynamic Hashing

Operation Description
Querying Use prefix bits (as defined by the depth) to compute the bucket address and fetch the record.
Insertion Compute bucket address using h(K). If the bucket is full: add new buckets, increase depth, and recompute hash values.
Deletion Locate and remove the record. If a bucket becomes empty, the system may merge it with another to reduce space.
Update Locate record using the hash function and update in place.

Comparison: Static vs Dynamic Hashing

Feature Static Hashing Dynamic Hashing
Bucket Count Fixed Grows/Shrinks dynamically
Performance Degrades as database grows Maintains efficiency as size changes
Collision Handling Requires chaining or probing Handled through directory expansion
Flexibility Less flexible Highly adaptive
Use Case Suitable for small or stable databases Ideal for large, dynamic databases

Limitations of Hashing

  • Not efficient for range queries (e.g., finding all records between two values) because hash functions do not preserve key ordering.
  • Works best when data keys are random and uniformly distributed.
  • Hash function design is crucial; poor hashing may lead to clustering and performance degradation.
  • Has a higher implementation complexity compared to traditional indexing.

Advantages of Hashing

Advantage Description
Fast Access Constant-time data retrieval (O(1)) for direct lookups.
Dynamic Adaptability Dynamic hashing grows with the database.
Reduced Disk I/O Minimizes index traversal.
Efficient Storage Buckets are reused efficiently, minimizing space waste.

Conclusion

Hashing is a powerful data organization technique that enables direct record access without traversing hierarchical indexes.

  • Static hashing provides constant-time lookup but lacks scalability.
  • Dynamic hashing overcomes this by expanding and contracting in response to data volume.

Although hashing is not suitable for ordered or range-based queries, it is exceptionally efficient for random access patterns, making it widely used in database indexing, symbol tables, and hash-based file systems.



2. Transaction Process


A transaction can be defined as a group of tasks. A single task is the minimum processing unit which can't be divided further.

Or, In Other words we can say:

  • Transaction is an unit of work done against the DB in a logical sequence. Here, the sequence is very important in transaction.

  • It is a logical unit of work that contains one or more SQL statements. The result of all these statements in a transaction either gets completed successfully (all the changes made to the database are permanent) or if at any point any failure happens it gets rollbacked (all the changes being done are undone.)

Lets take an example of a simple transaction. Suppose a bank employee transfers Rs 500 from A's account to B's account. This very simple and small transaction involves several low-level tasks.

A's Account

Open_Account(A)
   Old = A.balance # Read A's Balance
   New = Old - 500 # Process A's Balance
   A.balance = New # Write Balance if Transaction Sucessfull
Close_Account(A)

B's Account

Open_Account(B)
   Old = B.balance # Read B's Balance
   New = Old + 500 # Process B's Balance
   B.balance = New # Write Balance if Transaction Sucessfull
Close_Account(B)

ACID properties

A transaction is a very small unit of a program and it may contain several lowlevel tasks. A transaction in a database system must maintain Atomicity, Consistency, Isolation, and Durability − commonly known as ACID properties − in order to ensure accuracy, completeness, and data integrity.

  • Atomicity: Either all operations of transaction are reflected properly in the DB, or none are.

  • Consistency: Integrity constraints must be maintained before and after transaction. DB must be consistent after transaction happens.

  • Isolation: Even though multiple transactions may execute concurrently, the system guarantees that, for every pair of transactions Ti and Tj, it appears to Ti that either Tj finished execution before Ti started, or Tj started execution after Ti finished. Thus, each transaction is unaware of other transactions executing concurrently in the system. Multiple transactions can happen in the system in isolation, without interfering each other.

  • Durability: After transaction completes successfully, the changes it has made to the database persist, even if there are system failures.

Transaction states

  • Active State: The very first state of the life cycle of the transaction, all the read and write operations are being performed. If they execute without any error the T comes to Partially committed state. Although if any error occurs then it leads to a Failed state.

  • Partially committed state: After transaction is executed the changes are saved in the buffer in the main memory. If the changes made are permanent on the DB then the state will transfer to the committed state and if there is any failure, the T will go to Failed state.

  • Committed state: When updates are made permanent on the DB. Then the T is said to be in the committed state. Rollback can’t be done from the committed states. New consistent state is achieved at this stage.

  • Failed state: When T is being executed and some failure occurs. Due to this it is impossible to continue the execution of the T

  • Aborted state: When T reaches the failed state, all the changes made in the buffer are reversed. After that the T rollback completely. T reaches abort state after rollback. DB’s state prior to the T is achieved.

  • Terminated state: A transaction is said to have terminated if has either committed or aborted.

Transaction States

Concurrency Control

Definition: Concurrency control in DBMS ensures that when multiple transactions execute simultaneously, the consistency and isolation of the database are maintained. It prevents conflicts like lost updates, temporary inconsistency, and uncommitted data access.

Purpose: To coordinate concurrent transactions so that the result is equivalent to executing them serially, i.e., one after another.

Common Problems in Concurrent Transactions:

Problem Description
Lost Update Two transactions overwrite each other’s changes.
Temporary Inconsistency One transaction reads uncommitted changes from another.
Unrepeatable Read A value read twice by a transaction differs because another transaction modified it in between.
Deadlock Two or more transactions wait indefinitely for each other to release locks.

Techniques for Concurrency Control:

  1. Lock-based Protocols: Use locks to control access to data items.
  2. Timestamp-based Protocols: Assign timestamps to transactions for ordering.
  3. Optimistic Concurrency Control: Transactions execute freely and check for conflicts at commit time.
  4. Validation-based Schemes: Ensure data consistency through validation before commit.

Serializability of Scheduling

Definition: A schedule is the sequence in which transactions are executed. A schedule is said to be serializable if its outcome is the same as that of some serial execution of those transactions.

Types of Serializability:

  1. Conflict Serializability: If a non-serial schedule can be transformed into a serial one by swapping non-conflicting operations, it is conflict serializable.

    • Two operations conflict if they belong to different transactions, act on the same data item, and one of them is a write.
  2. View Serializability: If a schedule preserves the read and write dependencies of transactions in the same way as a serial schedule, it is view serializable.

Importance: Serializability ensures isolation in ACID properties, maintaining the logical correctness of concurrent transactions.


Locking and Timestamp-based Schedulers

(a) Lock-based Schedulers

These control access to data items using locks:

  • Shared Lock (S): For read operations; multiple transactions can share it.
  • Exclusive Lock (X): For write operations; no other lock is allowed.

Common Locking Protocols:

  1. Two-Phase Locking (2PL):

    • Growing phase: Transaction acquires locks.
    • Shrinking phase: Once a lock is released, no new lock can be acquired. Ensures conflict serializability but may cause deadlocks.
  2. Strict 2PL: Locks are released only after commit or abort, preventing cascading rollbacks.


(b) Timestamp-based Schedulers

Each transaction is assigned a unique timestamp when it starts.

  • Each data item maintains Read_TS(X) and Write_TS(X).
  • A transaction’s operations are executed in timestamp order.

Rules:

  • If a transaction tries to read a value written by a later transaction, it is rolled back (to maintain order).
  • If a transaction tries to write after a later read/write, it is rolled back.

Advantages:

  • Deadlock-free.
  • Ensures serializability based on timestamps.

Database Recovery

Definition: Database recovery is the process of restoring the database to a consistent state after a failure. Failures can be due to system crash, transaction error, or disk damage.

Types of Failures:

  1. Transaction Failure – Logical or input errors within a transaction.
  2. System Crash – Power failure or OS crash.
  3. Disk Failure – Physical damage to storage media.

Recovery Techniques:

1. Shadow Copy Scheme:

  • A db-pointer points to the current database copy.
  • Before updating, a new copy (shadow copy) is created.
  • If the transaction succeeds, the db-pointer is updated to the new copy.
  • If it fails, the new copy is discarded.
  • Ensures Atomicity and Durability, but is inefficient for large databases.

2. Log-based Recovery:

  • Every transaction operation is logged before execution in stable storage.
  • Each log record contains transaction ID, data item, old value, and new value.

Types:

  • Deferred Update (Deferred DB Modification): All updates are recorded in the log first and written to the database only after a transaction commits.
  • Immediate Update (Immediate DB Modification): Updates occur as the transaction executes, but old values are logged to support undo if failure occurs.

Recovery Actions:

Situation Action
Transaction fails before commit UNDO using old values in the log.
System crashes after commit REDO using new values in the log.

These mechanisms ensure the Atomicity and Durability components of the ACID properties.




Module 4: Authentication & Authorization


1. Authentication

Authentication is the process of verifying the identity of a user or system trying to access a database. It ensures that only legitimate users with valid credentials can connect to the DBMS.

Purpose

  • To verify the user’s identity before granting access.
  • To prevent unauthorized access and protect sensitive data.
  • To maintain accountability, ensuring every action in the DB can be traced to a verified user.

How Authentication Works

  1. User Identification: The user provides a username (or user ID).

  2. Verification: The system checks the provided password, token, or biometric data against stored credentials.

  3. Access Decision:

    • If credentials match → Access is granted.
    • If credentials fail → Access is denied and possibly logged.

Common Authentication Methods

Method Description
Password-based Most common; user enters username & password stored securely (hashed/salted).
Token-based Uses one-time passwords (OTPs), smart cards, or security tokens.
Biometric Uses fingerprint, iris, or face recognition systems.
Multi-Factor (MFA) Combines two or more methods for stronger security (e.g., password + OTP).
Database-level Authentication Managed by DBMS itself (e.g., MySQL CREATE USER, GRANT).
External Authentication Delegated to OS or LDAP systems (e.g., Kerberos, Active Directory).

In DBMS Context

Database systems like MySQL, Oracle, and PostgreSQL support multiple authentication types:

-- Example in MySQL
CREATE USER 'akash'@'localhost' IDENTIFIED BY 'secure_pass123';
GRANT ALL PRIVILEGES ON portfolio_db.* TO 'akash'@'localhost';
FLUSH PRIVILEGES;

This ensures only authenticated users can connect and perform operations.


2. Authorization and Access Control

Authorization is the process that determines what an authenticated user is allowed to do within the database. It defines permissions and access levels for different users or roles.

Access Control enforces these permissions by restricting database actions like reading, writing, or deleting data.

Purpose

  • To protect database objects (tables, views, schemas, etc.).
  • To ensure users perform only authorized operations.
  • To implement least privilege principle, minimizing damage from accidental misuse or attacks.

Types of Access Control

Type Description
Discretionary Access Control (DAC) Access rights are granted or revoked by the data owner using commands like GRANT or REVOKE.
Mandatory Access Control (MAC) Access is controlled by a central authority based on predefined policies (e.g., government-level databases).
Role-Based Access Control (RBAC) Permissions are assigned to roles, and users are assigned roles (common in enterprise DBMS).

SQL Authorization Commands

  1. GRANT: To assign privileges.

    GRANT SELECT, INSERT ON student TO 'teacher'@'localhost';
  2. REVOKE: To remove privileges.

    REVOKE INSERT ON student FROM 'teacher'@'localhost';

Common Privileges

Privilege Description
SELECT Read data from a table/view.
INSERT Add new records.
UPDATE Modify existing records.
DELETE Remove records.
CREATE / DROP Create or delete database objects.
EXECUTE Run stored procedures or functions.

Access Control Levels

  1. User Level: Access to database or schema.
  2. Object Level: Access to specific tables, views, or functions.
  3. Column Level: Restrict visibility to specific columns (useful for sensitive data like salary or password).
  4. Row Level: Access limited to certain rows based on user identity (via WHERE clauses or policies).

Relationship Between Authentication and Authorization

Aspect Authentication Authorization
Purpose Verifies who the user is. Determines what the user can do.
Stage Performed before authorization. Follows successful authentication.
Based on Identity verification (passwords, tokens). Access policies, roles, and privileges.
Outcome Grants or denies login. Grants or denies specific actions.

DBMS Example Scenario

Suppose a banking database has two roles:

  • Admin: Can view and modify all customer data.
  • Teller: Can only view customer balances.

Implementation:

CREATE ROLE teller;
CREATE ROLE admin;
 
GRANT SELECT ON accounts TO teller;
GRANT ALL PRIVILEGES ON accounts TO admin;
 
CREATE USER 'rahul'@'localhost' IDENTIFIED BY 'pass123';
GRANT teller TO 'rahul'@'localhost';

Here, authentication verifies Rahul’s identity, and authorization restricts his access only to viewing data.


Summary Table

Concept Definition Key Mechanism Example
Authentication Verifies user identity Passwords, MFA, Tokens CREATE USER
Authorization Determines access level Roles, GRANT, REVOKE GRANT SELECT
Access Control Enforces authorization rules DAC, MAC, RBAC Role-based privileges




Module 5: Data Analysis & Security


1. Introduction to Data Warehousing

A Data Warehouse (DW) is a centralized repository that stores integrated data from multiple heterogeneous sources such as databases, transactional systems, and external files. It is designed specifically for querying and analysis, not for transaction processing.

Purpose

  • To support decision-making and business intelligence (BI).
  • To provide a historical, subject-oriented, integrated, and time-variant view of data.

Key Characteristics

Property Description
Subject-oriented Organized around major business subjects like sales, finance, or customers.
Integrated Data is collected from various sources and standardized.
Time-variant Maintains historical data (e.g., yearly, quarterly trends).
Non-volatile Once data is stored, it is not updated or deleted frequently.

Architecture of a Data Warehouse

  1. Data Sources: Operational databases, flat files, web logs, etc.

  2. ETL Process (Extract, Transform, Load):

    • Extract: Retrieve data from various sources.
    • Transform: Clean, filter, and format data.
    • Load: Store transformed data into the warehouse.
  3. Data Storage Area: The central repository storing both current and historical data.

  4. OLAP Engine (Online Analytical Processing): Supports complex queries, multidimensional analysis, and data summarization.

  5. Front-end Tools: Dashboards, reports, and data visualization tools for business analysis.

Schemas Used

Schema Structure Description
Star Schema Central fact table linked to dimension tables. Simplest and widely used.
Snowflake Schema Dimension tables are normalized into multiple related tables. Reduces redundancy.
Galaxy Schema Multiple fact tables share dimension tables. Used for complex data marts.

Benefits

  • Enhances data analysis and reporting.
  • Improves decision-making and forecasting.
  • Provides data consistency across the organization.
  • Supports historical and trend analysis.

2. Introduction to Data Mining

Data Mining is the process of extracting hidden patterns, trends, and knowledge from large datasets using statistical, mathematical, and machine learning techniques.

It is often considered as “Knowledge Discovery in Databases (KDD)”.

Steps in Data Mining (KDD Process)

  1. Data Cleaning: Remove noise and inconsistencies.
  2. Data Integration: Combine data from multiple sources.
  3. Data Selection: Retrieve relevant data for analysis.
  4. Data Transformation: Convert data into appropriate forms for mining.
  5. Data Mining: Apply algorithms to extract patterns (e.g., clustering, classification).
  6. Pattern Evaluation: Identify interesting and useful patterns.
  7. Knowledge Presentation: Visualize the results for decision-making.

Techniques of Data Mining

Technique Description Example
Classification Assigns data into predefined categories. Spam vs. Non-spam email.
Clustering Groups similar data items together. Customer segmentation.
Association Rule Mining Discovers relationships between items. “If customer buys bread, they also buy butter.”
Regression Predicts continuous values. Predicting sales revenue.
Anomaly Detection Identifies outliers or unusual data points. Fraud detection.

Applications

  • Business Intelligence: Market basket analysis, sales forecasting.
  • Finance: Credit scoring, risk analysis.
  • Healthcare: Disease prediction, patient profiling.
  • Web Mining: User behavior analysis, recommendation systems.

Difference between Data Warehousing and Data Mining

Aspect Data Warehousing Data Mining
Purpose Store and organize data Analyze and discover patterns
Process Type Data storage Data analysis
Tools Used OLAP, ETL ML algorithms, statistics
Outcome Historical data repository Actionable insights and predictions

3. SQL Injection

SQL Injection (SQLi) is a security vulnerability in which an attacker inserts or “injects” malicious SQL statements into a query to manipulate or access a database unlawfully.

It targets applications that dynamically build SQL queries using unvalidated user inputs.

Example

Vulnerable Code:

// User input: username=admin' --  and password=anything
$query = "SELECT * FROM users WHERE username = '$username' AND password = '$password'";

Here, the injected input admin' -- comments out the rest of the query, allowing unauthorized login.

Resulting Query:

SELECT * FROM users WHERE username = 'admin' -- ' AND password = 'anything';

The attacker logs in without knowing the real password.


Types of SQL Injection

Type Description
Classic SQL Injection Inject malicious queries to manipulate database logic.
Blind SQL Injection Attacker infers data from the application's responses (true/false).
Union-based SQL Injection Uses the UNION keyword to extract data from other tables.
Error-based SQL Injection Exploits error messages to reveal database structure.
Time-based Blind SQL Injection Uses delays (like SLEEP()) to detect vulnerabilities indirectly.

Preventive Measures

  1. Input Validation: Always validate and sanitize user inputs.

  2. Parameterized Queries (Prepared Statements):

    $stmt = $conn->prepare("SELECT * FROM users WHERE username=? AND password=?");
    $stmt->bind_param("ss", $username, $password);
    $stmt->execute();

    Prevents query manipulation.

  3. Stored Procedures: Encapsulate SQL logic within the database to avoid dynamic queries.

  4. Least Privilege Principle: Give minimal permissions to application-level database users.

  5. Use ORM Frameworks: Frameworks like Sequelize, Hibernate, or Prisma automatically handle query sanitization.

  6. Error Handling: Avoid exposing SQL error messages to users.


Consequences of SQL Injection

  • Unauthorized data access or modification.
  • Data theft or deletion.
  • Compromise of entire database server.
  • Loss of business integrity and trust.

Summary Table

Concept Definition Key Focus Tools/Methods
Data Warehousing Centralized data storage for analytics Data integration and OLAP ETL, Star Schema
Data Mining Discovering patterns and insights from data Knowledge discovery Clustering, Classification
SQL Injection Attacking via malicious SQL input Database security Input validation, Prepared statements





Interview Topics

1. SQL vs NoSQL Databases

Introduction to SQL Databases

SQL (Structured Query Language) databases are relational in nature. They store data in tables (rows and columns) and use predefined schemas to enforce structure and relationships.

Characteristics

  • Data is structured and relational.
  • Follows ACID properties (Atomicity, Consistency, Isolation, Durability).
  • Requires a fixed schema.
  • Data is accessed and manipulated using SQL queries.

Examples

  • MySQL
  • PostgreSQL
  • Oracle
  • Microsoft SQL Server

To Learn More about SQL Go to this MySQL Tutorial / Download as PDF.


Introduction to NoSQL Databases

NoSQL (Not Only SQL) databases are non-relational. They store data in formats such as documents, key-value pairs, graphs, or columns, and offer flexible schemas for unstructured or semi-structured data.

Characteristics

  • Schema-free or dynamic schema.
  • Horizontal scalability (scale-out).
  • High availability and replication support.
  • Best suited for Big Data and real-time applications.
  • Often used in cloud, IoT, and microservice architectures.

Common Types of NoSQL Databases

Type Description Example
Document Store Stores data as JSON-like documents MongoDB, CouchDB
Key-Value Store Stores data as key-value pairs Redis, DynamoDB
Column Store Stores data by columns for analytics Cassandra, HBase
Graph Store Stores nodes and edges for relationships Neo4j, Amazon Neptune

Example Code Comparison

SQL Example (MySQL / PostgreSQL)

Table Structure:

CREATE TABLE users (
  id INT PRIMARY KEY AUTO_INCREMENT,
  name VARCHAR(50),
  email VARCHAR(100),
  age INT
);
 
INSERT INTO users (name, email, age) 
VALUES ('Akash', 'akash@example.com', 22);
 
SELECT * FROM users WHERE age > 18;
  • Requires a predefined schema.
  • Data is normalized into tables.
  • JOINs are used to relate multiple tables.

NoSQL Example (MongoDB – Document-Based)

Document Structure (JSON-like):

{
  "_id": 1,
  "name": "Akash",
  "email": "akash@example.com",
  "age": 22
}

MongoDB Commands:

// Insert a document
db.users.insertOne({ name: "Akash", email: "akash@example.com", age: 22 });
 
// Query documents
db.users.find({ age: { $gt: 18 } });
  • Schema is flexible; fields can vary across documents.
  • No need for JOINs; nested structures store related data together.
  • Ideal for agile, real-time, and distributed systems.

Historical Background

Aspect SQL NoSQL
Origin 1970s, developed to minimize data duplication and enforce relational consistency. Late 2000s, emerged due to cloud computing, unstructured data, and the need for horizontal scaling.
Primary Focus Data integrity and consistency. Developer productivity, scalability, and flexibility.

Advantages of NoSQL Databases

  1. Flexible Schema: No predefined schema; structure can change dynamically.
  2. Horizontal Scaling: Easily distribute data across multiple nodes.
  3. High Availability: Automatic replication ensures data redundancy.
  4. Fast Reads/Inserts: Optimized for read-heavy workloads.
  5. Cloud-Native: Suited for distributed and cloud-based systems.
  6. Handles Unstructured Data: Can manage JSON, logs, documents, and multimedia.
  7. Developer Friendly: Reduces complex migrations during development.
  8. No Complex Joins: Data is denormalized for performance.
  9. Caching Capability: Excellent for caching real-time data.
  10. Ideal for Big Data: Efficient for large-scale analytics and storage.

Common Misconceptions about NoSQL

Misconception Reality
“NoSQL databases can’t handle relationships.” They can, using embedded documents or graph models.
“NoSQL databases don’t support ACID transactions.” Modern NoSQL systems like MongoDB support ACID transactions.
“NoSQL replaces SQL.” It complements SQL; hybrid systems are common in modern architectures.

Major Disadvantages of NoSQL

  1. Data Redundancy: Data duplication is common due to denormalization.
  2. Complex Updates: Updating redundant data across multiple documents is difficult.
  3. Limited Querying Capability: Lacks powerful query languages like SQL.
  4. Weaker Consistency: Many systems prefer availability over strict consistency.
  5. Lack of Standardization: Query syntax differs across databases.
  6. No Universal ACID Support: Some NoSQL databases trade consistency for scalability.
  7. Difficult for Complex Transactions: Multi-document transactions can be limited.

When to Use Each

Use Case Recommended Database Type
Highly structured data with relationships SQL
Rapid development, schema flexibility NoSQL
Real-time analytics and high scalability NoSQL
Strict data integrity (banking, accounting) SQL
Microservices and cloud-native apps NoSQL
Complex reporting and JOIN operations SQL

SQL vs NoSQL

Feature SQL Databases NoSQL Databases
Data Model Relational (tables, rows, columns) Non-relational (documents, key-value, graph, column)
Schema Fixed, predefined Dynamic or schema-less
Scalability Vertical (scale-up by upgrading hardware) Horizontal (scale-out across servers)
Transactions Fully ACID compliant BASE (Basically Available, Soft state, Eventually consistent)
Joins Supports JOIN operations Typically avoided; data is nested
Query Language Structured Query Language (SQL) Database-specific APIs or JSON queries
Performance Slower for large-scale, distributed workloads Optimized for high-volume and low-latency workloads
Storage Type Row-based storage Key-value, document, column, or graph formats
Examples MySQL, PostgreSQL, Oracle MongoDB, Redis, Cassandra, Neo4j
Use Case Banking, ERP, structured data Social media, IoT, real-time analytics

Conclusion

  • SQL databases emphasize structure, integrity, and consistency, making them ideal for applications like banking or inventory management.
  • NoSQL databases emphasize flexibility, scalability, and speed, making them suitable for modern web apps, IoT, real-time systems, and Big Data analytics.

In practice, most modern systems use both, where SQL handles transactional data and NoSQL handles large, unstructured, or rapidly changing datasets.




2. Partitioning and Sharding in DBMS


Modern databases store massive volumes of data that grow rapidly with time. Managing such large databases can become inefficient, affecting query performance, scalability, and availability. To overcome these challenges, database engineers use partitioning and sharding, both of which divide large datasets into smaller, more manageable units.

The principle is simple:

“A big problem can be solved efficiently when broken into smaller parts.”


What is Partitioning?

Partitioning is a database optimization technique used to divide a large database table into smaller, manageable segments known as partitions. Each partition contains a subset of the table’s data and can be stored on separate disks or servers.

Despite partitioning, the table remains logically one single entity — SQL queries can access it as usual.

Key Idea

Instead of processing a giant table, queries now operate on smaller, partitioned tables, improving:

  • Performance
  • Manageability
  • Scalability

Purpose of Partitioning

Purpose Description
Performance Optimization Queries scan fewer rows within a partition rather than the entire table.
Improved Manageability Easier to maintain and backup smaller data slices.
Parallelism Multiple partitions can be queried in parallel.
High Availability Even if one partition/server fails, others continue to function.
Cost Efficiency Reduces the need for expensive hardware upgrades (scale-out instead of scale-up).

Types of Partitioning

Partitioning can be implemented in two main ways:

A. Vertical Partitioning

  • Data is divided column-wise.
  • Each partition stores different columns of a table.

Example:

ID Name Email Salary
1 Akash akash@example.com 60000

Vertical partitioning could split this into:

  • Partition 1: (ID, Name, Email)
  • Partition 2: (ID, Salary)

Pros:

  • Useful when applications frequently access only specific columns.
  • Reduces I/O load for partial data access.

Cons:

  • To reconstruct a full record, data from multiple partitions must be joined (higher latency).

B. Horizontal Partitioning

  • Data is divided row-wise (tuple-wise).
  • Each partition contains different rows of the same table.

Example:

ID Name Age City
1 Akash 22 Delhi
2 Priya 24 Mumbai
3 Raj 25 Chennai

Horizontal partitioning:

  • Partition 1 → Rows 1–1000
  • Partition 2 → Rows 1001–2000
  • Partition 3 → Rows 2001–3000

Pros:

  • Queries can be routed only to relevant partitions.
  • Ideal for distributed systems and scalability.

Cons:

  • Slight complexity in determining partition logic.

When Partitioning is Applied

Partitioning becomes essential when:

  1. Dataset size grows enormously, making query performance slow.
  2. Server load increases, causing high response times.
  3. Backup, restore, or maintenance operations become cumbersome.
  4. Distributed or Cloud databases are used (requiring data segmentation).

Advantages of Partitioning

Advantage Description
Parallelism Queries can run on multiple partitions simultaneously.
High Availability Failure of one partition/server does not affect the entire system.
Performance Queries and indexes operate on smaller data subsets.
Manageability Easier to back up, migrate, or maintain data.
Scalability Supports distributed data storage without affecting structure.
Cost Reduction Enables horizontal scaling, avoiding costly hardware upgrades.

Distributed Database

A distributed database is a single logical database stored across multiple physical locations (servers). Each site manages its data locally but is connected by a network and appears as a single unified system to the user.

Partitioning, Clustering, and Sharding are optimization techniques used to implement distributed databases.



Sharding

Sharding is a specific implementation of horizontal partitioning across multiple servers or database instances. It involves splitting a large dataset into smaller subsets called shards, each hosted on a separate database instance.

A routing layer determines which shard contains the requested data.


Example of Sharding

Scenario:

A global e-commerce platform stores user data by region.

Shard Region Database Server
Shard 1 Asia DB_Server_1
Shard 2 Europe DB_Server_2
Shard 3 America DB_Server_3

When a user from India logs in, the routing layer directs the query to Shard 1 (Asia), instead of searching all data globally.


How Sharding Works

  1. Sharding Key: A field (like user ID or region) determines which shard stores the data.
  2. Routing Layer: Routes queries to the appropriate shard using the key.
  3. Independent Shards: Each shard behaves as a separate database instance.

Advantages of Sharding

Benefit Description
Scalability Allows horizontal scaling across multiple machines.
Availability Failure in one shard does not affect others.
Performance Queries operate on smaller datasets.
Load Distribution Distributes data and workload evenly.

Disadvantages of Sharding

Limitation Description
Complexity Requires routing layer and consistent partition mapping.
Re-Sharding Issues If data grows unevenly, rebalancing shards is challenging.
Non-uniform Data Distribution May cause load imbalance.
Scatter-Gather Problem Analytical queries need to fetch data from multiple shards, increasing latency.

9. Difference Between Partitioning and Sharding

Aspect Partitioning Sharding
Definition Logical division of a large table into smaller pieces within a single database. Physical division of data across multiple database instances.
Implementation Scope Within one DB server. Across multiple servers or databases.
Data Distribution Logical (can exist on same hardware). Physical (distributed across nodes).
Used For Performance and management optimization. Scalability and distributed storage.
Routing Layer Not required. Required to direct queries to shards.
Failure Impact Failure affects the same DB instance. Isolated; failure of one shard doesn’t affect others.
Complexity Easier to manage. More complex (requires mapping, balancing).
Best Suited For Large single-server databases. Distributed, cloud-scale applications.

Practical Examples

Partitioning Example (Horizontal)

-- Table Partition Example in MySQL
CREATE TABLE orders (
  order_id INT,
  order_date DATE,
  customer_id INT
)
PARTITION BY RANGE (YEAR(order_date)) (
  PARTITION p2022 VALUES LESS THAN (2023),
  PARTITION p2023 VALUES LESS THAN (2024)
);

Divides data by year into multiple partitions within the same database.


Sharding Example (Conceptual)

// Pseudocode for sharding using userID
if (userID % 3 === 0)
    connect(DB_Server_1);
else if (userID % 3 === 1)
    connect(DB_Server_2);
else
    connect(DB_Server_3);

Distributes user data across multiple databases (shards) based on user ID.


Summary Table

Concept Definition Example Purpose
Partitioning Divides a table into smaller segments within the same database. Year-wise orders partition Performance and manageability
Sharding Distributes database segments across multiple servers. Region-wise user data Scalability and load balancing
Vertical Partitioning Split columns of a table. User info vs Salary data Optimize column access
Horizontal Partitioning Split rows of a table. Orders by region Optimize data access by key

Conclusion

  • Partitioning enhances query performance and manageability within a single database.
  • Sharding ensures scalability and high availability across distributed systems.
  • Both are fundamental database optimization techniques in modern large-scale systems like Google, Amazon, and Netflix, where billions of records must be managed efficiently.



3. CAP Theorem and Master–Slave Database Architecture


CAP Theorem

The CAP Theorem (also known as Brewer’s Theorem) is one of the most important principles in Distributed Database Systems. It defines the trade-offs a distributed database must make between Consistency, Availability, and Partition Tolerance — the three essential properties of distributed systems. Understanding this theorem helps engineers design efficient, fault-tolerant, and scalable distributed databases based on the specific requirements of an application.


Breaking Down CAP

Consistency (C)

In a consistent distributed system, all nodes reflect the same data at any given time. When a write operation occurs, the update must be propagated to all replicas so that any subsequent read operation returns the most recent value.

Example: If a user updates their email ID in one node, any subsequent read from another node should immediately return the new email ID.

Key Point: All clients see the same data simultaneously, ensuring synchronized state across the cluster.


Availability (A)

A system is available if it remains operational and responsive even when some nodes fail. Every request — read or write — receives a response, though it may not always reflect the most recent data.

Example: Even if one server is down, other nodes continue serving requests without downtime.

Key Point: The system guarantees responsiveness but not necessarily the latest consistency.


Partition Tolerance (P)

Partition tolerance refers to the system’s ability to continue functioning despite communication failures or network partitions between nodes. If a network partition occurs (e.g., some nodes can’t communicate), a partition-tolerant system remains operational by replicating and distributing data.

Example: If Node A cannot communicate with Node B due to network issues, both can still serve requests independently until communication is restored.

Key Point: A partition-tolerant system sacrifices either consistency or availability but not both.


The CAP Theorem Statement

The CAP Theorem states that:

In a distributed system, it is impossible to guarantee Consistency, Availability, and Partition Tolerance simultaneously. A system can only provide two of these three properties at any given time.

Therefore, distributed databases must choose between consistency and availability when a partition occurs.

CAP Theorem

CAP in NoSQL Databases

NoSQL databases are designed to operate across distributed networks and are optimized for scalability and high availability. Depending on business needs, each NoSQL database prioritizes different combinations of CAP properties.


CA Databases (Consistency + Availability)

  • These databases maintain data consistency across all nodes and ensure high availability as long as no network partition occurs.
  • They are not fault-tolerant because partitions (network failures) can cause system unavailability.

Example: Relational databases like MySQL or PostgreSQL configured with replication can function as CA systems in non-partitioned environments.

Use Case: Enterprise systems requiring consistent transactions but deployed in stable network environments.


CP Databases (Consistency + Partition Tolerance)

  • These systems maintain data consistency and tolerance to network partitions, but sacrifice availability during a partition.
  • If a partition occurs, inconsistent nodes are taken offline until synchronization is restored.

Example: MongoDB is a CP database. It uses a replica set with one primary node and multiple secondary nodes. Only the primary handles write requests; if it fails, a secondary becomes the new primary. During partitioning, unavailable nodes may reject requests to preserve consistency.

Use Case: Banking and financial applications where data accuracy is critical, and temporary unavailability is acceptable.


AP Databases (Availability + Partition Tolerance)

  • These databases guarantee availability and partition tolerance, but may return stale data during partitions.
  • All nodes remain operational and eventually synchronize after partitions are resolved (eventual consistency).

Example: Apache Cassandra and Amazon DynamoDB are AP databases. They have no primary node, and all nodes can handle read/write operations. Data updates propagate asynchronously across the cluster.

Use Case: Social media platforms, messaging systems, and e-commerce websites where availability and speed matter more than strict consistency.


Summary Table – CAP in Databases

Database Type Properties Ensured Compromised Property Example Typical Use Case
CA Consistency + Availability Partition Tolerance MySQL, PostgreSQL Enterprise data centers with stable networks
CP Consistency + Partition Tolerance Availability MongoDB Banking, financial systems
AP Availability + Partition Tolerance Consistency Cassandra, DynamoDB Social media, real-time applications

Design Implications of CAP

When designing distributed systems:

  • Choose CP for accuracy-sensitive systems (e.g., banking).
  • Choose AP for availability-first systems (e.g., social media, streaming).
  • CA systems work well only when network partitions are rare or negligible.

No single database satisfies all three properties simultaneously; instead, engineers must balance them based on application goals.




Master–Slave Database Architecture


Overview

Master–Slave Architecture is a database scaling pattern used to optimize I/O performance in systems with high read and write workloads. It follows the Command Query Responsibility Segregation (CQRS) principle — separating write and read operations to improve efficiency.

In this setup:

  • The Master Database handles write (insert, update, delete) operations.
  • The Slave Databases handle read (select) operations.

How It Works

  1. Master Node:

    • The primary source of truth.
    • All updates (writes) are applied here.
    • Data from the master is replicated to slave databases.
  2. Slave Nodes:

    • Read-only replicas of the master database.
    • Serve user read requests to reduce load on the master.
    • Updated through database replication mechanisms.
  3. Replication:

    • Can be synchronous (instant consistency) or asynchronous (eventual consistency).
    • Ensures all slave nodes reflect master data periodically.
Master Slave Architecture

Example Workflow

  • A user submits an update query → Directed to Master DB.
  • Another user fetches data → Served from a Slave DB.
  • Slaves synchronize with the master automatically to stay updated.

Advantages

Benefit Description
Load Balancing Distributes read and write loads efficiently.
Improved Performance Reduces read latency for high-traffic systems.
High Availability System remains operational even if a slave fails.
Reliability Master’s data is backed up across replicas.
Scalability Additional slave nodes can be added easily.

Example Implementation (MySQL)

-- On Master DB
CREATE USER 'replica'@'%' IDENTIFIED BY 'replica_pass';
GRANT REPLICATION SLAVE ON *.* TO 'replica'@'%';
 
-- On Slave DB
CHANGE MASTER TO
MASTER_HOST='master_host',
MASTER_USER='replica',
MASTER_PASSWORD='replica_pass',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=  107;
START SLAVE;

This setup ensures all data written to the Master is replicated to Slave(s) in real-time or near real-time.


Use Cases

  • E-commerce platforms: High concurrent read/write activity.
  • Analytics systems: Heavy read workloads on replicated slaves.
  • Content delivery applications: Fast global read performance.

Key Points

  • Master → Handles all writes (true source of data).
  • Slave → Handles reads (replicated copies).
  • Replication ensures data consistency across nodes.
  • Ideal for systems emphasizing reliability, availability, and scalability under heavy traffic.

Conclusion

Both the CAP Theorem and Master–Slave Architecture form the foundation of modern distributed databases:

  • CAP Theorem defines the trade-offs among consistency, availability, and partition tolerance.
  • Master–Slave Architecture practically implements scaling and reliability by distributing read and write operations.

Together, they help design robust, efficient, and high-performance distributed systems suited to real-world scalability demands.




4. Scaling Patterns in DBMS

Scaling a database means increasing its capacity to handle higher loads — more users, more requests, and more data — without degrading performance. In this case study, we’ll explore how a Cab Booking App evolves from a tiny startup system to a globally scalable platform, applying a series of database scaling patterns to solve real-world performance challenges.


The Beginning: The Startup Phase

At the initial stage:

  • The app serves ~10 customers.
  • A single small database server stores all information — customers, trips, locations, bookings, and trip history.
  • Average load: ~1 booking every 5 minutes.

This architecture is simple and efficient for early growth but not scalable as traffic increases.


The Problem Begins

As the app gains popularity:

  • Bookings rise to 30 per minute and beyond.

  • The single database server begins to struggle with:

    • High API latency
    • Transaction deadlocks and starvation
    • Frequent timeouts
    • Slow user experience

To fix this, the company must apply database optimization and scaling techniques.


Pattern 1: Query Optimization & Connection Pooling

Concept

Before scaling infrastructure, optimize what already exists.

Techniques

  • Query Optimization:

    • Rewrite inefficient SQL queries.
    • Use proper indexes, query plans, and denormalization for frequent lookups.
  • Caching:

    • Cache frequently accessed data (like user profiles, trip history, and payment data) using Redis or Memcached.
  • Connection Pooling:

    • Use libraries like HikariCP or pgBouncer to reuse database connections.
    • Allows multiple application threads to share connections efficiently.
  • Database Redundancy:

    • Add replicas or use NoSQL for less-structured data.

Outcome

  • Query latency drops significantly.
  • The app handles up to 100 bookings per minute smoothly.

Pattern 2: Vertical Scaling (Scale-Up)

Concept

Upgrade the existing server hardware instead of changing architecture.

Implementation

  • Increase RAM, CPU cores, and SSD storage.
  • Optimize database configuration (buffer size, cache limits, I/O throughput).

Advantages

  • Simple and quick to implement.
  • Cost-effective up to a point.

Limitations

  • Beyond a certain capacity, costs grow exponentially.
  • Hardware upgrades cannot overcome single-node limitations.

Outcome

  • Performance improves temporarily, handling 300 bookings per minute.

Pattern 3: Command Query Responsibility Segregation (CQRS)

Concept

Separate read and write operations to reduce contention on a single database.

Implementation

  • Introduce Master–Slave Replication:

    • Primary (Master) handles write operations.
    • Replicas (Slaves) handle read operations.
  • Use read replicas to balance query loads across multiple machines.

Advantages

  • Improved read performance.
  • Better load distribution across replicas.

Challenges

  • Replication lag between master and replicas.
  • Heavy write traffic can still overload the primary database.

Outcome

  • Efficient up to 500–600 bookings per minute, but write scalability remains limited.

Pattern 4: Multi-Primary Replication

Concept

Allow all nodes to act as both read and write replicas.

Implementation

  • Each node can process both read and write queries.
  • Nodes form a logical circular ring to replicate changes to one another.

Advantages

  • High write scalability — all servers accept writes.
  • Fault-tolerant, since each node can serve independently.

Challenges

  • Increased system complexity.
  • Requires synchronization and conflict resolution.
  • Risk of data inconsistency under heavy writes.

Outcome

  • Performance increases, but at ~50 requests/sec, latency and replication delays appear again.

Pattern 5: Partitioning by Functionality

Concept

Split data by functional domains into different databases or schemas.

Implementation

  • Separate large tables by business functions:

    • CustomerDB → customer details.
    • BookingDB → bookings and payments.
    • LocationDB → routes and geo-data.
  • Each DB can use its own replication or scaling pattern (single-primary or multi-primary).

Advantages

  • Improved modularity and fault isolation.
  • Easier to manage specific functionalities independently.

Challenges

  • The application layer must handle cross-database joins.
  • Increases complexity in backend logic.

Outcome

  • The system supports multi-country operations efficiently.

Pattern 6: Horizontal Scaling (Scale-Out)

Concept

Distribute data horizontally across multiple servers — known as Sharding.

Implementation

  • Divide data across multiple shards based on a shard key (e.g., customer ID or region).
  • Each shard has the same schema but holds a subset of total data.
  • Optionally, each shard has replicas for fault tolerance.

Advantages

  • Virtually unlimited scalability.
  • Reduced query latency (localized data access).
  • Fault isolation — failure in one shard doesn’t affect others.

Challenges

  • Complex to implement and maintain (requires routing layer).
  • Re-sharding (redistributing data) is difficult.
  • Cross-shard queries are slower (Scatter-Gather problem).

Outcome

  • Enables global scalability across multiple continents.

Pattern 7: Data Centre-wise Partitioning

Concept

Deploy region-specific data centers to minimize network latency and improve reliability.

Implementation

  • Set up multiple data centers across continents (e.g., Asia, Europe, America).
  • Implement cross–data center replication for synchronization and disaster recovery.
  • Route user requests to the nearest data center using Geo-DNS or load balancers.

Advantages

  • Reduced latency for geographically distributed users.
  • Improved availability and disaster recovery.
  • Enables data locality and compliance with regional regulations.

Outcome

  • System becomes globally resilient, capable of serving millions of requests per minute.

Summary Table of Scaling Patterns

Pattern Technique Goal Key Benefit Challenge
1 Query Optimization & Connection Pooling Performance tuning Low-cost improvement Limited scalability
2 Vertical Scaling (Scale-Up) Hardware upgrade Quick performance boost Expensive long-term
3 CQRS (Master–Slave) Separate reads/writes Load balancing Replication lag
4 Multi-Primary Replication Distributed writes High write throughput Data conflicts
5 Functional Partitioning Data by domain Manageability Complex joins
6 Horizontal Scaling (Sharding) Split data by key Massive scalability Complex management
7 Data Centre Partitioning Geo-distribution Low latency, disaster recovery High operational cost

Conclusion

In the Cab Booking App’s journey:

  • The system evolved from a single small database to a globally distributed architecture.
  • Each scaling pattern introduced a new layer of performance, availability, and fault tolerance.
  • The trade-off was increasing system complexity.

Ultimately:

“Database scaling is not a single step — it’s an evolutionary process that balances performance, cost, and complexity as the business grows.”

These seven scaling patterns represent the progressive path of DB optimization, used by real-world companies like Uber, Ola, and Lyft to manage massive, global-scale systems efficiently.


...
#dbms#sql#no-sql#interview-topics

Short on Time?? Want to read Offline??

We have got you covered, Download the PDF version of this Blog!

Comments

Loading comments...

Related Posts

Stay tuned for related posts!

Logo

A passionate developer dedicated to creating engaging digital experiences.

Quick Links

  • About Me
  • Services
  • Pricing
  • Blogs
  • Careers
  • Contact

Products

  • Code Compiler
  • Aksha Docs
  • Tutorials
  • StreamScripts
  • Notes & Handbooks

Legal

  • Privacy Policy
  • Terms & Conditions

Get In Touch

  • Kolkata, West Bengal, India

© 2025 Akash Halder. All rights reserved.

Designed and built with ❤️ using Next.js, Tailwind CSS

Module 1: Database System Architecture
Database System Architecture
Data Abstraction
Data Independence
Database Languages
Data Models
Entity-Relationship (E-R) Model
Network Model
Relational Model
Object-Oriented Data Model
Integrity Constraints
Data Manipulation Operations
Comparisons of Data Models
Module 2: Relational Query Languages & Relational Database Design
Relational Query Languages
Relational Algebra
Tuple Relational Calculus (TRC)
Domain Relational Calculus (DRC)
SQL3 (Extended SQL)
Open Source and Commercial DBMS
Relational Database Design
Domain and Data Dependency
Armstrong’s Axioms
Normal Forms
Dependency Preservation
Lossless Design
Query Processing and Optimization
Evaluation of Relational Algebra Expressions
Query Equivalence
Join Strategies
Query Optimization
Module 3: Storage strategies
1. Indexing & Hashing
Indices
Types of Indexing:
Ordered Indexing:
Multi-Level Indexing:
B-Trees
What is a B-Tree?
Key Characteristics:
Working of a B-Tree
1. Nodes and Keys
2. Balanced Structure
3. Search Operation
Key Properties of B-Trees
Example Structure of a B-Tree
B+ Trees: The Improved Version
Structure:
Key Properties of B+ Trees
Example Structure of a B+ Tree
Advantages of Dynamic Multilevel Indexing
Real-world Applications
Difference Between B-Trees and B+ Trees
Conclusion
Hashing
Concept of Hashing
Components of Hashing
Hash Organization
Static Hashing
Operations in Static Hashing
Bucket Overflow and Collision Handling
1. Overflow Chaining (Closed Hashing)
2. Linear Probing (Open Hashing)
Dynamic Hashing
Organization of Dynamic Hashing
Example
Operations in Dynamic Hashing
Comparison: Static vs Dynamic Hashing
Limitations of Hashing
Advantages of Hashing
Conclusion
2. Transaction Process
A's Account
B's Account
ACID properties
Transaction states
Concurrency Control
Serializability of Scheduling
Locking and Timestamp-based Schedulers
(a) Lock-based Schedulers
(b) Timestamp-based Schedulers
Database Recovery
Recovery Techniques:
1. Shadow Copy Scheme:
2. Log-based Recovery:
Module 4: Authentication & Authorization
1. Authentication
Purpose
How Authentication Works
Common Authentication Methods
In DBMS Context
2. Authorization and Access Control
Purpose
Types of Access Control
SQL Authorization Commands
Common Privileges
Access Control Levels
Relationship Between Authentication and Authorization
DBMS Example Scenario
Summary Table
Module 5: Data Analysis & Security
1. Introduction to Data Warehousing
Purpose
Key Characteristics
Architecture of a Data Warehouse
Schemas Used
Benefits
2. Introduction to Data Mining
Steps in Data Mining (KDD Process)
Techniques of Data Mining
Applications
Difference between Data Warehousing and Data Mining
3. SQL Injection
Example
Types of SQL Injection
Preventive Measures
Consequences of SQL Injection
Summary Table
Interview Topics
1. SQL vs NoSQL Databases
Introduction to SQL Databases
Characteristics
Examples
Introduction to NoSQL Databases
Characteristics
Common Types of NoSQL Databases
Example Code Comparison
SQL Example (MySQL / PostgreSQL)
NoSQL Example (MongoDB – Document-Based)
Historical Background
Advantages of NoSQL Databases
Common Misconceptions about NoSQL
Major Disadvantages of NoSQL
When to Use Each
SQL vs NoSQL
Conclusion
2. Partitioning and Sharding in DBMS
What is Partitioning?
Key Idea
Purpose of Partitioning
Types of Partitioning
A. Vertical Partitioning
B. Horizontal Partitioning
When Partitioning is Applied
Advantages of Partitioning
Distributed Database
Sharding
Example of Sharding
How Sharding Works
Advantages of Sharding
Disadvantages of Sharding
9. Difference Between Partitioning and Sharding
Practical Examples
Partitioning Example (Horizontal)
Sharding Example (Conceptual)
Summary Table
Conclusion
3. CAP Theorem and Master–Slave Database Architecture
CAP Theorem
Breaking Down CAP
The CAP Theorem Statement
CAP in NoSQL Databases
CA Databases (Consistency + Availability)
CP Databases (Consistency + Partition Tolerance)
AP Databases (Availability + Partition Tolerance)
Summary Table – CAP in Databases
Design Implications of CAP
Master–Slave Database Architecture
Overview
How It Works
Example Workflow
Advantages
Example Implementation (MySQL)
Use Cases
Key Points
Conclusion
4. Scaling Patterns in DBMS
The Beginning: The Startup Phase
The Problem Begins
Pattern 1: Query Optimization & Connection Pooling
Concept
Techniques
Outcome
Pattern 2: Vertical Scaling (Scale-Up)
Concept
Implementation
Advantages
Limitations
Outcome
Pattern 3: Command Query Responsibility Segregation (CQRS)
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 4: Multi-Primary Replication
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 5: Partitioning by Functionality
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 6: Horizontal Scaling (Scale-Out)
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 7: Data Centre-wise Partitioning
Concept
Implementation
Advantages
Outcome
Summary Table of Scaling Patterns
Conclusion
Module 1: Database System Architecture
Database System Architecture
Data Abstraction
Data Independence
Database Languages
Data Models
Entity-Relationship (E-R) Model
Network Model
Relational Model
Object-Oriented Data Model
Integrity Constraints
Data Manipulation Operations
Comparisons of Data Models
Module 2: Relational Query Languages & Relational Database Design
Relational Query Languages
Relational Algebra
Tuple Relational Calculus (TRC)
Domain Relational Calculus (DRC)
SQL3 (Extended SQL)
Open Source and Commercial DBMS
Relational Database Design
Domain and Data Dependency
Armstrong’s Axioms
Normal Forms
Dependency Preservation
Lossless Design
Query Processing and Optimization
Evaluation of Relational Algebra Expressions
Query Equivalence
Join Strategies
Query Optimization
Module 3: Storage strategies
1. Indexing & Hashing
Indices
Types of Indexing:
Ordered Indexing:
Multi-Level Indexing:
B-Trees
What is a B-Tree?
Key Characteristics:
Working of a B-Tree
1. Nodes and Keys
2. Balanced Structure
3. Search Operation
Key Properties of B-Trees
Example Structure of a B-Tree
B+ Trees: The Improved Version
Structure:
Key Properties of B+ Trees
Example Structure of a B+ Tree
Advantages of Dynamic Multilevel Indexing
Real-world Applications
Difference Between B-Trees and B+ Trees
Conclusion
Hashing
Concept of Hashing
Components of Hashing
Hash Organization
Static Hashing
Operations in Static Hashing
Bucket Overflow and Collision Handling
1. Overflow Chaining (Closed Hashing)
2. Linear Probing (Open Hashing)
Dynamic Hashing
Organization of Dynamic Hashing
Example
Operations in Dynamic Hashing
Comparison: Static vs Dynamic Hashing
Limitations of Hashing
Advantages of Hashing
Conclusion
2. Transaction Process
A's Account
B's Account
ACID properties
Transaction states
Concurrency Control
Serializability of Scheduling
Locking and Timestamp-based Schedulers
(a) Lock-based Schedulers
(b) Timestamp-based Schedulers
Database Recovery
Recovery Techniques:
1. Shadow Copy Scheme:
2. Log-based Recovery:
Module 4: Authentication & Authorization
1. Authentication
Purpose
How Authentication Works
Common Authentication Methods
In DBMS Context
2. Authorization and Access Control
Purpose
Types of Access Control
SQL Authorization Commands
Common Privileges
Access Control Levels
Relationship Between Authentication and Authorization
DBMS Example Scenario
Summary Table
Module 5: Data Analysis & Security
1. Introduction to Data Warehousing
Purpose
Key Characteristics
Architecture of a Data Warehouse
Schemas Used
Benefits
2. Introduction to Data Mining
Steps in Data Mining (KDD Process)
Techniques of Data Mining
Applications
Difference between Data Warehousing and Data Mining
3. SQL Injection
Example
Types of SQL Injection
Preventive Measures
Consequences of SQL Injection
Summary Table
Interview Topics
1. SQL vs NoSQL Databases
Introduction to SQL Databases
Characteristics
Examples
Introduction to NoSQL Databases
Characteristics
Common Types of NoSQL Databases
Example Code Comparison
SQL Example (MySQL / PostgreSQL)
NoSQL Example (MongoDB – Document-Based)
Historical Background
Advantages of NoSQL Databases
Common Misconceptions about NoSQL
Major Disadvantages of NoSQL
When to Use Each
SQL vs NoSQL
Conclusion
2. Partitioning and Sharding in DBMS
What is Partitioning?
Key Idea
Purpose of Partitioning
Types of Partitioning
A. Vertical Partitioning
B. Horizontal Partitioning
When Partitioning is Applied
Advantages of Partitioning
Distributed Database
Sharding
Example of Sharding
How Sharding Works
Advantages of Sharding
Disadvantages of Sharding
9. Difference Between Partitioning and Sharding
Practical Examples
Partitioning Example (Horizontal)
Sharding Example (Conceptual)
Summary Table
Conclusion
3. CAP Theorem and Master–Slave Database Architecture
CAP Theorem
Breaking Down CAP
The CAP Theorem Statement
CAP in NoSQL Databases
CA Databases (Consistency + Availability)
CP Databases (Consistency + Partition Tolerance)
AP Databases (Availability + Partition Tolerance)
Summary Table – CAP in Databases
Design Implications of CAP
Master–Slave Database Architecture
Overview
How It Works
Example Workflow
Advantages
Example Implementation (MySQL)
Use Cases
Key Points
Conclusion
4. Scaling Patterns in DBMS
The Beginning: The Startup Phase
The Problem Begins
Pattern 1: Query Optimization & Connection Pooling
Concept
Techniques
Outcome
Pattern 2: Vertical Scaling (Scale-Up)
Concept
Implementation
Advantages
Limitations
Outcome
Pattern 3: Command Query Responsibility Segregation (CQRS)
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 4: Multi-Primary Replication
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 5: Partitioning by Functionality
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 6: Horizontal Scaling (Scale-Out)
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 7: Data Centre-wise Partitioning
Concept
Implementation
Advantages
Outcome
Summary Table of Scaling Patterns
Conclusion

About the Author

Akash_Halder
Admin

Akash_Halder

Hi 👋🏻 I'm Akash Halder – Founder and CEO of this platform and also a Full Stack Web Developer & Data Scientist skilled in JavaScript, Python, and UI/UX design. I build impactful digital solutions and create content that blends tech with creativity. Currently I'm pursuing a B.Tech degree in Computer Science (AI & ML) at Brainware University.

Learn more about the author →

Module 1: Database System Architecture
Database System Architecture
Data Abstraction
Data Independence
Database Languages
Data Models
Entity-Relationship (E-R) Model
Network Model
Relational Model
Object-Oriented Data Model
Integrity Constraints
Data Manipulation Operations
Comparisons of Data Models
Module 2: Relational Query Languages & Relational Database Design
Relational Query Languages
Relational Algebra
Tuple Relational Calculus (TRC)
Domain Relational Calculus (DRC)
SQL3 (Extended SQL)
Open Source and Commercial DBMS
Relational Database Design
Domain and Data Dependency
Armstrong’s Axioms
Normal Forms
Dependency Preservation
Lossless Design
Query Processing and Optimization
Evaluation of Relational Algebra Expressions
Query Equivalence
Join Strategies
Query Optimization
Module 3: Storage strategies
1. Indexing & Hashing
Indices
Types of Indexing:
Ordered Indexing:
Multi-Level Indexing:
B-Trees
What is a B-Tree?
Key Characteristics:
Working of a B-Tree
1. Nodes and Keys
2. Balanced Structure
3. Search Operation
Key Properties of B-Trees
Example Structure of a B-Tree
B+ Trees: The Improved Version
Structure:
Key Properties of B+ Trees
Example Structure of a B+ Tree
Advantages of Dynamic Multilevel Indexing
Real-world Applications
Difference Between B-Trees and B+ Trees
Conclusion
Hashing
Concept of Hashing
Components of Hashing
Hash Organization
Static Hashing
Operations in Static Hashing
Bucket Overflow and Collision Handling
1. Overflow Chaining (Closed Hashing)
2. Linear Probing (Open Hashing)
Dynamic Hashing
Organization of Dynamic Hashing
Example
Operations in Dynamic Hashing
Comparison: Static vs Dynamic Hashing
Limitations of Hashing
Advantages of Hashing
Conclusion
2. Transaction Process
A's Account
B's Account
ACID properties
Transaction states
Concurrency Control
Serializability of Scheduling
Locking and Timestamp-based Schedulers
(a) Lock-based Schedulers
(b) Timestamp-based Schedulers
Database Recovery
Recovery Techniques:
1. Shadow Copy Scheme:
2. Log-based Recovery:
Module 4: Authentication & Authorization
1. Authentication
Purpose
How Authentication Works
Common Authentication Methods
In DBMS Context
2. Authorization and Access Control
Purpose
Types of Access Control
SQL Authorization Commands
Common Privileges
Access Control Levels
Relationship Between Authentication and Authorization
DBMS Example Scenario
Summary Table
Module 5: Data Analysis & Security
1. Introduction to Data Warehousing
Purpose
Key Characteristics
Architecture of a Data Warehouse
Schemas Used
Benefits
2. Introduction to Data Mining
Steps in Data Mining (KDD Process)
Techniques of Data Mining
Applications
Difference between Data Warehousing and Data Mining
3. SQL Injection
Example
Types of SQL Injection
Preventive Measures
Consequences of SQL Injection
Summary Table
Interview Topics
1. SQL vs NoSQL Databases
Introduction to SQL Databases
Characteristics
Examples
Introduction to NoSQL Databases
Characteristics
Common Types of NoSQL Databases
Example Code Comparison
SQL Example (MySQL / PostgreSQL)
NoSQL Example (MongoDB – Document-Based)
Historical Background
Advantages of NoSQL Databases
Common Misconceptions about NoSQL
Major Disadvantages of NoSQL
When to Use Each
SQL vs NoSQL
Conclusion
2. Partitioning and Sharding in DBMS
What is Partitioning?
Key Idea
Purpose of Partitioning
Types of Partitioning
A. Vertical Partitioning
B. Horizontal Partitioning
When Partitioning is Applied
Advantages of Partitioning
Distributed Database
Sharding
Example of Sharding
How Sharding Works
Advantages of Sharding
Disadvantages of Sharding
9. Difference Between Partitioning and Sharding
Practical Examples
Partitioning Example (Horizontal)
Sharding Example (Conceptual)
Summary Table
Conclusion
3. CAP Theorem and Master–Slave Database Architecture
CAP Theorem
Breaking Down CAP
The CAP Theorem Statement
CAP in NoSQL Databases
CA Databases (Consistency + Availability)
CP Databases (Consistency + Partition Tolerance)
AP Databases (Availability + Partition Tolerance)
Summary Table – CAP in Databases
Design Implications of CAP
Master–Slave Database Architecture
Overview
How It Works
Example Workflow
Advantages
Example Implementation (MySQL)
Use Cases
Key Points
Conclusion
4. Scaling Patterns in DBMS
The Beginning: The Startup Phase
The Problem Begins
Pattern 1: Query Optimization & Connection Pooling
Concept
Techniques
Outcome
Pattern 2: Vertical Scaling (Scale-Up)
Concept
Implementation
Advantages
Limitations
Outcome
Pattern 3: Command Query Responsibility Segregation (CQRS)
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 4: Multi-Primary Replication
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 5: Partitioning by Functionality
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 6: Horizontal Scaling (Scale-Out)
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 7: Data Centre-wise Partitioning
Concept
Implementation
Advantages
Outcome
Summary Table of Scaling Patterns
Conclusion
Module 1: Database System Architecture
Database System Architecture
Data Abstraction
Data Independence
Database Languages
Data Models
Entity-Relationship (E-R) Model
Network Model
Relational Model
Object-Oriented Data Model
Integrity Constraints
Data Manipulation Operations
Comparisons of Data Models
Module 2: Relational Query Languages & Relational Database Design
Relational Query Languages
Relational Algebra
Tuple Relational Calculus (TRC)
Domain Relational Calculus (DRC)
SQL3 (Extended SQL)
Open Source and Commercial DBMS
Relational Database Design
Domain and Data Dependency
Armstrong’s Axioms
Normal Forms
Dependency Preservation
Lossless Design
Query Processing and Optimization
Evaluation of Relational Algebra Expressions
Query Equivalence
Join Strategies
Query Optimization
Module 3: Storage strategies
1. Indexing & Hashing
Indices
Types of Indexing:
Ordered Indexing:
Multi-Level Indexing:
B-Trees
What is a B-Tree?
Key Characteristics:
Working of a B-Tree
1. Nodes and Keys
2. Balanced Structure
3. Search Operation
Key Properties of B-Trees
Example Structure of a B-Tree
B+ Trees: The Improved Version
Structure:
Key Properties of B+ Trees
Example Structure of a B+ Tree
Advantages of Dynamic Multilevel Indexing
Real-world Applications
Difference Between B-Trees and B+ Trees
Conclusion
Hashing
Concept of Hashing
Components of Hashing
Hash Organization
Static Hashing
Operations in Static Hashing
Bucket Overflow and Collision Handling
1. Overflow Chaining (Closed Hashing)
2. Linear Probing (Open Hashing)
Dynamic Hashing
Organization of Dynamic Hashing
Example
Operations in Dynamic Hashing
Comparison: Static vs Dynamic Hashing
Limitations of Hashing
Advantages of Hashing
Conclusion
2. Transaction Process
A's Account
B's Account
ACID properties
Transaction states
Concurrency Control
Serializability of Scheduling
Locking and Timestamp-based Schedulers
(a) Lock-based Schedulers
(b) Timestamp-based Schedulers
Database Recovery
Recovery Techniques:
1. Shadow Copy Scheme:
2. Log-based Recovery:
Module 4: Authentication & Authorization
1. Authentication
Purpose
How Authentication Works
Common Authentication Methods
In DBMS Context
2. Authorization and Access Control
Purpose
Types of Access Control
SQL Authorization Commands
Common Privileges
Access Control Levels
Relationship Between Authentication and Authorization
DBMS Example Scenario
Summary Table
Module 5: Data Analysis & Security
1. Introduction to Data Warehousing
Purpose
Key Characteristics
Architecture of a Data Warehouse
Schemas Used
Benefits
2. Introduction to Data Mining
Steps in Data Mining (KDD Process)
Techniques of Data Mining
Applications
Difference between Data Warehousing and Data Mining
3. SQL Injection
Example
Types of SQL Injection
Preventive Measures
Consequences of SQL Injection
Summary Table
Interview Topics
1. SQL vs NoSQL Databases
Introduction to SQL Databases
Characteristics
Examples
Introduction to NoSQL Databases
Characteristics
Common Types of NoSQL Databases
Example Code Comparison
SQL Example (MySQL / PostgreSQL)
NoSQL Example (MongoDB – Document-Based)
Historical Background
Advantages of NoSQL Databases
Common Misconceptions about NoSQL
Major Disadvantages of NoSQL
When to Use Each
SQL vs NoSQL
Conclusion
2. Partitioning and Sharding in DBMS
What is Partitioning?
Key Idea
Purpose of Partitioning
Types of Partitioning
A. Vertical Partitioning
B. Horizontal Partitioning
When Partitioning is Applied
Advantages of Partitioning
Distributed Database
Sharding
Example of Sharding
How Sharding Works
Advantages of Sharding
Disadvantages of Sharding
9. Difference Between Partitioning and Sharding
Practical Examples
Partitioning Example (Horizontal)
Sharding Example (Conceptual)
Summary Table
Conclusion
3. CAP Theorem and Master–Slave Database Architecture
CAP Theorem
Breaking Down CAP
The CAP Theorem Statement
CAP in NoSQL Databases
CA Databases (Consistency + Availability)
CP Databases (Consistency + Partition Tolerance)
AP Databases (Availability + Partition Tolerance)
Summary Table – CAP in Databases
Design Implications of CAP
Master–Slave Database Architecture
Overview
How It Works
Example Workflow
Advantages
Example Implementation (MySQL)
Use Cases
Key Points
Conclusion
4. Scaling Patterns in DBMS
The Beginning: The Startup Phase
The Problem Begins
Pattern 1: Query Optimization & Connection Pooling
Concept
Techniques
Outcome
Pattern 2: Vertical Scaling (Scale-Up)
Concept
Implementation
Advantages
Limitations
Outcome
Pattern 3: Command Query Responsibility Segregation (CQRS)
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 4: Multi-Primary Replication
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 5: Partitioning by Functionality
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 6: Horizontal Scaling (Scale-Out)
Concept
Implementation
Advantages
Challenges
Outcome
Pattern 7: Data Centre-wise Partitioning
Concept
Implementation
Advantages
Outcome
Summary Table of Scaling Patterns
Conclusion