Master your next Database interview with our comprehensive collection of questions and expert-crafted answers. Get prepared with real scenarios that top companies ask.
Prepare for your Database interview with proven strategies, practice questions, and personalized feedback from industry experts who've been in your shoes.
Thousands of mentors available
Flexible program structures
Free trial
Personal chats
1-on-1 calls
97% satisfaction rate
Choose your preferred way to study these interview questions
A trigger in a database is a special type of stored procedure that automatically runs, or triggers, in response to certain events on a particular table or view. These events could be changes like an update, insert, or delete made on a table's data.
Triggers can be used for a variety of purposes. They're often used to enforce business rules or data integrity, automate system tasks, log changes or even replicate data. For example, you might create a trigger that updates the 'last_modified' field in a record every time a certain field in the record is updated.
There are two types of triggers based on the timing when they are triggered. Before triggers (or "for each row" triggers) fire before the operation's changes are applied. After triggers (or "for each statement" triggers) fire after the operation is completed.
But while triggers can be useful, they can also make debugging more complex because they run automatically in the background, and can have a performance impact if not used judiciously. Therefore, it's important to be mindful of when and where you're using them.
ACID stands for Atomicity, Consistency, Isolation, and Durability. They are a set of properties that ensure reliable processing of data transactions in a database system.
Atomicity means that a transaction (a sequence of operations) is treated as a single, indivisible unit. So, all the operations within a transaction are either completed successfully, or if any operation fails, the entire transaction fails and the database remains unchanged - it's all or nothing.
Consistency refers to the database staying in a consistent state before and after the transaction. This means that any transaction will take the database from one valid state to another, while maintaining the predefined rules.
Isolation ensures that any transaction happening concurrently with others does not affect them—transactions do not interfere with each other and each transaction runs as if it is the only one in the system.
Finally, Durability guarantees that once a transaction is committed, it remains so, even in the case of a system failure. This means that the database remembers every change made in a successful transaction and those changes persist even if there's a power loss or crash right after.
These properties are crucial for ensuring the integrity of data in any system that works with sensitive or important transactions - like banking or financial systems.
There are mainly three types of relationships in a database: one-to-one, one-to-many, and many-to-many.
In a one-to-one relationship, each row in one database table is related to one and only one other row in another table. For example, in a system where you're storing social security numbers of employees, each employee has a unique SSN, and each SSN belongs to a single employee.
A one-to-many relationship is the most common type. Here, one row in a table can relate to many rows in another table. Take a book store: one publisher publishes many books, but each book is published by only one publisher.
The last type is many-to-many, where one or more rows in a table are associated with one or more rows in another table. Let's go back to the book store. One book can be written by multiple authors and each author can write multiple books. This is often handled by using a third, linking table, to tie together the books and authors in a many-to-many relationship.
Understanding these relationships is crucial for organizing and structuring data within a relational database.
Try your first call for free with every mentor you're meeting. Cancel anytime, no questions asked.
Data normalization is the process of structuring a relational database so data is stored logically, with as little duplication as possible.
At a practical level, it helps with two things:
Instead of keeping everything in one big table, you split data into related tables and connect them with keys.
Simple example:
If you store customer_name, customer_address, product_name, and product_price in every order row, the same customer and product details get repeated again and again.
A normalized design would split that into tables like:
CustomersProductsOrdersOrderItemsThen the Orders table stores references like customer_id, and OrderItems stores references like product_id.
Why that matters:
Normalization is typically discussed in forms like 1NF, 2NF, and 3NF.
In short, it is about organizing data so each fact is stored in the right place, once, and relationships are handled through keys rather than repetition.
DDL, or Data Definition Language, and DML, or Data Manipulation Language, are both subsets of SQL, but they're used for different purposes.
DDL is used for defining, altering, or dropping data structures in the database. For example, you use DDL when you want to create, alter, or delete tables, or when you want to define indices or relationships among tables. The DDL commands include CREATE, ALTER, DROP, TRUNCATE, and others.
On the other hand, DML is used for manipulating the data within these structures. So if you want to insert data into a table, update existing data in a table, or delete data from a table, you would use DML commands. These commands include INSERT, UPDATE, DELETE, and SELECT.
In summary: use DDL when you need to work on the structure of the database, and use DML when you need to work on the data within the database.
OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) are two key types of data processing systems and they each serve different purposes.
OLTP systems are commonly interactive and designed for real-time business operations, such as financial transactions in which data must be captured, stored, and updated in real time. They cover routine operations of businesses like sales, receipts, updates, and the likes. The main emphasis of such systems is put on fast query processing, maintaining data integrity in multi-access environments and effectiveness measured by the number of transactions per second.
OLAP, on the other hand, focuses on providing complex, in-depth, multi-dimensional, data analysis. It doesn't handle day-to-day transactions. Instead, it's used for forecasting, planning, and decision-support tasks, drawing from the data stored in databases and processed by OLTP systems. The goal here is to analyze data for trends over time or other data analytics tasks. It won't be updating as often, but when it does, there'll be large batch loads of new data.
Simply put, OLTP is more about the daily operation of a business focusing on transactional or data integrity, whereas OLAP is about strategic business decisions using a wide set of data for in-depth analysis.
A primary key serves a vital role in a relational database. It's a unique identifier for each record in a table. Meaning, each table row (a record) has a distinct primary key that allows you to pinpoint that specific record. No two rows can share a primary key and it cannot be null. For example, in an employee table, the Employee ID could serve as a primary key.
A foreign key, on the other hand, is a column (or collection of columns) in a table that provides a link to the primary key of another table. The purpose of the foreign key is to ensure referential integrity within your data. It points to the primary key in another table, creating a link between the two. Taking the earlier example, in an additional table that stores Project Allocation details for each employee, the Employee ID could be used as a foreign key to link the project with the specific employee.
So in simple terms, while the primary key uniquely identifies a record in a table, a foreign key in a table refers to the primary key of another table.
I’d explain it simply like this:
Database indexing is a way to help the database find data faster, without scanning every row in a table.
Think of it like the index in a book. If I need one specific topic, I do not flip through every page. I go straight to the right section. A database index works the same way.
What an index does:
- It creates a structured lookup path for one or more columns
- It helps the database quickly locate matching rows
- It is especially useful for WHERE, JOIN, ORDER BY, and sometimes GROUP BY operations
Why it matters: - Faster reads and query performance - Better response times for large tables - Less full table scanning, which saves resources
But there is a tradeoff:
- Indexes take extra storage
- They add overhead on INSERT, UPDATE, and DELETE
- Too many indexes can actually hurt performance
So the real goal is not to index everything. It is to index the right columns, based on how the application queries the data.
For example: - Good candidates are columns frequently used in filters, joins, or sorting - Poor candidates are columns with low selectivity, like a boolean field, unless there is a very specific use case
In practice, I think of indexing as a performance optimization tool. It is one of the easiest ways to improve read performance, but it has to be used intentionally and balanced against write cost.
Get personalized mentor recommendations based on your goals and experience level
Start matchingSQL Injection is a common web hacking technique where attackers can insert malicious SQL code into input fields or through direct manipulation of the parameters in the URL. This can trick the system into executing unintended SQL commands, possibly leading to data leakage, data manipulation, and even system compromise.
To prevent SQL injection, there are several strategies:
One common method is using parameterized queries or prepared statements. In these, SQL commands are defined with placeholders for the inputs, and actual values are then provided separately. This ensures that the data cannot be interpreted as SQL commands, no matter what inputs are provided.
Another strategy is to use stored procedures, which can limit what SQL code can be run.
Input validation and sanitation are crucial too. Never trust user input blindly. Validate input to make certain it's what you expect in terms of type, length, format, and range.
Lastly, using the principle of least privilege for database access can mitigate the damage if an SQL injection attack does occur. Only grant the minimum permissions necessary for a user or a process, so even if an attack is successful, it can't cause extensive damage.
I handle database failures with a simple framework:
A solid answer should show both sides: - how you react in the moment - what you put in place beforehand so recovery is smooth
In practice, my approach looks like this:
Is this a primary database issue, storage issue, network issue, or corruption issue?
Then, prioritize recovery
I keep stakeholders updated with clear status, ETA, and business impact.
Protect data integrity
I validate consistency before declaring the system healthy.
Investigate root cause
Common causes are disk failures, memory pressure, bad queries, locking issues, corruption, or application changes.
After recovery, I do a post-incident review
Example:
In one environment, a production database became unavailable because the primary server ran into storage issues overnight.
The key for me is this: recover service quickly, avoid unnecessary data loss, verify integrity, then fix the underlying issue so it does not repeat.
The best way to answer this kind of question is:
Early in my career, I made a classic database mistake.
I needed to remove a specific set of rows from a table, and I ran a DELETE statement without a proper WHERE clause. Instead of deleting a small subset, I wiped the whole table.
I caught it right away and escalated immediately. I let my lead know, stopped any follow-up activity, and we moved into recovery mode.
What I did next: - Confirmed exactly what was deleted - Helped restore the table from backup - Validated the restored data against application records and logs - Identified what recent transactions were missing and worked with the team to recreate what we could
There was some impact. We lost a small amount of recent data that happened after the last backup, and it caused extra work for the team. It was painful, but it permanently changed how I handle production data.
After that, I became much more disciplined:
- I always preview destructive changes with a SELECT first
- I use transactions whenever possible, so I can roll back if something looks wrong
- I avoid running direct production deletes without peer review for sensitive tables
- I make sure backups and recovery steps are verified before high-risk changes
- I test the statement in lower environments first
- For larger deletes, I prefer batch operations and safeguards instead of one big command
That mistake taught me a lot about operational discipline. Since then, I've been very careful with change validation, recovery planning, and putting guardrails in place before touching production data.
I’d answer this by showing a clear security mindset, not just listing tools.
A good structure is:
Then give a practical example of what you’d actually do.
For me, securing a database starts with limiting who can get in and what they can do.
After that, I focus on reducing exposure.
Then I look at application-layer risks, especially SQL injection.
Encryption is another core piece.
Monitoring and auditing matter because prevention is not enough.
I’d also make sure the environment stays hardened over time.
And finally, I always include recovery.
A concrete example would be something like this:
If I were securing a new production PostgreSQL environment, I’d:
That’s usually how I think about it, layered security, least privilege, visibility, and recoverability.
A data warehouse is a large, central repository of data that has been collected and integrated from various sources. Data here is stored for a long time enabling users to access historical data. It is primarily used for reporting and analysis.
The key difference between a database and a data warehouse lies in the type of data they store and how they store it. A database is used for day-to-day operations and transactions like CRUD operations (Create, Read, Update, Delete.) This is often referred to as Online Transaction Processing (OLTP.)
On the other hand, a data warehouse is structured to make reporting and data analysis faster and easier. It's not used for day-to-day transactions but rather to guide business decisions by allowing data analytics to gather insights from the data. This process is often referred to as Online Analytical Processing (OLAP.)
Another major difference is that while data in a database is usually normalized, in a data warehouse, data is often denormalized. This means you'll usually have more redundancy in a data warehouse, but data retrieval for complex queries can be faster. In simple terms, a database is optimal for read-write activities, while a data warehouse is designed for read-heavy operations.
Database denormalization is the process of combining tables in a database to reduce the amount of database normalization. It's essentially the opposite of normalization, where data is broken into smaller tables to avoid redundancy and improve data integrity.
Denormalization, on the other hand, aims to improve the read performance of a database at the cost of losing some write performance by adding redundant copies of data. It reduces the amount of table joins needed to collect the data, leading to faster data retrieval.
It can be useful in specific scenarios such as when read performance is critical, like in large databases handling mostly read operations or with real-time systems that require very fast response times. It's also suited for systems where data updates are infrequent, thus the drawbacks of maintaining consistency amidst redundancy don't pose a huge challenge. But care needs to be taken as denormalization can lead to anomalies and data inconsistency if not managed properly. Also, it's not typically the first step in database design, only to be considered when there's a specific performance issue that needs addressing.
A database transaction is a logical unit of work that contains one or more SQL statements. It's a sequence of operations performed as a single logical unit of work. A transaction has a clear beginning and end, usually encapsulated within the BEGIN TRANSACTION, COMMIT, and ROLLBACK commands.
A key aspect of transactions is that they are meant to create a reliable unit of work. They adhere to the four ACID properties: Atomicity, Consistency, Isolation, and Durability. That means they're designed to be completely done or not done at all (Atomicity), they ensure the database remains in a legal state before and after the transaction (Consistency), they operate separately from one another (Isolation), and once done, their results are permanent (Durability).
For example, in a banking application, transferring money from one account to another involves subtracting the amount from the first account and adding it to the second. These two SQL operations together would form a single transaction, and both need to succeed for the transaction to be successful. If either fails, the entire transaction would be rolled back. This helps maintain the integrity of the database.
SQL views are essentially saved queries. They're a virtual or logical table made up of the result set from a query. Even though they're based on tables, views don't store data themselves - they just reflect data present in tables upon which they're defined.
Views can be really useful for a variety of reasons. For one, they can simplify complex queries. If you're regularly running a complicated query, you can save that as a view and then just select the data from the view, like you would from a regular table.
They're also handy for security purposes. If you want users to be able to access only certain parts of the data in a table, you can create a view that only includes the acceptable fields and give the users access to that view, instead of the entire table.
Additionally, views can provide a consistent, unchanged interface even if the underlying data changes. For instance, if you change table names or move things around, views referencing those tables will still work, which can be a big advantage if you're dealing with lots of database changes.
I usually think about database performance in layers. Start by finding the real bottleneck, then fix the highest-impact issue first.
My approach is pretty simple:
I do not like guessing, because a lot of performance issues look similar from the outside.
Fix the query path
Check joins carefully, especially on large tables.
Tune indexes
Balance read performance with write cost, because too many indexes hurt inserts and updates.
Look at schema and data distribution
For very large tables, consider partitioning, archiving, or summary tables.
Watch concurrency and operational factors
A few concrete examples:
WHERE and JOIN pattern, and rewrote the query to avoid SELECT *.Runtime dropped to under 2 seconds.
Write-heavy table with too many indexes
Insert throughput improved noticeably, and read performance stayed within SLA.
Parameter-sensitive query issue
That made response times much more consistent.
Large historical table
If I had to boil it down, I optimize databases by using evidence, not assumptions. I look at workload patterns, tune queries and indexes for the real access path, and make sure the design still holds up as data volume grows.
A stored procedure is basically a named set of SQL statements that lives inside the database and can be executed whenever you need it.
Think of it like packaging a repeatable database task into one callable unit.
What it usually includes:
- SQL logic for SELECT, INSERT, UPDATE, DELETE
- Input parameters, and sometimes output parameters
- Control flow like IF, loops, and error handling
- Business rules that should run close to the data
Why teams use stored procedures: - Reuse, write the logic once and call it many times - Consistency, everyone uses the same database logic - Performance, less back-and-forth between the app and the database - Security, users can be given permission to run the procedure without direct access to underlying tables - Maintainability, changes can be made centrally in the database
Simple example:
- Instead of an application sending 5 separate SQL statements to process an order, you create a procedure like ProcessOrder(order_id)
- The procedure validates the order, updates inventory, writes audit records, and commits the transaction
- The app just calls that one procedure
One important nuance: - Stored procedures are powerful, but you do not want to cram all application logic into them - They work best for data-heavy operations, validations, transactional workflows, and tasks that benefit from being close to the database
So in plain terms, a stored procedure is a reusable, server-side database program for handling common or complex data operations efficiently and securely.
The easiest way to explain it is this:
Here’s the difference in practical terms:
CustomerIDUsually a good fit for primary keys or columns used in range searches
Non-clustered index
A simple example:
EmployeeID is a clustered index, the employee rows are physically organized by EmployeeIDLastName is a non-clustered index, SQL Server keeps a separate structure sorted by LastName, with references back to the full employee rowsA few interview-friendly points to mention:
BETWEEN, sorting, and sequential accessINSERT, UPDATE, and DELETE, because each index also has to be maintainedIf I were answering in an interview, I’d keep it tight:
“A clustered index defines the physical order of rows in the table, so there can only be one. A non-clustered index is a separate structure that stores key values and pointers to the actual rows, so you can have many of them. Clustered indexes are ideal for range-based access, while non-clustered indexes are useful for fast lookups on other search columns.”
A deadlock in a database is a situation where two or more transactions permanently block each other by each holding a lock on a resource that the other transactions are trying to lock. This creates a circular chain of transactions where every transaction in the chain is waiting for a resource that is locked by the next transaction in the chain.
Let's take an example: let's say transaction A has locked resource R1 and is waiting for resource R2, which is locked by transaction B. Meanwhile, transaction B is waiting for resource R1, which is locked by transaction A. Neither transaction can proceed, and they're stuck in a deadlock.
Deadlocks are problematic because they leave resources unavailable to other transactions, potentially bringing parts of the system to a halt. To deal with deadlocks, Database Management Systems usually have a mechanism to detect and handle them, often by aborting one of the transactions and rolling back its changes. It then restarts the aborted transaction, once a deadlock situation is resolved.
A good way to answer this is to pick 2 to 3 real challenges, explain the impact, then show what you did and the result.
A simple structure is:
Here is how I’d answer it:
A few come up pretty regularly.
• Performance at scale
As data grows, queries that used to be fine can suddenly become slow and expensive.
In one environment, we had a reporting workload starting to affect transactional performance. I dug into execution plans, looked at missing and unused indexes, and found a couple of queries doing full scans when they did not need to. I rewrote those queries, added targeted indexes, and cleaned up part of the schema that was causing unnecessary joins.
That brought response times down a lot and reduced pressure on the server during peak hours.
• Data integrity and consistency
This gets harder when multiple systems are writing to the same database or when business rules live in too many places.
I usually handle that by pushing as much consistency as possible into the database design, things like proper keys, constraints, validation rules, and solid transaction handling. On one project, we were seeing occasional data mismatches between upstream and downstream systems. I helped tighten the schema rules and added reconciliation checks so bad data was caught early instead of surfacing later in reporting.
That made the data more trustworthy and cut down on manual cleanup.
• High availability and recovery
A database is only as good as your ability to recover it when something goes wrong.
I’ve worked on backup and restore planning, replication setups, and failover testing. The big lesson for me is that backups are not enough unless you actually test recovery. In one case, we improved the recovery process by documenting runbooks, validating backups regularly, and running recovery drills. That gave the team a lot more confidence and reduced recovery risk.
• Keeping up with changing database tech
The tooling changes fast, and it is easy to chase trends instead of solving the actual problem.
I try to stay practical. I keep up with new features and platforms, but I evaluate them based on workload, consistency needs, scaling patterns, and operational overhead. That helps me choose the right tool instead of the newest one.
What I’ve learned is that most database challenges are manageable if you stay proactive, monitor closely, and build with reliability in mind from the start.
I’d frame this around the workload first, then the operational reality.
A simple way to answer it is:
In practice, I look at a few key areas:
If the schema changes often, or the data is document-heavy, key-value, or graph-shaped, I’d consider a NoSQL option.
Query patterns
A database that looks good on paper can still be a bad fit if it struggles with the real query patterns.
Consistency and transactions
If eventual consistency is acceptable, that opens up more distributed options.
Scale
It’s not just data volume, it’s concurrency, throughput, and whether the system needs vertical scaling, horizontal scaling, sharding, or multi-region support.
Performance requirements
Some systems are optimized for heavy reads, others for write throughput, analytics, or time-series workloads.
Availability and recovery
A fast database is not useful if recovery is painful.
Operational complexity
A powerful database that the team can’t run, tune, monitor, or troubleshoot well may be the wrong choice.
Ecosystem and cost
For example, if I were building an internal business application with structured data, reporting needs, and transactional workflows, I’d probably lean toward PostgreSQL because it gives strong consistency, solid SQL support, and good flexibility without adding much operational complexity.
If I were designing something like a high-scale event ingestion platform with massive write volume and flexible schema requirements, I’d evaluate a NoSQL or distributed datastore based on write throughput, partitioning strategy, and consistency tradeoffs.
So my general rule is, don’t choose based on popularity. Choose based on data shape, access patterns, reliability needs, and what the team can realistically operate well.
I think about database tuning as a simple loop:
A good answer should show that tuning is not random tweaking. It is a structured process driven by data.
In practice, my process looks like this:
Slow queries and execution plans
Understand the workload
Focus on the queries that matter most to the business
Identify the bottleneck
Poor schema design or over-fetching?
Apply targeted fixes
Archive or purge old data if table size is hurting performance
Test and validate
For example, if users report slowness in an order management system, I would first pull the top slow queries and check wait events. If I find a report query doing full table scans on a large orders table, I would review the execution plan, see whether indexes are missing or not being used, and check if statistics are stale.
From there, I might:
If the query drops from 12 seconds to under 1 second and overall I/O comes down, that is a successful tuning change.
The main thing is to be methodical. Tune based on evidence, fix the highest-impact issue first, and always verify results.
I treat database changes like a controlled release, not a quick edit.
A simple way to structure the answer is:
In practice, my approach looks like this:
For critical changes, I also note recovery steps and rollback time
Review the impact before touching production
Could it lock large tables or affect performance?
Test in lower environments first
I check for bad joins, missing WHERE clauses, data truncation, type mismatches, and long-running locks
Use change scripts, not manual edits
That also makes peer review easier
Make changes in small, safe steps
I avoid risky all-at-once changes when there is a safer phased approach
Use transactions where appropriate
For very large operations, I may batch them carefully to avoid long locks and log growth
Have a rollback plan
That could mean restoring from backup, reversing a script, or switching traffic away temporarily
Validate after the change
Example:
At one job, I had to modify a customer table and backfill a new status field for millions of rows.
The result was a clean rollout with no data loss, no unexpected downtime, and an easy rollback path if we had needed it.
I usually think about database bottlenecks in a few buckets, because it helps narrow down the fix fast:
Then I work from evidence, not guesses. I check query latency, execution plans, wait events, lock stats, cache hit rates, and storage I/O before deciding what to change.
A few common bottlenecks and how I’d handle them:
Fixes might include rewriting the query, filtering earlier, returning fewer columns, batching large operations, or avoiding patterns like functions on indexed columns.
Missing or weak indexing
WHERE, JOIN, ORDER BY, and GROUP BY clauses.Then I add or adjust indexes, sometimes composite indexes, sometimes covering indexes, and I also remove unused ones because too many indexes hurt writes.
Bad cardinality estimates or stale statistics
In some cases, query hints or plan forcing can help short term, but I’d rather fix the root cause.
Schema design issues
Depending on the use case, I might denormalize selectively, partition large tables, archive old data, or split workloads between OLTP and reporting paths.
Disk I/O bottlenecks
Fixes could include better indexes, more memory, faster storage, query tuning, or reducing unnecessary reads.
Memory pressure
Depending on the platform, I may tune memory settings, reduce query memory usage, or scale the instance.
Locking and blocking
Common fixes are shorter transactions, better indexing, batching writes, changing isolation strategy where appropriate, or moving heavy reads to replicas.
Write-heavy workload pressure
Fixes might include reducing unnecessary indexes, tuning checkpoint or vacuum settings, batching writes, or partitioning data.
Maintenance issues
Regular maintenance matters, whether that’s VACUUM and ANALYZE in PostgreSQL, index maintenance in SQL Server, or similar tasks in other systems.
Configuration or architecture mismatch
In practice, I prioritize fixes like this:
For example, if an API endpoint is slow, I wouldn’t jump straight to adding hardware. I’d first check whether the query is scanning millions of rows because of a missing composite index. If that solves it, great. If not, then I’d look at memory, storage, locking, or whether the workload needs a broader architecture change.
Normalization is about structuring a database so the same fact is stored in one place, not repeated all over the system.
The main goals are:
A simple example:
If you store customer info on every order row, you end up repeating the same customer name, email, and address many times. Normalization would separate that into tables like Customers, Orders, and maybe OrderItems.
The common normal forms are:
Example:
- Bad: one phone_numbers column containing 123, 456
- Better: one phone number per row, or a separate child table
Example:
- In an OrderItems table with key (order_id, product_id), a column like order_date depends only on order_id, so it belongs in Orders, not OrderItems
Example:
- If a table has employee_id, department_id, and department_name, then department_name depends on department_id, not directly on employee_id
- So department_name should live in a Departments table
This usually comes up in edge cases where 3NF still allows certain anomalies.
Example: - If a person can have many skills and many certifications, storing both in one table can create unnecessary combinations
In practice:
A solid interview answer is usually: "Normalization is the process of organizing data to reduce redundancy and prevent anomalies. 1NF makes data atomic, 2NF removes partial dependency, and 3NF removes transitive dependency. Beyond that, BCNF and 4NF handle more complex dependency issues."
Denormalization is the process of intentionally introducing redundancy into a database by merging tables or adding redundant data to improve read performance. It's the opposite of normalization, which focuses on reducing redundancy and avoiding update anomalies. Denormalization is often used in scenarios where read-heavy workloads require improved access times, such as in OLAP systems or data warehousing, where quick reads are more critical than the overhead of maintaining perfectly normalized data.
For instance, in a reporting database, you might combine frequently accessed tables into one to avoid expensive join operations. This can significantly speed up read queries at the cost of more complex write operations and increased storage requirements.
A primary key is a unique identifier for a record in a database table, ensuring that each entry is distinct. It's a column or a set of columns whose values uniquely identify a row. Primary keys cannot contain NULL values.
A foreign key, on the other hand, is a column or a set of columns in one table that establishes a link between data in two tables. It points to the primary key of another table, ensuring referential integrity by restricting the actions that can cause data inconsistencies.
Transactions in a database context are sequences of one or more operations that are executed as a single unit of work. They are essential for maintaining data integrity and consistency, especially in environments where multiple users are accessing and modifying data simultaneously. Transactions follow the ACID properties, which stand for Atomicity, Consistency, Isolation, and Durability. Atomicity ensures that all operations within a transaction are fully completed or none at all. Consistency ensures that the database moves from one valid state to another. Isolation keeps transactions independent from each other until they're complete. Durability guarantees that once a transaction is committed, it remains so, even in the case of a system failure. This structure ensures reliable and error-free database operations.
An index is a separate data structure the database uses to find rows faster, without scanning the whole table.
The simplest way to think about it: - A table scan is like reading every page in a book - An index is like using the book’s index to jump straight to the right section
Why it helps:
- Speeds up WHERE lookups
- Helps with JOIN conditions
- Can improve ORDER BY and sometimes GROUP BY
- Reduces the amount of data the database has to read
In practice, an index usually stores: - The indexed column values in an organized structure, often a B-tree - A reference to the actual row location
So if I query by customer_id, the database can use the index to go straight to matching rows instead of checking every record.
The trade-off: - Reads get faster - Writes can get slower
That’s because every INSERT, UPDATE, or DELETE may also need to update the index.
So the goal is not to index everything. It’s to index the columns that are queried often and provide the most value.
A stored procedure is basically a named set of SQL logic saved inside the database.
You use one when you want to package database work into something reusable, controlled, and easy to call.
Why they’re useful:
CustomerId or date rangesA simple example:
INSERT and UPDATE statements for an orderCreateOrderI’d use stored procedures when:
One thing I’d add in an interview, they’re useful, but I wouldn’t put every bit of business logic in stored procedures. It depends on the team, the application architecture, and how much logic belongs in the database versus the app layer.
An Entity-Relationship Diagram (ERD) is a visual representation of the database's structure, showcasing how entities like tables and their attributes are related to each other. It's like a blueprint for your database, helping you plan and communicate your database design clearly.
ERDs are important because they help ensure that all stakeholders—like developers, analysts, and business users—have a common understanding of the data. They also help identify potential issues early in the design phase, like redundancy or poor relationships, which can save a lot of time and resources down the line. Overall, they make the database design process more efficient and robust.
INNER JOIN retrieves records that have matching values in both tables involved in the join. If there is no match, the row is not included in the result. It essentially finds the intersection of the two datasets.
OUTER JOIN, on the other hand, extends the functionality to include unmatched rows as well. There are three types of OUTER JOINs: LEFT OUTER JOIN, which returns all records from the left table and the matched records from the right table (plus nulls for non-matches); RIGHT OUTER JOIN, which does the opposite by returning all records from the right table and matched records from the left; and FULL OUTER JOIN, which returns all records when there is a match in either left or right table. If there is no match, the result is NULL on the side that lacks the match.
I’d answer this in a simple structure:
Then I’d give a practical example.
For me, query optimization starts with understanding why the query is slow, not jumping straight to adding indexes.
My approach is usually:
SELECT *Example:
I had a reporting query that was taking several seconds during peak hours.
Here’s how I worked through it:
customer_id and order_date, but the existing index only covered customer_idThe result was a big drop in logical reads and the runtime went from seconds to well under a second.
So my mindset is, diagnose first, optimize based on evidence, and always validate with metrics.
I’d explain it in two parts, backup and recovery, then tie both back to the business goal: recover fast, with as little data loss as possible.
A clean way to structure it:
RPO and RTOExample answer:
Database backup is the process of creating restorable copies of the data, schema, and often transaction logs, so we can recover from failure, corruption, or accidental deletes.
The main backup types are:
Usually the most storage and time intensive
Differential backup
Restore is usually full backup plus latest differential
Incremental backup
Recovery can be more involved because you may need multiple backup sets
Transaction log backup, in systems that support it
In practice, a backup strategy is based on recovery requirements:
RPO, how much data loss is acceptableRTO, how quickly the system must be restoredFor example:
That gives a balance between storage, backup window, and recovery speed.
Recovery is the process of restoring those backups in the right order to bring the database back to a consistent state.
Typical recovery flow:
A strong answer should also mention that backup is not enough unless it’s tested.
So in a real environment, I’d also include:
If I were answering in an interview, I’d keep it practical:
“Backup is about creating recoverable copies of the database, and recovery is about restoring them to a clean, consistent point after a failure. I usually think in terms of full, differential, incremental, and log backups. The exact mix depends on RPO and RTO. For recovery, you typically restore the full backup first, then apply the differential or incrementals, and then transaction logs if point-in-time recovery is needed. The most important part is making sure backups are actually restorable, so regular recovery drills and validation are part of the process.”
I’d start by choosing the right level of search for the use case.
For basic keyword search inside a relational database, I’d use the database’s native full-text features. That usually means:
For example, in MySQL I’d create a FULLTEXT index and query it with MATCH(...) AGAINST(...). That works well when:
If the search experience needs to be richer, I’d move search into a dedicated engine like Elasticsearch or OpenSearch. That’s usually the better option when you need:
In that setup, the database stays the system of record, and the search engine holds a searchable index of the same data.
The implementation usually looks like this:
My rule of thumb is simple:
That gives you a solution that is fast, maintainable, and matched to the actual complexity of the problem.
A subquery is essentially a query nested inside another query. It allows you to perform more complex operations by breaking them down into smaller, manageable queries. You might use a subquery when you need to filter or aggregate data in a way that can't be done in a single query. For example, if you want to find employees who earn more than the average salary of their department, you could use a subquery to first calculate the average salary and then use that result in your main query. It helps keep your main query more readable and organized.
A self-join is when a table is joined with itself. This can be useful when you need to compare rows within the same table or when the table contains hierarchical data. For example, if you have an employee table where each row contains an employee and their manager, a self-join can help you list employees alongside their managers by joining the table on the manager ID. Essentially, you treat the single table as if it were two separate tables, enabling comparison or combination of rows.
A data warehouse is a centralized repository designed to store large amounts of structured data from different sources. It is optimized for query and analysis rather than transaction processing. While traditional databases, like relational databases, are designed to manage day-to-day operations, such as processing and recording transactions, data warehouses are built for read-heavy tasks, specifically for complex queries and reporting.
The main difference lies in their use-cases and design. Traditional databases are optimized for Online Transaction Processing (OLTP), which involves a large number of short online transactions (like INSERT, UPDATE, DELETE). On the other hand, data warehouses focus on Online Analytical Processing (OLAP), which includes fewer but more complex queries that often require aggregating large data sets. This makes data warehouses ideal for business analytics and decision-making.
I handle concurrency by starting with the business requirement, then choosing the lightest control that still keeps the data correct.
A simple way to structure the answer is:
In practice, I usually think about it like this:
The main concurrency patterns I use are:
Useful for things like inventory decrements or financial balance updates.
Optimistic concurrency
If the row changed, retry or reject cleanly.
MVCC
A concrete example:
Say I am handling ticket inventory for a high-demand event.
I also pay attention to operational issues:
Add safe retry logic.
Hotspots
Consider sharding counters or batching writes.
Long-running reads
So my overall approach is, use transactions deliberately, keep them short, choose the right isolation and conflict strategy, and design the data access pattern to avoid unnecessary contention.
Database replication is just keeping the same data in more than one database server.
In practice, one server writes the data, and one or more other servers get copies of it. Those copies can be updated almost instantly, or with a small delay, depending on the setup.
Why teams use it:
A simple example:
One important detail, replication is not the same as backup.
So the short version is, replication is about copying data across database servers to improve availability, scalability, and resilience.
A cursor is a way to process query results one row at a time.
Think of it like this:
Typical cursor flow:
DECLARE the cursor for a SELECT queryOPEN the cursorFETCH each row into variablesCLOSE and DEALLOCATE it when doneWhen would you use one?
Important caveat:
UPDATE, MERGE, window functions, or CTEs are better choicesA solid interview answer would be:
"A cursor in SQL is a database object used to iterate through a result set one row at a time. You usually declare it for a query, open it, fetch rows in a loop, process each row, then close and deallocate it. Cursors are helpful for procedural, row-by-row logic, but they are generally less efficient than set-based SQL, so I only use them when there is a real need for sequential processing."
A foreign key constraint is used to link two tables together. It ensures that the value in one table matches a value in another table to maintain referential integrity. For instance, if you have a table for orders and another for customers, you might have a foreign key in the orders table that relates to the primary key in the customers table. This prevents you from inserting an order with a non-existent customer ID, ensuring the data remains consistent and meaningful across the database.
A composite key is a combination of two or more columns in a table that together uniquely identify a row. Unlike a primary key that can be a single column, a composite key is necessary when a single column is not sufficient to ensure row uniqueness. For example, in an order detail table, you might use a combination of order_id and product_id as a composite key because neither of them alone is unique, but together they always are. This helps in maintaining the integrity of the data.
UNION combines the results of two queries and removes duplicate records, so you end up with a list of unique rows. UNION ALL also combines the results of two queries but includes all duplicates. So, if you want to see every record, including repeating ones, you’d go with UNION ALL. It's also faster because it doesn't have to check for and remove duplicates.
A materialized view is a precomputed, stored result of a query.
Unlike a regular view, which runs the query every time, a materialized view actually saves the data on disk. That means reads are much faster, especially for:
The tradeoff is freshness.
Because the data is stored, it has to be refreshed to stay in sync with the underlying tables. Depending on the database, that refresh can happen:
A simple way to explain it in an interview:
I usually mention the main benefit and tradeoff together: - Benefit: better read performance - Tradeoff: extra storage and refresh overhead
Example: If a dashboard keeps querying total sales by region from millions of rows, a materialized view can store that pre-aggregated result so the dashboard loads much faster.
OLTP (Online Transaction Processing) systems are designed to manage day-to-day transaction data. They focus on speed, efficiency, and ensuring data integrity in multi-access environments. Think of these as your everyday activities like banking transactions, online purchases, or booking flights. They usually involve a lot of short, fast queries.
OLAP (Online Analytical Processing), on the other hand, is built for querying and reporting rather than handling transactions. It's used for complex queries that involve aggregating large datasets, and is optimized for read-heavy operations. This makes OLAP systems ideal for data analysis, business reporting, and forecasting. They're the go-to for people needing to pull insights from large amounts of archived data, such as business trends or sales analytics.
Data types define the kind of value a column in a table can hold, such as integers, floating-point numbers, strings, or dates. They're crucial because they ensure data integrity and efficient use of storage. For instance, storing dates as text could lead to inconsistent formats or invalid dates, whereas using a DATE data type ensures the format and validity are consistent. Also, appropriate use of data types optimizes query performance since the database management system (DBMS) knows exactly what kind of data it's dealing with.
I treat data migration like a small project, not just a copy-paste job.
A clean way to answer this is:
In practice, my approach looks like this:
Volume of data, downtime limits, and performance risks
Map the data
Identify transformations needed, like date formats, null handling, code value changes, or denormalized to normalized structures
Choose the migration method
The choice depends on data size, cutover window, and whether this is a one-time move or phased migration
Test first
Check for load failures, broken foreign keys, duplicate data, and performance issues
Execute with controls in place
Log errors so bad records can be reviewed without stopping everything
Validate after migration
Application-level testing to confirm the data behaves correctly, not just that it exists
Plan for rollback
Example:
I once migrated data from an older SQL Server system into PostgreSQL for a reporting platform.
The main thing is balancing accuracy, downtime, and recoverability. A successful migration is not just moving data, it is making sure the new system is trustworthy on day one.
A relational database is a database that stores data in tables, with rows and columns, and connects those tables through defined relationships.
The easiest way to think about it is:
Customers, Orders, or ProductsCustomerName or OrderDateWhat makes it "relational" is the link between tables.
primary key uniquely identifies a row in a tableforeign key points to a related row in another tableFor example, in an online store:
Customers stores customer infoOrders stores order infoProducts stores product infoThe Orders table might contain a CustomerID, which links each order back to the right customer. That relationship makes it easy to answer questions like:
Relational databases are popular because they give you:
So in simple terms, it is a structured way to store data that is related, and to query those relationships efficiently.
I’d answer this by doing two things:
A clean way to say it:
There are a few major types of databases:
In practice, most of my experience has been with relational and NoSQL systems.
What I’ve used most:
How I think about them:
For example, I’ve used MySQL for systems where normalized data, reporting, and transactional accuracy were important. I’ve used MongoDB in cases where storing nested JSON-like data made development faster and the schema changed more often.
So while I’m familiar with the broader database landscape, the platforms I’ve worked with most directly are MySQL and MongoDB.
I’ve used a pretty wide range of SQL beyond the basic CRUD stuff. When I answer this kind of question, I usually keep it simple:
A strong answer would sound like this:
Some of the more advanced SQL features I’ve used regularly are:
JOINs, especially INNER, LEFT, and sometimes SELF JOINFor example, combining orders, customers, payments, and shipment data into one reporting query
GROUP BY with aggregate functions like COUNT, SUM, AVG, MIN, and MAX
Things like total sales by region, failed jobs by day, or customer activity by month
Window functions such as ROW_NUMBER(), RANK(), DENSE_RANK(), and SUM() OVER(...)
A common example is identifying the most recent status per customer or calculating cumulative revenue over time
Common Table Expressions, WITH clauses
Especially helpful when breaking business logic into steps instead of writing one huge nested query
CASE expressions
CASE a lot for categorization and conditional logic directly in SQLFor example, turning raw values into business-friendly labels like high, medium, and low priority
Subqueries and correlated subqueries
Useful when filtering based on derived results, like finding customers whose spend is above the average for their segment
Set operators like UNION and UNION ALL
I usually prefer UNION ALL when duplicates are acceptable because it performs better
Stored procedures, functions, and views
I’ve worked with these to standardize logic, simplify reporting access, and reduce repeated query code
Temporary tables and sometimes table variables
Helpful for ETL steps, staging intermediate results, or improving readability in multi-step transformations
Transaction control, things like BEGIN, COMMIT, and ROLLBACK
One example, I worked on a reporting query for operations where I had to combine data from several tables, classify records with CASE, aggregate results by business unit, and use ROW_NUMBER() to pull the latest status per item. The first version worked, but it was slow. I rewrote parts of it using a CTE and adjusted the joins, and after that the query was much easier to maintain and performed much better.
So for me, advanced SQL is not just knowing the commands, it’s knowing when to use them to make the query accurate, readable, and efficient.
I’d answer this in layers, because data integrity is not just one feature, it’s a combination of design, rules, and operational discipline.
My approach would be:
PRIMARY KEY constraints so every row is uniquely identifiableFOREIGN KEY constraints to enforce valid relationshipsUNIQUE constraints to prevent duplicate business-critical valuesUse NOT NULL and CHECK constraints to block bad data at the database level
Design tables carefully
Only denormalize when there’s a clear performance reason and the tradeoff is understood
Protect writes with transactions
Pick the right isolation level based on the workload and concurrency needs
Validate at multiple layers
The database should still enforce core business rules, because every app, script, or integration touching the data needs the same protection
Control change and access
Apply change management for schema updates, migrations, and data fixes
Monitor and recover
A concrete example:
If I’m designing an order system, I’d make sure:
- customers.customer_id is the primary key
- orders.customer_id has a foreign key to customers
- orders.status has a check constraint so it only allows valid values like pending, paid, or shipped
- order_items.quantity cannot be zero or negative
- Creating an order and its line items happens in one transaction, so I don’t end up with an order header but no items
That way, even if a buggy app or manual script tries to insert invalid data, the database still protects itself.
I’d explain it in a simple compare-and-contrast way:
Example answer:
SQL and NoSQL solve different problems.
SQL databases are relational. They store data in tables with rows and columns, and the relationships between tables are a big part of the design. They usually have a defined schema, so the structure is set upfront.
That makes SQL a strong choice when you need:
A classic example is banking, order processing, or ERP systems, where accuracy and transactional integrity really matter.
NoSQL is a broader category. It includes document stores, key-value databases, column-family databases, and graph databases. These systems are usually more flexible with schema, so you can evolve the data model more easily.
NoSQL is a good fit when you need:
The short version is:
It’s not really about one being better than the other. It’s about choosing based on the workload, consistency needs, and how the data is shaped. In a lot of real systems, you end up using both.
A join is how you pull related data from multiple tables into one result set.
The idea is simple:
- You have a common key, like customer_id or order_id
- SQL matches rows across tables based on that key
- The join type decides what to do when a match exists, or does not exist
The main join types I use are:
Example:
- orders joined to customers
- You only get orders that have a matching customer record
NULLExample:
- Start with customers
- Left join orders
- You get all customers, even the ones who have never placed an order
I do not use this much in practice, because you can usually rewrite it as a left join, which tends to be easier to read
Full outer join
NULLExample: - Useful when comparing two datasets and finding what exists in one but not the other
A couple of other join concepts that are worth knowing:
Useful occasionally, but dangerous on large tables because it can explode row counts fast
Self join
In interviews, I usually explain joins as:
- INNER = matched rows only
- LEFT = everything from the left, matched where possible
- RIGHT = everything from the right, matched where possible
- FULL = everything from both sides
That gives the definition, the behavior, and the practical use case quickly.
I’d handle large data sets by starting with one thing first, access patterns.
If I know: - what queries matter most, - how fast they need to be, - how much data is growing, - and the read vs write mix,
then the design decisions get a lot clearer.
My approach would usually look like this:
Then denormalize only where it clearly helps performance.
Partition early when growth is predictable
It also improves things like archiving and index management.
Be intentional with indexing
I’d validate indexes against real query plans, not just assumptions.
Separate hot and cold data
That keeps operational tables smaller and more efficient.
Plan for scale beyond one box if needed
I treat that as a later-stage decision, because it adds operational complexity.
Design for reporting separately when needed
A concrete example:
On a high-volume transaction system, I’d likely:
- partition the main transaction table by month,
- index account_id and transaction date for the most common lookups,
- keep the last 12 to 18 months in the primary operational path,
- archive older records for compliance and historical access,
- and use summary tables for dashboards instead of aggregating billions of raw rows live.
That gives you a design that stays maintainable, performs well, and can keep scaling as data grows.
A clean way to answer this is:
Here is a strong version:
Sharding is a way to split a large database into smaller pieces called shards.
Each shard holds only part of the data, but usually keeps the same table structure. Instead of one database server storing everything, the data is spread across multiple servers or instances.
A common example is splitting customer data by region or by customer ID range:
- Shard 1 handles users 1 to 1,000,000
- Shard 2 handles users 1,000,001 to 2,000,000
Why teams do it: - It helps scale horizontally - It reduces the amount of data each server has to manage - It can improve query performance and distribute traffic better
The catch is that it adds complexity: - Choosing the right shard key is critical - Cross-shard joins and transactions get harder - Rebalancing data later can be painful
So in simple terms, sharding is a scalability strategy for very large databases, where you divide the data across multiple database nodes instead of keeping it all in one place.
I usually answer this in two parts, process first, example second.
How I approach it: 1. Define the recovery goal. - What are the RPO and RTO? - How much data loss is acceptable? - How fast do we need to recover?
Transaction log or binlog backups for point-in-time recovery, if the platform supports it
Automate and secure it.
Set retention policies
Validate it.
A practical answer:
- I use native tools or platform-specific tooling, depending on the database.
- For MySQL, that might be mysqldump for logical backups, or physical backups like Percona XtraBackup for larger systems.
- For SQL Server, I would use full, differential, and transaction log backups.
- For PostgreSQL, I would use pg_dump for logical backups, or base backups plus WAL archiving for point-in-time recovery.
For restores, I keep it simple: - Identify the right backup set - Restore the full backup first - Apply differential or incremental backups if needed - Apply logs/binlogs/WAL to recover to the desired point in time - Validate the data and application connectivity after restore
What matters most in production: - Backups should be automated, not manual - Restore testing is just as important as taking backups - I always want at least one copy outside the primary environment - The backup plan should match business requirements, not just technical preference
Example: In one environment, we had a critical MySQL database with a tight recovery target. I set up nightly full backups, frequent binlog retention for point-in-time recovery, and stored encrypted copies in cloud storage. We also did monthly restore drills in a staging environment. That gave us confidence that if something failed, we could recover quickly and with minimal data loss.
I treat dynamic SQL as something to use carefully, not casually.
My approach is pretty simple:
A few practical rules I follow:
ORDER BY or optional columns, map user input to an approved list instead of passing it through directlyFor example, if I am building a search endpoint with optional filters:
WHERE clauses only for filters the user actually suppliedname, created_at, or statusSo the mindset is flexibility with guardrails. Dynamic SQL is useful, but only when it is controlled, parameterized, and easy to audit.
I’d explain it like this:
These are two different ideas that solve different problems.
What it’s for: - High availability - Fault tolerance - Sometimes load distribution - Better scalability, depending on the setup
How to think about it: - You have multiple database nodes - If one node goes down, another can take over - In some architectures, reads or workloads can be spread across nodes
Example: - A bank runs its database on a clustered setup with a primary node and standby nodes - If the primary fails, a standby is promoted so the application keeps running
Important note:
- This is different from a clustered index, which is about how rows are physically organized inside a table
What it’s for: - Faster queries - Easier maintenance - Better handling of very large tables - Archiving and purging data more efficiently
How it works: - Data is divided based on a rule - Common partitioning methods include: - Range, like partitioning orders by month - List, like partitioning by region - Hash, like distributing rows evenly by customer ID
Example: - A sales table with billions of rows is partitioned by order date - A query for last month only scans that partition instead of the whole table
Key difference: - Clustering = multiple servers working together - Partitioning = one large dataset split into smaller chunks
In short: - Use clustering when you care about uptime and server-level scalability - Use partitioning when you care about managing and querying large volumes of data efficiently
A JOIN is how you pull related data from multiple tables into one result set.
Think of it like this:
customersorderscustomer_idCommon types:
INNER JOINExample: customers who actually have orders
LEFT JOIN
NULLExample: all customers, even the ones with no orders
RIGHT JOIN
LEFT JOIN, but keeps all rows from the right tableNULLLess commonly used, because most people just swap table order and use LEFT JOIN
FULL OUTER JOIN
If one side has no match, the missing side shows NULL
CROSS JOIN
One practical point: the join condition matters a lot. If you join on the wrong column, or forget the condition entirely, you can get duplicate rows or a huge unintended result set.
There are a few common index types, and each one is good at a different kind of query.
Works well for =, <, >, BETWEEN, ORDER BY, and prefix searches
Hash index
WHERE id = 100Not useful for range searches or sorting
Clustered index
Usually only one per table
Non-clustered index
Useful for speeding up different access patterns on the same table
Bitmap index
A few other important ones often come up too:
Column order matters
Unique index
Improves lookup speed and also protects data quality
Full-text index
If I were answering in an interview, I would usually group them like this:
That keeps the answer clear and shows you understand both the structure and the use case.
A trigger is database logic that runs automatically when something happens to a table or view.
Think of it like an event listener inside the database.
Common trigger events:
- INSERT
- UPDATE
- DELETE
Triggers usually fire:
- BEFORE the change
- AFTER the change
- In some systems, INSTEAD OF the change
What they’re used for: - Enforcing business rules - Validating data - Writing audit logs - Preventing certain changes - Automatically updating related data
Example: - If someone updates an employee salary, a trigger can automatically write the old and new values to an audit table. - Or, if someone tries to delete a critical record, the trigger can block it.
Why they’re useful: - The logic lives in the database, so it runs no matter which app or user makes the change. - They help keep data consistent.
One thing to watch out for: - Triggers can make behavior less obvious and harder to debug if overused, so I use them for rules that truly belong at the database level.
I’d explain it by comparing how they model data, how strict they are, and where each one fits best.
They’re strong when you need joins, complex queries, and reliable transactions.
NoSQL databases are non-relational.
A simple way to think about it:
When I’d use SQL: - Financial systems - Order management - Inventory - Any system where ACID transactions and data integrity really matter
When I’d use NoSQL: - User activity feeds - Large-volume event data - Content platforms - Applications with rapidly changing or semi-structured data
Examples: - SQL: PostgreSQL, MySQL, SQL Server, Oracle - NoSQL: MongoDB, Cassandra, Redis, Neo4j
One important nuance, this is not about one being better than the other.
It’s really about tradeoffs: - SQL gives you strong consistency and rich querying - NoSQL gives you schema flexibility and easier horizontal scaling in many use cases
In practice, a lot of modern systems use both, depending on the workload.
I usually think about data integrity in layers, not just one feature.
A clean way to answer this is:
In practice, I’d say something like this:
Normalized tables where it makes sense
Then I enforce integrity in the database itself, not just in the application.
PRIMARY KEY for uniquenessFOREIGN KEY for referential integrityUNIQUE constraints to prevent duplicatesCHECK constraints for valid ranges or formatsDefaults for predictable values
I also rely on transactions for write operations.
That protects against partial updates and keeps data consistent
Beyond that, I pay attention to how data enters the system.
Role-based access, so only the right users or services can modify critical data
Finally, I treat integrity as something you continuously verify.
Example: If I’m designing an order management database, I’d make sure every order has a valid customer ID through a foreign key, prevent duplicate order numbers with a unique constraint, and use a transaction so creating the order and its line items happens together. If any step fails, the whole transaction rolls back, which keeps the data clean and reliable.
I’d explain ACID as the core rules that make database transactions trustworthy.
Think of a transaction like transferring money between two bank accounts. ACID is what makes sure that transfer is safe, correct, and permanent.
Example, if money is deducted from Account A but adding it to Account B fails, the deduction should also be undone.
Consistency
Example, a transaction should not leave an order with a customer ID that does not exist.
Isolation
This prevents issues like dirty reads or lost updates.
Durability
A simple way to say it in an interview:
"ACID is what makes transactions reliable. Atomicity means all or nothing. Consistency means the data always follows defined rules. Isolation means concurrent transactions do not corrupt each other. Durability means committed data survives crashes."
Views are basically saved queries that behave like tables.
They do not usually store the data themselves. Instead, they pull from underlying tables whenever you query them.
How I’d explain them in an interview:
Common ways views are used:
Simplifying complex SQL
If a report always needs 5 tables joined together, I can put that logic in a view so nobody has to rewrite it every time.
Security and access control
I can expose only specific columns, like hiding salary or PII fields, or filter rows so a team only sees its own data.
Consistency
If multiple analysts need the same business logic, like "active customers" or "monthly revenue", a view keeps everyone using the same definition.
Backward compatibility
If the underlying schema changes, a view can help preserve the old interface for downstream users.
One practical example:
customers, orders, and paymentsOne thing I’d also mention is that standard views are virtual, but some databases also support materialized views, which actually store the results for faster reads and are refreshed on a schedule or by process. That is useful when performance matters.
SQL commands usually get grouped into 5 buckets:
DDL, Data Definition Language
This is about database structure.
CREATE
ALTERDROPTRUNCATEYou use these to create or change tables, indexes, schemas, and other objects.
DML, Data Manipulation Language
This is about changing the data inside tables.
INSERT
UPDATEDELETEMERGE, in some systems
DQL, Data Query Language
This is for reading data.
SELECT
Some people fold SELECT into DML, but in interviews it is usually clearer to call it out separately as DQL.
DCL, Data Control Language
This handles access and permissions.
GRANT
REVOKEUsed to control who can read, write, or manage database objects.
TCL, Transaction Control Language
This manages transactions.
COMMIT
ROLLBACKSAVEPOINTThese are important when you want changes to be applied safely and consistently.
A simple way to remember it:
I usually think about database error handling in layers.
In practice, that means:
TRY...CATCH in SQL Server or equivalent in other platformsA few things I pay attention to:
For stored procedures, my pattern is pretty simple:
At the application layer, I handle database exceptions there too, because not every error should be exposed directly to the user. The database should preserve integrity and provide diagnostics. The application should translate that into something useful and safe for the user.
Example:
If I am loading orders into a table and one step fails halfway through, I would not leave partial data behind. I would wrap the whole operation in a transaction, roll it back in the catch block, log the failed order ID and error details, and return a clear status to the application. If it is a transient issue like a deadlock, I might allow a retry strategy in the app layer. If it is a data quality issue, I would fail fast and send it to a support or data remediation queue.
A good way to answer this is:
Example answer:
I’d use NoSQL when the data model is changing ხშირად, the traffic is very high, and I care more about fast reads and writes at scale than complex relational queries.
A common case is something like a social platform or event-driven application.
For example: - user activity is coming in constantly, likes, comments, views, reactions, session events - the data is semi-structured, and different event types may have different fields - the system needs to scale horizontally as traffic grows - most queries are simple, like fetching a user feed, recent activity, or aggregated counters
In that kind of setup, a NoSQL database like MongoDB, Cassandra, or DynamoDB can be a better fit because: - schema flexibility makes it easier to handle changing event formats - high write throughput supports large volumes of incoming data - horizontal scaling is usually more straightforward - you can design records around access patterns, which helps avoid expensive joins
I’d also call out the tradeoff. If the application needs strong relational integrity, complex joins, or multi-row transactional logic, I’d lean back toward SQL. So for me, NoSQL is the right choice when scale, flexibility, and access speed matter more than relational complexity.
Sharding is basically splitting one big database into multiple smaller databases, and each one holds only part of the data.
A simple way to think about it:
For example:
Why teams do it:
The tradeoff is that it adds complexity:
So in one line, sharding is a horizontal scaling strategy where you partition data across multiple database instances so no single database has to handle everything.
I’d answer this by showing that database security is not just one control, it’s a layered approach.
I usually break it into a few key areas:
In practice, that looks like this:
Enforce strong password policies, MFA for admin access, and remove shared accounts.
Tight access control
Restrict access to sensitive tables, columns, or rows when needed.
Encryption
Manage and rotate encryption keys properly.
Auditing and monitoring
Send logs to a SIEM or central monitoring platform so alerts are not missed.
Hardening
Use network segmentation, firewalls, and IP allowlists.
Operational discipline
A concrete example:
In a production environment, I’d give application services only the exact permissions they need, for example read/write on specific schemas, but no admin rights. Admin access would be limited to a small DBA group with MFA. All connections would use TLS, backups would be encrypted, and audit logging would track login attempts and permission changes. On top of that, I’d schedule periodic access reviews to catch privilege creep before it becomes a problem.
That’s usually how I think about it, layered controls, least privilege, encryption, and strong monitoring.
A clean way to answer this is:
Here’s how I’d say it:
There are four standard isolation levels:
Read UncommittedRead CommittedRepeatable ReadSerializableThe easiest way to think about them is, as you go up the list, transactions are more isolated, but the cost is usually more contention or lower throughput.
Read Uncommitted
- Lowest isolation level
- A transaction can see changes made by another transaction even if they have not been committed yet
- This can cause dirty reads
- Rarely used in systems that care about correctness
Read Committed
- Prevents dirty reads
- You only see committed data
- But if you run the same query twice in one transaction, you might get different results
- That means non-repeatable reads can still happen
Repeatable Read
- Prevents dirty reads and non-repeatable reads
- If you read the same row twice in the same transaction, you get the same result
- Depending on the database, you can still run into phantom reads, where a later query returns extra rows that were inserted by another transaction
- Some databases, like MySQL InnoDB, handle this differently than others
Serializable
- Highest isolation level
- Makes transactions behave as if they ran one at a time, in sequence
- Prevents dirty reads, non-repeatable reads, and phantom reads
- Safest option for consistency, but usually the most expensive in terms of performance, locking, or transaction retries
I’d also call out one practical point:
If I wanted to make it very practical in an interview, I’d finish with:
Read Committed is a common default for general OLTP workloadsRepeatable Read is useful when a transaction needs a stable view of dataSerializable is for cases where correctness matters more than throughput, like financial or conflict-sensitive workflowsKnowing the questions is just the start. Work with experienced professionals who can help you perfect your answers, improve your presentation, and boost your confidence.
Comprehensive support to help you succeed at every stage of your interview journey
We've already delivered 1-on-1 mentorship to thousands of students, professionals, managers and executives. Even better, they've left an average rating of 4.9 out of 5 for our mentors.
Find Database Interview Coaches