50 Linux Interview Questions you may face during your interview (2026 Edition)

Study Mode

Choose your preferred way to study these interview questions

How do systemd services, targets, and unit files work together?

systemd manages the system with units, and a service is just one unit type. A unit file is the config that tells systemd what to start, how to start it, what it depends on, and when it should run. For a service, that lives in a .service file with sections like [Unit], [Service], and [Install].

Targets are grouping units, kind of like runlevels but more flexible. A target does not usually run a process itself, it pulls in other units. For example, multi-user.target brings up core system services, and graphical.target adds the GUI stack. Services and targets connect through dependencies like WantedBy=multi-user.target in the service file, which makes systemctl enable create the right symlink so that target starts the service automatically.

How do Linux file permissions, ownership, and umask interact?

They work together to decide who can access a file, and what the default access looks like when it is created.

Ownership sets the identities involved: each file has a user owner and a group owner, viewed with ls -l, changed with chown and chgrp.
Permissions define access for three classes, user, group, others, with r, w, x bits on files and directories.
New files start from a base mode, usually 666, and new directories from 777; then umask subtracts permissions, it does not add them.
Example: umask 022 gives files 644 and directories 755; umask 027 gives files 640 and directories 750.
Effective access depends on whether you are the owner, in the file’s group, or fall into others; root can bypass most checks.

One subtlety, execute is usually not given by default to regular files, even if the math suggests it.

What is the difference between setuid, setgid, and the sticky bit?

They’re special permission bits that change how execution or deletion works beyond normal rwx permissions.

setuid on an executable makes it run with the file owner’s effective UID, not the user launching it. Example, passwd runs as root so it can update /etc/shadow.
setgid on an executable makes it run with the file’s group ID. On a directory, new files inherit the directory’s group, which is useful for shared team folders.
sticky bit on a directory means users can only delete or rename files they own, or that root owns, even if the directory is world-writable. Classic example, /tmp.

You’ll see them as s or t in ls -l, like rws, rwxr-s, or drwxrwxrwt. Numeric examples are 4755, 2755, and 1777.

What is SELinux or AppArmor, and how have you worked with one of them in practice?

SELinux and AppArmor are Linux Mandatory Access Control systems. They add a layer beyond normal Unix permissions by restricting what a process can read, write, execute, or connect to, even if it gets compromised. SELinux is label based and more common on RHEL, CentOS, Rocky. AppArmor is path based and often seen on Ubuntu, Debian.

In practice, I have worked more with SELinux: - I usually start with getenforce, sestatus, and audit logs in /var/log/audit/audit.log. - When an app breaks, I check denials with ausearch -m AVC or sealert -a. - For quick validation, I may switch to permissive mode temporarily, not as a fix. - I fix issues by setting the right context, like semanage fcontext plus restorecon, or enabling a needed boolean. - Example, I allowed Nginx to connect to a backend by enabling httpd_can_network_connect.

What is the difference between su and sudo, and how do you manage sudoers safely?

su switches you to another user, usually root, and gives you that user’s shell. You authenticate with the target user’s password. sudo runs a single command as another user, usually root, using your own password and policy rules.

su is broader, it changes identity for a session; sudo is more controlled and auditable.
sudo logs command usage, supports per-user and per-group permissions, and follows least privilege better.
su - gives a full login shell with target user environment; plain su keeps more of the current environment.
Safest sudoers practice is editing with visudo, because it does syntax checking and prevents concurrent edits.
Prefer entries in /etc/sudoers.d/ over modifying /etc/sudoers directly, use groups, and grant only required commands.
Validate with visudo -c, test in a second session, and avoid broad NOPASSWD: ALL unless tightly justified.

What steps would you take to investigate high CPU usage on a Linux system?

I’d start broad, then narrow from system level to process, then to thread, syscall, or I/O behavior.

Confirm the spike with uptime, top, mpstat, and check load average versus CPU percent.
Identify the culprit process using top -H, ps -eo pid,ppid,cmd,%cpu --sort=-%cpu, and note if it is user or kernel heavy.
Check process state and threads, top -H -p <pid>, ps -Lp <pid> -o pid,tid,%cpu,psr,comm.
Distinguish CPU from I/O wait, look at %us, %sy, %wa in vmstat 1 or iostat.
If needed, trace deeper with strace -p <pid> for syscall loops, or perf top and perf record for hotspots.
Review recent changes, logs, cron jobs, deploys, container limits, and noisy neighbors.
Mitigate safely, renice, taskset, restart service, or scale out after finding the cause.

How would you create, enable, and troubleshoot a custom systemd service?

I’d keep it simple: create a unit file, validate it, enable it, then use systemctl and journalctl to debug.

Create /etc/systemd/system/myapp.service with [Unit], [Service], and [Install] sections.
In [Service], define ExecStart=/path/to/app, set User=appuser, WorkingDirectory=..., and usually Restart=on-failure.
Reload units with systemctl daemon-reload, start it using systemctl start myapp, then enable at boot with systemctl enable myapp.
Check status via systemctl status myapp, and logs with journalctl -u myapp -xe or -f for live output.
Common issues: wrong path in ExecStart, missing permissions, bad environment variables, wrong service type like Type=simple vs forking, or SELinux blocking execution.

If it will not start, I also run systemd-analyze verify /etc/systemd/system/myapp.service to catch unit syntax problems fast.

How do you troubleshoot a Linux server that suddenly becomes unreachable over SSH?

I’d work it layer by layer, network first, then SSH, then system health, so I do not waste time debugging the wrong thing.

Confirm basic reachability: ping, traceroute, test port 22 with nc -zv host 22 from another machine.
Check if the server is up through console, hypervisor, cloud serial console, or out-of-band access like iLO/DRAC.
Verify network on the host: ip a, ip r, NIC status, firewall rules in iptables or nft, security groups, and recent route or DNS changes.
Inspect SSH service: systemctl status sshd, ss -tlnp | grep :22, review /var/log/auth.log or /var/log/secure.
Rule out resource issues: high CPU, memory pressure, disk full, hung filesystem, using top, free, df -h, dmesg.
Check for lockouts, fail2ban, config errors, or a changed sshd_config, then restart sshd carefully.

How do you diagnose a server that is running out of memory?

I start by confirming whether it is true RAM pressure, swap thrashing, or a leak, then narrow it to the process, workload, or kernel behavior.

Check overall memory and swap with free -h, vmstat 1, sar -r, and /proc/meminfo.
Look for OOM events in dmesg -T or journalctl -k, they often name the killed process.
Find top consumers with top, htop, ps aux --sort=-%mem, and compare RSS vs VIRT.
If swap is active and CPU iowait is high, suspect memory pressure causing thrash.
Separate cache from real pressure, Linux uses RAM for page cache, so low "free" alone is not bad.
For leaks, watch growth over time with smem, pmap, app metrics, or heap profilers.
In containers, check cgroup limits with systemctl status, docker stats, or /sys/fs/cgroup.

Then I correlate with recent deploys, traffic spikes, or config changes, and either tune limits, fix the leaking service, or add memory.

What is the difference between the kernel, init system, and shell?

They sit at different layers of a Linux system.

The kernel is the core of the OS, it talks to hardware, manages CPU, memory, devices, filesystems, networking, and exposes system calls.
The init system is the first userspace process, usually PID 1. It boots the rest of userspace, starts and supervises services, and handles shutdowns. Examples are systemd and sysvinit.
The shell is a user interface, usually a command interpreter like bash or zsh. It reads commands, launches programs, and supports scripting.

A simple way to explain it in an interview is: kernel runs the machine, init brings the system up, shell lets the user interact with it. The shell depends on the kernel, and usually starts after init has set up userspace.

What logs and tools do you use first when troubleshooting a failed service on Linux?

I start with systemd, then the journal, then the app logs. That gives me the fastest path to whether it is a unit issue, a dependency problem, or the service itself crashing.

systemctl status <service>, shows state, exit code, recent errors, and dependency failures.
journalctl -u <service> -b, gets service logs from the current boot, often enough to spot config or permission errors.
Check app specific logs in /var/log, or wherever the service writes, like Nginx, MySQL, or custom app logs.
systemctl cat <service> and systemctl show <service>, verify unit file, overrides, environment, restart policy, and exec path.
If needed, ss -lntp, ps aux, top, df -h, free -m, and dmesg, rule out port conflicts, dead processes, resource exhaustion, or kernel issues.

Explain the purpose of /etc/fstab and what can go wrong if it is misconfigured.

/etc/fstab is the static filesystem table. It tells Linux which filesystems, swap spaces, and mount points to use, plus options like defaults, noatime, or nofail. At boot, systemd or the init process reads it to mount disks automatically. It is also used by mount -a, so it is the central place for persistent mounts.

If it is misconfigured, a few bad things can happen: - Wrong device or UUID, the system may fail to mount a needed filesystem. - Bad mount point, services may break because expected paths are empty. - Incorrect options, you can lose write access, get permission issues, or hurt performance. - Broken root or /boot entry, the machine may drop into emergency mode or fail to boot. - Bad swap entry, memory pressure handling gets worse.

Best practice is to use UUIDs, test with mount -a, and keep recovery access handy.

How do you identify disk space issues versus inode exhaustion?

I check both capacity and inode usage, because a filesystem can look "full" for either reason.

Use df -h to see disk space by filesystem, size, used, available, mount point.
Use df -i to see inode usage, if IUse% is 100%, you have inode exhaustion.
Symptom difference, disk full means large files consumed blocks, inode exhaustion means too many small files.
To find space hogs, use du -sh /* 2>/dev/null | sort -h or drill into the busy mount.
To find inode hogs, count files with find /path -xdev | wc -l, then inspect directories with lots of small files like logs, cache, mail spools.

If apps report "No space left on device" but df -h looks okay, I immediately check df -i, because that error happens with inode exhaustion too.

What is the difference between a hard link and a symbolic link, and when would you use each?

A hard link is another directory entry pointing to the same inode as the original file. A symbolic link is a separate file that stores a pathname to another file or directory.

Hard link: same inode, same data, deleting one name does not remove the file until all links are gone.
Symlink: different inode, points by path, can become broken if the target moves or is deleted.
Hard links usually cannot span filesystems, and typically do not link directories.
Symlinks can span filesystems and can point to directories.

I use hard links when I want two filenames to behave like the exact same file, often for space efficiency. I use symlinks when I want a flexible pointer, like current -> /opt/app/v2, shared config paths, or shortcuts across filesystems.

Describe a situation where you had to balance speed, stability, and security while managing Linux infrastructure.

I’d answer this with a quick STAR flow: situation, tradeoff, actions, result. Pick an example where business pressure was real, but you still showed good engineering judgment.

At a previous job, we had to roll out urgent OpenSSL updates across production Linux web servers after a high severity advisory. Speed mattered because of exposure, stability mattered because these boxes handled customer traffic, and security mattered because delaying patching increased risk. I split the rollout into phases, patched and reboot tested a staging group first, then a small production canary behind the load balancer. I used config management to keep changes consistent, verified service health with monitoring and smoke tests, and scheduled the wider rollout during a low traffic window. We finished the same day with no customer impact, and I documented the runbook so the next emergency patch cycle was faster.

Which Linux distributions have you worked with most, and how do their package management and service management approaches differ?

I’ve worked most with Ubuntu and Debian, RHEL/CentOS and Rocky, and a bit of SUSE and Alpine.

Debian/Ubuntu use apt with .deb packages, very dependency-friendly, huge repos, and common in cloud and app hosting.
RHEL family uses yum or dnf with .rpm packages, stronger enterprise tooling, predictable lifecycles, and tighter vendor support.
SUSE also uses .rpm, but package management is usually through zypper, which is solid for patching and repo handling.
Alpine uses apk, very lightweight, musl-based, and popular for containers where small image size matters.
For service management, modern Debian, Ubuntu, RHEL, Rocky, and SUSE all use systemd, so systemctl is the standard.
Older systems might use SysV init or Upstart, so the main difference there is service scripts versus systemd unit files and targets.

Explain the Linux boot process from power-on to a user-space login prompt.

At a high level, firmware initializes the machine, hands off to a bootloader, the kernel starts hardware and mounts a root filesystem, then init brings up user space until you get a login prompt.

Power on, CPU jumps to firmware, BIOS or UEFI runs POST, initializes basic hardware, finds a bootable device.
Firmware loads a bootloader like GRUB, which presents a menu, loads the Linux kernel and usually an initramfs into memory.
The kernel decompresses, sets up memory, CPU features, drivers, and mounts the temporary root from initramfs.
Early userspace in initramfs loads needed modules, finds the real root filesystem, then does switch_root or pivot_root.
Kernel starts PID 1, usually systemd, which mounts filesystems, starts services, networking, logging, and targets.
systemd starts a getty on a TTY, or a display manager for GUI, and that gives you the login prompt.

Walk me through your experience administering Linux systems in production environments.

I’ve administered Linux in production across web, API, and batch workloads, mostly on Ubuntu, RHEL, and Amazon Linux. My focus has been reliability, security, and making systems easy to operate at scale.

Built and maintained VMs and cloud instances, handled patching, kernel updates, and lifecycle management.
Managed systemd services, users, SSH hardening, sudo, firewalls, SELinux basics, and backup/restore processes.
Did performance and incident work with top, vmstat, iostat, sar, journalctl, and logs in /var/log.
Automated provisioning and config with Bash, Ansible, and some Terraform, which reduced drift and manual errors.
Supported Nginx, Docker, cron jobs, EBS or LVM storage, and monitoring with Prometheus, CloudWatch, and alerting.

A solid example, I helped stabilize a high traffic API cluster by tuning file descriptor limits, fixing log rotation, and identifying disk latency during peak load. That cut recurring incidents and improved recovery time.

How do you troubleshoot a permission denied error when traditional file permissions appear correct?

I’d widen the check beyond basic rwx bits, because “Permission denied” often comes from something else in the access path.

Confirm the exact user and groups with id, and test the path with namei -l /path/file.
Check parent directories, you need execute permission on every directory in the path.
Look for ACLs with getfacl, they can override what ls -l suggests.
Verify ownership and special bits like sticky, setgid, or immutable flags via lsattr.
Check SELinux or AppArmor, getenforce, ls -Z, and audit logs often expose denials.
On NFS or CIFS, review mount options like root_squash, noexec, or UID mapping issues.
Use strace on the failing command to see the exact syscall and object being denied.

In interviews, I’d say I follow layers, identity, path traversal, extended controls, then security modules and mounts.

Explain how processes are created and managed in Linux.

Linux creates a process mainly with fork(), which copies the parent process, then often exec() replaces that copy with a new program. Modern systems may also use clone() for finer control, especially for threads and containers. Every process gets a PID and starts with resources like memory mappings, file descriptors, and environment variables inherited from the parent.

Management is handled by the kernel scheduler and process states. A process can be running, runnable, sleeping, stopped, or zombie. The kernel schedules CPU time based on priority and policy, like normal CFS or real-time classes. Parents can monitor children with wait() and receive signals such as SIGCHLD. Admins manage processes with tools like ps, top, kill, nice, and systemd, which also tracks services using cgroups for resource limits and isolation.

Explain the difference between load average and CPU utilization.

They measure different things. CPU utilization is the percentage of time the CPU is busy doing work. Load average is the average number of tasks that are either running on CPU or waiting in uninterruptible sleep, usually I/O wait, over 1, 5, and 15 minutes.

CPU utilization answers, "How busy are my CPUs right now?"
Load average answers, "How much demand is there for CPU and certain blocked work?"
High CPU with low load can mean CPUs are busy, but there is not much queueing.
High load with low CPU often points to I/O bottlenecks, stuck disk or NFS, not pure CPU pressure.
On a 4 core system, load 4 means roughly full saturation, 8 means sustained queueing.

So, utilization is a percentage, load is a queue depth style signal.

How do you diagnose disk I/O bottlenecks on Linux?

I’d work top-down: confirm the symptom, find the busy device, then decide whether it’s throughput, latency, queueing, or filesystem related.

Start with iostat -xz 1, check %util, await, svctm, r/s, w/s, avgqu-sz. High await and queue depth usually signal contention.
Use iotop or pidstat -d 1 to find which processes are generating I/O.
Check memory pressure with vmstat 1, high wa plus swapping can look like a disk issue.
Inspect per-device stats in sar -d, lsblk, and /proc/diskstats, confirm whether one disk, LVM layer, or RAID device is the hotspot.
For filesystem impact, use df -h, df -i, mount, and dmesg for ext4/xfs errors or controller resets.
If needed, go deeper with blktrace, fio, or smartctl to separate workload limits from failing hardware.

How would you expand a filesystem on a running Linux server with minimal downtime?

I’d answer it as a layered process: confirm what sits under the filesystem, expand the block device, then grow the filesystem online if supported.

Identify the stack with lsblk, df -hT, and pvs/vgs/lvs if LVM is involved.
Expand the underlying disk first, for example hypervisor disk, SAN LUN, or cloud volume, then rescan the device.
If there is a partition, grow it with growpart or parted; if using LVM, run pvresize, then extend the LV with lvextend.
Grow the filesystem online: xfs_growfs for XFS, resize2fs for ext4, usually mounted and live.
Validate with df -h, lsblk, and check logs for errors.

Minimal downtime comes from using online resize paths. I’d still take a snapshot or backup first, confirm filesystem type, and have a rollback plan if the storage layer does not rescan cleanly.

How would you explain the difference between virtualization and containerization in a Linux context?

In Linux, virtualization and containerization solve isolation differently.

Virtualization uses a hypervisor to run full virtual machines, each with its own kernel and virtual hardware.
Containers share the host Linux kernel, but isolate processes with namespaces and control resources with cgroups.
VMs are heavier, slower to boot, and use more RAM and disk, but they give stronger isolation and can run different OSes.
Containers are lightweight, start fast, and pack densely, but they require a compatible Linux kernel.
Use VMs when you need hard isolation or mixed operating systems. Use containers for microservices, CI/CD, and fast scaling.

A simple way to say it in an interview is, a VM is like a full house, a container is like an apartment in the same building.

How would you securely grant a user limited administrative access on a Linux system?

The standard way is sudo, not sharing the root password. You give the user only the commands they need, ideally via a group or a small rule in /etc/sudoers, and you edit it with visudo so syntax errors do not lock you out.

Add the user to an admin group like wheel or sudo, if broad admin rights are acceptable.
For limited access, create command-specific rules, like allowing only systemctl restart nginx or journalctl.
Use least privilege, full command paths, and avoid wildcards unless you really trust the user.
Set defaults like requiring the user’s own password, logging sudo activity, and optionally restricting TTY or environment inheritance.
Test with sudo -l -U username to verify exactly what they can run.

If I wanted tighter control, I’d prefer a dedicated group plus a narrow sudoers rule over full root access.

What is the difference between a process, a thread, a daemon, and a zombie process?

Here is the clean way to explain it in an interview:

A process is a running program with its own memory space, PID, file descriptors, and resources.
A thread is a lightweight execution unit inside a process. Threads share the same memory and resources of that process, but each has its own stack and CPU state.
A daemon is a background process, usually started at boot or managed by systemd, that provides a service like sshd or cron.
A zombie process is a process that has finished execution, but still has an entry in the process table because its parent has not collected its exit status with wait().

The key distinction is ownership and lifecycle. Processes are isolated, threads are shared within a process, daemons are service-style background processes, and zombies are dead processes waiting to be reaped.

What tools do you use to analyze network connectivity issues on a Linux system?

I usually start by isolating the layer where it breaks, local stack, routing, DNS, or remote reachability.

ip addr, ip link, ethtool, check interface state, IP, duplex, errors.
ip route, ss -tulpn, verify routing, default gateway, listening sockets, active connections.
ping, traceroute or tracepath, test reachability and where packets stop.
dig, nslookup, resolvectl, separate DNS problems from raw network problems.
curl -v, nc, telnet, test specific ports, TLS handshakes, HTTP behavior.
tcpdump and sometimes wireshark, inspect packets, retransmits, resets, ARP issues.
journalctl, dmesg, NetworkManager logs, catch driver, DHCP, or link flap errors.
iptables or nft, firewall-cmd, verify firewall or NAT isn’t blocking traffic.

If it’s intermittent, I’ll also use mtr for ongoing path quality and latency loss patterns.

How would you investigate repeated authentication failures on a Linux host?

I’d work from scope, source, and timing, then confirm whether it’s a user issue, a service issue, or active abuse.

Check auth logs first: /var/log/auth.log on Debian/Ubuntu, /var/log/secure on RHEL, or journalctl -u sshd -u sudo -u sssd.
Identify the pattern: failed user, source IP, service, time window, and whether it is local, SSH, sudo, or LDAP/AD related.
Verify account state with passwd -S, chage -l, faillock --user, and look for expired, locked, or disabled accounts.
Review PAM, SSH, and identity configs: /etc/pam.d/*, sshd_config, nsswitch.conf, SSSD, LDAP, Kerberos.
Correlate with security controls like fail2ban, firewall logs, IDS, and check if the source is internal automation using stale credentials.
If suspicious, contain it: block the IP, rotate creds, preserve logs, and check for broader brute force activity.

How do signals work in Linux, and when would you use SIGTERM versus SIGKILL?

Signals are an async way for the kernel or another process to notify a process that something happened. Each signal has a number and default action, like terminate, stop, continue, or ignore. A process can catch or ignore many signals with handlers, but some, like SIGKILL and SIGSTOP, cannot be caught, blocked, or ignored.

SIGTERM is the polite shutdown signal, use it first.
It lets the app clean up, flush files, close sockets, and exit gracefully.
SIGKILL is the force option, the kernel kills the process immediately.
Use SIGKILL only if the process is hung, ignoring SIGTERM, or stuck in a bad state.
Typical flow is kill -TERM pid, wait, then kill -KILL pid if needed.

In interviews, I’d mention that systemd, containers, and orchestration tools usually send SIGTERM first for graceful shutdown.

How would you troubleshoot a DNS resolution problem on a Linux server?

I’d work from the client outward: confirm the symptom, check local resolver config, then test the DNS path step by step.

Verify whether it’s DNS only, ping 8.8.8.8 vs ping google.com.
Check resolver settings in /etc/resolv.conf, search domains, nameserver entries, and whether NetworkManager or systemd-resolved manages it.
Test resolution directly with dig or nslookup, for example dig google.com and dig @8.8.8.8 google.com.
Inspect local name service flow in /etc/nsswitch.conf, plus cache or resolver status with resolvectl status or systemctl status systemd-resolved.
Look for connectivity or firewall issues to DNS servers, usually port 53 UDP/TCP, using ss, tcpdump, or nc.
If it’s intermittent, check logs with journalctl, compare multiple DNS servers, and rule out stale cache or split-DNS/VPN issues.

What is the difference between TCP and UDP, and how does that affect Linux service troubleshooting?

TCP is connection-oriented, UDP is connectionless. TCP does a handshake, tracks sequence numbers, retransmits lost packets, and guarantees ordered delivery. UDP just sends datagrams with no delivery guarantee, no ordering, and much less overhead.

That changes troubleshooting a lot in Linux: - For TCP, check if the port is listening and whether the handshake completes, using ss -lntp, telnet, nc, or tcpdump. - Common TCP symptoms are connection refused, timeouts, resets, backlog issues, or firewall blocks. - For UDP, a port can look "open" but you still may not get replies, because silence is normal. - UDP troubleshooting relies more on packet capture, app logs, counters, and checking both directions with tcpdump -ni any udp port <port>. - Examples: web, SSH, databases are usually TCP; DNS, syslog, SNMP, and many streaming/VoIP workloads often use UDP.

Explain routing on a Linux host and how you would debug an incorrect route.

Routing on a Linux host is the kernel deciding where to send packets based on the routing table. It looks at the destination IP, picks the most specific match, then forwards traffic either directly to a local subnet or to a next-hop gateway through an interface. If nothing more specific exists, it uses the default route. Policy routing can add extra logic with multiple tables and rules.

Start with ip addr, ip route, and ip rule to see interfaces, routes, and policy rules.
Test the kernel’s decision with ip route get <destination>.
Check if the wrong interface, gateway, subnet mask, or metric is winning.
Verify neighbor resolution with ip neigh, and confirm packets with tcpdump -i <iface>.
Look for route sources, static config, DHCP, NetworkManager, or cloud-init overwriting changes.

If fixing it, I’d add or replace the route with ip route add or ip route replace, test, then make it persistent in the network config.

How do you find which process is listening on a port and determine whether it should be?

I’d answer it in two steps, identify the process, then validate whether it belongs there.

Use ss -ltnp for TCP or ss -lunp for UDP, then filter with grep :PORT.
Alternatives are lsof -i :PORT or netstat -tulpn on older systems.
Note the PID and process name, then inspect with ps -fp PID and systemctl status <service>.
Check whether the port is expected by comparing against app configs, service docs, change records, and firewall policy.
Validate the bind address too, 127.0.0.1 may be fine internally, 0.0.0.0 is broader exposure.

If I’m unsure whether it should be listening, I ask: what service owns it, is it required in this environment, is the exposure intentional, and does it match our hardening baseline. If not, I’d stop or disable it, then verify impact first.

What considerations do you take into account when hardening a Linux server?

I think about hardening in layers, starting with reducing attack surface, then tightening access, then improving detection and recovery.

Minimize what runs: install only needed packages, disable unused services, close unnecessary ports, remove default accounts.
Patch aggressively: keep OS and apps updated, subscribe to security advisories, test and automate updates where possible.
Lock down access: use SSH keys, disable password and root login, enforce MFA if available, least privilege with sudo.
Secure the network: host firewall like nftables or firewalld, restrict management access, segment sensitive services.
Harden the system: set correct file permissions, mount options like noexec where appropriate, enable SELinux or AppArmor.
Improve visibility: centralize logs, audit with auditd, monitor file integrity and suspicious auth events.
Protect data and recovery: encrypt sensitive data, manage secrets properly, take tested backups, document a baseline and review regularly.

How do you verify that a Linux system complies with security baselines or internal standards?

I’d answer this as a mix of policy mapping, automated validation, and evidence collection.

Start by identifying the benchmark, CIS, DISA STIG, or internal hardening standard, and map each control to technical checks.
Use compliance tools like OpenSCAP, Lynis, or vendor tools to scan the host against a profile and generate reports.
Manually verify high risk items, things like SSH settings, password policy, firewall rules, auditd, logging, kernel params, file permissions, and running services.
Compare actual state to the approved baseline using config management, for example Ansible, Puppet, or Chef, to detect drift.
Review exceptions separately, document compensating controls, then keep artifacts like scan results, command output, and remediation tickets for audit evidence.

How do you capture and inspect network traffic on Linux during an incident?

I start broad, then narrow fast so I do not lose volatile evidence.

Identify the interface and scope with ip a, ip route, ss -tupna, and recent logs.
Capture safely with tcpdump, for example tcpdump -i eth0 -nn -s0 -C 100 -W 10 -w /tmp/incident.pcap, which rotates files and avoids DNS lookups.
Filter aggressively if needed, like host 10.0.0.5, port 443, or net 192.168.1.0/24, to reduce noise.
Inspect live with tcpdump -r incident.pcap, tshark, or Wireshark on a copy, looking for unusual egress, beaconing, scans, resets, and odd DNS.
Correlate packets with processes using ss, lsof -i, conntrack -L, firewall counters, and timestamps from syslog or EDR.

If the box is sensitive, I prefer writing captures to a separate disk and hashing the pcap for chain of custody.

How would you investigate intermittent application slowness on a Linux host?

I’d work top down and time-correlate symptoms first, because “intermittent” usually means I need evidence during the bad window.

Confirm scope, app only or whole host, and line up timestamps from app logs, journalctl, deploys, cron jobs, backups, or traffic spikes.
Check load vs real bottleneck with uptime, top, vmstat 1, iostat -xz 1, sar, looking at CPU steal, run queue, memory pressure, swap, and disk latency.
Verify memory issues, free -m, sar -B, dmesg for OOM, page reclaim, or THP and NUMA-related noise.
Look at process level hotspots, pidstat -durh 1 -p <pid>, strace -p, perf top, and open files or socket pressure with lsof, ss -tpn.
Rule out network and DNS, packet loss, retransmits, slow upstreams, ss, sar -n DEV, tcpdump, dig.

If I need a concrete example, I once traced periodic slowness to iowait spikes from logrotate plus compression on the same volume as the app.

What experience do you have configuring firewalls with iptables, nftables, or firewalld?

I have hands-on experience with all three, mostly on Linux servers in production and lab environments. My strongest background is with iptables and firewalld, and I have also worked with nftables during migrations on newer distros like RHEL 8 and Debian 11.

With iptables, I have built host-based rules for SSH, web traffic, NAT, port forwarding, and basic rate limiting.
With firewalld, I usually manage zones, services, rich rules, and permanent runtime changes on RHEL-based systems.
With nftables, I have created simple rulesets and translated older iptables logic into tables, chains, and sets.
I always validate changes carefully, using staged rules, console access, and testing to avoid locking myself out.
I also document rule intent clearly, so future troubleshooting is easier and audits go more smoothly.

Describe a shell script you wrote to automate a repetitive Linux administration task.

I’d answer this with a quick STAR flow: task, what the script did, how I made it safe, and the result.

At a previous job, I wrote a Bash script to automate log cleanup and disk space alerts across several Linux app servers. We had teams manually checking df -h, deleting old rotated logs, and restarting one service if a filesystem filled up. The script checked usage thresholds, archived and removed logs older than a set number of days, compressed large files, and sent an email and Slack alert with the server name and top directories from du -sh. I added safeguards like set -e, a dry-run flag, logging, and excluded critical paths. Then I scheduled it with cron. It cut manual cleanup work from a few hours a week to almost nothing and helped prevent repeat disk-full incidents.

How do you make shell scripts safer, more maintainable, and easier to troubleshoot?

I treat shell scripts like production code: fail early, validate inputs, and make behavior obvious.

Start with #!/usr/bin/env bash, then set -Eeuo pipefail to catch common failures.
Quote variables, prefer $(...), and use arrays to avoid word-splitting bugs.
Validate arguments up front, show usage(), and check dependencies with command -v.
Use functions with clear names, local variables, and a main function for flow.
Add trap for cleanup and useful error context, like line number and command.
Log key steps with timestamps; enable set -x selectively for debugging.
Run shellcheck, format with shfmt, and test happy path plus failure cases.

For maintainability, keep scripts idempotent, avoid hardcoded paths, use constants like readonly, and document assumptions near the code that depends on them.

What is the purpose of standard input, output, and error, and how do pipes and redirection help in administration?

In Linux, processes get three default data streams: stdin for input, stdout for normal output, and stderr for errors. Keeping output and errors separate matters in admin work, because you can log clean results while still seeing failures. By default, input comes from the keyboard, and output and errors go to the terminal.

Redirection changes where a stream goes, like > to a file, >> to append, < for input.
2> redirects only errors, which is useful for troubleshooting or clean scripting.
Pipes, |, send stdout from one command into another, like ps aux | grep nginx.
This lets admins chain simple tools, filter data, automate reports, and avoid temp files.
Together, they make scripts more reliable, composable, and easy to debug.

How do cron and systemd timers differ, and when would you choose one over the other?

Both schedule recurring jobs, but systemd timers are more integrated and easier to manage on modern Linux.

cron is simple, lightweight, and portable. Good for basic time-based jobs like 0 2 * * *.
systemd timers pair with service units, so you get logging in journald, dependencies, resource controls, retries, and better visibility with systemctl.
Timers can do calendar schedules like cron, but also relative triggers such as "10 minutes after boot".
systemd supports Persistent=true, which runs missed jobs after downtime. Cron usually skips them unless you use anacron.

I’d choose cron for quick, universal scripts or older systems. I’d choose systemd timers on modern distros when I want observability, service management, boot-aware scheduling, or tighter control over execution.

How do you search large log files efficiently and identify the relevant events?

For large logs, I start broad, then narrow fast. The goal is to reduce volume, anchor on time, and correlate by IDs like request ID, PID, user, or IP.

Use grep -i, zgrep, or rg for fast keyword searches, and add -n for line numbers.
Filter by time window first, with awk, sed, or journal fields like journalctl --since --until.
Chain tools, for example grep "ERROR" app.log | grep "request_id=123" to cut noise quickly.
Sort and summarize patterns with sort | uniq -c | sort -nr to spot spikes or repeated failures.
Follow live activity with tail -f or journalctl -f, then reproduce the issue if possible.

If logs are structured, I prefer jq for JSON. In practice, I usually find one known bad event, grab its timestamp and correlation ID, then trace backward and forward a few minutes to identify the root cause.

What backup and recovery practices have you implemented for Linux systems?

I usually answer this by covering policy, tooling, validation, and recovery time.

I’ve used rsync, tar, borg, and snapshot-based backups with LVM or storage arrays, depending on RPO and retention needs.
I separate file-level backups from system-state backups, configs in /etc, app data, databases, and boot-critical items like /boot and EFI.
For databases, I prefer app-consistent backups, like mysqldump, xtrabackup, or pg_dump, not just filesystem copies.
I automate with cron or systemd timers, encrypt backups, ship copies offsite, and follow a 3-2-1 strategy.
The big one is restore testing, I regularly verify checksum integrity and do test restores to confirm RTO actually works.

Example, I set up Borg with daily incremental backups, weekly prune policies, and quarterly restore drills for a web stack, which cut recovery from hours to under 30 minutes.

What experience do you have with containers on Linux, and how do namespaces and cgroups relate to them?

I’ve worked with containers mainly through Docker, containerd, and Kubernetes on Linux, plus some lower-level troubleshooting with runc and Podman. In practice, that meant building images, tuning resource limits, debugging networking and filesystem issues, and investigating why a container could see or not see certain processes, mounts, or devices.

Namespaces provide isolation, they give a container its own view of PIDs, network, mounts, users, IPC, and hostname.
Cgroups provide control, they limit and account for CPU, memory, I/O, and process counts.
Containers are basically regular Linux processes wrapped with namespaces and constrained by cgroups.
I’ve used this knowledge to debug OOM kills, CPU throttling, and cases where PID or mount isolation behaved unexpectedly.
The runtime, like runc, sets up those kernel features; Docker or Kubernetes mostly orchestrate around them.

What experience do you have with log rotation and centralized logging on Linux systems?

I’ve worked with both local log rotation and centralized logging in production Linux environments, mostly using logrotate, rsyslog, journald, and ELK or Graylog stacks.

For rotation, I’ve configured /etc/logrotate.d/ policies with size and time based rotation, compression, retention, and copytruncate or service reloads depending on the app.
I usually verify ownership, permissions, and post-rotate actions, because that’s where logging often breaks.
On centralized logging, I’ve forwarded logs with rsyslog or Filebeat into Elasticsearch, and used structured logs when possible to improve searchability.
I’ve also tuned filtering, rate limiting, and disk usage so noisy services don’t flood storage or hide useful events.
In troubleshooting, I check dropped messages, timestamp consistency, TLS for secure forwarding, and whether apps reopen log files correctly after rotation.

Describe a time you recovered a Linux system from a serious outage or misconfiguration.

I’d answer this with a tight STAR format: situation, actions, outcome, and what I changed to prevent it happening again.

At a previous job, a production Linux VM stopped accepting app traffic after a firewall change. I confirmed the app was healthy locally with ss, curl localhost, and systemctl status, so I narrowed it to networking. I used the cloud serial console because SSH was blocked, reviewed nft rules, and found a bad default drop policy applied before the allow rules loaded. I rolled back the ruleset, restored access, and validated from both the load balancer and host. Afterward, I added a staged firewall deployment, automated rule validation, and out-of-band access checks. That turned a high pressure outage into a 15 minute recovery with a cleaner process afterward.

How do you approach patching Linux systems while minimizing business impact?

I treat patching as a risk management and change management exercise, not just package updates.

Start with an asset inventory, classify systems by criticality, owner, OS version, and maintenance window.
Test first in dev or staging, validate app compatibility, kernel updates, agents, and rollback steps.
Prioritize by risk, internet-facing, known CVEs, compliance deadlines, and vendor advisories.
Use phased rollout, small pilot group, then broader deployment, with canaries for critical services.
Minimize downtime with live patching where possible, clustering, load balancer draining, and reboots only in approved windows.
Automate with tools like Ansible, Satellite, Landscape, or unattended workflows, but keep approvals for high-risk hosts.
Monitor before and after, service health, logs, performance, and have backups or snapshots ready for rollback.

What Linux performance metrics do you monitor routinely, and why?

I usually group them by saturation, errors, and user impact. The goal is to catch bottlenecks early and tie low level signals to application symptoms.

CPU, %user, %system, %iowait, run queue, load average, to spot contention vs blocked work.
Memory, free vs available, page faults, swap in or out, OOM events, because pressure hurts latency fast.
Disk, IOPS, throughput, latency, queue depth, %util, to find storage bottlenecks, especially under write bursts.
Network, bandwidth, drops, errors, retransmits, connection states, because packet loss and retries kill performance.
Process level stats, top CPU or RSS users, thread counts, FD usage, to catch noisy neighbors or leaks.
App metrics, p95 or p99 latency, error rate, QPS, backlog, because infra metrics only matter if users feel it.

Tools I’d reach for are top, vmstat, iostat, sar, ss, and Prometheus plus Grafana.

What is LVM, and what advantages and risks does it introduce?

LVM, Logical Volume Manager, sits between disks and filesystems and gives you flexible storage. Instead of carving one disk into fixed partitions, you create physical volumes, group them into a volume group, then carve out logical volumes that can be resized or moved more easily.

Advantages: easy online resizing, simpler storage pooling, snapshots for backups, and cleaner disk replacement or migration.
It is great when disk needs change over time, especially on servers and VMs.
Risks: added complexity, harder troubleshooting, and if the volume group metadata is damaged, multiple logical volumes can be affected.
Snapshots can hurt performance and fill up if not sized well.
It is not a backup, and bad commands like lvremove can be destructive across a larger storage pool.

1. How do systemd services, targets, and unit files work together?

2. How do Linux file permissions, ownership, and umask interact?

They work together to decide who can access a file, and what the default access looks like when it is created.

Ownership sets the identities involved: each file has a user owner and a group owner, viewed with ls -l, changed with chown and chgrp.
Permissions define access for three classes, user, group, others, with r, w, x bits on files and directories.
New files start from a base mode, usually 666, and new directories from 777; then umask subtracts permissions, it does not add them.
Example: umask 022 gives files 644 and directories 755; umask 027 gives files 640 and directories 750.
Effective access depends on whether you are the owner, in the file’s group, or fall into others; root can bypass most checks.

One subtlety, execute is usually not given by default to regular files, even if the math suggests it.

3. What is the difference between setuid, setgid, and the sticky bit?

They’re special permission bits that change how execution or deletion works beyond normal rwx permissions.

setuid on an executable makes it run with the file owner’s effective UID, not the user launching it. Example, passwd runs as root so it can update /etc/shadow.
setgid on an executable makes it run with the file’s group ID. On a directory, new files inherit the directory’s group, which is useful for shared team folders.
sticky bit on a directory means users can only delete or rename files they own, or that root owns, even if the directory is world-writable. Classic example, /tmp.

You’ll see them as s or t in ls -l, like rws, rwxr-s, or drwxrwxrwt. Numeric examples are 4755, 2755, and 1777.

No strings attached, free trial, fully vetted.

Try your first call for free with every mentor you're meeting. Cancel anytime, no questions asked.

Browse Linux Interview Coaches

4. What is SELinux or AppArmor, and how have you worked with one of them in practice?

5. What is the difference between su and sudo, and how do you manage sudoers safely?

su is broader, it changes identity for a session; sudo is more controlled and auditable.
sudo logs command usage, supports per-user and per-group permissions, and follows least privilege better.
su - gives a full login shell with target user environment; plain su keeps more of the current environment.
Safest sudoers practice is editing with visudo, because it does syntax checking and prevents concurrent edits.
Prefer entries in /etc/sudoers.d/ over modifying /etc/sudoers directly, use groups, and grant only required commands.
Validate with visudo -c, test in a second session, and avoid broad NOPASSWD: ALL unless tightly justified.

6. What steps would you take to investigate high CPU usage on a Linux system?

I’d start broad, then narrow from system level to process, then to thread, syscall, or I/O behavior.

Confirm the spike with uptime, top, mpstat, and check load average versus CPU percent.
Identify the culprit process using top -H, ps -eo pid,ppid,cmd,%cpu --sort=-%cpu, and note if it is user or kernel heavy.
Check process state and threads, top -H -p <pid>, ps -Lp <pid> -o pid,tid,%cpu,psr,comm.
Distinguish CPU from I/O wait, look at %us, %sy, %wa in vmstat 1 or iostat.
If needed, trace deeper with strace -p <pid> for syscall loops, or perf top and perf record for hotspots.
Review recent changes, logs, cron jobs, deploys, container limits, and noisy neighbors.
Mitigate safely, renice, taskset, restart service, or scale out after finding the cause.

7. How would you create, enable, and troubleshoot a custom systemd service?

I’d keep it simple: create a unit file, validate it, enable it, then use systemctl and journalctl to debug.

Create /etc/systemd/system/myapp.service with [Unit], [Service], and [Install] sections.
In [Service], define ExecStart=/path/to/app, set User=appuser, WorkingDirectory=..., and usually Restart=on-failure.
Reload units with systemctl daemon-reload, start it using systemctl start myapp, then enable at boot with systemctl enable myapp.
Check status via systemctl status myapp, and logs with journalctl -u myapp -xe or -f for live output.
Common issues: wrong path in ExecStart, missing permissions, bad environment variables, wrong service type like Type=simple vs forking, or SELinux blocking execution.

If it will not start, I also run systemd-analyze verify /etc/systemd/system/myapp.service to catch unit syntax problems fast.

8. How do you troubleshoot a Linux server that suddenly becomes unreachable over SSH?

I’d work it layer by layer, network first, then SSH, then system health, so I do not waste time debugging the wrong thing.

Confirm basic reachability: ping, traceroute, test port 22 with nc -zv host 22 from another machine.
Check if the server is up through console, hypervisor, cloud serial console, or out-of-band access like iLO/DRAC.
Verify network on the host: ip a, ip r, NIC status, firewall rules in iptables or nft, security groups, and recent route or DNS changes.
Inspect SSH service: systemctl status sshd, ss -tlnp | grep :22, review /var/log/auth.log or /var/log/secure.
Rule out resource issues: high CPU, memory pressure, disk full, hung filesystem, using top, free, df -h, dmesg.
Check for lockouts, fail2ban, config errors, or a changed sshd_config, then restart sshd carefully.

9. How do you diagnose a server that is running out of memory?

I start by confirming whether it is true RAM pressure, swap thrashing, or a leak, then narrow it to the process, workload, or kernel behavior.

Check overall memory and swap with free -h, vmstat 1, sar -r, and /proc/meminfo.
Look for OOM events in dmesg -T or journalctl -k, they often name the killed process.
Find top consumers with top, htop, ps aux --sort=-%mem, and compare RSS vs VIRT.
If swap is active and CPU iowait is high, suspect memory pressure causing thrash.
Separate cache from real pressure, Linux uses RAM for page cache, so low "free" alone is not bad.
For leaks, watch growth over time with smem, pmap, app metrics, or heap profilers.
In containers, check cgroup limits with systemctl status, docker stats, or /sys/fs/cgroup.

Then I correlate with recent deploys, traffic spikes, or config changes, and either tune limits, fix the leaking service, or add memory.

10. What is the difference between the kernel, init system, and shell?

They sit at different layers of a Linux system.

The kernel is the core of the OS, it talks to hardware, manages CPU, memory, devices, filesystems, networking, and exposes system calls.
The init system is the first userspace process, usually PID 1. It boots the rest of userspace, starts and supervises services, and handles shutdowns. Examples are systemd and sysvinit.
The shell is a user interface, usually a command interpreter like bash or zsh. It reads commands, launches programs, and supports scripting.

11. What logs and tools do you use first when troubleshooting a failed service on Linux?

I start with systemd, then the journal, then the app logs. That gives me the fastest path to whether it is a unit issue, a dependency problem, or the service itself crashing.

systemctl status <service>, shows state, exit code, recent errors, and dependency failures.
journalctl -u <service> -b, gets service logs from the current boot, often enough to spot config or permission errors.
Check app specific logs in /var/log, or wherever the service writes, like Nginx, MySQL, or custom app logs.
systemctl cat <service> and systemctl show <service>, verify unit file, overrides, environment, restart policy, and exec path.
If needed, ss -lntp, ps aux, top, df -h, free -m, and dmesg, rule out port conflicts, dead processes, resource exhaustion, or kernel issues.

12. Explain the purpose of /etc/fstab and what can go wrong if it is misconfigured.

Best practice is to use UUIDs, test with mount -a, and keep recovery access handy.

13. How do you identify disk space issues versus inode exhaustion?

I check both capacity and inode usage, because a filesystem can look "full" for either reason.

Use df -h to see disk space by filesystem, size, used, available, mount point.
Use df -i to see inode usage, if IUse% is 100%, you have inode exhaustion.
Symptom difference, disk full means large files consumed blocks, inode exhaustion means too many small files.
To find space hogs, use du -sh /* 2>/dev/null | sort -h or drill into the busy mount.
To find inode hogs, count files with find /path -xdev | wc -l, then inspect directories with lots of small files like logs, cache, mail spools.

If apps report "No space left on device" but df -h looks okay, I immediately check df -i, because that error happens with inode exhaustion too.

14. What is the difference between a hard link and a symbolic link, and when would you use each?

A hard link is another directory entry pointing to the same inode as the original file. A symbolic link is a separate file that stores a pathname to another file or directory.

Hard link: same inode, same data, deleting one name does not remove the file until all links are gone.
Symlink: different inode, points by path, can become broken if the target moves or is deleted.
Hard links usually cannot span filesystems, and typically do not link directories.
Symlinks can span filesystems and can point to directories.

15. Describe a situation where you had to balance speed, stability, and security while managing Linux infrastructure.

I’d answer this with a quick STAR flow: situation, tradeoff, actions, result. Pick an example where business pressure was real, but you still showed good engineering judgment.

16. Which Linux distributions have you worked with most, and how do their package management and service management approaches differ?

I’ve worked most with Ubuntu and Debian, RHEL/CentOS and Rocky, and a bit of SUSE and Alpine.

Debian/Ubuntu use apt with .deb packages, very dependency-friendly, huge repos, and common in cloud and app hosting.
RHEL family uses yum or dnf with .rpm packages, stronger enterprise tooling, predictable lifecycles, and tighter vendor support.
SUSE also uses .rpm, but package management is usually through zypper, which is solid for patching and repo handling.
Alpine uses apk, very lightweight, musl-based, and popular for containers where small image size matters.
For service management, modern Debian, Ubuntu, RHEL, Rocky, and SUSE all use systemd, so systemctl is the standard.
Older systems might use SysV init or Upstart, so the main difference there is service scripts versus systemd unit files and targets.

17. Explain the Linux boot process from power-on to a user-space login prompt.

At a high level, firmware initializes the machine, hands off to a bootloader, the kernel starts hardware and mounts a root filesystem, then init brings up user space until you get a login prompt.

Power on, CPU jumps to firmware, BIOS or UEFI runs POST, initializes basic hardware, finds a bootable device.
Firmware loads a bootloader like GRUB, which presents a menu, loads the Linux kernel and usually an initramfs into memory.
The kernel decompresses, sets up memory, CPU features, drivers, and mounts the temporary root from initramfs.
Early userspace in initramfs loads needed modules, finds the real root filesystem, then does switch_root or pivot_root.
Kernel starts PID 1, usually systemd, which mounts filesystems, starts services, networking, logging, and targets.
systemd starts a getty on a TTY, or a display manager for GUI, and that gives you the login prompt.

18. Walk me through your experience administering Linux systems in production environments.

Built and maintained VMs and cloud instances, handled patching, kernel updates, and lifecycle management.
Managed systemd services, users, SSH hardening, sudo, firewalls, SELinux basics, and backup/restore processes.
Did performance and incident work with top, vmstat, iostat, sar, journalctl, and logs in /var/log.
Automated provisioning and config with Bash, Ansible, and some Terraform, which reduced drift and manual errors.
Supported Nginx, Docker, cron jobs, EBS or LVM storage, and monitoring with Prometheus, CloudWatch, and alerting.

19. How do you troubleshoot a permission denied error when traditional file permissions appear correct?

I’d widen the check beyond basic rwx bits, because “Permission denied” often comes from something else in the access path.

Confirm the exact user and groups with id, and test the path with namei -l /path/file.
Check parent directories, you need execute permission on every directory in the path.
Look for ACLs with getfacl, they can override what ls -l suggests.
Verify ownership and special bits like sticky, setgid, or immutable flags via lsattr.
Check SELinux or AppArmor, getenforce, ls -Z, and audit logs often expose denials.
On NFS or CIFS, review mount options like root_squash, noexec, or UID mapping issues.
Use strace on the failing command to see the exact syscall and object being denied.

In interviews, I’d say I follow layers, identity, path traversal, extended controls, then security modules and mounts.

20. Explain how processes are created and managed in Linux.

21. Explain the difference between load average and CPU utilization.

CPU utilization answers, "How busy are my CPUs right now?"
Load average answers, "How much demand is there for CPU and certain blocked work?"
High CPU with low load can mean CPUs are busy, but there is not much queueing.
High load with low CPU often points to I/O bottlenecks, stuck disk or NFS, not pure CPU pressure.
On a 4 core system, load 4 means roughly full saturation, 8 means sustained queueing.

So, utilization is a percentage, load is a queue depth style signal.

22. How do you diagnose disk I/O bottlenecks on Linux?

I’d work top-down: confirm the symptom, find the busy device, then decide whether it’s throughput, latency, queueing, or filesystem related.

Start with iostat -xz 1, check %util, await, svctm, r/s, w/s, avgqu-sz. High await and queue depth usually signal contention.
Use iotop or pidstat -d 1 to find which processes are generating I/O.
Check memory pressure with vmstat 1, high wa plus swapping can look like a disk issue.
Inspect per-device stats in sar -d, lsblk, and /proc/diskstats, confirm whether one disk, LVM layer, or RAID device is the hotspot.
For filesystem impact, use df -h, df -i, mount, and dmesg for ext4/xfs errors or controller resets.
If needed, go deeper with blktrace, fio, or smartctl to separate workload limits from failing hardware.

23. How would you expand a filesystem on a running Linux server with minimal downtime?

I’d answer it as a layered process: confirm what sits under the filesystem, expand the block device, then grow the filesystem online if supported.

Identify the stack with lsblk, df -hT, and pvs/vgs/lvs if LVM is involved.
Expand the underlying disk first, for example hypervisor disk, SAN LUN, or cloud volume, then rescan the device.
If there is a partition, grow it with growpart or parted; if using LVM, run pvresize, then extend the LV with lvextend.
Grow the filesystem online: xfs_growfs for XFS, resize2fs for ext4, usually mounted and live.
Validate with df -h, lsblk, and check logs for errors.

Minimal downtime comes from using online resize paths. I’d still take a snapshot or backup first, confirm filesystem type, and have a rollback plan if the storage layer does not rescan cleanly.

24. How would you explain the difference between virtualization and containerization in a Linux context?

In Linux, virtualization and containerization solve isolation differently.

Virtualization uses a hypervisor to run full virtual machines, each with its own kernel and virtual hardware.
Containers share the host Linux kernel, but isolate processes with namespaces and control resources with cgroups.
VMs are heavier, slower to boot, and use more RAM and disk, but they give stronger isolation and can run different OSes.
Containers are lightweight, start fast, and pack densely, but they require a compatible Linux kernel.
Use VMs when you need hard isolation or mixed operating systems. Use containers for microservices, CI/CD, and fast scaling.

A simple way to say it in an interview is, a VM is like a full house, a container is like an apartment in the same building.

25. How would you securely grant a user limited administrative access on a Linux system?

Add the user to an admin group like wheel or sudo, if broad admin rights are acceptable.
For limited access, create command-specific rules, like allowing only systemctl restart nginx or journalctl.
Use least privilege, full command paths, and avoid wildcards unless you really trust the user.
Set defaults like requiring the user’s own password, logging sudo activity, and optionally restricting TTY or environment inheritance.
Test with sudo -l -U username to verify exactly what they can run.

If I wanted tighter control, I’d prefer a dedicated group plus a narrow sudoers rule over full root access.

26. What is the difference between a process, a thread, a daemon, and a zombie process?

Here is the clean way to explain it in an interview:

A process is a running program with its own memory space, PID, file descriptors, and resources.
A thread is a lightweight execution unit inside a process. Threads share the same memory and resources of that process, but each has its own stack and CPU state.
A daemon is a background process, usually started at boot or managed by systemd, that provides a service like sshd or cron.
A zombie process is a process that has finished execution, but still has an entry in the process table because its parent has not collected its exit status with wait().

27. What tools do you use to analyze network connectivity issues on a Linux system?

I usually start by isolating the layer where it breaks, local stack, routing, DNS, or remote reachability.

ip addr, ip link, ethtool, check interface state, IP, duplex, errors.
ip route, ss -tulpn, verify routing, default gateway, listening sockets, active connections.
ping, traceroute or tracepath, test reachability and where packets stop.
dig, nslookup, resolvectl, separate DNS problems from raw network problems.
curl -v, nc, telnet, test specific ports, TLS handshakes, HTTP behavior.
tcpdump and sometimes wireshark, inspect packets, retransmits, resets, ARP issues.
journalctl, dmesg, NetworkManager logs, catch driver, DHCP, or link flap errors.
iptables or nft, firewall-cmd, verify firewall or NAT isn’t blocking traffic.

If it’s intermittent, I’ll also use mtr for ongoing path quality and latency loss patterns.

28. How would you investigate repeated authentication failures on a Linux host?

I’d work from scope, source, and timing, then confirm whether it’s a user issue, a service issue, or active abuse.

Check auth logs first: /var/log/auth.log on Debian/Ubuntu, /var/log/secure on RHEL, or journalctl -u sshd -u sudo -u sssd.
Identify the pattern: failed user, source IP, service, time window, and whether it is local, SSH, sudo, or LDAP/AD related.
Verify account state with passwd -S, chage -l, faillock --user, and look for expired, locked, or disabled accounts.
Review PAM, SSH, and identity configs: /etc/pam.d/*, sshd_config, nsswitch.conf, SSSD, LDAP, Kerberos.
Correlate with security controls like fail2ban, firewall logs, IDS, and check if the source is internal automation using stale credentials.
If suspicious, contain it: block the IP, rotate creds, preserve logs, and check for broader brute force activity.

29. How do signals work in Linux, and when would you use SIGTERM versus SIGKILL?

SIGTERM is the polite shutdown signal, use it first.
It lets the app clean up, flush files, close sockets, and exit gracefully.
SIGKILL is the force option, the kernel kills the process immediately.
Use SIGKILL only if the process is hung, ignoring SIGTERM, or stuck in a bad state.
Typical flow is kill -TERM pid, wait, then kill -KILL pid if needed.

In interviews, I’d mention that systemd, containers, and orchestration tools usually send SIGTERM first for graceful shutdown.

30. How would you troubleshoot a DNS resolution problem on a Linux server?

I’d work from the client outward: confirm the symptom, check local resolver config, then test the DNS path step by step.

Verify whether it’s DNS only, ping 8.8.8.8 vs ping google.com.
Check resolver settings in /etc/resolv.conf, search domains, nameserver entries, and whether NetworkManager or systemd-resolved manages it.
Test resolution directly with dig or nslookup, for example dig google.com and dig @8.8.8.8 google.com.
Inspect local name service flow in /etc/nsswitch.conf, plus cache or resolver status with resolvectl status or systemctl status systemd-resolved.
Look for connectivity or firewall issues to DNS servers, usually port 53 UDP/TCP, using ss, tcpdump, or nc.
If it’s intermittent, check logs with journalctl, compare multiple DNS servers, and rule out stale cache or split-DNS/VPN issues.

31. What is the difference between TCP and UDP, and how does that affect Linux service troubleshooting?

32. Explain routing on a Linux host and how you would debug an incorrect route.

Start with ip addr, ip route, and ip rule to see interfaces, routes, and policy rules.
Test the kernel’s decision with ip route get <destination>.
Check if the wrong interface, gateway, subnet mask, or metric is winning.
Verify neighbor resolution with ip neigh, and confirm packets with tcpdump -i <iface>.
Look for route sources, static config, DHCP, NetworkManager, or cloud-init overwriting changes.

If fixing it, I’d add or replace the route with ip route add or ip route replace, test, then make it persistent in the network config.

33. How do you find which process is listening on a port and determine whether it should be?

I’d answer it in two steps, identify the process, then validate whether it belongs there.

Use ss -ltnp for TCP or ss -lunp for UDP, then filter with grep :PORT.
Alternatives are lsof -i :PORT or netstat -tulpn on older systems.
Note the PID and process name, then inspect with ps -fp PID and systemctl status <service>.
Check whether the port is expected by comparing against app configs, service docs, change records, and firewall policy.
Validate the bind address too, 127.0.0.1 may be fine internally, 0.0.0.0 is broader exposure.

34. What considerations do you take into account when hardening a Linux server?

I think about hardening in layers, starting with reducing attack surface, then tightening access, then improving detection and recovery.

Minimize what runs: install only needed packages, disable unused services, close unnecessary ports, remove default accounts.
Patch aggressively: keep OS and apps updated, subscribe to security advisories, test and automate updates where possible.
Lock down access: use SSH keys, disable password and root login, enforce MFA if available, least privilege with sudo.
Secure the network: host firewall like nftables or firewalld, restrict management access, segment sensitive services.
Harden the system: set correct file permissions, mount options like noexec where appropriate, enable SELinux or AppArmor.
Improve visibility: centralize logs, audit with auditd, monitor file integrity and suspicious auth events.
Protect data and recovery: encrypt sensitive data, manage secrets properly, take tested backups, document a baseline and review regularly.

35. How do you verify that a Linux system complies with security baselines or internal standards?

I’d answer this as a mix of policy mapping, automated validation, and evidence collection.

Start by identifying the benchmark, CIS, DISA STIG, or internal hardening standard, and map each control to technical checks.
Use compliance tools like OpenSCAP, Lynis, or vendor tools to scan the host against a profile and generate reports.
Manually verify high risk items, things like SSH settings, password policy, firewall rules, auditd, logging, kernel params, file permissions, and running services.
Compare actual state to the approved baseline using config management, for example Ansible, Puppet, or Chef, to detect drift.
Review exceptions separately, document compensating controls, then keep artifacts like scan results, command output, and remediation tickets for audit evidence.

36. How do you capture and inspect network traffic on Linux during an incident?

I start broad, then narrow fast so I do not lose volatile evidence.

Identify the interface and scope with ip a, ip route, ss -tupna, and recent logs.
Capture safely with tcpdump, for example tcpdump -i eth0 -nn -s0 -C 100 -W 10 -w /tmp/incident.pcap, which rotates files and avoids DNS lookups.
Filter aggressively if needed, like host 10.0.0.5, port 443, or net 192.168.1.0/24, to reduce noise.
Inspect live with tcpdump -r incident.pcap, tshark, or Wireshark on a copy, looking for unusual egress, beaconing, scans, resets, and odd DNS.
Correlate packets with processes using ss, lsof -i, conntrack -L, firewall counters, and timestamps from syslog or EDR.

If the box is sensitive, I prefer writing captures to a separate disk and hashing the pcap for chain of custody.

37. How would you investigate intermittent application slowness on a Linux host?

I’d work top down and time-correlate symptoms first, because “intermittent” usually means I need evidence during the bad window.

Confirm scope, app only or whole host, and line up timestamps from app logs, journalctl, deploys, cron jobs, backups, or traffic spikes.
Check load vs real bottleneck with uptime, top, vmstat 1, iostat -xz 1, sar, looking at CPU steal, run queue, memory pressure, swap, and disk latency.
Verify memory issues, free -m, sar -B, dmesg for OOM, page reclaim, or THP and NUMA-related noise.
Look at process level hotspots, pidstat -durh 1 -p <pid>, strace -p, perf top, and open files or socket pressure with lsof, ss -tpn.
Rule out network and DNS, packet loss, retransmits, slow upstreams, ss, sar -n DEV, tcpdump, dig.

If I need a concrete example, I once traced periodic slowness to iowait spikes from logrotate plus compression on the same volume as the app.

38. What experience do you have configuring firewalls with iptables, nftables, or firewalld?

With iptables, I have built host-based rules for SSH, web traffic, NAT, port forwarding, and basic rate limiting.
With firewalld, I usually manage zones, services, rich rules, and permanent runtime changes on RHEL-based systems.
With nftables, I have created simple rulesets and translated older iptables logic into tables, chains, and sets.
I always validate changes carefully, using staged rules, console access, and testing to avoid locking myself out.
I also document rule intent clearly, so future troubleshooting is easier and audits go more smoothly.

39. Describe a shell script you wrote to automate a repetitive Linux administration task.

I’d answer this with a quick STAR flow: task, what the script did, how I made it safe, and the result.

40. How do you make shell scripts safer, more maintainable, and easier to troubleshoot?

I treat shell scripts like production code: fail early, validate inputs, and make behavior obvious.

Start with #!/usr/bin/env bash, then set -Eeuo pipefail to catch common failures.
Quote variables, prefer $(...), and use arrays to avoid word-splitting bugs.
Validate arguments up front, show usage(), and check dependencies with command -v.
Use functions with clear names, local variables, and a main function for flow.
Add trap for cleanup and useful error context, like line number and command.
Log key steps with timestamps; enable set -x selectively for debugging.
Run shellcheck, format with shfmt, and test happy path plus failure cases.

For maintainability, keep scripts idempotent, avoid hardcoded paths, use constants like readonly, and document assumptions near the code that depends on them.

41. What is the purpose of standard input, output, and error, and how do pipes and redirection help in administration?

Redirection changes where a stream goes, like > to a file, >> to append, < for input.
2> redirects only errors, which is useful for troubleshooting or clean scripting.
Pipes, |, send stdout from one command into another, like ps aux | grep nginx.
This lets admins chain simple tools, filter data, automate reports, and avoid temp files.
Together, they make scripts more reliable, composable, and easy to debug.

42. How do cron and systemd timers differ, and when would you choose one over the other?

Both schedule recurring jobs, but systemd timers are more integrated and easier to manage on modern Linux.

cron is simple, lightweight, and portable. Good for basic time-based jobs like 0 2 * * *.
systemd timers pair with service units, so you get logging in journald, dependencies, resource controls, retries, and better visibility with systemctl.
Timers can do calendar schedules like cron, but also relative triggers such as "10 minutes after boot".
systemd supports Persistent=true, which runs missed jobs after downtime. Cron usually skips them unless you use anacron.

43. How do you search large log files efficiently and identify the relevant events?

For large logs, I start broad, then narrow fast. The goal is to reduce volume, anchor on time, and correlate by IDs like request ID, PID, user, or IP.

Use grep -i, zgrep, or rg for fast keyword searches, and add -n for line numbers.
Filter by time window first, with awk, sed, or journal fields like journalctl --since --until.
Chain tools, for example grep "ERROR" app.log | grep "request_id=123" to cut noise quickly.
Sort and summarize patterns with sort | uniq -c | sort -nr to spot spikes or repeated failures.
Follow live activity with tail -f or journalctl -f, then reproduce the issue if possible.

44. What backup and recovery practices have you implemented for Linux systems?

I usually answer this by covering policy, tooling, validation, and recovery time.

I’ve used rsync, tar, borg, and snapshot-based backups with LVM or storage arrays, depending on RPO and retention needs.
I separate file-level backups from system-state backups, configs in /etc, app data, databases, and boot-critical items like /boot and EFI.
For databases, I prefer app-consistent backups, like mysqldump, xtrabackup, or pg_dump, not just filesystem copies.
I automate with cron or systemd timers, encrypt backups, ship copies offsite, and follow a 3-2-1 strategy.
The big one is restore testing, I regularly verify checksum integrity and do test restores to confirm RTO actually works.

Example, I set up Borg with daily incremental backups, weekly prune policies, and quarterly restore drills for a web stack, which cut recovery from hours to under 30 minutes.

45. What experience do you have with containers on Linux, and how do namespaces and cgroups relate to them?

Namespaces provide isolation, they give a container its own view of PIDs, network, mounts, users, IPC, and hostname.
Cgroups provide control, they limit and account for CPU, memory, I/O, and process counts.
Containers are basically regular Linux processes wrapped with namespaces and constrained by cgroups.
I’ve used this knowledge to debug OOM kills, CPU throttling, and cases where PID or mount isolation behaved unexpectedly.
The runtime, like runc, sets up those kernel features; Docker or Kubernetes mostly orchestrate around them.

46. What experience do you have with log rotation and centralized logging on Linux systems?

I’ve worked with both local log rotation and centralized logging in production Linux environments, mostly using logrotate, rsyslog, journald, and ELK or Graylog stacks.

For rotation, I’ve configured /etc/logrotate.d/ policies with size and time based rotation, compression, retention, and copytruncate or service reloads depending on the app.
I usually verify ownership, permissions, and post-rotate actions, because that’s where logging often breaks.
On centralized logging, I’ve forwarded logs with rsyslog or Filebeat into Elasticsearch, and used structured logs when possible to improve searchability.
I’ve also tuned filtering, rate limiting, and disk usage so noisy services don’t flood storage or hide useful events.
In troubleshooting, I check dropped messages, timestamp consistency, TLS for secure forwarding, and whether apps reopen log files correctly after rotation.

47. Describe a time you recovered a Linux system from a serious outage or misconfiguration.

I’d answer this with a tight STAR format: situation, actions, outcome, and what I changed to prevent it happening again.

48. How do you approach patching Linux systems while minimizing business impact?

I treat patching as a risk management and change management exercise, not just package updates.

Start with an asset inventory, classify systems by criticality, owner, OS version, and maintenance window.
Test first in dev or staging, validate app compatibility, kernel updates, agents, and rollback steps.
Prioritize by risk, internet-facing, known CVEs, compliance deadlines, and vendor advisories.
Use phased rollout, small pilot group, then broader deployment, with canaries for critical services.
Minimize downtime with live patching where possible, clustering, load balancer draining, and reboots only in approved windows.
Automate with tools like Ansible, Satellite, Landscape, or unattended workflows, but keep approvals for high-risk hosts.
Monitor before and after, service health, logs, performance, and have backups or snapshots ready for rollback.

49. What Linux performance metrics do you monitor routinely, and why?

I usually group them by saturation, errors, and user impact. The goal is to catch bottlenecks early and tie low level signals to application symptoms.

CPU, %user, %system, %iowait, run queue, load average, to spot contention vs blocked work.
Memory, free vs available, page faults, swap in or out, OOM events, because pressure hurts latency fast.
Disk, IOPS, throughput, latency, queue depth, %util, to find storage bottlenecks, especially under write bursts.
Network, bandwidth, drops, errors, retransmits, connection states, because packet loss and retries kill performance.
Process level stats, top CPU or RSS users, thread counts, FD usage, to catch noisy neighbors or leaks.
App metrics, p95 or p99 latency, error rate, QPS, backlog, because infra metrics only matter if users feel it.

Tools I’d reach for are top, vmstat, iostat, sar, ss, and Prometheus plus Grafana.

50. What is LVM, and what advantages and risks does it introduce?

Advantages: easy online resizing, simpler storage pooling, snapshots for backups, and cleaner disk replacement or migration.
It is great when disk needs change over time, especially on servers and VMs.
Risks: added complexity, harder troubleshooting, and if the volume group metadata is damaged, multiple logical volumes can be affected.
Snapshots can hurt performance and fill up if not sized well.
It is not a backup, and bad commands like lvremove can be destructive across a larger storage pool.

Linux Interview Questions

Master Linux interviews with expert guidance

Study Mode

How do systemd services, targets, and unit files work together?

How do systemd services, targets, and unit files work together?

How do Linux file permissions, ownership, and umask interact?

How do Linux file permissions, ownership, and umask interact?

What is the difference between setuid, setgid, and the sticky bit?

What is the difference between setuid, setgid, and the sticky bit?

What is SELinux or AppArmor, and how have you worked with one of them in practice?

What is SELinux or AppArmor, and how have you worked with one of them in practice?

What is the difference between su and sudo, and how do you manage sudoers safely?

What is the difference between su and sudo, and how do you manage sudoers safely?

What steps would you take to investigate high CPU usage on a Linux system?

What steps would you take to investigate high CPU usage on a Linux system?

How would you create, enable, and troubleshoot a custom systemd service?

How would you create, enable, and troubleshoot a custom systemd service?

How do you troubleshoot a Linux server that suddenly becomes unreachable over SSH?

How do you troubleshoot a Linux server that suddenly becomes unreachable over SSH?

How do you diagnose a server that is running out of memory?

How do you diagnose a server that is running out of memory?

What is the difference between the kernel, init system, and shell?

What is the difference between the kernel, init system, and shell?

What logs and tools do you use first when troubleshooting a failed service on Linux?

What logs and tools do you use first when troubleshooting a failed service on Linux?

Explain the purpose of /etc/fstab and what can go wrong if it is misconfigured.

Explain the purpose of /etc/fstab and what can go wrong if it is misconfigured.

How do you identify disk space issues versus inode exhaustion?

How do you identify disk space issues versus inode exhaustion?

What is the difference between a hard link and a symbolic link, and when would you use each?

What is the difference between a hard link and a symbolic link, and when would you use each?

Describe a situation where you had to balance speed, stability, and security while managing Linux infrastructure.

Describe a situation where you had to balance speed, stability, and security while managing Linux infrastructure.

Which Linux distributions have you worked with most, and how do their package management and service management approaches differ?

Which Linux distributions have you worked with most, and how do their package management and service management approaches differ?

Explain the Linux boot process from power-on to a user-space login prompt.

Explain the Linux boot process from power-on to a user-space login prompt.

Walk me through your experience administering Linux systems in production environments.

Walk me through your experience administering Linux systems in production environments.

How do you troubleshoot a permission denied error when traditional file permissions appear correct?

How do you troubleshoot a permission denied error when traditional file permissions appear correct?

Explain how processes are created and managed in Linux.

Explain how processes are created and managed in Linux.

Explain the difference between load average and CPU utilization.

Explain the difference between load average and CPU utilization.

How do you diagnose disk I/O bottlenecks on Linux?

How do you diagnose disk I/O bottlenecks on Linux?

How would you expand a filesystem on a running Linux server with minimal downtime?

How would you expand a filesystem on a running Linux server with minimal downtime?

How would you explain the difference between virtualization and containerization in a Linux context?

How would you explain the difference between virtualization and containerization in a Linux context?

How would you securely grant a user limited administrative access on a Linux system?

How would you securely grant a user limited administrative access on a Linux system?

What is the difference between a process, a thread, a daemon, and a zombie process?

What is the difference between a process, a thread, a daemon, and a zombie process?

What tools do you use to analyze network connectivity issues on a Linux system?

What tools do you use to analyze network connectivity issues on a Linux system?

How would you investigate repeated authentication failures on a Linux host?

How would you investigate repeated authentication failures on a Linux host?

How do signals work in Linux, and when would you use SIGTERM versus SIGKILL?

How do signals work in Linux, and when would you use SIGTERM versus SIGKILL?

How would you troubleshoot a DNS resolution problem on a Linux server?

How would you troubleshoot a DNS resolution problem on a Linux server?

What is the difference between TCP and UDP, and how does that affect Linux service troubleshooting?

What is the difference between TCP and UDP, and how does that affect Linux service troubleshooting?

Explain routing on a Linux host and how you would debug an incorrect route.

Explain routing on a Linux host and how you would debug an incorrect route.

How do you find which process is listening on a port and determine whether it should be?

How do you find which process is listening on a port and determine whether it should be?

What considerations do you take into account when hardening a Linux server?

What considerations do you take into account when hardening a Linux server?

How do you verify that a Linux system complies with security baselines or internal standards?

How do you verify that a Linux system complies with security baselines or internal standards?

How do you capture and inspect network traffic on Linux during an incident?

How do you capture and inspect network traffic on Linux during an incident?

How would you investigate intermittent application slowness on a Linux host?

How would you investigate intermittent application slowness on a Linux host?

What experience do you have configuring firewalls with iptables, nftables, or firewalld?

What experience do you have configuring firewalls with iptables, nftables, or firewalld?

Describe a shell script you wrote to automate a repetitive Linux administration task.