Understanding DNS and Name Resolution: A DevOps Guide

How DNS and /etc/hosts affect your services

The Phone Book of the Internet

Remember when you had to memorize phone numbers? I still remember my childhood friend’s number: 555-0123. But imagine if you had to remember every friend’s number, every restaurant, every business. That’s essentially what the early internet was like.

The Problem We Needed to Solve

Back in the day, if you wanted to visit a website, you had to type something like 192.168.1.1 into your browser. Try remembering that for every site you visit! As the internet grew, this became impossible.

Enter DNS: The Internet’s Phone Book

DNS (Domain Name System) solved this by creating a massive, distributed phone book. Instead of remembering 172.217.14.196, you could just type google.com. Much easier, right?

Think of it this way: when you want to call your friend Sarah, you don’t memorize her number. You save it in your contacts as “Sarah” and let your phone do the lookup. DNS works the same way for websites and services.

A Day in the Life of a DNS Query

Let’s follow what happens when you type github.com in your browser:

  1. You type: github.com
  2. Your computer asks: “Hey, what’s the IP address for github.com?”
  3. DNS responds: “That’s 140.82.112.3”
  4. Your browser connects: to 140.82.112.3
  5. You see: GitHub’s homepage

This happens in milliseconds, thousands of times a day, without you even thinking about it.

Why DevOps Engineers Care

As a DevOps engineer, you’re not just a user of DNS – you’re an architect of it. Your services need to find each other reliably. Your monitoring needs to reach your applications. Your load balancers need to distribute traffic. DNS is the foundation that makes all of this possible.

The Great DNS Treasure Hunt

Ever wondered how your computer finds the right server when you type a domain name? It’s like a treasure hunt with clues scattered across the globe. Let me walk you through this adventure.

The Cast of Characters

Before we start our treasure hunt, let’s meet the players:

You (the client): The one asking “Where is example.com?” Recursive resolver: Your helpful librarian who does the research Root nameservers: The wise elders who know everything TLD nameservers: The specialists (.com, .org, .net experts) Authoritative nameservers: The final authorities with the answer

The Hunt Begins

Imagine you’re trying to find blog.example.com. Here’s the journey:

Step 1: Check Your Memory

Your computer first checks its cache. “Did I ask about blog.example.com recently?” If yes, great! We’re done. If not, the hunt continues.

Step 2: Ask Your Librarian

Your computer asks the recursive resolver (usually your ISP’s DNS server): “Do you know where blog.example.com is?”

The resolver might say: “Let me check my notes… nope, I don’t have that one. But don’t worry, I’ll find it for you!”

Step 3: Start at the Top

The resolver goes to the root nameservers and asks: “Where can I find information about blog.example.com?”

Root nameserver: “I don’t know about blog.example.com specifically, but I know who handles all .com domains. Go ask the .com nameservers.”

Step 4: Get More Specific

The resolver asks the .com nameservers: “Where can I find blog.example.com?”

.com nameserver: “I don’t have blog.example.com, but I know who’s in charge of example.com. Here are the nameservers for example.com.”

Step 5: Find the Expert

Finally, the resolver asks example.com’s nameservers: “Where is blog.example.com?”

Authoritative nameserver: “Ah yes! blog.example.com is at 192.0.2.1. Here you go!”

Step 6: Treasure Found!

The resolver brings this information back to your computer, and voilà! Your browser can now connect to 192.0.2.1 to load the blog.

Why This Matters for DevOps

Understanding this process helps you troubleshoot when things go wrong. If your service isn’t reachable, you can check each step:

  • Is it cached incorrectly?
  • Is the recursive resolver working?
  • Are the authoritative nameservers responding?
  • Is the DNS record correct?

The Plot Twist: TTL (Time To Live)

Here’s where it gets interesting. Each DNS record comes with a TTL – essentially an expiration date. When the resolver caches the answer, it remembers it for that long.

Set TTL too high (like 24 hours), and changes take forever to propagate. Set it too low (like 30 seconds), and you’re putting unnecessary load on the DNS system.

It’s like deciding how long to keep milk in your fridge. Too long, and it goes bad. Too short, and you’re constantly going to the store.

The Local Shortcut: /etc/hosts Explained

Let me tell you about the time I spent three hours debugging why my application couldn’t connect to a database, only to discover someone had added an entry to /etc/hosts that was pointing to the wrong server. It was like having a sticky note on your phone that said “Sarah: 555-WRONG” while Sarah had changed her number months ago.

What is /etc/hosts?

The /etc/hosts file is your computer’s private phone book. Before your system even thinks about asking DNS servers, it checks this local file first. It’s like having speed dial numbers that override everything else.

The File That Time Forgot

Here’s what a typical /etc/hosts file looks like:

1
2
3
4
127.0.0.1       localhost
::1             localhost
192.168.1.100   myapp.local
10.0.0.50       database.internal

Each line says: “When someone asks for this name, send them to this IP address.”

Why /etc/hosts Exists

Back in the early internet days (we’re talking 1970s), there was no DNS. Instead, there was a single file called HOSTS.TXT that everyone downloaded from a central server. As the internet grew, this became impossible to maintain.

But the /etc/hosts file stuck around for local overrides and quick fixes. It’s like having a personal notepad that you check before looking in the phone book.

The Order of Operations

When you type myapp.local in your browser, here’s what happens:

  1. Check /etc/hosts first: “Is myapp.local in my local file?”
  2. If found: Use that IP address, skip DNS entirely
  3. If not found: Proceed with normal DNS resolution

This is why that database connection issue took me three hours to debug – the /etc/hosts entry was silently overriding DNS!

Common DevOps Use Cases

Development Environment Setup

1
2
3
127.0.0.1       api.myapp.local
127.0.0.1       db.myapp.local
127.0.0.1       cache.myapp.local

This lets you run everything locally while using production-like domain names.

Testing Different Environments

1
2
3
4
5
# Production
# 203.0.113.10    api.example.com

# Staging (currently active)
198.51.100.5    api.example.com

Just comment/uncomment lines to switch between environments.

Emergency Workarounds

When DNS is broken or you need to bypass a problematic server, /etc/hosts can be a lifesaver:

1
192.0.2.100     problematic-service.com

The Dark Side of /etc/hosts

While powerful, /etc/hosts can be dangerous:

Silent Overrides: It can mask DNS issues Forgotten Entries: Old entries can cause mysterious problems No TTL: Changes require manual updates Local Only: Only affects the current machine

Best Practices for DevOps

  1. Document Everything: Comment your entries
1
2
# Temporary fix for staging deployment - Remove after 2024-01-15
192.168.1.50    api.staging.com
  1. Use Version Control: Keep your /etc/hosts in Git for team projects

  2. Automate When Possible: Use configuration management tools

  3. Regular Cleanup: Review and remove old entries

The Great Debugging Checklist

When DNS seems broken, check /etc/hosts first:

1
2
3
4
5
# On Linux/Mac
cat /etc/hosts | grep -v "^#" | grep -v "^$"

# On Windows
type C:\Windows\System32\drivers\etc\hosts

A Real-World Story

I once worked on a team where a developer added 127.0.0.1 payment-gateway.com to their /etc/hosts for testing. They forgot about it, and weeks later, they couldn’t figure out why the payment system wasn’t working in their development environment. The gateway was responding (from localhost), but with a 404 error.

The lesson? Always check your local overrides first!

DNS in the Container World

The first time I deployed a microservice to Kubernetes, I was amazed. My service could talk to user-service, payment-service, and notification-service by name – no IP addresses, no configuration files, just simple names. It felt like magic.

Then I had to debug a service discovery issue at 2 AM, and I realized I needed to understand how this magic actually worked.

The Old World vs. The New World

Traditional deployments: Everything runs on known servers with predictable IP addresses. You might have web1.company.com, db1.company.com, and so on.

Container world: Services come and go. IP addresses change. A service might be running on any of dozens of nodes. payment-service could be at 10.244.1.15 now, and 10.244.3.8 after a restart.

Docker’s DNS Magic

When you run Docker containers, Docker creates a virtual network with its own DNS server. Here’s what happens:

1
2
3
4
5
6
# Start two containers
docker run -d --name web-app nginx
docker run -d --name database postgres

# From inside web-app, you can reach the database
ping database  # This works!

Docker’s built-in DNS server resolves database to whatever IP the database container currently has. It’s like having a very smart /etc/hosts file that updates itself.

Docker Compose: The Neighborhood Directory

With Docker Compose, this gets even nicer:

1
2
3
4
5
6
7
8
version: '3'
services:
  web:
    image: nginx
  api:
    image: my-api
  database:
    image: postgres

Now web can talk to api, api can talk to database, and everyone knows everyone else by name. It’s like living in a small neighborhood where everyone knows each other.

Kubernetes: The Smart City

Kubernetes takes this concept and scales it up. In Kubernetes, you have:

Services: Stable names for groups of pods Pods: The actual containers running your application DNS: Automatic service discovery

Here’s a simple example:

1
2
3
4
5
6
7
8
9
apiVersion: v1
kind: Service
metadata:
  name: user-service
spec:
  selector:
    app: user-api
  ports:
  - port: 80

Now any pod in your cluster can reach this service at user-service.default.svc.cluster.local (or just user-service if you’re in the same namespace).

The DNS Hierarchy in Kubernetes

Kubernetes creates a DNS hierarchy that makes sense:

  • user-service (same namespace)
  • user-service.production (different namespace)
  • user-service.production.svc.cluster.local (fully qualified)

It’s like having addresses: “John” (same house), “John in Apartment 3B” (same building), “John at 123 Main St, Apartment 3B” (full address).

When Things Go Wrong: A Debugging Story

I once had a service that couldn’t connect to a database. The error was confusing: “Could not resolve hostname ‘db’”. Here’s how I debugged it:

  1. Check the obvious: Was the database service running?
  2. Test DNS resolution:
    1
    
    kubectl exec -it my-pod -- nslookup db
    
  3. Check the full name:
    1
    
    kubectl exec -it my-pod -- nslookup db.default.svc.cluster.local
    
  4. Found the issue: The pod was in a different namespace!

The fix was simple – use the full service name or move the pod to the right namespace.

Mixing Local and Cluster DNS

Here’s where it gets interesting. Sometimes you need your containerized application to talk to external services. You might have:

  • Internal services (handled by Kubernetes DNS)
  • External services (handled by regular DNS)
  • Local overrides (using ConfigMaps or environment variables)

Best Practices I’ve Learned

  1. Use service names, not IP addresses

    1
    2
    3
    4
    5
    
    # Good
    DATABASE_URL: postgresql://user:pass@database:5432/mydb
    
    # Bad (IP will change)
    DATABASE_URL: postgresql://user:[email protected]:5432/mydb
    
  2. Understand your DNS hierarchy

    • Same namespace: service-name
    • Different namespace: service-name.namespace
    • External: api.external.com
  3. Test DNS resolution

    1
    2
    3
    4
    5
    
    # Quick test
    kubectl exec -it pod-name -- nslookup service-name
    
    # Detailed debugging
    kubectl exec -it pod-name -- dig service-name
    
  4. Use health checks Services come and go in containers. Make sure your application handles DNS failures gracefully.

The Plot Twist: ExternalName Services

Kubernetes has a neat trick for bridging internal and external DNS:

1
2
3
4
5
6
7
apiVersion: v1
kind: Service
metadata:
  name: external-api
spec:
  type: ExternalName
  externalName: api.partner.com

Now your internal services can call external-api instead of api.partner.com. If the external API changes, you only update one place.

Troubleshooting DNS: A Detective’s Guide

It’s 3 AM. Your monitoring system is screaming. Users can’t access your application. The error logs are full of “DNS resolution failed” messages. Sound familiar?

I’ve been there more times than I’d like to admit. Over the years, I’ve developed a systematic approach to DNS troubleshooting that’s saved me countless hours. Let me share it with you.

The DNS Detective’s Toolkit

Before we dive into specific problems, let’s gather our tools:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Basic DNS lookup
nslookup example.com

# More detailed information
dig example.com

# Test specific DNS servers
dig @8.8.8.8 example.com

# Trace the full DNS path
dig +trace example.com

# Check local DNS cache (varies by OS)
systemctl status systemd-resolved  # Linux
dscacheutil -q host -a name example.com  # macOS
ipconfig /displaydns  # Windows

Case Study 1: The Mysterious Slow Website

Problem: “Our website is really slow, but the server is fine.”

Investigation:

1
2
3
# Check DNS response time
dig example.com | grep "Query time"
# Result: Query time: 5000 msec

Root Cause: DNS queries were taking 5 seconds! The DNS server was overloaded.

Solution:

  1. Temporarily switched to faster DNS servers (8.8.8.8, 1.1.1.1)
  2. Implemented DNS caching
  3. Contacted the DNS provider about the performance issue

Lesson: Slow DNS can make fast websites feel slow. Always check DNS response times.

Case Study 2: The Vanishing Microservice

Problem: “Payment service can’t connect to user service in Kubernetes.”

Investigation:

1
2
3
4
5
6
7
# From payment service pod
kubectl exec -it payment-pod -- nslookup user-service
# Result: NXDOMAIN

# Check if service exists
kubectl get svc user-service
# Result: No resources found

Root Cause: The user service was deployed but the Kubernetes Service object wasn’t created.

Solution:

1
2
# Create the missing service
kubectl expose deployment user-service --port=80

Lesson: In Kubernetes, deployments create pods, but you need services for DNS resolution.

Case Study 3: The Cached Nightmare

Problem: “I updated the DNS record 6 hours ago, but some users still see the old site.”

Investigation:

1
2
3
# Check TTL of the DNS record
dig example.com | grep "IN A"
# Result: example.com. 86400 IN A 192.0.2.1

Root Cause: The TTL was set to 86400 seconds (24 hours).

Solution:

  1. Lowered TTL to 300 seconds (5 minutes) for future changes
  2. Waited for the old cache to expire
  3. Asked users to flush their DNS cache or restart their routers

Lesson: Plan DNS changes in advance. Lower TTL before making changes, raise it after they’re stable.

Case Study 4: The Split-Brain DNS

Problem: “The application works from the office but not from home.”

Investigation:

1
2
3
4
5
6
7
# From office
dig api.company.com
# Result: 192.168.1.100 (internal IP)

# From home
dig api.company.com
# Result: 203.0.113.100 (external IP)

Root Cause: Split-brain DNS setup where internal users got internal IPs, external users got external IPs. The internal server was down.

Solution:

  1. Fixed the internal server
  2. Implemented health checks
  3. Added monitoring for both internal and external endpoints

Lesson: Test your services from different network locations.

The Ultimate DNS Troubleshooting Checklist

When DNS problems strike, follow this systematic approach:

Step 1: Confirm It’s Actually DNS

1
2
3
# Try connecting by IP address
curl http://192.0.2.1
# If this works but the domain doesn't, it's DNS

Step 2: Check Local Overrides

1
2
3
4
5
# Check /etc/hosts
grep -v "^#" /etc/hosts | grep -v "^$"

# Check if you're in a container with custom DNS
cat /etc/resolv.conf

Step 3: Test DNS Resolution

1
2
3
4
5
6
7
# Basic test
nslookup problematic-domain.com

# Test different record types
dig problematic-domain.com A
dig problematic-domain.com AAAA
dig problematic-domain.com CNAME

Step 4: Test Different DNS Servers

1
2
3
4
5
6
7
8
# Your current DNS server
dig problematic-domain.com

# Google's DNS
dig @8.8.8.8 problematic-domain.com

# Cloudflare's DNS
dig @1.1.1.1 problematic-domain.com

Step 5: Trace the DNS Path

1
2
# See the full resolution path
dig +trace problematic-domain.com

Step 6: Check DNS Propagation

Use online tools like whatsmydns.net to see how DNS looks from different locations worldwide.

Prevention: Building Resilient DNS

Monitor DNS Performance:

1
2
3
4
5
6
# Simple monitoring script
#!/bin/bash
for server in 8.8.8.8 1.1.1.1 your-dns-server; do
    time=$(dig @$server example.com | grep "Query time" | awk '{print $4}')
    echo "DNS server $server: ${time}ms"
done

Implement DNS Caching:

1
2
3
# Install local DNS cache
sudo apt install dnsmasq  # Ubuntu
brew install dnsmasq      # macOS

Use Multiple DNS Servers:

1
2
3
4
# /etc/resolv.conf
nameserver 8.8.8.8
nameserver 1.1.1.1
nameserver 208.67.222.222

Document Your DNS Setup:

  • Keep records of all DNS changes
  • Document TTL values and why they’re set that way
  • Maintain a diagram of your DNS hierarchy

Final Thoughts: DNS is Infrastructure

DNS isn’t just about resolving domain names – it’s critical infrastructure that everything else depends on. When DNS breaks, everything breaks.

The key to mastering DNS is understanding that it’s not magic. It’s a distributed system with caching, redundancy, and failure modes just like any other system. When you approach DNS problems with the same systematic thinking you’d use for any other infrastructure issue, they become much more manageable.

Remember: DNS problems are often symptoms of larger issues. A DNS failure might indicate network problems, server overload, or configuration drift. Always look at the bigger picture.

And finally, the golden rule of DNS: when in doubt, check your /etc/hosts file first. You’d be surprised how often that’s where the problem is hiding.

We’ve journeyed from the basics of DNS to real-world troubleshooting scenarios. You now understand why DNS exists and how it works, the step-by-step process of DNS resolution, how /etc/hosts provides local overrides, DNS in containerized environments, and systematic approaches to DNS troubleshooting.

The next time you encounter a DNS issue, you’ll have the knowledge and tools to tackle it confidently. Remember, DNS problems are solvable – it just takes patience and systematic thinking.

Happy troubleshooting!