Database Optimization: Indexing Strategies for Better Performance

Database performance is crucial for any application's success. Proper indexing strategies can dramatically improve query performance, but incorrect indexing can also hurt performance. This guide covers everything you need to know about database indexing.

What are Database Indexes?

A database index is a data structure that improves the speed of data retrieval operations on a database table. Think of it like an index in a book - instead of scanning every page to find a topic, you can use the index to jump directly to the relevant pages.

## Types of Indexes

1. Primary Index (Clustered Index)
The primary index determines the physical order of data in the table.

-- Creating a table with primary key (automatically creates clustered index)
CREATE TABLE users (
    id INT PRIMARY KEY,
    username VARCHAR(50) NOT NULL,
    email VARCHAR(100) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

### 2. Secondary Index (Non-Clustered Index)
Secondary indexes point to the location of data in the table without changing the physical order.

-- Creating secondary indexes
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_username ON users(username);
CREATE INDEX idx_users_created_at ON users(created_at);

### 3. Composite Index
Indexes on multiple columns, useful for queries that filter on multiple fields.

-- Composite index for queries filtering by status and created_at
CREATE INDEX idx_users_status_created ON users(status, created_at);

-- This index will be effective for queries like:
-- SELECT * FROM users WHERE status = 'active' AND created_at > '2025-01-01';
-- SELECT * FROM users WHERE status = 'active'; -- Uses leftmost prefix

### 4. Unique Index
Ensures uniqueness while providing fast lookups.

-- Unique index on email
CREATE UNIQUE INDEX idx_users_email_unique ON users(email);

-- This prevents duplicate emails and speeds up email lookups

### 5. Partial Index
Indexes only rows that meet certain conditions (PostgreSQL, SQLite).

-- Index only active users
CREATE INDEX idx_active_users ON users(username) WHERE status = 'active';

-- This is smaller and more efficient for queries on active users

## Indexing Strategies

### 1. Analyze Query Patterns

Before creating indexes, understand your application's query patterns:

-- Common query patterns to analyze
SELECT * FROM users WHERE email = ?;                    -- Single column lookup
SELECT * FROM users WHERE status = ? AND created_at > ?; -- Multi-column filter
SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC; -- Filter + sort
SELECT COUNT(*) FROM users WHERE status = ?;            -- Aggregation

### 2. Index Selectivity

Create indexes on columns with high selectivity (many unique values):

-- Check selectivity
SELECT 
    column_name,
    COUNT(DISTINCT column_name) as unique_values,
    COUNT(*) as total_rows,
    COUNT(DISTINCT column_name) * 100.0 / COUNT(*) as selectivity_percentage
FROM users;

-- High selectivity (good for indexing): email, username, id
-- Low selectivity (poor for indexing): status, gender, boolean flags

### 3. Covering Indexes

Include all columns needed by a query in the index to avoid table lookups:

-- Covering index for a specific query
CREATE INDEX idx_users_covering ON users(status, created_at) 
INCLUDE (username, email);

-- This query can be satisfied entirely from the index:
SELECT username, email FROM users 
WHERE status = 'active' AND created_at > '2025-01-01';

## Advanced Indexing Techniques

### 1. Function-Based Indexes

Index the result of functions or expressions:

-- Index for case-insensitive searches
CREATE INDEX idx_users_email_lower ON users(LOWER(email));

-- Index for date-based queries
CREATE INDEX idx_orders_year_month ON orders(EXTRACT(YEAR FROM created_at), EXTRACT(MONTH FROM created_at));

-- Usage
SELECT * FROM users WHERE LOWER(email) = LOWER('user@example.com');

### 2. Hash Indexes

Useful for equality comparisons (PostgreSQL, MySQL):

-- Hash index for exact matches
CREATE INDEX idx_users_email_hash ON users USING HASH(email);

-- Good for: WHERE email = 'specific@email.com'
-- Not good for: WHERE email LIKE '%@gmail.com'

### 3. GIN/GiST Indexes (PostgreSQL)

For complex data types like arrays, JSON, and full-text search:

-- GIN index for JSON data
CREATE INDEX idx_users_preferences ON users USING GIN(preferences);

-- GIN index for array data
CREATE INDEX idx_posts_tags ON posts USING GIN(tags);

-- Full-text search index
CREATE INDEX idx_posts_content_fts ON posts USING GIN(to_tsvector('english', content));

-- Usage examples
SELECT * FROM users WHERE preferences @> '{"theme": "dark"}';
SELECT * FROM posts WHERE tags @> ARRAY['javascript', 'react'];
SELECT * FROM posts WHERE to_tsvector('english', content) @@ to_tsquery('database & optimization');

## Index Maintenance and Monitoring

### 1. Monitor Index Usage

-- PostgreSQL: Check index usage
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan as index_scans,
    idx_tup_read as tuples_read,
    idx_tup_fetch as tuples_fetched
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;

-- MySQL: Check index usage
SELECT 
    TABLE_SCHEMA,
    TABLE_NAME,
    INDEX_NAME,
    CARDINALITY
FROM information_schema.STATISTICS
WHERE TABLE_SCHEMA = 'your_database'
ORDER BY CARDINALITY DESC;

### 2. Identify Unused Indexes

-- PostgreSQL: Find unused indexes
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexname NOT LIKE '%_pkey';

### 3. Index Maintenance

-- Rebuild fragmented indexes
REINDEX INDEX idx_users_email;
REINDEX TABLE users;

-- Update statistics
ANALYZE users;

-- MySQL: Optimize table
OPTIMIZE TABLE users;

## Performance Testing

### 1. Query Execution Plans

-- PostgreSQL
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';

-- MySQL
EXPLAIN FORMAT=JSON SELECT * FROM users WHERE email = 'test@example.com';

-- SQL Server
SET STATISTICS IO ON;
SELECT * FROM users WHERE email = 'test@example.com';

### 2. Benchmark Queries

import time
import psycopg2

def benchmark_query(connection, query, iterations=1000):
    cursor = connection.cursor()
    
    start_time = time.time()
    for _ in range(iterations):
        cursor.execute(query)
        cursor.fetchall()
    end_time = time.time()
    
    avg_time = (end_time - start_time) / iterations
    print(f"Average query time: {avg_time:.4f} seconds")
    
    cursor.close()

# Usage
conn = psycopg2.connect("dbname=test user=postgres")

# Test with and without index
benchmark_query(conn, "SELECT * FROM users WHERE email = 'test@example.com'")

## Best Practices

### 1. Index Design Guidelines

-- ✅ Good: Index on frequently queried columns
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- ✅ Good: Composite index with proper column order
CREATE INDEX idx_orders_status_date ON orders(status, created_at);

-- ❌ Bad: Too many single-column indexes
CREATE INDEX idx_users_first_name ON users(first_name);
CREATE INDEX idx_users_last_name ON users(last_name);
CREATE INDEX idx_users_age ON users(age);

-- ✅ Better: One composite index
CREATE INDEX idx_users_name_age ON users(first_name, last_name, age);

### 2. Avoid Over-Indexing

-- Monitor index overhead
SELECT 
    table_name,
    COUNT(*) as index_count,
    SUM(size_bytes) as total_index_size
FROM (
    SELECT 
        tablename as table_name,
        pg_relation_size(indexrelid) as size_bytes
    FROM pg_stat_user_indexes
) index_sizes
GROUP BY table_name
ORDER BY total_index_size DESC;

### 3. Regular Maintenance Schedule

#!/bin/bash
# Daily index maintenance script

# Update statistics
psql -d mydb -c "ANALYZE;"

# Reindex if fragmentation > 30%
psql -d mydb -c "
SELECT 'REINDEX INDEX ' || indexname || ';' 
FROM pg_stat_user_indexes 
WHERE idx_scan > 0 
AND pg_relation_size(indexrelid) > 1000000;
"

## Common Pitfalls

### 1. Wrong Column Order in Composite Indexes

-- ❌ Wrong: Low selectivity column first
CREATE INDEX idx_users_status_email ON users(status, email);

-- ✅ Right: High selectivity column first
CREATE INDEX idx_users_email_status ON users(email, status);

### 2. Indexing Small Tables

-- Don't index tables with < 1000 rows
-- The overhead isn't worth it
SELECT COUNT(*) FROM small_lookup_table; -- Returns 50 rows
-- No need for indexes here

### 3. Ignoring Index Maintenance

-- Set up automated statistics updates
-- PostgreSQL: Enable autovacuum
ALTER TABLE users SET (autovacuum_enabled = true);

-- MySQL: Enable automatic statistics updates
SET GLOBAL innodb_stats_auto_recalc = ON;

## Conclusion

Effective database indexing requires understanding your data, query patterns, and performance requirements. Key takeaways:

- **Analyze before indexing**: Understand your query patterns
- **Choose the right index type**: B-tree, hash, GIN, etc.
- **Consider composite indexes**: For multi-column queries
- **Monitor and maintain**: Regular performance reviews
- **Avoid over-indexing**: Balance query speed with write performance

Remember: indexes speed up reads but slow down writes. Always measure the impact and find the right balance for your specific use case.