Machine Learning

Introduction to Machine Learning with Python

Dive into the world of machine learning with Python, covering supervised learning, data preprocessing, and model evaluation techniques.

Sasank - BTech CSE Student
January 5, 2025
15 min read
Introduction to Machine Learning with Python
Python
ML
Data Science
AI

Introduction to Machine Learning with Python

Machine Learning has revolutionized how we solve complex problems across industries. Python, with its rich ecosystem of libraries, has become the go-to language for ML practitioners. This guide will take you through the fundamentals of machine learning using Python.

What is Machine Learning?

Machine Learning is a subset of artificial intelligence that enables computers to learn and make decisions from data without being explicitly programmed for every scenario.

Types of Machine Learning

1. **Supervised Learning**: Learning with labeled data
2. **Unsupervised Learning**: Finding patterns in unlabeled data
3. **Reinforcement Learning**: Learning through interaction and feedback

## Setting Up Your Environment

### Essential Libraries

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


## Data Preprocessing

### Loading and Exploring Data

# Load dataset
df = pd.read_csv('housing_data.csv')

# Basic information
print(df.info())
print(df.describe())
print(df.head())

# Check for missing values
print(df.isnull().sum())


### Handling Missing Data

# Fill missing values
df['age'].fillna(df['age'].median(), inplace=True)

# Drop rows with missing target values
df.dropna(subset=['price'], inplace=True)

# Or use advanced imputation
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')
df_numeric = df.select_dtypes(include=[np.number])
df_imputed = pd.DataFrame(
imputer.fit_transform(df_numeric),
columns=df_numeric.columns
)


### Feature Engineering

# Create new features
df['price_per_sqft'] = df['price'] / df['sqft']
df['age_category'] = pd.cut(df['age'], bins=[0, 10, 30, 100],
labels=['New', 'Medium', 'Old'])

# Encode categorical variables
df_encoded = pd.get_dummies(df, columns=['neighborhood', 'age_category'])

# Scale numerical features
scaler = StandardScaler()
numerical_features = ['sqft', 'bedrooms', 'bathrooms']
df[numerical_features] = scaler.fit_transform(df[numerical_features])


## Building Your First Model

### Linear Regression Example

# Prepare features and target
X = df[['sqft', 'bedrooms', 'bathrooms', 'age']]
y = df['price']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R² Score: {r2:.2f}')


### Classification Example

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Binary classification: expensive vs affordable housing
df['expensive'] = (df['price'] > df['price'].median()).astype(int)

X = df[['sqft', 'bedrooms', 'bathrooms', 'age']]
y = df['expensive']

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predictions
y_pred = rf_model.predict(X_test)

# Evaluation
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


## Model Evaluation and Validation

### Cross-Validation

from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f'Cross-validation scores: {cv_scores}')
print(f'Average CV score: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})')


### Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10]
}

# Grid search
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)

grid_search.fit(X_train, y_train)
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best cross-validation score: {grid_search.best_score_:.3f}')


## Visualization and Interpretation

### Feature Importance

# Get feature importance
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='importance', y='feature')
plt.title('Feature Importance')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.show()


### Learning Curves

from sklearn.model_selection import learning_curve

train_sizes, train_scores, val_scores = learning_curve(
rf_model, X, y, cv=5, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10)
)

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, np.mean(train_scores, axis=1), 'o-', label='Training score')
plt.plot(train_sizes, np.mean(val_scores, axis=1), 'o-', label='Validation score')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy Score')
plt.legend()
plt.title('Learning Curves')
plt.grid(True)
plt.show()


## Advanced Topics

### Ensemble Methods

from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# Create individual models
rf = RandomForestClassifier(n_estimators=100)
svm = SVC(probability=True)
nb = GaussianNB()

# Create ensemble
ensemble = VotingClassifier(
estimators=[('rf', rf), ('svm', svm), ('nb', nb)],
voting='soft'
)

ensemble.fit(X_train, y_train)
ensemble_score = ensemble.score(X_test, y_test)
print(f'Ensemble accuracy: {ensemble_score:.3f}')


## Best Practices

1. **Always split your data** before any preprocessing
2. **Use cross-validation** for reliable model evaluation
3. **Scale your features** when using distance-based algorithms
4. **Handle missing data** appropriately
5. **Validate on unseen data** to check for overfitting
6. **Document your experiments** and keep track of results

## Conclusion

Machine Learning with Python offers powerful tools for solving real-world problems. Start with simple algorithms, understand your data thoroughly, and gradually move to more complex models.

Key takeaways:
- Data preprocessing is crucial for model performance
- Start simple and gradually increase complexity
- Always validate your models properly
- Feature engineering can significantly improve results
- Ensemble methods often provide better performance

The journey in machine learning is iterative—keep experimenting, learning, and improving your models!