Types of Machine Learning
1. **Supervised Learning**: Learning with labeled data
2. **Unsupervised Learning**: Finding patterns in unlabeled data
3. **Reinforcement Learning**: Learning through interaction and feedback
## Setting Up Your Environment
### Essential Libraries
# Data manipulation and analysis
import pandas as pd
import numpy as np
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
## Data Preprocessing
### Loading and Exploring Data
# Load dataset
df = pd.read_csv('housing_data.csv')
# Basic information
print(df.info())
print(df.describe())
print(df.head())
# Check for missing values
print(df.isnull().sum())
### Handling Missing Data
# Fill missing values
df['age'].fillna(df['age'].median(), inplace=True)
# Drop rows with missing target values
df.dropna(subset=['price'], inplace=True)
# Or use advanced imputation
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
df_numeric = df.select_dtypes(include=[np.number])
df_imputed = pd.DataFrame(
imputer.fit_transform(df_numeric),
columns=df_numeric.columns
)
### Feature Engineering
# Create new features
df['price_per_sqft'] = df['price'] / df['sqft']
df['age_category'] = pd.cut(df['age'], bins=[0, 10, 30, 100],
labels=['New', 'Medium', 'Old'])
# Encode categorical variables
df_encoded = pd.get_dummies(df, columns=['neighborhood', 'age_category'])
# Scale numerical features
scaler = StandardScaler()
numerical_features = ['sqft', 'bedrooms', 'bathrooms']
df[numerical_features] = scaler.fit_transform(df[numerical_features])
## Building Your First Model
### Linear Regression Example
# Prepare features and target
X = df[['sqft', 'bedrooms', 'bathrooms', 'age']]
y = df['price']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R² Score: {r2:.2f}')
### Classification Example
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
# Binary classification: expensive vs affordable housing
df['expensive'] = (df['price'] > df['price'].median()).astype(int)
X = df[['sqft', 'bedrooms', 'bathrooms', 'age']]
y = df['expensive']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Predictions
y_pred = rf_model.predict(X_test)
# Evaluation
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
## Model Evaluation and Validation
### Cross-Validation
from sklearn.model_selection import cross_val_score
# Perform 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f'Cross-validation scores: {cv_scores}')
print(f'Average CV score: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})')
### Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10]
}
# Grid search
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best cross-validation score: {grid_search.best_score_:.3f}')
## Visualization and Interpretation
### Feature Importance
# Get feature importance
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='importance', y='feature')
plt.title('Feature Importance')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.show()
### Learning Curves
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(
rf_model, X, y, cv=5, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10)
)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, np.mean(train_scores, axis=1), 'o-', label='Training score')
plt.plot(train_sizes, np.mean(val_scores, axis=1), 'o-', label='Validation score')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy Score')
plt.legend()
plt.title('Learning Curves')
plt.grid(True)
plt.show()
## Advanced Topics
### Ensemble Methods
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
# Create individual models
rf = RandomForestClassifier(n_estimators=100)
svm = SVC(probability=True)
nb = GaussianNB()
# Create ensemble
ensemble = VotingClassifier(
estimators=[('rf', rf), ('svm', svm), ('nb', nb)],
voting='soft'
)
ensemble.fit(X_train, y_train)
ensemble_score = ensemble.score(X_test, y_test)
print(f'Ensemble accuracy: {ensemble_score:.3f}')
## Best Practices
1. **Always split your data** before any preprocessing
2. **Use cross-validation** for reliable model evaluation
3. **Scale your features** when using distance-based algorithms
4. **Handle missing data** appropriately
5. **Validate on unseen data** to check for overfitting
6. **Document your experiments** and keep track of results
## Conclusion
Machine Learning with Python offers powerful tools for solving real-world problems. Start with simple algorithms, understand your data thoroughly, and gradually move to more complex models.
Key takeaways:
- Data preprocessing is crucial for model performance
- Start simple and gradually increase complexity
- Always validate your models properly
- Feature engineering can significantly improve results
- Ensemble methods often provide better performance
The journey in machine learning is iterative—keep experimenting, learning, and improving your models!