Classificação com o dataset California Housing Objetivo: Ajustar hiperparâmetros para uma árvore de decisão usando Grid Search.
Passos: python from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.tree import DecisionTreeClassifier from sklearn.preprocessing import KBinsDiscretizer from sklearn.metrics import accuracy_score import pandas as pd
Carregar dados e transformar o target em categorias (classificação)
data = fetch_california_housing() X = pd.DataFrame(data.data, columns=data.feature_names) y = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile').fit_transform(data.target.reshape(-1, 1)).ravel()
Dividir em treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Definir modelo e parâmetros para busca
model = DecisionTreeClassifier(random_state=42) param_grid = { 'max_depth': [3, 5, 10], 'min_samples_split': [2, 5, 10] }
Executar GridSearchCV
grid = GridSearchCV(model, param_grid, cv=5) grid.fit(X_train, y_train)
Avaliar
print("Melhores parâmetros:", grid.best_params_) y_pred = grid.predict(X_test) print("Acurácia:", accuracy_score(y_test, y_pred))
Regressão com o dataset Pima Indians Diabetes Objetivo: Identificar as features mais importantes na previsão.
Passos: python from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_squared_error import pandas as pd import matplotlib.pyplot as plt
Carregar dados
data = load_diabetes() X = pd.DataFrame(data.data, columns=data.feature_names) y = data.target
Dividir em treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Treinar modelo
model = DecisionTreeRegressor(max_depth=5, random_state=42) model.fit(X_train, y_train)
Avaliar
y_pred = model.predict(X_test) print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))
Importância das features
importances = model.feature_importances_ features_importance = pd.Series(importances, index=data.feature_names).sort_values(ascending=False)
Visualizar
features_importance.plot(kind='bar', title="Importância das Features", figsize=(10, 5)) plt.ylabel('Importância') plt.show()