O array features2
é (5074382, 82) dtype('float64') numpy.ndarray
features_encoded
: (5074382, 9276434) dtype('float64') scipy.sparse.csr.csr_matrix
features_final = np.column_stack((features2, features_encoded))
A linha acima gera o erro:
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 5074382 and the array at index 1 has size 1
Eu preciso justar os dois arrays acima. Como fazer?
O dataset que estou utilizando tem algumas colunas que sao categorias. Nelas apliquei o OneHotEncoder. Depois, tentei juntas o array com as features numericas e o array que saiu do OneHotEncoder, formando, assim, um único array com todas as features!
Segue o código completo:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.display.max_columns = None #Display all dataframe columns in a Jupyter Python Notebook
pd.set_option('display.max_rows', 1000)
get_ipython().run_line_magic('matplotlib', 'inline')
CIC2019 = pd.read_csv(r"DrDoS_DNS.csv")
remove =lambda x:x.strip()# remove the blancks in columns names
columns = list(CIC2019.columns)
new_columns =list(map(lambda x:x.strip(),columns))# removing blamcks
CIC2019 = pd.read_csv(r"DrDoS_DNS.csv", names =new_columns, header = None, skiprows=1,nrows=None)
CIC2019.rename(columns={"Unnamed: 0": "ID"}, inplace=True)
CIC2019 = CIC2019.dropna()
CIC2019.isna().sum()
features = CIC2019.drop("Label", axis =1)
# # Handling categorical attributes
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
CIC2019["Label"]
Label_encoded = encoder.fit_transform(CIC2019["Label"].to_numpy().reshape(1,-1))
features[["Flow ID","Source IP","Timestamp","SimillarHTTP","Destination IP"]]
features2 = features.drop(["Flow ID","Source IP","Timestamp","Destination IP","SimillarHTTP"], axis =1)
features2 = features2.to_numpy()
features_encoded = encoder.fit_transform(features[["Flow ID","Source IP","Timestamp","Destination IP",]].to_numpy())
#"SimillarHTTP" : error when you added this
# # Training - Linear Regression
features_final = np.column_stack((features2, features_encoded))