classifica_email_limpo.py C:\Python27\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning) Traceback (most recent call last): File "classifica_email_limpo.py", line 21, in classificacoes = pd.read_csv('emails.csv', encoding = "utf8") File "C:\Python27\lib\site-packages\pandas\io\parsers.py", line 709, in parser_f return _read(filepath_or_buffer, kwds) File "C:\Python27\lib\site-packages\pandas\io\parsers.py", line 455, in _read data = parser.read(nrows) File "C:\Python27\lib\site-packages\pandas\io\parsers.py", line 1069, in read ret = self._engine.read(nrows) File "C:\Python27\lib\site-packages\pandas\io\parsers.py", line 1839, in read data = self._reader.read(nrows) File "pandas_libs\parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read File "pandas_libs\parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory File "pandas_libs\parsers.pyx", line 1001, in pandas._libs.parsers.TextReader._read_rows File "pandas_libs\parsers.pyx", line 1130, in pandas._libs.parsers.TextReader._convert_column_data File "pandas_libs\parsers.pyx", line 1182, in pandas._libs.parsers.TextReader._convert_tokens File "pandas_libs\parsers.pyx", line 1281, in pandas._libs.parsers.TextReader._convert_with_dtype File "pandas_libs\parsers.pyx", line 1300, in pandas._libs.parsers.TextReader._string_convert File "pandas_libs\parsers.pyx", line 1596, in pandas._libs.parsers._string_box_decode File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True)
Parte do código digitado:
#! -*- coding: UTF8 -*-
from collections import Counter
#fazer o score dos algoritimos
from sklearn.cross_validation import cross_val_score
# pd = pandas / usado para abrevição
import pandas as pd
# np = fazer média de um Arr
import numpy as np
# o nltk serve para retirar palavras que não acrescentam informações sobre o conteudo do texto "black List"
import nltk
#nltk.dowload('stopwords')
#nltk.dowload('rlps') --- removedor de sufixo da lingua portuguesa
stemmer = nltk.stem.RSLPStemmer()
stopwords = nltk.corpus.stopwords.words('portuguese')
classificacoes = pd.read_csv('emails.csv', encoding = "utf8")
textosPuros = classificacoes['email']
marcas = classificacoes['classificacao']
textosQuebrados = textosPuros.str.lower().str.split(' ')
dicionario = set()
for lista in textosQuebrados:
validas = [stemmer.stem(palavra) for palavra in lista if palavra not in stopwords and len(palavra) > 0]
dicionario.update(validas)