Olá Guilherme,
Faz todo o sentido o que você cita com relação a descartar dados que foram usados para ajudar a tomar uma decisão a respeito de um modelo ou algoritmo. Reusar os dados nesse sentido seria o que a literatura chama de "data snooping".
No vídeo 8 da aula 4, você sugere o uso dos mesmos dados para treinar algoritmos diferentes. Parece que essa é uma questão meio polêmica. Apesar de ser justo e intuitivo treinar/avaliar diferentes modelos com os mesmos dados, será que em uma situação onde exista uma grande quantidade de dados disponíveis, não faria sentido usar conjuntos diferentes? Permita-me complementar a discussão transcrevendo abaixo um trecho de um livro de um professor de Machine Learning (Learning from Data, de Yaser Abu-Mostafa). (Antes de mais nada, nem concordo nem discordo do que é dito no trecho abaixo, ainda não cheguei a uma conclusão na prática, pois o mesmo autor defende o uso de cross-validation em um mesmo dataset para seleção de modelos).
"One of the most common occurrences of data snooping is the reuse of the
same data set . If you try learning using first one model and then another and
then another on the same data set , you will eventually 'succeed' . As the saying
goes, if you torture the data long enough, it will confess © . If you try all
possible dichotomies, you will eventually fit any data set ; this is true whether
we try the dichotomies directly (using a single model) or indirectly (using a
sequence of models). The effective VC dimension for the series of trials will
not be that of the last model that succeeded, but of the entire union of models
that could have been used depending on the outcomes of different trials.
Sometimes the reuse of the same data set is carried out by different people.
Let 's say that there is a public data set that you would like to work on. Before
you download the data, you read about how other people did with this data set using different techniques. You naturally pick the most promising techniques
as a baseline, then try to improve on them and introduce your own ideas .
Although you haven't even seen the data set yet, you are already guilty of
data snooping. Your choice of baseline techniques was affected by the data
set , through the actions of others. You may find that your estimates of the
performance will turn out to be too optimistic, since the techniques you are
using have already proven well-suited to this particular data set .
(...)
- Avoid data snooping: A strict discipline in handling the data is required. Data that is going to be used to evaluate the final performance should be 'locked in a safe ' and only brought out after the final hypothesis has been decided. If intermediate tests are needed, separate data sets should be used for that. Once a data set has been used, it should be treated as contaminated as far as testing the performance is concerned.
- Account for data snooping: If you have to use a data set more than once, keep track of the level of contamination and treat the reliability of your performance estimates in light of this contamination. The bounds ( 1 .6) and (2 . 12) can provide guidelines for the relative reliability of different data sets that have been used in different roles within the learning process."