IMPLEMENTAÇÃO COMPLETA
1️ Carregar e transformar JSON
import json
with open("qa_dataset.json", "r", encoding="utf-8") as f:
raw_data = json.load(f)
evaluation_dataset = []
for item in raw_data:
evaluation_dataset.append({
"query": item["question"],
"answer": item["ground_truth"]
})
print(len(evaluation_dataset))
Agora temos apenas:
[
{"query": "...", "answer": "..."}
]
Estrutura limpa para avaliação.
2️ Configurar Modelo GPT
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="gpt-4o-mini",
temperature=0
)
3️ Tubulação SEM RAG
def generate_without_rag(question):
response = llm.invoke(question)
return response.content
4️ Pipeline COM RAG
Assumindo que você já tem:
vectorstore
retriever
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
prompt = ChatPromptTemplate.from_template("""
Responda apenas com base no contexto fornecido.
Contexto:
{context}
Pergunta:
{question}
""")
rag_chain = (
{
"context": retriever,
"question": RunnablePassthrough()
}
| prompt
| llm
| StrOutputParser()
)
def generate_with_rag(question):
return rag_chain.invoke(question)
5️ Função de Avaliação Corrigida
Erro qual da:
Indexação incorreta
Chaves erradas
Estrutura incompatível com avaliador
Implementação robusta:
from langchain.evaluation.qa import QAEvalChain
eval_chain = QAEvalChain.from_llm(llm)
def evaluate_pipeline(dataset, generator_function):
predictions = []
references = []
for item in dataset:
try:
prediction = generator_function(item["query"])
predictions.append({
"query": item["query"],
"result": prediction
})
references.append({
"query": item["query"],
"answer": item["answer"]
})
except Exception as e:
print("Erro na geração:", e)
graded_outputs = eval_chain.evaluate(
references,
predictions
)
return graded_outputs
6️ Executar Avaliação
Sem RAG
results_no_rag = evaluate_pipeline(
evaluation_dataset,
generate_without_rag
)
Com RAG
results_rag = evaluate_pipeline(
evaluation_dataset,
generate_with_rag
)
7️ Medir Precisão
def compute_accuracy(results):
correct = 0
for r in results:
if "CORRECT" in r["results"]:
correct += 1
return correct / len(results)
print("Precisão sem RAG:", compute_accuracy(results_no_rag))
print("Precisão com RAG:", compute_accuracy(results_rag))