DEPOIS DE MAIS DE 4HORAS CONSEGUIII!!!!
por favor peço que se puderem, ADICIONEM ISSO AO CURSO, porque não tem video no youtube explicando nem no reddit, nem twitter e nem o chat gpt consegue resolver, tive que juntar tudo isso mais o copilot pra resolver. Em resumo pelo que eu tive que fazer foi:
Baixar a versão 3.11 do python, spark n funciona com 3.13
Baixar o java jdk 17, já q n funciona o 25.
Depois adicionei tudo correto ao Path e a JAVA_HOME. (tbm cheguei a criar spark_home, hadoop, pelos videos que tinha visto mas acho que no final n serviu pra nada, só serviria pra rodar em cmd pelo que entendi)
Depois disso tive que fazer um import os, import sys
os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable
Isso pra forçar o pyspark a usar o python 11, recomendo criar venv tbm caso tenha mais versoes de python instalado.
Como não sou profissional pedi para o copilot criar um resumo pra mim, afinal eu n sei tanto sobre os e sys.
SEGUE O RESUMO DELE PASSO A PASSO:
Short answer first — what I changed, and then an exact, copy-paste checklist you can use to create another project + venv that runs PySpark reliably.
What I changed earlier (summary)
- Added a diagnostics cell to teste.ipynb that:
- Prints Python version, executable, platform and CWD.
- Verifies
java -version, JAVA_HOME, and PySpark env vars. - Sets
PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to the current interpreter if unset. - Uses a per-project Ivy dir and safer Spark builder options (
local[1], spark.driver.host=127.0.0.1).
- Verified SparkSession created successfully under Python 3.11 and Java 17.
If you want to reproduce this workflow for a new project, follow these steps in PowerShell.
Step-by-step (copy/paste in PowerShell)
- Create project folder and enter it:
mkdir C:\path\to\NewProject
cd C:\path\to\NewProject
- Create a venv (use the Python 3.11 executable if you have multiple Pythons):
If python is the 3.11 you want:
python -m venv .venv
If you have multiple Python installations, use the explicit path:
"C:\Path\To\Python311\python.exe" -m venv .venv
.\.venv\Scripts\Activate.ps1
- Upgrade pip and install required packages:
pip install --upgrade pip
pip install pyspark notebook ipykernel
- Register the venv as a Jupyter kernel (so notebooks can use it):
python -m ipykernel install --user --name newproject-py311 --display-name "Python (newproject-py311)"
- (Optional) Save dependencies:
pip freeze > requirements.txt
- Ensure Java (JDK) is installed and
JAVA_HOME is set (replace path with your JDK):
java -version
Set JAVA_HOME for current user (replace with your JDK path)
[Environment]::SetEnvironmentVariable('JAVA_HOME', 'C:\Program Files\Eclipse Adoptium\jdk-17.0.17.10-hotspot', 'User')
Verify
java -version
- Start Jupyter and select the new kernel:
jupyter notebook
In the browser: New / select "Python (newproject-py311)"
Minimal diagnostic cell to add at the top of a notebook
(put this in the first code cell; it helps catch mismatched kernels)
import os, sys, shutil, subprocess, tempfile
print("Python:", sys.version)
print("Executable:", sys.executable)
print("CWD:", os.getcwd())
print("java:", shutil.which("java"))
try:
p = subprocess.run(["java","-version"], capture_output=True, text=True)
print("java -version:", p.stderr.strip())
except Exception as e:
print("java -version failed:", e)
print("JAVA_HOME:", os.environ.get("JAVA_HOME"))
make sure pyspark uses this interpreter
os.environ.setdefault("PYSPARK_PYTHON", sys.executable)
os.environ.setdefault("PYSPARK_DRIVER_PYTHON", sys.executable)
Example safe Spark startup (put after diagnostics)
from pyspark.sql import SparkSession
import tempfile, os
ivy_dir = os.path.join(tempfile.gettempdir(), "ivy")
os.makedirs(ivy_dir, exist_ok=True)
spark = SparkSession.builder
.appName("MyProject")
.master("local[1]")
.config("spark.driver.host", "127.0.0.1")
.config("spark.jars.ivy", ivy_dir)
.config("spark.sql.shuffle.partitions", "1")
.getOrCreate()
print("Spark version:", spark.version)
Quick checklist / tips
- Always ensure the Jupyter kernel uses the same Python as the venv (
sys.executable should be inside .venv). - Prefer Python 3.10/3.11 for PySpark (you already used 3.11 and it worked).
- Keep a small diagnostics cell at the top of new notebooks — it quickly shows mismatches.
- Start with
local[1] to verify things work, then switch to local[*] if you need more cores. - If you frequently create projects, copy a small template containing:
.venv creation + requirements.txtdiagnostics cellspark startup cell
- If you use Git, add
.venv to .gitignore and keep requirements.txt under source control.