1
resposta

[Dúvida] Erro ao criar arquivo parket na minha maquina local.

Estou lendo um arquivo csv e transformando em parket:

leitura:

empresas2 = spark.read.csv(
    r'C:\Users\guilherme.kanai\Desktop\arquivos\any.csv', 
    sep=';', 
    inferSchema=True,
    header=True)

enviando para parket:

empresas2.write.parquet(
    path= r'C:\Users\guilherme.kanai\Desktop\arquivos\parquet\novo.parquet',
    #OU -  r'C:\Users\guilherme.kanai\Desktop\arquivos\parquet\'
    mode='overwrite',
)

Ambos dão o mesmo erro

Py4JJavaError                             Traceback (most recent call last)
c:\Users\guilherme.kanai\Downloads\aula6_projeto_spark.ipynb Cell 12 in ()
----> 1 empresas2.write.parquet(
      2     path= r'C:\Users\guilherme.kanai\Desktop\arquivos\parquet\novo.parquet',
      3     mode='overwrite',
      4 )

File c:\Users\guilherme.kanai\AppData\Local\Programs\Python\Python39\lib\site-packages\pyspark\sql\readwriter.py:1140, in DataFrameWriter.parquet(self, path, mode, partitionBy, compression)
   1138     self.partitionBy(partitionBy)
   1139 self._set_opts(compression=compression)
-> 1140 self._jwrite.parquet(path)

File c:\Users\guilherme.kanai\AppData\Local\Programs\Python\Python39\lib\site-packages\py4j\java_gateway.py:1321, in JavaMember.__call__(self, *args)
   1315 command = proto.CALL_COMMAND_NAME +\
   1316     self.command_header +\
   1317     args_command +\
   1318     proto.END_COMMAND_PART
   1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
   1322     answer, self.gateway_client, self.target_id, self.name)
   1324 for temp_arg in temp_args:
   1325     temp_arg._detach()

File c:\Users\guilherme.kanai\AppData\Local\Programs\Python\Python39\lib\site-packages\pyspark\sql\utils.py:190, in capture_sql_exception..deco(*a, **kw)
    188 def deco(*a: Any, **kw: Any) -> Any:
    189     try:
--> 190         return f(*a, **kw)
    191     except Py4JJavaError as e:
    192         converted = convert_exception(e.java_exception)

File c:\Users\guilherme.kanai\AppData\Local\Programs\Python\Python39\lib\site-packages\py4j\protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o34.parquet.
: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:651)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:288)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
1 resposta

Oi Guilherme,

Você está tentando rodar o projeto localmente e dessa forma fica difícil identificar o problema para poder te ajudar. É provável que esteja relacionado com a configuração de suas variáveis de ambiente e nesse sentido a postagem de nosso colega Eduardo talvez possa te ajudar.

Dê preferência para rodar o projeto no Colab. A partir das novas versões para rodar o projeto basta apenas instalar o pyspark e seguir o curso. Não precisa configurar variáveis de ambiente, baixar Spark etc.

Basta apenas rodar o seguinte código e seguir com o restante do curso:

!pip install pyspark

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').getOrCreate()

Espero ter ajudado e bons estudos