Aqui na empresa criei uma pipeline pra ler os dados de uma base SQL Server para o bucket s3. A tabela tem 2.907.375 linhas. Aparentemente é criado o dataframe pyspark, mas ao salvar em CSV no bucket s3 ocorre o erro Error code is: -9.
Estou utilizando o método
df.write.csv(
path = path_raw_tier,
mode = "overwrite",
sep=';',
)
que fica em um script .py e é chamado pelo operador SparkSubmitOperator
O log do Airflow não deixa claro que erro ocorreu para eu tratar.
Qualquer ajuda é bem vinda.
[2023-11-09, 00:47:58 UTC] {spark_submit.py:495} INFO - Read **2907375** rows in DBA table lancamento.
[2023-11-09, 00:47:58 UTC] {spark_submit.py:495} INFO - **Writing csv file into s3a://analytics-production-data-lake-tier-1/airflow/raw_data**
[2023-11-09, 00:57:32 UTC] {taskinstance.py:1935} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 157, in execute
self._hook.submit(self._application)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 426, in submit
raise AirflowException(
airflow.exceptions.AirflowException: Cannot execute: spark-submit --master local[1] --jars /usr/local/airflow/jars/aws-java-sdk-dynamodb-1.11.534.jar,/usr/local/airflow/jars/aws-java-sdk-core-1.11.534.jar,/usr/local/airflow/jars/aws-java-sdk-s3-1.11.534.jar,/usr/local/airflow/jars/hadoop-aws-3.2.2.jar,/usr/local/airflow/jars/mssql-jdbc-12.4.0.jre8.jar --name arrow-spark /usr/local/airflow/dags/spark_scripts/spark_load_table_igc_ativy_db03.py lancamento ano_dt_lancamento mes_dt_lancamento. **Error code is: -9**.
[2023-11-09, 00:57:32 UTC] {taskinstance.py:1398} INFO - Marking task as FAILED. dag_id=igc_ativy_lancamento_elt, task_id=spark_load_lancamento_db03, execution_date=20231108T063546, start_date=20231109T004309, end_date=20231109T005732