Olá pessoal,
Quem já passou por isso e puder me ajudar agradeço muito. Trata-se de um problema que estou passando na Empresa e na Internet o que falam pra fazer não resolveu.
Estou acessando um banco SQL Server, rodando via Jupyter Notebook localmente funciona normalmente, mas quando executo o mesmo código dentro do container Airflow recebo o erro:
py4j.protocol.Py4JJavaError: An error occurred while calling o44.load.
java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
Baixei o último driver jdbc do site da Microsoft, estou subido ele na pasta jars e faço referencia ao mesmo caminho ao criar a SparkSession.
spark = (SparkSession
.builder
.appName("spark_pandas_load_table")
** .config("spark.jars","/usr/local/airflow/jars/mssql-jdbc-12.4.0.jre8.jar")**
.config("spark.hadoop.fs.s3a.access.key", "AKIASZCDLCHKDF2LD6OA")
.config("spark.hadoop.fs.s3a.secret.key", "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")
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
.getOrCreate()
)
[2023-11-15, 13:59:10 UTC] {spark_submit.py:495} INFO - spark session: <pyspark.sql.session.SparkSession object at 0x7fde59b45880>
[2023-11-15, 13:59:10 UTC] {spark_submit.py:495} INFO - jdbc_url: jdbc:sqlserver://10.100.105.106:50324;databaseName=DBA;user=datalakeuser;password=***;encrypt=false;SocketTimeout=10000
[2023-11-15, 13:59:11 UTC] {spark_submit.py:495} INFO - 23/11/15 13:59:11 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
[2023-11-15, 13:59:11 UTC] {spark_submit.py:495} INFO - 23/11/15 13:59:11 INFO SharedState: Warehouse path is 'file:/usr/local/airflow/spark-warehouse'.
[2023-11-15, 13:59:12 UTC] {spark_submit.py:495} INFO - Traceback (most recent call last):
[2023-11-15, 13:59:12 UTC] {spark_submit.py:495} INFO - File "/usr/local/airflow/dags/spark_scripts/spark_pandas_load_table.py", line 56, in <module>
[2023-11-15, 13:59:12 UTC] {spark_submit.py:495} INFO - df = ps.read_sql_table(schema_name, con=jdbc_url, options={"Driver":"com.microsoft.sqlserver.jdbc.SQLServerDriver"})
[2023-11-15, 13:59:12 UTC] {spark_submit.py:495} INFO - File "/home/airflow/.local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/namespace.py", line 1441, in read_sql_table
[2023-11-15, 13:59:12 UTC] {spark_submit.py:495} INFO - sdf = reader.format("jdbc").load()
[2023-11-15, 13:59:12 UTC] {spark_submit.py:495} INFO - File "/home/airflow/.local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 314, in load
[2023-11-15, 13:59:12 UTC] {spark_submit.py:495} INFO - return self._df(self._jreader.load())
[2023-11-15, 13:59:12 UTC] {spark_submit.py:495} INFO - File "/home/airflow/.local/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
[2023-11-15, 13:59:12 UTC] {spark_submit.py:495} INFO - return_value = get_return_value(
[2023-11-15, 13:59:12 UTC] {spark_submit.py:495} INFO - File "/home/airflow/.local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
[2023-11-15, 13:59:12 UTC] {spark_submit.py:495} INFO - return f(*a, **kw)
[2023-11-15, 13:59:12 UTC] {spark_submit.py:495} INFO - File "/home/airflow/.local/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
[2023-11-15, 13:59:12 UTC] {spark_submit.py:495} INFO - raise Py4JJavaError(
**[2023-11-15, 13:59:12 UTC] {spark_submit.py:495} INFO - py4j.protocol.Py4JJavaError: An error occurred while calling o44.load.
[2023-11-15, 13:59:12 UTC] {spark_submit.py:495} INFO - : java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver**