rodei :
import pyspark.sql.functions as f
df.select(f.explode("created_at")).printSchema() e deu:
DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "explode(created_at)" due to data type mismatch: The first parameter requires the ("ARRAY" or "MAP") type, however "created_at" has the type "STRING". SQLSTATE: 42K09;
'Project [unresolvedalias(explode(created_at#755))]
+- Relation [author_id#753,conversation_id#754,created_at#755,edit_history_tweet_ids#756,id#757,in_reply_to_user_id#758,lang#759,public_metrics#760,text#761] json
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
Daí aqui entendi q era por causa do array o meu é string)
tentei:
df = df.withColumn(
"created_at",
try_to_timestamp("created_at", "yyyy-MM-dd'T'HH:mm:ss.SSSSSSZ")
)
df.show()
e veio:
Cell In[24], line 3 1 from pyspark.sql.functions import try_to_timestamp----> 3 df = df.withColumn( 4 "created_at", 5 try_to_timestamp("created_at", "yyyy-MM-dd'T'HH:mm:ss.SSSSSSZ") 6 ) 7 df.show()File ~/.local/lib/python3.10/site-packages/pyspark/sql/classic/dataframe.py:1647, in DataFrame.withColumn(self, colName, col) 1642 if not isinstance(col, Column): 1643 raise PySparkTypeError( 1644 errorClass="NOT_COLUMN", 1645 messageParameters={"arg_name": "col", "arg_type": type(col).name}, 1646 )-> 1647 return DataFrame(self._jdf.withColumn(colName, col._jc), self.sparkSession)File ~/.local/lib/python3.10/site-packages/py4j/java_gateway.py:1362, in JavaMember.call(self, *args) 1356 command = proto.CALL_COMMAND_NAME +\ 1357 self.command_header +\ 1358 args_command +\ 1359 proto.END_COMMAND_PART 1361 answer = self.gateway_client.send_command(command)-> 1362 return_value = get_return_value(
...
+- Project [author_id#292, conversation_id#293, to_timestamp(created_at#353, Some(yyyy-MM-dd'T'HH:mm:ss.SSS'Z'), TimestampType, Some(America/Sao_Paulo), true) AS created_at#354, edit_history_tweet_ids#295, id#296, in_reply_to_user_id#297, lang#298, public_metrics#299, text#300] +- Project [author_id#292, conversation_id#293, to_timestamp(created_at#352, Some(yyyy-MM-dd'T'HH:mm:ss.SSS'Z'), TimestampType, Some(America/Sao_Paulo), true) AS created_at#353, edit_history_tweet_ids#295, id#296, in_reply_to_user_id#297, lang#298, public_metrics#299, text#300] +- Project [author_id#292, conversation_id#293, to_timestamp(created_at#294, Some(yyyy-MM-dd'T'HH:mm:ss.SSS'Z'), TimestampType, Some(America/Sao_Paulo), true) AS created_at#352, edit_history_tweet_ids#295, id#296, in_reply_to_user_id#297, lang#298, public_metrics#299, text#300] +- Relation [author_id#292,conversation_id#293,created_at#294,edit_history_tweet_ids#295,id#296,in_reply_to_user_id#297,lang#298,public_metrics#299,text#300] json
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...