VectorType for StructType in Pyspark Schema

Multi tool use


VectorType for StructType in Pyspark Schema
I'm reading a parquet file that has the following schema:
df.printSchema()
root
|-- time: integer (nullable = true)
|-- amountRange: integer (nullable = true)
|-- label: integer (nullable = true)
|-- pcaVector: vector (nullable = true)
Now I want to test Pyspark structured streaming and I want to use the same parquet files. The closest schema that I was able to create was using ArrayType, but it doesn't work:
schema = StructType(
[
StructField('time', IntegerType()),
StructField('amountRange', IntegerType()),
StructField('label', IntegerType()),
StructField('pcaVector', ArrayType(FloatType()))
]
)
df_stream = spark.readStream
.format("parquet")
.schema(schema)
.load("/home/user/test_arch/data/fraud/")
Caused by: java.lang.ClassCastException: Expected instance of group converter but got "org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter"
at org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RepeatedGroupConverter.<init>(ParquetRowConverter.scala:659)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetRowConverter$$newConverter(ParquetRowConverter.scala:308)
How can I create a schema with VectorType, that seems to exist only for Scala, for the StructType in Pyspark?
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.