VectorType for StructType in Pyspark Schema

VectorType for StructType in Pyspark Schema

I'm reading a parquet file that has the following schema:

df.printSchema() root |-- time: integer (nullable = true) |-- amountRange: integer (nullable = true) |-- label: integer (nullable = true) |-- pcaVector: vector (nullable = true)

Now I want to test Pyspark structured streaming and I want to use the same parquet files. The closest schema that I was able to create was using ArrayType, but it doesn't work:

schema = StructType( [ StructField('time', IntegerType()), StructField('amountRange', IntegerType()), StructField('label', IntegerType()), StructField('pcaVector', ArrayType(FloatType())) ] ) df_stream = spark.readStream .format("parquet") .schema(schema) .load("/home/user/test_arch/data/fraud/") Caused by: java.lang.ClassCastException: Expected instance of group converter but got "org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter" at org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RepeatedGroupConverter.<init>(ParquetRowConverter.scala:659) at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetRowConverter$$newConverter(ParquetRowConverter.scala:308)

How can I create a schema with VectorType, that seems to exist only for Scala, for the StructType in Pyspark?

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Xuykyuu

VectorType for StructType in Pyspark Schema

VectorType for StructType in Pyspark Schema

Popular posts from this blog

Keycloak server returning user_not_found error when user is already imported with LDAP

PHP parse/syntax errors; and how to solve them?

415 Unsupported Media Type while sending json file over REST Template