Distributed Computation for large data-data processing

Distributed Computation for large data-data processing

I have a huge time series data and I want to do data processing using spark`s parallel processing/distributed computation.
The requirement is looking at the data row by row to determine the groups as specified below under desired result sections, I can't really get spark to distribute this without some kind of coordination between the executors

For instance : Taking a small part of sample data-set for explaining the case

t lat long 0 27 28 5 27 28 10 27 28 15 29 49 20 29 49 25 27 28 30 27 28

Desired Output should be :

Lat-long interval (27,28) (0,10) (29,49) (15,20) (27,28) (25,30)

I am able to get the desired result using this piece of code

val spark = SparkSession.builder().master("local").getOrCreate() import spark.implicits._ val df = Seq( (0, 27,28), (5, 27,28), (10, 27,28), (15, 26,49), (20, 26,49), (25, 27,28), (30, 27,28) ).toDF("t", "lat","long") val dfGrouped = df .withColumn("lat-long", struct($"lat", $"long")) val wAll = Window.partitionBy().orderBy($"t".asc) dfGrouped.withColumn("lag", lag("lat-long", 1, null).over(wAll)) .orderBy(asc("t")).withColumn("detector", when($"lat-long" === $"lag", 0) .otherwise(1)).withColumn("runningTotal", sum("detector").over(wAll)) .groupBy("runningTotal", "lat-long").agg(struct(min("t"), max("t")).as("interval")) .drop("runningTotal").show }

But what If the data gets into two executors then the data will be like

t lat long 0 27 28 5 27 28 10 27 28 15 29 49 20 29 49 25 27 28

t lat long 30 27 28

How should I get the desired output for large amount of data.There must be smarter ways to do this

Please guide me through a right direction,I have researched about the same but not being able to land up to a solution.

PS: This just a sample example.

Are you able to create a single cluster with many machines? Then the dataset can be visible as one dataset and it will be easier to operate on it.
– wind
yesterday

@wind Can you please elaborate.
– experiment
yesterday

As I unserstand your description you have 2 separate data clusters (HDFS or whatever) and you have a problem with join of this data. Am I correct?
– wind
yesterday

No ,Its like if I have a time series data in cassandra and I need to do some processing which is to be done in parallel manner so that when it would be distributed in different partitions and then processing would be applied on different partition,but at that the end I aim to get the above mentioned result.
– experiment
22 hours ago

OK, I got it. In this case you shouldn't work on partitions directly. Spark is designed to apply operations like map, reduce, group by, etc for the entire dataset. If you want to apply operation on specific group of rows you can use e.g. reduceByKey.
– wind
10 hours ago

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Xuykyuu