Distributed Computation for large data-data processing

The name of the picture


Distributed Computation for large data-data processing



I have a huge time series data and I want to do data processing using spark`s parallel processing/distributed computation.
The requirement is looking at the data row by row to determine the groups as specified below under desired result sections, I can't really get spark to distribute this without some kind of coordination between the executors

For instance : Taking a small part of sample data-set for explaining the case


t lat long
0 27 28
5 27 28
10 27 28
15 29 49
20 29 49
25 27 28
30 27 28



Desired Output should be :


Lat-long interval
(27,28) (0,10)
(29,49) (15,20)
(27,28) (25,30)



I am able to get the desired result using this piece of code


val spark = SparkSession.builder().master("local").getOrCreate()

import spark.implicits._

val df = Seq(
(0, 27,28),
(5, 27,28),
(10, 27,28),
(15, 26,49),
(20, 26,49),
(25, 27,28),
(30, 27,28)
).toDF("t", "lat","long")

val dfGrouped = df
.withColumn("lat-long", struct($"lat", $"long"))

val wAll = Window.partitionBy().orderBy($"t".asc)

dfGrouped.withColumn("lag", lag("lat-long", 1, null).over(wAll))
.orderBy(asc("t")).withColumn("detector", when($"lat-long" === $"lag", 0)
.otherwise(1)).withColumn("runningTotal", sum("detector").over(wAll))
.groupBy("runningTotal", "lat-long").agg(struct(min("t"), max("t")).as("interval"))
.drop("runningTotal").show
}



But what If the data gets into two executors then the data will be like


t lat long
0 27 28
5 27 28
10 27 28
15 29 49
20 29 49
25 27 28


t lat long
30 27 28




How should I get the desired output for large amount of data.There must be smarter ways to do this



Please guide me through a right direction,I have researched about the same but not being able to land up to a solution.



PS: This just a sample example.





Are you able to create a single cluster with many machines? Then the dataset can be visible as one dataset and it will be easier to operate on it.
– wind
yesterday





@wind Can you please elaborate.
– experiment
yesterday





As I unserstand your description you have 2 separate data clusters (HDFS or whatever) and you have a problem with join of this data. Am I correct?
– wind
yesterday





No ,Its like if I have a time series data in cassandra and I need to do some processing which is to be done in parallel manner so that when it would be distributed in different partitions and then processing would be applied on different partition,but at that the end I aim to get the above mentioned result.
– experiment
22 hours ago





OK, I got it. In this case you shouldn't work on partitions directly. Spark is designed to apply operations like map, reduce, group by, etc for the entire dataset. If you want to apply operation on specific group of rows you can use e.g. reduceByKey.
– wind
10 hours ago









By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Stripe::AuthenticationError No API key provided. Set your API key using “Stripe.api_key = ”

CRM reporting Extension - SSRS instance is blank

Keycloak server returning user_not_found error when user is already imported with LDAP