Data Science in Drilling

RDDs in Spark - Part II

written by Zeyu Yan, Ph.D., Head of Data Science from Nvicta AI

Data Science in Drilling is a multi-episode series written by the technical team members in Nvicta AI. Nvicta AI is a startup company who helps drilling service companies increase their value offering by providing them with advanced AI and automation technologies and services. The goal of this Data Science in Drilling series is to provide both data engineers and drilling engineers an insight of the state-of-art techniques combining both drilling engineering and data science.

We will continue our focus on Resilient Distributed Dataset (RDD) in Spark in today's blog post.

Recall that in the previous episode, we imported the necessary dependencies and created a SparkContext object:

from pyspark import SparkContext, SparkConf  
conf =SparkConf().setAppName("testApp").setMaster("local[4]") 
sc = SparkContext.getOrCreate(conf)

union()

The union method returns the union of two RDDs as a new RDD. Take a look at the following example:

rdd1 = sc.parallelize([1, 2, 3, 4])
rdd2 = sc.parallelize([2, 2, 3, 4, 5, 5])
rdd3 = rdd1.union(rdd2)
rdd3.collect()

The result is:

[1, 2, 3, 4, 2, 2, 3, 4, 5, 5]

intersection()

The intersection method returns the intersection of two RDDs as a new RDD. Take a look at the following example:

rdd1 = sc.parallelize([1, 2, 3, 4])
rdd2 = sc.parallelize([2, 2, 3, 4, 5, 5])
rdd3 = rdd1.intersection(rdd2)
rdd3.collect()

The result is:

[2, 3, 4]

subtract()

The subtract method removes the elements in another RDD from a RDD. Take a look at the following example:

rdd1 = sc.parallelize([1, 2, 3, 4])
rdd2 = sc.parallelize([2, 2, 3, 4, 5, 5])
rdd3 = rdd1.subtract(rdd2)
rdd3.collect()

The results is:

[1]

countByValue()

The countByValue method counts the number of occurances of the elements in a RDD and returns the results as a Python Dict. Take a look at the following example:

rdd = sc.parallelize([2, 2, 3, 4, 5, 5])
rdd.countByValue()

The results is:

defaultdict(<class 'int'>, {2: 2, 3: 1, 4: 1, 5: 2})

take()

The take method takes a certain number of elememts from the RDD. Take a look at the following example:

rdd = sc.parallelize([2, 2, 3, 4, 5, 5])
rdd.take(2)

The result is:

[2, 2]

top()

The top method takes the top (count from right to left) number of elements from the RDD. Take a look at the following example:

rdd = sc.parallelize([2, 2, 3, 4, 5, 5])
rdd.top(3)

The result is:

[5, 5, 4]

takeOrdered()

The takeOrdered method takes a certain number of elements from a RDD and returns the result in an ascending order. Take a look at the following example:

rdd = sc.parallelize([2, 3, 2, 4, 5, 5])
rdd.takeOrdered(3)

The result is:

[2, 2, 3]

Key - Value Operations

Take a look at the following example:

kvRDD = sc.parallelize([(3, 4),(3, 6),(5, 6),(1, 2)])
print(kvRDD.keys().collect())
print(kvRDD.values().collect())

The results are:

[3, 3, 5, 1]
[4, 6, 6, 2]

It can be seen that when we create a RDD from a list of tuples of two elements, the first element of the tuple will be treated as the key and the second element will be treated as the value. We can then use the mapValues function to perform maps on the values only:

kvRDD.mapValues(lambda x:x**2).collect()

The result is:

[(3, 16), (3, 36), (5, 36), (1, 4)]

We can also sort the RDD by key through the sortByKey method:

print(kvRDD.sortByKey(True).collect())
print(kvRDD.sortByKey(False).collect())

The results are:

[(1, 2), (3, 4), (3, 6), (5, 6)]
[(5, 6), (3, 4), (3, 6), (1, 2)]

The fault option of the sortByKey method is True, which means ascending, False means descending.

Similarly, we can also perform the count by key operation using the countByKey method:

kvRDD.countByKey().collect()

The result is:

defaultdict(<type 'int'>,{1: 1, 3: 2, 5: 1})

Lastly, we can find values using keys through the lookup method:

kvRDD.lookup(3)

The result is:

[4, 6]

Conclusions

In this article, we continue to cover basics about Spark RDD. We will continue to cover great details about Spark in this upcoming Spark tutorial series. Stay tuned!

Get in Touch

Thank you for reading! Please let us know if you like this series or if you have critiques. If this series was helpful to you, please follow us and share this series to your friends.

If you or your company needs any help on projects related to drilling automation and optimization, AI, and data science, please get in touch with us Nvicta AI. We are here to help. Cheers!

Data Science in Drilling - Episode 30

RDDs in Spark - Part II

union()

intersection()

subtract()

countByValue()

take()

top()

takeOrdered()

Key - Value Operations

Conclusions

Get in Touch

Recent Posts

Comments