创建RDD

创建 RDD 最简单的方式就是把程序中一个已有的集合传给 SparkContext 的 parallelize() 方法

parallelize() 方法

Python:

>>> lines = sc.parallelize(["pandas", "i like pandas"])
>>> lines.first()
'pandas'

Scala:

scala> val lines = sc.parallelize(List("pandas", "i like pandas"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24
   
scala> lines.first()
res0: String = pandas

Java:

JavaRDD<String> lines = sc.parallelize(Arrays.asList("pandas","i like pandas"));
System.out.println(lines.first());

textFile() 方法

Python:

>>> lines = sc.textFile('pyspark')
>>> lines.count()
77

Scala:

scala> val lines = sc.textFile("spark-shell")
lines: org.apache.spark.rdd.RDD[String] = spark-shell MapPartitionsRDD[2] at textFile at <console>:24

scala> lines.count()
res1: Long = 95

Java:

JavaRDD<String> lines = sc.textFile("README.md");
System.out.println(lines.count());