-
从官网下载spark:
wget http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
-
解压:
tar -xf spark-2.4.4-bin-hadoop2.7.tgz
-
进入目录:
cd spark-2.4.4-bin-hadoop2.7
-
ipython交互式pyspark:
PYSPARK_DRIVER_PYTHON=ipython ./bin/pyspark
-
pip install pyspark
-
直接
pyspark
- 不过反正我用的第一种,所以懒得看这样如何开ipython了
-
调整log等级
-
cp conf/log4j.properties.template conf/log4j.properties
-
vi conf/log4j.properties
-
找到
log4j.rootCategory=INFO, console
-
改成
log4j.rootCategory=WARN, console
-
找到
-
-
SparkContext对象,shell中自动创建,叫做sc
-
创建RDD:
lines = sc.textFile("README.md")
-
驱动器程序(例如shell)可以操作执行器(executor)做操作:
lines.count()
-
可以把操作函数传递给executor:
lines.filter(lambda line: "Python" in line)
-
创建RDD:
-
运行python脚本:
./bin/spark-submit my_script.py
print(123) from pyspark import SparkConf, SparkContext conf = SparkConf().setMaster("local").setAppName("My App") sc = SparkContext(conf = conf) lines = sc.textFile("README.md") print(lines.count()) print(lines.first()) pythonLines = lines.filter(lambda line: "Python" in line) print(pythonLines.first()) sc.stop()