- 
从官网下载spark:wget http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
- 
解压:tar -xf spark-2.4.4-bin-hadoop2.7.tgz
- 
进入目录:cd spark-2.4.4-bin-hadoop2.7
- 
ipython交互式pyspark:PYSPARK_DRIVER_PYTHON=ipython ./bin/pyspark
- 
pip install pyspark
- 
直接pyspark
- 不过反正我用的第一种,所以懒得看这样如何开ipython了
- 
调整log等级
- 
cp conf/log4j.properties.template conf/log4j.properties
- 
vi conf/log4j.properties- 
找到log4j.rootCategory=INFO, console
- 
改成log4j.rootCategory=WARN, console
 
- 
找到
 
- 
- 
SparkContext对象,shell中自动创建,叫做sc
- 
创建RDD:lines = sc.textFile("README.md")
- 
驱动器程序(例如shell)可以操作执行器(executor)做操作:lines.count()
- 
可以把操作函数传递给executor:lines.filter(lambda line: "Python" in line)
 
- 
创建RDD:
- 
运行python脚本:./bin/spark-submit my_script.pyprint(123) from pyspark import SparkConf, SparkContext conf = SparkConf().setMaster("local").setAppName("My App") sc = SparkContext(conf = conf) lines = sc.textFile("README.md") print(lines.count()) print(lines.first()) pythonLines = lines.filter(lambda line: "Python" in line) print(pythonLines.first()) sc.stop()