spark sql系列——Datasets and DataFrames入门

/ Spark / 没有评论 / 49浏览

Datasets and DataFrames入门

Datasets and DataFrames

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine

A DataFrame is a Dataset organized into named columns.

Dataset可以认为是DataFrame的一个特例,主要区别是Dataset每一个record存储的是一个强类型值(一个Java对象)而不是一个Row。如下是创建一个DataFrame(Dataset):

// creates a DataFrame based on the content of a JSON file
Dataset<Row> df = spark.read().json("file:///usr/local/spark/people.json");

官网必读

创建datasets

rdd与datasets转换

DataFrames例子

/**
 * DataFrames
 *
 * bash:
 * bin/spark-submit --class "com.example.spark.sql.DataFrameApp" spark-example-1.0-SNAPSHOT.jar
 *
 * @author xuan
 * @since 1.0.0
 */
public class DataFrameApp {

    public static void main(String[] args) {
        // The entry point into all functionality in Spark is the SparkSession class
        SparkSession spark = SparkSession
                .builder()
                .appName(DataFrameApp.class.getSimpleName())
                .getOrCreate();

        // creates a DataFrame based on the content of a JSON file
        Dataset<Row> df = spark.read().json("file:///usr/local/spark/people.json");

        // Displays the content of the DataFrame to stdout
        df.show();

        // Select only the "name" column
        df.select("name").show();

        // Select everybody, but increment the age by 1
        df.select(df.col("name"), df.col("age").plus(1)).show();

        // Select people older than 21
        df.filter(df.col("age").gt(21)).show();

        // save
        df.javaRDD().saveAsTextFile("file:///usr/local/spark/people.txt");

        // close
        spark.close();
    }

}

DataFrame是面向Spark SQL的接口。