Spark案例——基于nginx访问日志,分析用户信息

/ Spark / 没有评论 / 62浏览

基于nginx访问日志,分析用户的一些信息。

浏览器占比

只要知道user-agent信息,提取出浏览器信息即可。

思路

观察一行日志的特点:

10.211.99.57 - - [07/Feb/2018:03:12:18 +0000] "GET / HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

根据实际情况,合理切分出user-agent:

String[] split = line.split("\"");
int length = split.length;
// user-agent
split[length - 1];

然后基于spark的rdd编写相关程序

代码

public class NginxUserAgentApp {

    public static void main(String[] args) {
        SparkConf conf = new SparkConf()
                .setAppName(WordCountApp.class.getSimpleName());
        JavaSparkContext sc = new JavaSparkContext(conf);

        String readPath = args[0];
        String savePath = args[1];

        sc.textFile(readPath)
                // 将一行ngxin的access.log获取到浏览器信息
                .map(NginxUserAgentApp::getBrowserFromLine)
                .mapToPair(word -> new Tuple2<>(word, 1))
                .reduceByKey((a, b) -> a + b)
                .map(tuple -> tuple._1 + ": " + tuple._2)
                .saveAsTextFile(savePath);
    }

    /**
     * 获取浏览器
     *
     * @param line 一行日志信息
     * @return user-agent中提取的浏览器信息
     */
    private static String getBrowserFromLine(String line) {
        return new UserAgentParser()
                .parse(userAgentExtract(line))
                .getBrowser();
    }

    /**
     * 根据自己的日志编写提取user-agent字符串规则
     *
     * @param line 一行日志信息
     * @return user-agent字符串
     */
    private static String userAgentExtract(String line) {
        String[] split = line.split("\"");
        int length = split.length;
        return split[length - 1];
    }

}

运行结果

MSIE: 129
Safari: 14905
Unknown: 21986
Chrome: 239642
Apache HTTP Client: 1
Firefox: 2319

不足

没有根据ip去除重复,如果数据很大,结果影响不大。

改进

思路

可以按照这样的思路处理:

优化后的代码

public class NginxIpUserAgentApp {

    public static void main(String[] args) {
        SparkConf conf = new SparkConf()
                .setAppName(WordCountApp.class.getSimpleName());
        JavaSparkContext sc = new JavaSparkContext(conf);

        String readPath = args[0];
        String savePath = args[1];

        sc.textFile(readPath)
                // 转化为`浏览器:ip`的形式
                .map(line -> getBrowserFromLine(line) + ":" + getIpFromLine(line))
                // 去重
                .distinct()
                // 每出现1次,浏览器计为1
                .mapToPair(browserIp -> new Tuple2<>(browserIp.split(":")[0], 1))
                // 浏览器value累加
                .reduceByKey((a, b) -> a + b)
                .map(tuple -> tuple._1 + ": " + tuple._2)
                .saveAsTextFile(savePath);
    }

    /**
     * 获取ip
     *
     * @param line 一行日志信息
     * @return ip
     */
    private static String getIpFromLine(String line) {
        String[] split = line.split(" ");
        return split[0];
    }

    /**
     * 获取浏览器
     *
     * @param line 一行日志信息
     * @return user-agent中提取的浏览器信息
     */
    private static String getBrowserFromLine(String line) {
        return new UserAgentParser()
                .parse(userAgentExtract(line))
                .getBrowser();
    }

    /**
     * 根据自己的日志编写提取user-agent字符串规则
     *
     * @param line 一行日志信息
     * @return user-agent字符串
     */
    private static String userAgentExtract(String line) {
        String[] split = line.split("\"");
        int length = split.length;
        return split[length - 1];
    }

}

结果

MSIE: 2
Safari: 4
Unknown: 2
Chrome: 24
Apache HTTP Client: 1
Firefox: 3

统计ip

统计每个ip出现多少次,这个较为简单。

代码

public class NginxIpApp {

    public static void main(String[] args) {
        SparkConf conf = new SparkConf()
                .setAppName(WordCountApp.class.getSimpleName());
        JavaSparkContext sc = new JavaSparkContext(conf);

        String readPath = args[0];
        String savePath = args[1];

        sc.textFile(readPath)
                // (ip1, 1)、(ip1, 1)、(ip2, 1)...
                .mapToPair(line -> new Tuple2<>(getIpFromLine(line), 1))
                // (ip1, 2)、(ip2, 1)...
                .reduceByKey((a, b) -> a + b)
                .map(tuple -> tuple._1 + ": " + tuple._2)
                .saveAsTextFile(savePath);
    }

    /**
     * 获取ip
     *
     * @param line 一行日志信息
     * @return ip
     */
    private static String getIpFromLine(String line) {
        String[] split = line.split(" ");
        return split[0];
    }

}

结果

10.211.96.208: 9
10.221.0.54: 1
10.211.98.205: 1948
172.20.0.108: 234364
10.221.0.85: 7
10.211.96.247: 1835
10.211.107.182: 94
10.211.98.125: 7
10.211.107.58: 5
10.221.0.65: 19
10.211.96.131: 169
10.211.96.75: 171
10.211.98.134: 125
10.211.99.57: 725
10.211.96.86: 5
10.211.104.0: 24
10.211.109.42: 37421
10.211.108.27: 91
10.211.109.130: 446
10.211.109.169: 14
10.221.0.50: 1
10.211.100.159: 99
10.221.0.53: 7
10.221.0.83: 11
10.211.102.82: 1206
10.211.118.142: 85
10.211.99.254: 93