1. 场景描述
    存在像
    1
    2
    3
    diyishuai hello hi hadoop
    spark kafka flume zookeeper
    ...

这样的单词,现在要用空格把他们分离开,并统计每个单词出现的次数

  1. 编码
  2. 打包并上传到一个datanode客户端
  3. 启动hdsf和yarn(已经启动的可以略过)

    1
    2
    start-dsf.sh
    start-yarn.sh
  4. 在hdfs创建目标目录并存入待分析文件

    1
    2
    hadoop fs -mkdir -p /wordcount/input
    hadoop fs -put LICENSE.txt NOTICE.txt README.txt /wordcount/input
  5. http://server01:50070中check一下,之后要运行的wordcount可在http://server01:8088中看到

  6. 运行wordcount

    1
    hadoop jar wordcount.jar com.diyishuai.hadoop.mr.wcdemo.WordcountDriver /wordcount/input /wordcount/output
  7. 可以在hdsf的/wordcount/output中查看运行结果

  8. 问题
    如果遇到这个
    1
    Container [pid=3058,containerID=container_1515314973658_0001_01_000005] is running beyond virtual memory limits. Current usage: 107.9 MB of 1 GB physical memory used; 2.1 GB of 2.1 GB virtual memory used. Killing container.

在全部节点的hadoop-2.x.x/etc/hadoop/mapre-site.xml配置文件中添加

1
2
3
4
5
6
7
8
<property>
  <name>mapreduce.map.memory.mb</name>
  <value>1536</value>
</property>
<property>
  <name>mapreduce.map.java.opts</name>
  <value>-Xmx1024M</value>
</property>

并重启yarn