Hadoop实践之MapReduce

What


分布式离线计算框架 MapReduce On YARN:MRv2

  • 移动计算而不是移动数据
  • 一主多从架构
    • JobTracker
    • TaskTracker
  • NameNode
  • DateNode
  • ResourceManager 负责整个集群的资源管理和`调度
    • container
      • 一个container一个进程
  • ApplicationMaster:负责应用程序相关的事务,比如任务调度、任务监控和容错等。
将MapReduce作业直接运行在YARN上,而不是由JobTracker和TaskTracker构建的MRv1系统中
  • 基本功能模块

    • YARN:负责资源管理和调度
    • MRAppMaster:负责任务切分、任务调度、任务监控和容错等
    • MapTask/ReduceTask:任务驱动引擎,与MRv1一致
  • 每个MapRduce作业对应一个MRAppMaster任务调度

    • YARN将资源分配给MRAppMaster
    • MRAppMaster进一步将资源分配给内部的任务
  • MRAppMaster容错

    • 失败后,由YARN重新启动
    • 任务失败后,MRAppMaster重新申请资源
  • Zookeeper Failover Controller: 监控NameNode健康状态,并向Zookeeper注册NameNode,NameNode挂掉后,ZKFC为NameNode竞争锁,获得ZKFC 锁的NameNode变为active

细节
  1. input split 对应一个 block 128M
  2. output 可能大于 一个block 128M
  3. map 到 reduce 方法之间 叫 shuffle
  4. map 在相应的datanode上执行,reduce找空闲的datanode执行,不一定在block上
  5. split后的大小不一定是一个block大小
  6. map的输出物是经过快速排序的

<div id="flowchart-0" class="flow-chart"></div>

partition

  • hash(key) mod r
    • 有几个reduce 就有几个partition

shuffle

map task <div id="flowchart-1" class="flow-chart"></div>

Why


海量数据计算

Where


http://hadoop.apache.org/docs/r2.5.2/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html http://www.cnblogs.com/luogankun/p/4019303.html https://note.youdao.com/share/?id=95d458d779f8e9f391d6ea06b6c6d122&type=note#/ https://note.youdao.com/share/?id=86ca5c96d13413f789164ff92f9ab4f9&type=note#/ https://note.youdao.com/share/?id=212e4a69d7bf8fc30979f1e4fc39ff0f&type=note#/ https://note.youdao.com/share/?id=a518dfed10d824b0995380669ddd28c9&type=note#/

How


  1. 配置文件yarn-site.xml

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
      <configuration>
    <property>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>true</value>
    </property>
    <property>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>appcity</value>
    </property>
    <property>
    <name>yarn.resourcemanager.ha.rm-ids</name>
    <value>rm1,rm2</value>
    </property>
    <property>
    <name>yarn.resourcemanager.hostname.rm1</name>
    <value>sj-node3</value>
    </property>
    <property>
    <name>yarn.resourcemanager.hostname.rm2</name>
    <value>sj-node4</value>
    </property>
    <property>
    <name>yarn.resourcemanager.zk-address</name>
    <value>sj-node1:2181,sj-node2:2181,sj-node3:2181</value>
    </property>
    <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    </property>
    </configuration>

  2. 配置mapred-site.xml使用yarn来调度 cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml vi etc/hadoop/mapred-site.xml

    1
    2
    3
    4
    5
    6
    <configuration>
    <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    </property>
    </configuration>

  3. 将配置文件拷贝到其它机器上

    1
    2
    3
    4
    5
    6
    scp etc/hadoop/yarn-site.xml  sj-node2:/opt/soft/hadoop-2.5.1/etc/hadoop/
    scp etc/hadoop/yarn-site.xml sj-node3:/opt/soft/hadoop-2.5.1/etc/hadoop/
    scp etc/hadoop/yarn-site.xml sj-node4:/opt/soft/hadoop-2.5.1/etc/hadoop/
    scp etc/hadoop/mapred-site.xml sj-node2:/opt/soft/hadoop-2.5.1/etc/hadoop/
    scp etc/hadoop/mapred-site.xml sj-node3:/opt/soft/hadoop-2.5.1/etc/hadoop/
    scp etc/hadoop/mapred-site.xml sj-node4:/opt/soft/hadoop-2.5.1/etc/hadoop/

  4. 由于配置了3,4节点为yarn的主节点,所以进入3或4执行命令 ./sbin/start-yarn.sh

  5. 如果某个节点故障 kill掉之后 执行单独启动命令./sbin/yarn-daemon.sh start resourcemanager

  6. 检查 http://sj-node3:8088/cluster 能够访问代表主节点启动成功 http://sj-node4:8088/cluster 能够访问并自动跳转到node3代表热备机启动成功

<script src="https://cdnjs.cloudflare.com/ajax/libs/raphael/2.2.7/raphael.min.js"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/flowchart/1.6.5/flowchart.min.js"></script>