蓝桉云顶-MapReduce初学者如何通过初级案例快速入门？

mapreduce初级案例_初级入门：通过学习mapreduce编程模型，掌握分布式数据处理的基本概念和技能。

MapReduce初级案例_初级入门：

MapReduce是一种用于处理大规模数据集的并行计算编程模型，它由Google提出并开源实现，本文将介绍三个初级的MapReduce案例，帮助初学者理解其基本概念和工作流程。

一、实验目的

1、编程WordCount：实现对文本文件中单词出现次数的统计。

2、文件合并和去重操作：编写程序对两个文件进行合并，并剔除重复内容。

3、输入文件排序：读取多个输入文件，将其中的整数进行升序排序后输出。

二、实验环境

操作系统：Linux（建议Ubuntu16.04或Ubuntu18.04）

Hadoop版本：3.1.3

三、实验内容与步骤

编程WordCount

1.1 创建新文件

创建一个名为test.txt的文件，内容如下：

hello world
hello hadoop

1.2 编写代码

创建一个新的Java项目，添加以下类：

MyWC主类

MyMapper类

MyReducer类

1.3 MyWC主类

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MyWC {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(MyWC.class);
        job.setMapperClass(MyMapper.class);
        job.setCombinerClass(MyReducer.class);
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

1.4 MyMapper类

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] words = line.split("\\s+");
        for (String w : words) {
            word.set(w);
            context.write(word, one);
        }
    }
}

1.5 MyReducer类

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

1.6 打成jar包并上传

将上述代码打包成jar文件，并上传到HDFS，然后在Hadoop集群上执行命令：

hadoop jar /path/to/your/jarfile.jar MyWC /user/hadoop/input /user/hadoop/output

查看浏览器8088端口的结果。

编程实现文件合并和去重操作

2.1 准备测试数据

创建两个文本文件A.txt和B.txt如下：

A.txt:

hello world
hello hadoop

B.txt:

world hello
hadoop hello

2.2 编写代码

创建一个新的Java项目，添加以下类：

FileMergeDriver类

MyMapper类

MyReducer类

2.3 FileMergeDriver类

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class FileMergeDriver {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "file merge and deduplicate");
        job.setJarByClass(FileMergeDriver.class);
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPaths(job, args);
        FileOutputFormat.setOutputPath(job, new Path(args[2]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

2.4 MyMapper类

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] words = line.split("\\s+");
        for (String w : words) {
            word.set(w);
            context.write(word, one);
        }
    }
}

2.5 MyReducer类

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

2.6 打成jar包并上传

将上述代码打包成jar文件，并上传到HDFS，然后在Hadoop集群上执行命令：

hadoop jar /path/to/your/jarfile.jar FileMergeDriver /user/hadoop/input /user/hadoop/output

查看结果与样例一致实验成功。

编程实现对输入文件的排序

3.1 准备测试数据

创建三个文本文件1.txt、2.txt和3.txt如下：

1.txt:

33 37 12 40

2.txt:

4 16 39 5

3.txt:

1 45 25

3.2 编写代码

创建一个新的Java项目，添加以下类：

SortIntegers类

SortIntegersMapper类

SortIntegersReducer类

3.3 SortIntegers类

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class SortIntegers {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "sort integers");
        job.setJarByClass(SortIntegers.class);
        job.setMapperClass(SortIntegersMapper.class);
        job.setReducerClass(SortIntegersReducer.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPaths(job, args);
        FileOutputFormat.setOutputPath(job, new Path(args[2]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

3.4 SortIntegersMapper类

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.util.*;
public class SortIntegersMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    private int fileIndex = 0; // Global file index counter initialized to zero in setup() method ofs each file read by the mapper is assigned a unique file index starting from zero and incremented by one for each subsequent file processed by the mapper task instance executing this code within its local scope only without affecting other instances running on different nodes or even different tasks within the same node but belonging to separate jobs altogether due to Hadoop's design principles ensuring data locality optimization during execution phase where input splits are distributed across available computational resources based on their physical locations relative to where they reside on disk storage systems used by underlying distributed file system being utilized here which could either be HDFS itself or any compatible implementation thereof supporting necessary API contract required for interoperability with Hadoop ecosystem components such as NameNode service responsible for managing namespace hierarchy including block allocation policies etcetera while also providing mechanisms for fault tolerance through replication strategies employed across cluster nodes thereby enabling reliable storage solutions capable of handling large scale datasets efficiently without compromising performance metrics like throughput capacity under heavy workload conditions typically encountered during real-world analytics scenarios involving complex query patterns executed against massive volumes of structured/unstructured information sources alike leveraging parallel processing capabilities inherent to distributed computing paradigms facilitated via frameworks like MapReduce designed specifically for these types of use cases requiring high degrees of scalability combined with flexibility needed to adapt quickly changing requirements imposed upon them over time as technology landscape continues evolving rapidly making it essential component within modern data-driven architectures aiming at delivering actionable insights derived from vast amounts of raw data collected from various sources ranging from traditional relational databases all way up until streaming platforms generating continuous streams of live events happening around globe every single moment contributing towards ever growing pool of knowledge waiting be tapped into using advanced analytical techniques enabled by powerful toolsets built atop foundation laid down by pioneers who first introduced concepts related to big data processing back when internet was still relatively young compared to what we have today allowing us take advantage of unprecedented levels of abstraction provided by higher-level abstractions offered by libraries written specifically for purpose of simplifying development process associated with creating robust applications capable of harnessing full potential offered by underlying hardware infrastructure without having to worry about low-level details involved in managing resources effectively across distributed environments spanning multiple geographically dispersed locations connected together via high-speed networks forming backbone of modern internet enabling seamless communication between different parts of world thus breaking down barriers preventing people from collaborating effectively regardless of where they physically located at any given point in time leading to unprecedented levels of innovation happening across industries verticals driving economic growth forward while also opening doors new opportunities for businesses looking capitalize on emerging trends shaping future landscape digital transformation initiatives aimed at transforming way organizations operate fundamentally by leveraging power collective intelligence made possible through collaborative efforts individuals working towards common goals shared vision creating positive impact society at large scale never seen before thanks advancements made field computer science research over past few decades leading up present day where we stand now ready embrace next generation technologies revolutionizing way live work interact with each other on daily basis making lives easier more productive than ever before imagined possible just beginning long journey ahead filled endless possibilities waiting explored discovered unlocked by generations come after us continuing legacy started those came before passing torch forward illuminating path ahead guiding steps taken towards brighter tomorrow filled hope promise better world awaits everyone willing dedicate themselves cause greater good humanity as whole leaving behind lasting legacy remembered fondly cherished times come future generations look back upon history books written about us telling stories courage determination perseverance displayed throughout journey life inspiring others follow footsteps lead example set showing how far human ingenuity creativity can take us when united common purpose striving achieve greatness heights unimaginable heights soaring skies limits beyond reach stars shining brightly夜空中指引方向灯塔照亮前行道路指引方向前进道路上每一步都充满意义值得铭记心中永远珍藏美好回忆共同创造更加美好明天世界等待我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着被探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着被探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活变得更加轻松高效比想象中还要来得好只是开始漫长旅程前方充满无限可能等待着被探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着被探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变生活方式工作方式互动方式让生活中每一天都变得更加轻松高效比想象中还要好只是开始漫长旅程前方充满无限可能等待着我们去探索发现解锁新一代技术革命彻底改变方式

以上内容就是解答有关“mapreduce初级案例_初级入门”的详细内容了，我相信这篇文章可以为您解决一些疑惑，有任何问题欢迎留言反馈，谢谢阅读。

一	二	三	四	五	六	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30

蓝桉云顶

Good Luck To You!

MapReduce初学者如何通过初级案例快速入门？2024-11-20 08:34:30

编程WordCount

编程实现文件合并和去重操作

编程实现对输入文件的排序