Hadoop 多表关联-Toy模板网

这篇具有很好参考价值的文章主要介绍了Hadoop 多表关联。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

一、实例描述

　　多表关联和单表关联类似，它也是通过对原始数据进行一定的处理，从其中挖掘出关心的信息。下面进入这个实例。

　　输入是两个文件，一个代表工厂表，包含工厂名列和地址编号列；另一个代表地址列，包含地址名列和地址编号列。要求从输入数据中找出工厂名和地址名的对应关系，输出工厂名-地址名表。

　　样例输入：

　　factory：

　　factoryname addressed
　　Beijing Red Star 1
　　Shenzhen Thunder 3
　　Guangzhou Honda 2
　　Beijing Rising 1
　　Guangzhou Development Bank 2
　　Tencent 3
　　Bank of Beijing 1

　　address：

　　addressID addressname
　　1 Beijing
　　2 Guangzhou
　　3 Shenzhen
　　4 Xian

　　样例输出：

Hadoop 多表关联

二、设计思路

　　多表关联和单表关联类似，都类似于数据库中的自然连接。相比单表关联，多表关联的左右表和连接列更清楚，因此可以采用和单表关联相同的处理方式。Map识别出输入的行属于哪个表之后，对其进行分割，将连接的值保存在key中，另一列和左右表标志保存在value中，然后输出。Reduce拿到连接结果后，解析value内容，根据标志将左右表内容分开存放，然后求笛卡尔积，最后直接输出。

　　这个实例的具体分析参考Hadoop 单表关联博客，下面贴出代码。

三、程序代码

　　程序代码如下：

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;


public class MTjoin {

    public static int time = 0;

    public static class Map extends Mapper<Object, Text, Text, Text>{
        // 在Map中先区分输入行属于左表还是右表，然后对两列值进行分割，
        // 连接列保存在key值，剩余列和左右表标志保存在value中，最后输出
        @Override
        protected void map(Object key, Text value,Mapper<Object, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {
            // super.map(key, value, context);
            String line = value.toString();
            int i=0;
            // 输入文件首行，不处理
            if(line.contains("factoryname")==true || line.contains("addressID")==true){
                return ;
            }
            // 找出数据中的分割点
            while(line.charAt(i)>='9' || line.charAt(i)<='0'){
                i++;
            }
            if (line.charAt(0)>='9'||line.charAt(0)<='0') {
                // 左表
                int j = i-1;
                while(line.charAt(j)!=' ') j--;
                String [] values = {line.substring(0,j),line.substring(i)};
                context.write(new Text(values[1]), new Text("1+"+values[0]));
            }else {
                // 右表
                int j = i+1;
                while(line.charAt(j)!=' ') j++;
                String[] values = {line.substring(0,i+1),line.substring(j)};
                context.write(new Text(values[0]), new Text("2"+values[1]));
            }
        }
    }

    public static class Reduce extends Reducer<Text, Text, Text, Text>{
        // Reduce解析Map输出，将value中数据按照左右表分别保存，然后求 // 笛卡尔积，输出
        @Override
        protected void reduce(Text key, Iterable<Text> values,Reducer<Text, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {
            // super.reduce(arg0, arg1, arg2);
            if (time==0) {
                //  输出文件第一行
                context.write(new Text("factoryname"), new Text("addressname"));
                time++;
            }
            int factorynum = 0;
            String[] factory = new String[10];
            int addressnum = 0;
            String[] address = new String[10];
            Iterator ite = values.iterator();
            while (ite.hasNext()) {
                String record = ite.next().toString();
                int len = record.length();
                int i = 2;
                char type = record.charAt(0);
                String factoryname = new String();
                String addressname = new String();
                if (type=='1') {
                    // 左表
                    factory[factorynum] = record.substring(2);
                    factorynum++;
                }else {
                    // 右表
                    address[addressnum] = record.substring(2);
                    addressnum++;
                }
            }
            if (factorynum != 0 && addressnum !=0) {
                // 求笛卡尔积
                for(int m=0;m<factorynum;m++){
                    for(int n=0;n<addressnum;n++){
                        context.write(new Text(factory[m]), new Text(address[n]));
                    }
                }
            }
        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
        if(otherArgs.length!=2){
            System.out.println("Usage:wordcount <in> <out>");
            System.exit(2);
        }
        Job job = new Job(conf,"multiple table join");
        job.setJarByClass(MTjoin.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job,new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true)?0:1);
    }

}

文章来源地址https://www.toymoban.com/news/detail-404883.html

到了这里，关于Hadoop 多表关联的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！