文章目录

Hive中表的分类

managed_table—受控表、管理表、内部表

表中的数据的生命周期/存在与否，受到了表结构的影响，当表结构被删除的，表中的数据随之一并被删除。
默认创建的表就是这种表。
可以在cli中通过desc extended tableName来查看表的详细信息，当然也可以在MySQL中hive的元数据信息表TBLS中查看。

external_table—外部表

表中的数据的生命周期/存在与否，不受到了表结构的影响，当表结构被删除的，表中对应数据依然存在。
这相当于只是表对相应数据的引用。
创建外部表：

create external table t6_external(
    id int
);

增加数据：

alter table t6_external set location "/input/hive/hive-t6.txt";

还可以在创建外部表的时候就可以指定相应数据

create external table t6_external_1(
    id int
) location "/input/hive/hive-t6.txt";

上述hql报错：

MetaException(message:hdfs://ns1/input/hive/hive-t6.txt is not a directory or unable to create one

意思是说在创建表的时候指定的数据，不期望为一个具体文件，而是一个目录

create external table t6_external_1(
    id int
) location "/input/hive/";

当使用外部表时，是不允许删除操作的，但是可以添加数据，并且这样做也会影响到hdfs中引用的文本数据。

内部表和外部表的简单用途区别：

当考虑到数据的安全性的时候，或者数据被多部门协调使用的，一般用到外部表。
当考虑到hive和其它框架(比如hbase)进行协调集成的时候，一般用到外部表。

可以对内部表和外部表进行互相转换：

外--->内部
    alter table t6_external set tblproperties("EXTERNAL"="FALSE");
内部表---->外部表
    alter table t2 set tblproperties("EXTERNAL"="TRUE");

持久表和临时表

以上的表都是持久表，表的存在和会话没有任何关系。
临时表：

在一次会话session中创建的临时存在的表，当会话断开的时候，该表所有数据(包括元数据)都会消失，表数据是临时存储在内存中，(实际上，创建临时表后，在hdfs中会创建/tmp/hive目录来保存临时文件，但只要退出会话，数据就会马上删除)
在元数据库中没有显示。
这种临时表通常就做临时的数据存储或交换
临时表的特点
不能是分区表
创建临时表非常简单，和外部表一样，将external关键字替换成temporary就可以了

功能表

分区表

假如up_web_log表的结构如下：

user/hive/warehouse/up_web_log/
            web_log_2017-03-09.log
            web_log_2017-03-10.log
            web_log_2017-03-11.log
            web_log_2017-03-12.log
            ....
            web_log_2018-03-12.log

对该表的结构解释如下：

该表存放的是web日志，在hive中，一个表就是hdfs中的一个目录，web日志的保存统计是按天进行的，所以每天结束后
都会将日志数据加载到hive中，所以可以看到up_web_log目录下有多个文件，但是对于hive来说，这些日志数据都是
属于up_web_log这个表的，显然，随着时间的推移，这张表的数据会越来越多。

该表存在的问题：

原先的是一张大表，这张表下面存放有若干数据，要想查看其中某一天的数据，
只能首先在表中定义一个日期字段(比如：dt)，然后再在查询的时候添加过滤字段where dt="2017-03-12"
如此才能求出相应结果，但是有个问题，这种方式需要加载该表下面所有的文件中的数据，造成在内存中加载了
大量的不相关的数据，造成了我们hql运行效率低下。

那么如何对该表进行优化呢？

要想对这个问题进行优化，我们可以依据hive表的特性，其实在管理的是一个文件夹，也就是说，
通过表能够定位到一个hdfs上面的目录，我们就可以在该表/目录的基础之上再来创建一/多级子目录，
来完成对该大表的一个划/拆分，我们通过某种特征标识，比如子文件夹名称datadate=2017-03-09...
以后再来查询其中一天的数据的时候，只需要定位到该子文件夹，即可类似加载一张表数据一样，加载
该子文件夹下面的所有的数据。这样就不会再去全量加载该大表下面所有的数据，只加载了其中的一部分，
减少了内存数据量，提高了hql运行效率。

我们把这种技术称之为，表的分区，这种表称之为分区表。把这个子文件夹称之为分区表的分区。

分区表的组成说明如下：

分区有两部分组成，分区字段和具体的分区值组成，中间使用“=”连接，分区字段在整个表中的地位
就相当于一个字段，要想查询某一分区下面的数据，如下操作 where datadate="2017-03-09"
hdfs中关于该表的存储结构为：
user/hive/warehouse/up_web_log/
            /datadate=2017-03-09
                web_log_2017-03-09.log
            /datadate=2017-03-10    
                web_log_2017-03-10.log
            /datadate=2017-03-11
                web_log_2017-03-11.log
            /datadate=2017-03-12  
                web_log_2017-03-12.log
                ....
                web_log_2018-03-12.log

创建一张分区表：

create table t7_partition (
    id int
) partitioned by (dt date comment "date partition field");

load data local inpath '/opt/data/hive/hive-t6.txt' into table t7_partition;
FAILED: SemanticException [Error 10062]: Need to specify partition columns because the destination table is partitioned
不能直接向分区表加载数据，必须在加载数据之前明确加载到哪一个分区中，也就是子文件夹中。

分区表的DDL：

创建一个分区：
    alter table t7_partition add partition(dt="2017-03-10");
查看分区列表：
    show partitions t7_partition;
删除一个分区：
    alter table t7_partition drop partition(dt="2017-03-10");

增加数据：

向指定分区中增加数据：
    load data local inpath '/opt/data/hive/hive-t6.txt' into table t7_partition partition (dt="2017-03-10");
这种方式，会自动创建分区

有多个分区字段的情况：

统计学校，每年，每个学科的招生，就业的情况/每年就业情况
create table t7_partition_1 (
    id int
) partitioned by (year int, school string); 

添加数据：
    load data local inpath '/opt/data/hive/hive-t6.txt' into table t7_partition_1 partition(year=2015, school='python');

桶表

分区表存在的问题：

因为分区表还有可能造成某些分区数据非常大，某些则非常小，造成查询不均匀，这不是我们所预期，
就需要使用一种技术，对这些表进行相对均匀的打散，把这种技术称之为分桶，分桶之后的表称之为桶表。

创建一张分桶表：

create table t8_bucket(
    id int
) clustered by(id) into 3 buckets;

向分桶表增加数据：

只能从表的表进行转换，不能使用上面的load这种方式（不会对数据进行拆分）
insert into t8_bucket select * from t7_partition_1 where year=2016 and school="mysql";
FAILED: SemanticException [Error 10044]: Line 1:12 Cannot insert into target table because column number/types are different 't8_bucket': 
Table insclause-0 has 1 columns, but query has 3 columns.
我们的桶表中只有一个字段，但是分区表中有3个字段，所以在使用insert into 的方式导入数据的时候，
一定要注意前后字段个数必须保持一致。
insert into t8_bucket select id from t7_partition_1 where year=2016 and school="mysql";

增加数据后，查看表中的数据：
> select * from t8_bucket;
OK
6
3
4
1
5
2
Time taken: 0.08 seconds, Fetched: 6 row(s)
可以看到，数据的顺序跟原来不同，那是因为数据分成了3份，使用的分桶算法为哈希算法，如下：
6%3 = 0, 3%3 = 0，放在第1个桶
4%3 = 1, 2%3 = 1，放在第2个桶
5%3 = 2, 2%3 = 2，放在第3个桶

查看hdfs中表t8_bucket的结构：
hive (mydb1)> dfs -ls /user/hive/warehouse/mydb1.db/t8_bucket;
Found 3 items
-rwxr-xr-x   3 uplooking supergroup          4 2018-03-09 23:00 /user/hive/warehouse/mydb1.db/t8_bucket/000000_0
-rwxr-xr-x   3 uplooking supergroup          4 2018-03-09 23:00 /user/hive/warehouse/mydb1.db/t8_bucket/000001_0
-rwxr-xr-x   3 uplooking supergroup          4 2018-03-09 23:00 /user/hive/warehouse/mydb1.db/t8_bucket/000002_0
可以看到，数据被分别保存到t8_bucket的3个不同的子目录中。

注意：操作分桶表的时候，本地模式不起作用。

数据的加载和导出

[]==>可选，<> ==>必须

加载

load

load data [local] inpath 'path'  [overwrite] into table [partition_psc];
local：
    有==>从linux本地加载数据
    无==>从hdfs加载数据，相当于执行mv操作(无指的是没有local参数时，而不是本地中没有这个文件)
overwrite
    有==>覆盖掉表中原来的数据
    无==>在原来的基础上追加新的数据

从其他表加载

insert <overwrite|into> [table(当前面参数为overwrite时必须加table)] t_des select [...] from t_src [...];
    overwrite
        有==>覆盖掉表中原来的数据
        无==>在原来的基础上追加新的数据   
==>会转化成为MR执行

需要注意的地方：t_des中列要和select [...] from t_src这里面的[...]一一对应起来。
当选择的参数为overwrite时，后面必须要加table，如：
insert overwrite table test select * from t8_bucket;

创建表的时候加载

create table t_des as select [...] from t_src [...];

这样会创建一张表，表结构为select [...] from t_src中的[...]
eg.create temporary table tmp as select distinct(id) from t8_bucket;

动态分区的加载

快速复制表结构：

create table t_d_partition like t_partition_1;

hive (default)> show partitions t_partition_1;
OK
partition
year=2015/class=bigdata
year=2015/class=linux
year=2016/class=bigdata
year=2016/class=linux

要将2016的数据都到入到t_d_partition的相关的分区中：

insert into table t_d_partition partition(class, year=2016) select id, name, class from t_partition_1 where year=2016;

要将t_partition_1中所有数据都到入到t_d_partition的相关的分区中：

insert overwrite table t_d_partition partition(year, class) select id, name, year, class from t_partition_1;

(我操作时出现的提示：
            FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set                      shive.exec.dynamic.partition.mode=nonstrict
            )

其它问题：

从hdfs上面删除的数据，并没有删除表结构，我们show partitions t_d_partition;是从metastore中查询出来的内容，如果你是手动删除的hdfs上面数据，它的元数据信息依然在。

insert into t10_p_1 partition(year=2016, class) select * from t_partition_1;
FAILED: SemanticException [Error 10094]: Line 1:30 Dynamic partition cannot be the parent of a static partition 'professional'
动态分区不能成为静态分区的父目录
需要将hive.exec.dynamic.partition.mode设置为nonstrict
<property>
    <name>hive.exec.max.dynamic.partitions</name>
    <value>1000</value>
    <description>Maximum number of dynamic partitions allowed to be created in total.</description>
</property>

import导入hdfs上的数据：

import table stu from '/data/stu';

目前测试时会出现下面的错误：
hive (mydb1)> import table test from '/input/hive/';
FAILED: SemanticException [Error 10027]: Invalid path
hive (mydb1)> import table test from 'hdfs://input/hive/';
FAILED: SemanticException [Error 10324]: Import Semantic Analyzer Error
hive (mydb1)> import table test from 'hdfs://ns1/input/hive/';
FAILED: SemanticException [Error 10027]: Invalid path

导出

1.
hadoop fs -cp src_uri dest_uri
(hdfs dfs -cp src_uri dest_uri)

2.
hive> export table tblName to 'hdfs_uri';
导出到的hdfs目录必须是一个空目录，如果不存在时，则会自动创建该目录。
这种方式同时会将元数据信息导出。

3.
insert overwrite [local] directory 'linux_fs_path' select ...from... where ...; 
如果不加local，则数据会导出到hdfs，否则会导出到linux文件系统
不管哪一种方式，如果目录不存在，则会自动创建，如果存在，则会覆盖。

Hive笔记整理（二）

Hive中表的分类

managed_table—受控表、管理表、内部表

external_table—外部表

持久表和临时表

功能表

分区表

桶表

数据的加载和导出

加载

load

从其他表加载

创建表的时候加载

动态分区的加载

阁主

相关推荐

大佬们的评论抢沙发

女生也可以快速建出专业的网站

热门专题

分类目录

猜你喜欢

全新“一站式”建站，高质量、高售后的一条龙服务

微信抖音支付宝百度头条快手全平台打通信息流

觉得文章有用就打赏一下文章作者

非常感谢你的打赏，我们将继续给力更多优质内容，让我们一起创建更加美好的网络世界！

支付宝扫一扫

微信扫一扫

切换注册登录

切换登录注册

Hive中表的分类

managed_table—受控表、管理表、内部表

external_table—外部表

持久表和临时表

功能表

分区表

桶表

数据的加载和导出

加载

load

从其他表加载

创建表的时候加载

动态分区的加载

阁主

相关推荐

大佬们的评论 抢沙发

女生也可以快速建出专业的网站

热门专题

分类目录

猜你喜欢

全新“一站式”建站，高质量、高售后的一条龙服务

微信 抖音 支付宝 百度 头条 快手全平台打通信息流

觉得文章有用就打赏一下文章作者

非常感谢你的打赏，我们将继续给力更多优质内容，让我们一起创建更加美好的网络世界！

支付宝扫一扫

微信扫一扫

切换注册登录

切换登录注册

大佬们的评论抢沙发

微信抖音支付宝百度头条快手全平台打通信息流