访问hive
pxf读取hive源¶
PXF读取hive数据配置文件描述如下:
从hive中导入数据,可以使用format
指定其存储类型提高导入性能。
-
rcfile
-
orc
-
parquet
-
text
-
avro
- 选项说明
- text和rc格式需要指定delimiter,与hive表相同。
- 支持的数据类型对应表
-
Hive Data Type SeaboxMpp Data Type boolean bool int int4 smallint int2 tinyint int2 bigint int8 float float4 doulbe float8 string text binary bytea timestamp timestamp - 访问方法说明
- 下面所有的例子中format不指定也可以成功创建外部表或者使用COPY导入,指定格式会提高导入的性能。
- Seabox创建目标表
create table test(a int, b text);
- 加载fdw和创建访问hive的server
create extension pxf_fdw;
create server hive_test foreign data wrapper hive_pxf_fdw options (host 'test-3', port '10000');
create user mapping for public server hive_test;
1) 外部表方式
```bash
seaboxsql=# create foreign table hive_text(a int, b text)server hive_test options (resource 't_text', format 'text', delimiter ',');
CREATE FOREIGN TABLE
seaboxsql=# select * from hive_text;
a | b
-----+---------
100 | string1
200 | string2
300 | string3
(3 rows)
```
2) COPY导入
```bash
seaboxsql=# copy test from 'fdw://hive_test/t_text' (format 'text', delimiter ',') ;
COPY 3
seaboxsql=# select * from test;
a | b
-----+---------
100 | string1
200 | string2
300 | string3
(3 rows)
```
- 访问RCFile格式的表
1) 外部表方式
```bash
seaboxsql=# create foreign table hive_rc(a int, b text)server hive_test options (resource 't_rc', format 'rc');
CREATE FOREIGN TABLE
seaboxsql=# select * from hive_rc;
a | b
-----+---------
100 | string1
200 | string2
300 | string3
(3 rows)
```
2) COPY导入
```bash
seaboxsql=# copy test from 'fdw://hive_test/t_rc' (format 'rc');
COPY 3
seaboxsql=# select * from test;
a | b
-----+---------
100 | string1
200 | string2
300 | string3
(3 rows)
```
- 访问Parquet格式的表
1) 外部表方式
```bash
seaboxsql=# create foreign table hive_parquet(a int, b text)server hive_test options(resource 't_parquet', format 'parquet');
CREATE FOREIGN TABLE
seaboxsql=# select * from hive_parquet;
a | b
-----+---------
100 | string1
200 | string2
300 | string3
(3 rows)
```
2) COPY导入
```bash
seaboxsql=# copy test from 'fdw://hive_test/t_parquet' (format 'parquet');
COPY 3
seaboxsql=# select * from test;
a | b
-----+---------
100 | string1
200 | string2
300 | string3
(3 rows)
```
- 访问ORC格式表
1) 外部表方式
```bash
seaboxsql=# create foreign table hive_orc(a int, b text)server hive_test options(resource 't_orc', format 'orc');
CREATE FOREIGN TABLE
seaboxsql=# select * from hive_orc;
a | b
-----+---------
100 | string1
200 | string2
300 | string3
(3 rows)
```
2) COPY导入
```bash
seaboxsql=# copy test from 'fdw://hive_test/t_orc' (format 'orc');
COPY 3
seaboxsql=# select * from test;
a | b
-----+---------
100 | string1
200 | string2
300 | string3
(3 rows)
```
- 访问SequenceFile格式的表
1) 外部表方式
```bash
seaboxsql=# create foreign table hive_sequencefile(a int, b text)server hive_test options(resource 't_sequencefile', format 'sequencefile');
CREATE FOREIGN TABLE
seaboxsql=# select * from hive_sequencefile;
a | b
-----+---------
100 | string1
200 | string2
300 | string3
(3 rows)
```
2) COPY导入
```bash
seaboxsql=# copy test from 'fdw://hive_test/t_sequencefile' (format 'sequencefile');
COPY 3
seaboxsql=# select * from test;
a | b
-----+---------
100 | string1
200 | string2
300 | string3
(3 rows)
```
- 访问Avro格式的表
1) 外部表方式
```bash
seaboxsql=# create foreign table hive_avro(a int, b text)server hive_test options(resource 't_avro'); --- 不支持format指定为avro
CREATE FOREIGN TABLE
seaboxsql=# select * from hive_avro;
a | b
-----+---------
100 | string1
200 | string2
300 | string3
(3 rows)
```
2) COPY导入
```bash
seaboxsql=# copy test from 'fdw://hive_test/t_avro';
COPY 3
seaboxsql=# select * from test;
a | b
-----+---------
100 | string1
200 | string2
300 | string3
(3 rows)
```
- Hive中准备数据
create table t_text(a int, b string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile;
create table t_avro(a int, b string)stored as avro;
create table t_sequencefile(a int, b string)stored as sequencefile;
create table t_parquet(a int, b string)stored as parquet;
create table t_rc(a int, b string)stored as rcfile;
create table t_avro(a int, b string)stored as avro;
insert into t_text values(100,'string1'),(200,'string2'),(300, 'string3');
insert into t_avro values(100,'string1'),(200,'string2'),(300, 'string3');
insert into t_sequencefile values(100,'string1'),(200,'string2'),(300, 'string3');
insert into t_parquet values(100,'string1'),(200,'string2'),(300, 'string3');
insert into t_rc values(100,'string1'),(200,'string2'),(300, 'string3');
insert into t_avro values(100,'string1'),(200,'string2'),(300, 'string3');
配置文件注意事项
-
$PXF_CONF/servers/default
或者对应server name下需要包含hive-site.xml -
$PXF_CONF/lib
下需要包含连接元数据库需要的jar包(test-3上的测试环境使用本地元存储,不需要此项配置,如果使用derby,mysql等元数据存储,需要javax.jdo以及对应元数据库的jdbc jar包)