跳转至

访问hive

pxf读取hive源

PXF读取hive数据配置文件描述如下:

从hive中导入数据,可以使用format指定其存储类型提高导入性能。

  • rcfile

  • orc

  • parquet

  • text

  • avro

选项说明
text和rc格式需要指定delimiter,与hive表相同。
支持的数据类型对应表
Hive Data Type SeaboxMpp Data Type
boolean bool
int int4
smallint int2
tinyint int2
bigint int8
float float4
doulbe float8
string text
binary bytea
timestamp timestamp
访问方法说明
下面所有的例子中format不指定也可以成功创建外部表或者使用COPY导入,指定格式会提高导入的性能。
  1. Seabox创建目标表
create table test(a int, b text);
  1. 加载fdw和创建访问hive的server

create extension pxf_fdw;
create server hive_test foreign data wrapper hive_pxf_fdw options (host 'test-3',    port '10000');
create user mapping for public server hive_test;
3. 访问文本格式的表

1) 外部表方式

  ```bash
  seaboxsql=# create foreign table hive_text(a int, b text)server hive_test options      (resource 't_text', format 'text', delimiter ',');
  CREATE FOREIGN TABLE
  seaboxsql=# select * from hive_text;
    a  |    b    
  -----+---------
   100 | string1
   200 | string2
   300 | string3
  (3 rows)
  ```

2) COPY导入

  ```bash
  seaboxsql=# copy test from 'fdw://hive_test/t_text' (format 'text', delimiter ',')      ;
  COPY 3
  seaboxsql=# select * from test;
    a  |    b    
  -----+---------
   100 | string1
   200 | string2
   300 | string3
  (3 rows)
  ```
  1. 访问RCFile格式的表

1) 外部表方式

  ```bash
  seaboxsql=# create foreign table hive_rc(a int, b text)server hive_test options      (resource 't_rc', format 'rc');
  CREATE FOREIGN TABLE
  seaboxsql=# select * from hive_rc;
    a  |    b    
  -----+---------
   100 | string1
   200 | string2
   300 | string3
  (3 rows)
  ```

2) COPY导入

  ```bash
  seaboxsql=# copy test from 'fdw://hive_test/t_rc' (format 'rc');
  COPY 3
  seaboxsql=# select * from test;
    a  |    b    
  -----+---------
   100 | string1
   200 | string2
   300 | string3
  (3 rows)
  
  ```
  1. 访问Parquet格式的表

1) 外部表方式

  ```bash
  seaboxsql=# create foreign table hive_parquet(a int, b text)server hive_test       options(resource 't_parquet', format 'parquet');
  CREATE FOREIGN TABLE
  seaboxsql=# select * from hive_parquet;
    a  |    b    
  -----+---------
   100 | string1
   200 | string2
   300 | string3
  (3 rows)
  ```

2) COPY导入

  ```bash
  seaboxsql=# copy test from 'fdw://hive_test/t_parquet' (format 'parquet');
  COPY 3
  seaboxsql=# select * from test;
    a  |    b    
  -----+---------
   100 | string1
   200 | string2
   300 | string3
  (3 rows)
  
  ```
  1. 访问ORC格式表

1) 外部表方式

  ```bash
  seaboxsql=# create foreign table hive_orc(a int, b text)server hive_test options(resource 't_orc', format 'orc');
  CREATE FOREIGN TABLE
  seaboxsql=# select * from hive_orc;
    a  |    b    
  -----+---------
   100 | string1
   200 | string2
   300 | string3
  (3 rows)
  ```

2) COPY导入

  ```bash
  seaboxsql=# copy test from 'fdw://hive_test/t_orc' (format 'orc');
  COPY 3
  seaboxsql=# select * from test;
    a  |    b    
  -----+---------
   100 | string1
   200 | string2
   300 | string3
  (3 rows)
  
  ```
  1. 访问SequenceFile格式的表

1) 外部表方式

  ```bash
  seaboxsql=# create foreign table hive_sequencefile(a int, b text)server hive_test options(resource 't_sequencefile', format 'sequencefile');
  CREATE FOREIGN TABLE
  seaboxsql=# select * from hive_sequencefile;
    a  |    b    
  -----+---------
   100 | string1
   200 | string2
   300 | string3
  (3 rows)
  ```

2) COPY导入

  ```bash
  seaboxsql=# copy test from 'fdw://hive_test/t_sequencefile' (format 'sequencefile');
  COPY 3
  seaboxsql=# select * from test;
    a  |    b    
  -----+---------
   100 | string1
   200 | string2
   300 | string3
  (3 rows)
  
  ```
  1. 访问Avro格式的表

1) 外部表方式

  ```bash
  seaboxsql=# create foreign table hive_avro(a int, b text)server hive_test options(resource 't_avro');   --- 不支持format指定为avro
  CREATE FOREIGN TABLE
  seaboxsql=# select * from hive_avro;
    a  |    b    
  -----+---------
   100 | string1
   200 | string2
   300 | string3
  (3 rows)
  ```

2) COPY导入

  ```bash
  seaboxsql=# copy test from 'fdw://hive_test/t_avro';
  COPY 3
  seaboxsql=# select * from test;
    a  |    b    
  -----+---------
   100 | string1
   200 | string2
   300 | string3
  (3 rows)
  
  ```
  1. Hive中准备数据
create table t_text(a int, b string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile;
create table t_avro(a int, b string)stored as avro;
create table t_sequencefile(a int, b string)stored as sequencefile;
create table t_parquet(a int, b string)stored as parquet;
create table t_rc(a int, b string)stored as rcfile;
create table t_avro(a int, b string)stored as avro;

insert into t_text values(100,'string1'),(200,'string2'),(300, 'string3');
insert into t_avro values(100,'string1'),(200,'string2'),(300, 'string3');
insert into t_sequencefile values(100,'string1'),(200,'string2'),(300, 'string3');
insert into t_parquet values(100,'string1'),(200,'string2'),(300, 'string3');
insert into t_rc values(100,'string1'),(200,'string2'),(300, 'string3');
insert into t_avro values(100,'string1'),(200,'string2'),(300, 'string3');

配置文件注意事项

  1. $PXF_CONF/servers/default或者对应server name下需要包含hive-site.xml

  2. $PXF_CONF/lib下需要包含连接元数据库需要的jar包(test-3上的测试环境使用本地元存储,不需要此项配置,如果使用derby,mysql等元数据存储,需要javax.jdo以及对应元数据库的jdbc jar包)