ClickHouse 数据迁移-天翼云开发者社区

### 一、 clickhouse-copier
#### 1、what's this？
clickhouse-copier是官方出的用来同步数据的工具，依赖zk来满足跨集群同步数据的场景。

#### 2、how to use？
假设我们要从cluster1[IP1,IP2,IP3]集群中拷贝table_dis到cluster2[IP4,IP5]中。table_dis是distributed table，对应的mergetree表为table_local

（1）zk.xml
``` XML
<yandex>
<logger>
<level>trace</level>
<size>100M</size>
<count>3</count>
</logger>

<zookeeper>
<node index="1">
<host>[ZK-IP]</host>
<port>[ZK-PORT]</port>
</node>
</zookeeper>
</yandex>
```
创建zk.xml文件，用于copy时候使用。

（2）schema.xml
用括号标注的变量需要根据实际情况更换。
``` XML
<yandex>

<remote_servers>
<source_cluster>
<shard>
<internal_replication>false</internal_replication>
<replica>
<host>[IP1]</host>
<port>[TCP PORT]</port>
</replica>
</shard>
</source_cluster>
<destination_cluster>
<shard>
<internal_replication>false</internal_replication>
<replica>
<host>[IP4]</host>
<port>[TCP PORT]</port>
</replica>
</shard>
<shard>
<internal_replication>false</internal_replication>
<replica>
<host>[IP5]</host>
<port>[TCP PORT]</port>
</replica>
</shard>
</destination_cluster>
</remote_servers>
<max_workers>2</max_workers>
<settings_pull>
<readonly>1</readonly>
</settings_pull>
<settings_push>
<readonly>0</readonly>
</settings_push>
<settings>
<connect_timeout>3</connect_timeout>

<insert_distributed_sync>1</insert_distributed_sync>
</settings>

<tables>

<table_hits>


<cluster_pull>source_cluster</cluster_pull>
<database_pull>[DATABASE]</database_pull>
<table_pull>[table_dis]</table_pull>

<cluster_push>destination_cluster</cluster_push>
<database_push>[DATABASE]</database_push>
<table_push>[table_local]</table_push>

<sharding_key>rand()</sharding_key>
<engine>[ENGINE SYNTAX]</engine>
</table_hits>
</tables>
</yandex>
```

* [IP1] & [table_dis]：我们选择在IP1上运行clickhouse-copier程序，所以指定IP1。在source_cluster中只指定了一台服务器，所以直接会用分布式表做query。也有一种方案是在source_cluster中配置全IP1、IP2、IP3，再在table_pull的tag中使用table_local表。
* 我们在destination_cluster中配置全了cluster中的所以节点，并且table_push指定的是local表，这样子从IP1的分布式表查出来的数据就会分散写入cluster2中的节点中。
* [ENGINE SYNTAX]：实测我们需要在目标集群中都手动建立起来需要的表，并且engine也需要指定。***应该是个bug***。
* 其余参数可以根据描写按实际需要修改，提高并行度。

#### 3、execute
在IP1中执行：

clickhouse-copier copier --daemon --config zk.xml --task-path /[ZK-CLICKHOUSE-PATH/NODE] --base-dir /[PATH] --task-upload-force true --task-file schema.xml

| 参数 | 含义 |
| --- | --- |
| --daemon | 后台运行 |
| /ZK-CLICKHOUSE-PATH/NODE | 在zk上使用的节点。ZK-CLICKHOUSE-PATH 这个节点需要先手动创建，NODE不需要 |
|PATH | 本机的一个路径，用于记录日志 |
|--task-upload-force|是否每次都更新schema.xml到zk节点上，默认false|

#### 4、性能
测试在3台的集群1000w条的数据，写入2台的集群中，耗时在30s

### 二、remote table
```
INSERT INTO TABLE SELECT * FROM remote('remote-table-ip','origin-table-name','user','passwd')
```
### 三、对比

* remote表的方式较简单，copier的方式需要依赖zk，并且配置较为麻烦。
* copier的方式支持多进程复制，能提高复制的效率。
* copier表支持origin表平均、hash等方式写入多个节点，remote表只能一对一。

### 一、 clickhouse-copier
#### 1、what's this？
clickhouse-copier是官方出的用来同步数据的工具，依赖zk来满足跨集群同步数据的场景。

#### 2、how to use？
假设我们要从cluster1[IP1,IP2,IP3]集群中拷贝table_dis到cluster2[IP4,IP5]中。table_dis是distributed table，对应的mergetree表为table_local

（1）zk.xml
``` XML
<yandex>
<logger>
<level>trace</level>
<size>100M</size>
<count>3</count>
</logger>

<zookeeper>
<node index="1">
<host>[ZK-IP]</host>
<port>[ZK-PORT]</port>
</node>
</zookeeper>
</yandex>
```
创建zk.xml文件，用于copy时候使用。

#### 3、execute
在IP1中执行：

clickhouse-copier copier --daemon --config zk.xml --task-path /[ZK-CLICKHOUSE-PATH/NODE] --base-dir /[PATH] --task-upload-force true --task-file schema.xml

#### 4、性能
测试在3台的集群1000w条的数据，写入2台的集群中，耗时在30s

### 二、remote table
```
INSERT INTO TABLE SELECT * FROM remote('remote-table-ip','origin-table-name','user','passwd')
```
### 三、对比

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

ClickHouse 数据迁移

ClickHouse 数据迁移

活动

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

ClickHouse 数据迁移

ClickHouse 数据迁移