之前学习spark 的时候对rdd和ds经常用的groupby操作,在flink中居然变少了
取而代之的是keyby
顾名思义,keyby是根据key的hashcode对分区数取模
For instance, if we know that the load of the parallel partitions of a DataStream is skewed, we might want to rebalance the data to evenly distribute the computation load of subsequent operators. Alternatively, the application logic might require that all tasks of an operation receive the same data or that events be distributed following a custom strategy. In this section, we present DataStream methods that enable users to control partitioning strategies or define their own.
摘自flink的一本书,意思就是如果我们知道DataStream 的并行分区中的数据倾斜,我们可能想重新平衡这些数据,
这就要求我们通过一种自定义的方式去改变 所有task会收到相同的数据,所以我们提供了DataStream的很多操作方法去按照他们的想法去控制分区策略
keyby 与下面三个不同,他产生的是KeyedStream,而不是dataStream,他的分区策略是hashcode
shuffle 分区策略是random
rebalance 分区策略的Round-Robin
resclae 和rebalance差不多,但是做了更细粒度的划分
rebalance() will create communication channels between all sending tasks to all receiving tasks,
rescale() will only createchannels from each task to some of the tasks of the downstream operator
这个是rescala的图,在分区之前还做了类似分组或者keyby的操作