问题:目前有11个dataframe,最多的dataframe有100w条数据,现在需要把这11个dataframe写入同一个表格的11个工作簿
解决历程:
1.使用 pd.ExcelWriter方法,默认为xlwriter引擎,修改为 openpyxl:
ps: 在本地写入100w条数据大概2分多,但是服务器上大概要20分钟。。。
with pd.ExcelWriter(self.file_name, engine='openpyxl') as file_writer:
sheet_name1 = 'aaa'
frame1.to_excel(writer, engine='openpyxl', encoding='utf-8', sheet_name=sheet_name1, index=False)
sheet_name2 = 'bbb'
frame2.to_excel(writer, engine='openpyxl', encoding='utf-8', sheet_name=sheet_name2, index=False)
...
上述方法只能简单的写入数据,效率较慢
2.在上述的基础上,使用多线程写入,效率提升一小半
with pd.ExcelWriter(self.file_name, engine="openpyxl") as file_writer:
"""采用多线程方式,提高写入速度"""
thread1 = threading.Thread(target=data_to_excel, args=(frame1, file_writer, 'aaa', 1))
thread2 = threading.Thread(target=data_to_excel, args=(frame2, file_writer, 'bbb')
...
thread1.start()
thread2.start()
...
thread1.join()
thread2.join()
...
3.虽然使用了多线程,但是保存时最后的save动作耗费的时间还是很长,所以继续优化,直接使用openpyxl库来进行写入:
from openpyxl.utils.dataframe import dataframe_to_rows
from openpyxl import Workbook
file_writer = Workbook()
thread1 = threading.Thread(target=data_to_excel, args=(cell_del_frame, file_writer, 'aaa', 1))
thread2 = threading.Thread(target=data_to_excel, args=(cell_data, file_writer, 'bbb'))
...
thread1.start()
thread2.start()
...
thread1.join()
thread2.join()
...
file_writer.save('data.xlsx')
"""""""""""""
data_to_excel
"""""""""""""
def data_to_excel(frame, writer, sheet_name, target=None):
""" frame写入 excel工作簿"""
print("开始写入:", sheet_name)
print("监控报告数据长度:", len(frame))
# frame.to_excel(writer, engine='openpyxl', encoding='utf-8', sheet_name=sheet_name, index=False)
if target == 1:
sheet = writer.active
sheet.title = sheet_name
else:
sheet = writer.create_sheet(sheet_name)
for row in dataframe_to_rows(frame, index=False):
sheet.append(row)
print("写入完成:", sheet_name)
整体写入效率提升1/3,18分钟--> 12分钟完成写入
4.再次更新,效率继续提升,跟上面比再提升1/3的效率,时间缩短1/3。使用 pyexcelerate 库来实现,这里贴一下这个库的官方说明:
PyExcelerate库说明
下面是示例代码:
import pyexcelerate as pe
import datetime
t1 = datetime.datetime.now()
workbook = pe.Workbook()
for i, df in enumerate([data1, data2, data3]):
sheet_name = f"sheet{i + 1}"
df_columns = df.columns.tolist()
col_length = len(df_columns)
# df.values的结果是一个,没有表头的二维数组,需要手动把表头拼在最前面!
value = np.concatenate(([np.array(df_columns)], df.values), axis=0)
workbook.new_sheet(sheet_name, data=value)
workbook.save("output.xlsx")
t2 = datetime.datetime.now()
print('pyexcelerate花费时间:', t2 - t1)