如何使用Python分块处理Excel文件数据?
介绍
看起来全世界都被Excel统治着。在我的数据工程工作中,我惊讶地发现我的许多同事都在使用Excel作为关键工具来做出决策。虽然我不是MS Office及其Excel电子表格的忠实粉丝,但我仍然会向您展示一个巧妙的技巧,以有效地处理大型Excel电子表格。
怎么做…
在直接进入程序之前,让我们了解一些使用Pandas处理Excel电子表格的基础知识。
1. 安装。继续安装openpyxl和xlwt。如果您不确定是否已安装,只需使用python终端中的pip freeze或pip list查看可用的包。
我们将首先通过传递数据元组来创建一个Excel电子表格。然后我们将数据加载到pandas数据框中。最后,我们将数据框数据写入新的工作簿。
import xlsxwriter import pandas as pd
2. 创建一个包含少量数据的Excel电子表格。我们将有一个小的函数来将字典数据写入Excel电子表格。所有代码逻辑都在每个步骤中定义。
# Function : write_data_to_files def write_data_to_files(inp_data, inp_file_name): """ function : create a csv file with the data passed to this code args : inp_data : tuple data to be written to the target file file_name : target file name to store the data return : none assumption : File to be created and this code are in same directory. """ print(f" *** Writing the data to - {inp_file_name}") # Create a Workbook. workbook = xlsxwriter.Workbook(inp_file_name) # add a worksheet. worksheet = workbook.add_worksheet() # Start from the first cell. Rows and columns are zero indexed. row = 0 col = 0 # read the input data and write them in rows and columns for player, titles in inp_data: worksheet.write(row, col, player) worksheet.write(row, col + 1, titles) row += 1 # close the workbook. workbook.close() print(f" *** Completed writing the data to - {inp_file_name}")
# Function : excel_functions_with_pandas def excel_functions_with_pandas(inp_file_name): """ function : Quick overview of functions you can apply on excel with pandas args : inp_file_name : input excel spread sheet. return : none assumption : Input excel spreadsheet and this code are in same directory. """ data = pd.read_excel(inp_file_name) # print top 2 rows print(f" *** Displaying top 2 rows of - {inp_file_name} \n {data.head()} ") # look at the data types print(f" *** Displaying info about {inp_file_name} - {data.info()}") # Create a new spreadsheet "Sheet2" and write data into it. new_players_info = pd.DataFrame(data=[ {"players": "new Roger Federer", "titles": 20}, {"players": "new Rafael Nadal", "titles": 20}, {"players": "new Novak Djokovic", "titles": 17}, {"players": "new Andy Murray", "titles": 3}], columns=["players", "titles"]) new_data = pd.ExcelWriter(inp_file_name) new_players_info.to_excel(new_data, sheet_name="Sheet2") if __name__ == '__main__': # Define your file name and data file_name = "temporary_file.xlsx" # tuple data for storage file_data = (['player', 'titles'], ['Federer', 20], ['Nadal', 20], ['Djokovic', 17], ['Murray', 3]) # write the file_data to file_name # write_data_to_files(file_data, file_name) # # Read excel file into pandas and apply functions. # excel_functions_with_pandas(file_name)
if __name__ == '__main__': # Define your file name and data file_name = "temporary_file.xlsx" # tuple data for storage file_data = (['player', 'titles'], ['Federer', 20], ['Nadal', 20], ['Djokovic', 17], ['Murray', 3]) # write the file_data to file_name # write_data_to_files(file_data, file_name) # # Read excel file into pandas and apply functions. # excel_functions_with_pandas(file_name)
输出
*** Writing the data to - temporary_file.xlsx *** Completed writing the data to - temporary_file.xlsx *** Displaying top 2 rows of - temporary_file.xlsx player titles 0 Federer 20 1 Nadal 20 2 Djokovic 17 3 Murray 3 <class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 player 4 non-null object 1 titles 4 non-null int64 dtypes: int64(1), object(1) memory usage: 192.0+ bytes *** Displaying info about temporary_file.xlsx - None
现在,当处理大型csv文件时,我们有很多选项,包括分块处理它们,但是对于Excel电子表格,Pandas默认情况下不提供分块选项。
因此,如果您想分块处理Excel电子表格,下面的程序非常有用。
示例
def global_excel_to_db_chunks(file_name, nrows): """ function : handle excel spreadsheets in chunks args : inp_file_name : input excel spread sheet. return : none assumption : Input excel spreadsheet and this code are in same directory. """ chunks = [] i_chunk = 0 # The first row is the header. We have already read it, so we skip it. skiprows = 1 df_header = pd.read_excel(file_name, nrows=1) while True: df_chunk = pd.read_excel( file_name, nrows=nrows, skiprows=skiprows, header=None) skiprows += nrows # When there is no data, we know we can break out of the loop. if not df_chunk.shape[0]: break else: print( f" ** Reading chunk number {i_chunk} with {df_chunk.shape[0]} Rows") # print(f" *** Reading chunk {i_chunk} ({df_chunk.shape[0]} rows)") chunks.append(df_chunk) i_chunk += 1 df_chunks = pd.concat(chunks) # Rename the columns to concatenate the chunks with the header. columns = {i: col for i, col in enumerate(df_header.columns.tolist())} df_chunks.rename(columns=columns, inplace=True) df = pd.concat([df_header, df_chunks]) print(f' *** Reading is Completed in chunks...') if __name__ == '__main__': print(f" *** Gathering & Displaying Stats on the excel spreadsheet ***") file_name = 'Sample-sales-data-excel.xls' stats = pd.read_excel(file_name) print(f" ** Total rows in the spreadsheet are - {len(stats.index)} Rows") # process the excel file in chunks of 1000 rows at a time. global_excel_to_db_chunks(file_name, 1000)
*** Gathering & Displaying Stats on the excel spreadsheet *** ** Total rows in the spreadsheet are - 9994 Rows ** Reading chunk number 0 with 1000 Rows ** Reading chunk number 1 with 1000 Rows ** Reading chunk number 2 with 1000 Rows ** Reading chunk number 3 with 1000 Rows ** Reading chunk number 4 with 1000 Rows ** Reading chunk number 5 with 1000 Rows ** Reading chunk number 6 with 1000 Rows ** Reading chunk number 7 with 1000 Rows ** Reading chunk number 8 with 1000 Rows ** Reading chunk number 9 with 994 Rows *** Reading is Completed in chunks...
输出
*** Gathering & Displaying Stats on the excel spreadsheet *** ** Total rows in the spreadsheet are - 9994 Rows ** Reading chunk number 0 with 1000 Rows ** Reading chunk number 1 with 1000 Rows ** Reading chunk number 2 with 1000 Rows ** Reading chunk number 3 with 1000 Rows ** Reading chunk number 4 with 1000 Rows ** Reading chunk number 5 with 1000 Rows ** Reading chunk number 6 with 1000 Rows ** Reading chunk number 7 with 1000 Rows ** Reading chunk number 8 with 1000 Rows ** Reading chunk number 9 with 994 Rows *** Reading is Completed in chunks...
广告