-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert '.xlsx' ResourceFiles to csv
#1337
Comments
Hi @tbhallett . Do you want all resource files under resources folder to be of .csv format? |
Yes I think that would be really good if possible. Where a sheet has multiple sheets and various things that are not used, we can drop the unused stuff (as we can archive the excel file in Dropbox), and just replace in the code pd.read_excel for pd.read_csv. Where multiple sheets in the excel file are being used, we might need to create a neat little solution - for instance putting multiple csv files into a folder, and then making a utility function so that reading this in behaves the same as pd.read_excel() did: i.e, if sheet_name=None, return a dict of pd.DateFrames for all the sheets, and otherwise provide a pd.DateFrame of just the target sheet. Let's tag @tamuri and @matt-graham for their thoughts and in case there is some ready-made solution for this. |
This sounds like a great solution to me - mapping an Excel file with multiple (in use) sheets to a directory of CSV files should keep the current functional grouping of related resource file sheets apparent, while removing the need to have Excel files, and as @tbhallett suggests having a helper function which takes a path to a directory and returns a dictionary mapping from CSV file name to a dataframe should mean the changes in code where currently using In terms of automating doing this, I think using Pandas to deal with conversion from Excel to CSV would work - something like (this is untested!) for excel_file_path in resource_file_path.rglob("*.xslx"):
sheet_dataframes = pd.read_excel(excel_file_path, sheet_name=None)
excel_file_directory = excel_file_path.with_suffix("")
# Create a container directory for per sheet CSVs
if excel_file_directory.exists():
print(f"Directory {excel_file_directory} already exists")
else:
excel_file_directory.mkdir()
# Write a CSV for each worksheet
for sheet_name, dataframe in sheet_dataframes.items():
dataframe.to_csv(excel_file_directory / sheet_name + ".csv")
# Remove no longer needed Excel file
excel_file_path.unlink() |
Thanks Tim and Matt for the clarification on this. @tbhallett do you want me to start working on this issue? I have some time |
Yes please @mnjowe -- that would be brilliant |
Great! |
Hi both. In order to get started with this issue, I've written the below function and have tested it already in lifestyle. My question is which is the right module to house this helper function as it will be needed in the read parameters section of all modules still using excel files. @matt-graham thanks for the above code(it didn't need a lot of modifications to do its job). all excel files are now turned into folders with one or multiple .csv files. Thanks
|
How about making this functionality a part of the |
That's a good idea. I will add it to |
I'd lean towards making it a utility function in |
Good point Asif, thanks. Indeed, scenarios as these will require us to think twice on the location of this function. @tbhallett I will defer to you for a final decision on this. |
Happy to go with Asif's suggestion! (We could always put a shortcut to it from the module, for convenience). |
Great! |
Hi all. I have created a draft PR here where we can continue with our discussion on this issue. I've started implementation in lifestyle and simplified birth. May you please to look at this initial stage of implementation and provide feedback if any before I move on to implementing the read csv files method to the rest of the disease modules? Thanks |
@tbhallett , now that all my PR's on this topic have been merged can we close this Issue? |
Just realised there is one folder |
We have some .xlsx ResourceFiles in use, but this makes comparison using
git
very cumbersome and it's not not necessary to use the Excel file format. It should be a straight forward task to convert the few.xlsx
into.csv
format.The text was updated successfully, but these errors were encountered: