Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert '.xlsx' ResourceFiles to csv #1337

Open
tbhallett opened this issue May 13, 2024 · 16 comments · May be fixed by #1570
Open

Convert '.xlsx' ResourceFiles to csv #1337

tbhallett opened this issue May 13, 2024 · 16 comments · May be fixed by #1570
Assignees
Labels
enhancement New feature or request

Comments

@tbhallett
Copy link
Collaborator

We have some .xlsx ResourceFiles in use, but this makes comparison using git very cumbersome and it's not not necessary to use the Excel file format. It should be a straight forward task to convert the few .xlsx into .csv format.

@tbhallett tbhallett added the enhancement New feature or request label May 17, 2024
@mnjowe
Copy link
Collaborator

mnjowe commented Jul 9, 2024

Hi @tbhallett . Do you want all resource files under resources folder to be of .csv format?

@tbhallett
Copy link
Collaborator Author

Hi @tbhallett . Do you want all resource files under resources folder to be of .csv format?

Yes I think that would be really good if possible.

Where a sheet has multiple sheets and various things that are not used, we can drop the unused stuff (as we can archive the excel file in Dropbox), and just replace in the code pd.read_excel for pd.read_csv.

Where multiple sheets in the excel file are being used, we might need to create a neat little solution - for instance putting multiple csv files into a folder, and then making a utility function so that reading this in behaves the same as pd.read_excel() did: i.e, if sheet_name=None, return a dict of pd.DateFrames for all the sheets, and otherwise provide a pd.DateFrame of just the target sheet.

Let's tag @tamuri and @matt-graham for their thoughts and in case there is some ready-made solution for this.

@matt-graham
Copy link
Collaborator

Where multiple sheets in the excel file are being used, we might need to create a neat little solution - for instance putting multiple csv files into a folder, and then making a utility function so that reading this in behaves the same as pd.read_excel() did: i.e, if sheet_name=None, return a dict of pd.DateFrames for all the sheets, and otherwise provide a pd.DateFrame of just the target sheet.

This sounds like a great solution to me - mapping an Excel file with multiple (in use) sheets to a directory of CSV files should keep the current functional grouping of related resource file sheets apparent, while removing the need to have Excel files, and as @tbhallett suggests having a helper function which takes a path to a directory and returns a dictionary mapping from CSV file name to a dataframe should mean the changes in code where currently using read_excel should be minimal.

In terms of automating doing this, I think using Pandas to deal with conversion from Excel to CSV would work - something like (this is untested!)

for excel_file_path in resource_file_path.rglob("*.xslx"):
    sheet_dataframes = pd.read_excel(excel_file_path, sheet_name=None)
    excel_file_directory = excel_file_path.with_suffix("")
    # Create a container directory for per sheet CSVs
    if excel_file_directory.exists():
        print(f"Directory {excel_file_directory} already exists")
    else:
        excel_file_directory.mkdir()
    # Write a CSV for each worksheet
    for sheet_name, dataframe in sheet_dataframes.items():
        dataframe.to_csv(excel_file_directory / sheet_name + ".csv")
    # Remove no longer needed Excel file
    excel_file_path.unlink()

@mnjowe
Copy link
Collaborator

mnjowe commented Jul 12, 2024

Thanks Tim and Matt for the clarification on this. @tbhallett do you want me to start working on this issue? I have some time

@tbhallett
Copy link
Collaborator Author

Thanks Tim and Matt for the clarification on this. @tbhallett do you want me to start working on this issue? I have some time

Yes please @mnjowe -- that would be brilliant

@mnjowe
Copy link
Collaborator

mnjowe commented Jul 12, 2024

Great!

@mnjowe mnjowe self-assigned this Jul 12, 2024
@mnjowe
Copy link
Collaborator

mnjowe commented Jul 12, 2024

Hi both. In order to get started with this issue, I've written the below function and have tested it already in lifestyle. My question is which is the right module to house this helper function as it will be needed in the read parameters section of all modules still using excel files. @matt-graham thanks for the above code(it didn't need a lot of modifications to do its job). all excel files are now turned into folders with one or multiple .csv files. Thanks

    def read_csv_files(resource_filepath: Path = None, file_name: list[str] = None) -> dict[str, DataFrame]:
        """
        A function to read csv files in a similar way pandas read Excel files(pd.read_excel).

        :param resource_filepath:  path to resource file folder
        :param file_name: preferred csv file name(s). This is the same as sheet names in Excel file

        """
        all_data = {}

        if file_name is not None:
            for f_name in file_name:
                df = pd.read_csv(f'{resource_filepath}/{f_name}.csv')
                all_data[f_name] = df

        else:
            for f_name in resource_filepath.rglob("*.csv"):
                file_base_name = str(f_name).split('/')[-1].split('.')[0]  # Extract base name without extension
                df = pd.read_csv(f_name)
                all_data[file_base_name] = df

        return all_data 

@tbhallett
Copy link
Collaborator Author

Hi both. In order to get started with this issue, I've written the below function and have tested it already in lifestyle. My question is which is the right module to house this helper function as it will be needed in the read parameters section of all modules still using excel files. @matt-graham thanks for the above code(it didn't need a lot of modifications to do its job). all excel files are now turned into folders with one or multiple .csv files. Thanks

    def read_csv_files(resource_filepath: Path = None, file_name: list[str] = None) -> dict[str, DataFrame]:
        """
        A function to read csv files in a similar way pandas read Excel files(pd.read_excel).

        :param resource_filepath:  path to resource file folder
        :param file_name: preferred csv file name(s). This is the same as sheet names in Excel file

        """
        all_data = {}

        if file_name is not None:
            for f_name in file_name:
                df = pd.read_csv(f'{resource_filepath}/{f_name}.csv')
                all_data[f_name] = df

        else:
            for f_name in resource_filepath.rglob("*.csv"):
                file_base_name = str(f_name).split('/')[-1].split('.')[0]  # Extract base name without extension
                df = pd.read_csv(f_name)
                all_data[file_base_name] = df

        return all_data 

How about making this functionality a part of the Module class? The Module's read_parameters() method is where this is always used, and It's the same kind of thing as load_parameters_from_dataframe() which is a member function.

@mnjowe
Copy link
Collaborator

mnjowe commented Jul 12, 2024

That's a good idea. I will add it to Module class, implement it in lifestyle read_parameters( ) section and open a new PR for further discussions. Thanks

@tamuri
Copy link
Collaborator

tamuri commented Jul 12, 2024

I'd lean towards making it a utility function in util.py rather than Module. It feel likes a general purpose util - imagine wanting to load the files outside of a module, in an analysis file for example.

@mnjowe
Copy link
Collaborator

mnjowe commented Jul 15, 2024

Good point Asif, thanks. Indeed, scenarios as these will require us to think twice on the location of this function. @tbhallett I will defer to you for a final decision on this.

@tbhallett
Copy link
Collaborator Author

Happy to go with Asif's suggestion! (We could always put a shortcut to it from the module, for convenience).

@mnjowe
Copy link
Collaborator

mnjowe commented Jul 15, 2024

Great!

@mnjowe
Copy link
Collaborator

mnjowe commented Jul 17, 2024

Hi all. I have created a draft PR here where we can continue with our discussion on this issue. I've started implementation in lifestyle and simplified birth. May you please to look at this initial stage of implementation and provide feedback if any before I move on to implementing the read csv files method to the rest of the disease modules? Thanks

@mnjowe
Copy link
Collaborator

mnjowe commented Jan 16, 2025

@tbhallett , now that all my PR's on this topic have been merged can we close this Issue?

@mnjowe mnjowe moved this from To do to Done in Issue management Jan 16, 2025
@mnjowe mnjowe moved this from Done to In progress in Issue management Jan 16, 2025
@mnjowe
Copy link
Collaborator

mnjowe commented Jan 16, 2025

Just realised there is one folder cmd that was missed.

@mnjowe mnjowe linked a pull request Jan 16, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In progress
Development

Successfully merging a pull request may close this issue.

4 participants