Convert '.xlsx' ResourceFiles to `csv` #1337

tbhallett · 2024-05-13T10:03:00Z

We have some .xlsx ResourceFiles in use, but this makes comparison using git very cumbersome and it's not not necessary to use the Excel file format. It should be a straight forward task to convert the few .xlsx into .csv format.

The text was updated successfully, but these errors were encountered:

mnjowe · 2024-07-09T12:56:50Z

Hi @tbhallett . Do you want all resource files under resources folder to be of .csv format?

tbhallett · 2024-07-10T06:29:43Z

Hi @tbhallett . Do you want all resource files under resources folder to be of .csv format?

Yes I think that would be really good if possible.

Where a sheet has multiple sheets and various things that are not used, we can drop the unused stuff (as we can archive the excel file in Dropbox), and just replace in the code pd.read_excel for pd.read_csv.

Where multiple sheets in the excel file are being used, we might need to create a neat little solution - for instance putting multiple csv files into a folder, and then making a utility function so that reading this in behaves the same as pd.read_excel() did: i.e, if sheet_name=None, return a dict of pd.DateFrames for all the sheets, and otherwise provide a pd.DateFrame of just the target sheet.

Let's tag @tamuri and @matt-graham for their thoughts and in case there is some ready-made solution for this.

matt-graham · 2024-07-11T15:03:28Z

Where multiple sheets in the excel file are being used, we might need to create a neat little solution - for instance putting multiple csv files into a folder, and then making a utility function so that reading this in behaves the same as pd.read_excel() did: i.e, if sheet_name=None, return a dict of pd.DateFrames for all the sheets, and otherwise provide a pd.DateFrame of just the target sheet.

This sounds like a great solution to me - mapping an Excel file with multiple (in use) sheets to a directory of CSV files should keep the current functional grouping of related resource file sheets apparent, while removing the need to have Excel files, and as @tbhallett suggests having a helper function which takes a path to a directory and returns a dictionary mapping from CSV file name to a dataframe should mean the changes in code where currently using read_excel should be minimal.

In terms of automating doing this, I think using Pandas to deal with conversion from Excel to CSV would work - something like (this is untested!)

for excel_file_path in resource_file_path.rglob("*.xslx"):
    sheet_dataframes = pd.read_excel(excel_file_path, sheet_name=None)
    excel_file_directory = excel_file_path.with_suffix("")
    # Create a container directory for per sheet CSVs
    if excel_file_directory.exists():
        print(f"Directory {excel_file_directory} already exists")
    else:
        excel_file_directory.mkdir()
    # Write a CSV for each worksheet
    for sheet_name, dataframe in sheet_dataframes.items():
        dataframe.to_csv(excel_file_directory / sheet_name + ".csv")
    # Remove no longer needed Excel file
    excel_file_path.unlink()

mnjowe · 2024-07-12T06:40:31Z

Thanks Tim and Matt for the clarification on this. @tbhallett do you want me to start working on this issue? I have some time

tbhallett · 2024-07-12T07:46:56Z

Thanks Tim and Matt for the clarification on this. @tbhallett do you want me to start working on this issue? I have some time

Yes please @mnjowe -- that would be brilliant

mnjowe · 2024-07-12T07:49:12Z

Great!

mnjowe · 2024-07-12T10:18:22Z

Hi both. In order to get started with this issue, I've written the below function and have tested it already in lifestyle. My question is which is the right module to house this helper function as it will be needed in the read parameters section of all modules still using excel files. @matt-graham thanks for the above code(it didn't need a lot of modifications to do its job). all excel files are now turned into folders with one or multiple .csv files. Thanks

    def read_csv_files(resource_filepath: Path = None, file_name: list[str] = None) -> dict[str, DataFrame]:
        """
        A function to read csv files in a similar way pandas read Excel files(pd.read_excel).

        :param resource_filepath:  path to resource file folder
        :param file_name: preferred csv file name(s). This is the same as sheet names in Excel file

        """
        all_data = {}

        if file_name is not None:
            for f_name in file_name:
                df = pd.read_csv(f'{resource_filepath}/{f_name}.csv')
                all_data[f_name] = df

        else:
            for f_name in resource_filepath.rglob("*.csv"):
                file_base_name = str(f_name).split('/')[-1].split('.')[0]  # Extract base name without extension
                df = pd.read_csv(f_name)
                all_data[file_base_name] = df

        return all_data

tbhallett · 2024-07-12T12:56:09Z

Hi both. In order to get started with this issue, I've written the below function and have tested it already in lifestyle. My question is which is the right module to house this helper function as it will be needed in the read parameters section of all modules still using excel files. @matt-graham thanks for the above code(it didn't need a lot of modifications to do its job). all excel files are now turned into folders with one or multiple .csv files. Thanks
    def read_csv_files(resource_filepath: Path = None, file_name: list[str] = None) -> dict[str, DataFrame]:
        """
        A function to read csv files in a similar way pandas read Excel files(pd.read_excel).

        :param resource_filepath:  path to resource file folder
        :param file_name: preferred csv file name(s). This is the same as sheet names in Excel file

        """
        all_data = {}

        if file_name is not None:
            for f_name in file_name:
                df = pd.read_csv(f'{resource_filepath}/{f_name}.csv')
                all_data[f_name] = df

        else:
            for f_name in resource_filepath.rglob("*.csv"):
                file_base_name = str(f_name).split('/')[-1].split('.')[0]  # Extract base name without extension
                df = pd.read_csv(f_name)
                all_data[file_base_name] = df

        return all_data 

How about making this functionality a part of the Module class? The Module's read_parameters() method is where this is always used, and It's the same kind of thing as load_parameters_from_dataframe() which is a member function.

mnjowe · 2024-07-12T13:16:00Z

That's a good idea. I will add it to Module class, implement it in lifestyle read_parameters( ) section and open a new PR for further discussions. Thanks

tamuri · 2024-07-12T14:29:34Z

I'd lean towards making it a utility function in util.py rather than Module. It feel likes a general purpose util - imagine wanting to load the files outside of a module, in an analysis file for example.

mnjowe · 2024-07-15T07:58:02Z

Good point Asif, thanks. Indeed, scenarios as these will require us to think twice on the location of this function. @tbhallett I will defer to you for a final decision on this.

tbhallett · 2024-07-15T08:00:38Z

Happy to go with Asif's suggestion! (We could always put a shortcut to it from the module, for convenience).

mnjowe · 2024-07-15T09:05:59Z

Great!

mnjowe · 2024-07-17T08:53:30Z

Hi all. I have created a draft PR here where we can continue with our discussion on this issue. I've started implementation in lifestyle and simplified birth. May you please to look at this initial stage of implementation and provide feedback if any before I move on to implementing the read csv files method to the rest of the disease modules? Thanks

mnjowe · 2025-01-16T06:23:22Z

@tbhallett , now that all my PR's on this topic have been merged can we close this Issue?

mnjowe · 2025-01-16T06:44:55Z

Just realised there is one folder cmd that was missed.

tbhallett added the enhancement New feature or request label May 17, 2024

tbhallett mentioned this issue Jul 11, 2024

Parameter Table #1419

Closed

mnjowe self-assigned this Jul 12, 2024

mnjowe mentioned this issue Jul 17, 2024

Convert excel ResourceFiles to csv - Add a method to read converted files #1425

Merged

mnjowe linked a pull request Jul 17, 2024 that will close this issue

Convert excel ResourceFiles to csv - Add a method to read converted files #1425

Merged

tamuri mentioned this issue Jul 23, 2024

use of xlsx #1432

Closed

mnjowe removed a link to a pull request Jul 26, 2024

Convert excel ResourceFiles to csv - Add a method to read converted files #1425

Merged

github-project-automation bot added this to Issue management Aug 27, 2024

github-project-automation bot moved this to To do in Issue management Aug 27, 2024

mnjowe moved this from To do to Done in Issue management Jan 16, 2025

mnjowe moved this from Done to In progress in Issue management Jan 16, 2025

mnjowe linked a pull request Jan 16, 2025 that will close this issue

Convert Excel files to CSV - Cardiometabolic disorders #1570

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert '.xlsx' ResourceFiles to `csv` #1337

Convert '.xlsx' ResourceFiles to `csv` #1337

tbhallett commented May 13, 2024

mnjowe commented Jul 9, 2024

tbhallett commented Jul 10, 2024

matt-graham commented Jul 11, 2024

mnjowe commented Jul 12, 2024

tbhallett commented Jul 12, 2024

mnjowe commented Jul 12, 2024

mnjowe commented Jul 12, 2024

tbhallett commented Jul 12, 2024

mnjowe commented Jul 12, 2024

tamuri commented Jul 12, 2024

mnjowe commented Jul 15, 2024

tbhallett commented Jul 15, 2024

mnjowe commented Jul 15, 2024

mnjowe commented Jul 17, 2024

mnjowe commented Jan 16, 2025

mnjowe commented Jan 16, 2025

Convert '.xlsx' ResourceFiles to csv #1337

Convert '.xlsx' ResourceFiles to csv #1337

Comments

tbhallett commented May 13, 2024

mnjowe commented Jul 9, 2024

tbhallett commented Jul 10, 2024

matt-graham commented Jul 11, 2024

mnjowe commented Jul 12, 2024

tbhallett commented Jul 12, 2024

mnjowe commented Jul 12, 2024

mnjowe commented Jul 12, 2024

tbhallett commented Jul 12, 2024

mnjowe commented Jul 12, 2024

tamuri commented Jul 12, 2024

mnjowe commented Jul 15, 2024

tbhallett commented Jul 15, 2024

mnjowe commented Jul 15, 2024

mnjowe commented Jul 17, 2024

mnjowe commented Jan 16, 2025

mnjowe commented Jan 16, 2025

Convert '.xlsx' ResourceFiles to `csv` #1337

Convert '.xlsx' ResourceFiles to `csv` #1337