-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate chain of events for individuals #1468
base: master
Are you sure you want to change the base?
Conversation
print_chains = True | ||
if self.target != self.sim.population: | ||
row = self.sim.population.props.iloc[[self.target]] | ||
row['person_ID'] = self.target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think needed?
src/tlo/events.py
Outdated
row['event'] = self | ||
row['event_date'] = self.sim.date | ||
row['when'] = 'After' | ||
self.sim.event_chains = pd.concat([self.sim.event_chains, row], ignore_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that it's faster to not do these kinds of pandas operations very often, but instead to collect the data in python native structures (sets, dicts, lists, tuples) and then assemble them into a data-frame at the end.
if sim.generate_event_chains is False: | ||
# Create (and store pointer to) the OtherDeathPoll and schedule first occurrence immediately | ||
self.other_death_poll = OtherDeathPoll(self) | ||
sim.schedule_event(self.other_death_poll, sim.date) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsure why this change is included; perhaps done for something in debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required if we want to train disease-specific emulators: we don't want other causes, including those grouped as "Others" (which we do not explicitly include as a disease module when running a sim, because they are always included by default), to "interfere"/end a life short when we are trying to capture effect of single disease.
If we want to use this not for emu training but other purposes then it could become important to not exclude this.
src/tlo/methods/hsi_event.py
Outdated
print_chains = False | ||
df_before = [] | ||
|
||
if self.sim.generate_event_chains: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obviously would be nice to factorise-out this logic, which repeats in events.py
(Shame we don't have HSI_Event inheriting from Event, and we'd get it for free. We used to.. but it was changed at some point for a reason I can no longer remember.)
src/tlo/methods/hsi_event.py
Outdated
new_rows_after['event_date'] = self.sim.date | ||
new_rows_after['when'] = 'After' | ||
|
||
self.event_chains = pd.concat([self.sim.event_chains,new_rows_before], ignore_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than building up the enormous thing in memory in the format of a data frame, I have the feeling that it's more efficient to put out to a logger bit by bit.
src/tlo/simulation.py
Outdated
self.end_date = None | ||
self.output_file = None | ||
self.population: Optional[Population] = None | ||
self.event_chains: Optinoal[Population] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think event_chains is a data frame rather than an instance of Population?
src/tlo/events.py
Outdated
new_rows_after['person_ID'] = new_rows_after.index | ||
new_rows_after['event'] = self | ||
new_rows_after['event_date'] = self.sim.date | ||
new_rows_after['when'] = 'After' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not store only the changes?
…intained is generate_event_chains is None
…Log changes to logger.
Updates:
To add:
|
…th, label what is only used for ddebugging and will be later removed
Hi @tbhallett, The bulk of the changes we discussed are done. I just have a couple of quick questions (below) before being ready to submit this for formal review. To summarise:
Quick questions:
|
Hi @tbhallett @tamuri @mnjowe @matt-graham, Apologies for casting the net wide but I'm a bit stuck on this. I was hoping to get someone's input on the following two issues (summary of what this PR does below):
Many thanks in advance! If anything is unclear please let me know. Summary of PR
|
…ible to all modules. For now add person_ID to the dict of info printed as the outer dictionary key logging seems to have a problem.
Make this always be a list of Person_IDs, so that whether it's one or more doesn't change the type.
If in doubt, I coerce to strings for the purpose of logging (and then unpack it using
Sorted by first point?
I think probably using Asif/Matt will know more, but I was thinking that for an indidual-level event, storing only the row as a dict and then using some optimised tool for dict comparison (e.g. https://miguendes.me/the-best-way-to-compare-two-dictionaries-in-python) might be better. In any case, storing the results of that as a dict seems like the efficient choice to me. |
I would agree with Tim to use pandas compare() but only for population-level-events. However, I believe the individual level events are rather small dictionaries, so it may not be necessary to use compare() for them. For the individual-level-events, using the same method you're using should be okay as it seems to be straightforward. Just considering the trouble of converting to a different data structure before doing a comparison. |
…lysis file now collects all relevant info and prints them
This PR creates an option to use the simulation to log & print events and their effect on individuals across any number of modules. By collating this information in post-processing, we can create chains of events and outcomes for each simulated individual across different diseases, demographic changes, etc.
This will be useful to generate training data for module emulators/time series analysis, but also for general debugging. In order to assist in the former, the PR additionally gives the user the option to “ignore” the epidemiology of diseases in order to prioritise infecting as many individuals as possible during the runs. This (for now) is controlled by the “generate_event_chains” parameter in the Simulation class. Post-processing is done via postprocess_event_chains.py.
NOTE: This is working as intended, but not really feasible (RAM+CPU-time wise) if intending to use it for a large number of individuals. Next step will be to think about how to make it so.
How the PR works:
OPTIMISATION
To reduce CPU requirements:
To reduce RAM requirements:
--Assume that the individual has not been modified since the previous event, such that the last “After” state is the “Before” state of the subsequent event. This is a bit risky given that the user can decide not to print all events the individual has experienced. In addition to this, one could also only log the changes . Again, potential draw-back in CPU time from having to compare rows.
-- Decide to ignore all intermediate events, and clearly define a “start” (i.e. onset) and end (“death”/”resolution”) of the episode. This is what we would need for emulator, however loads of very interesting information in between that ultimately we would want to capture to explore the possibility of training time series / to capture “infectiousness” of individual (e.g. viral load status for HIV).