-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification on Methodology & Datasets used in Paper? #15
Comments
I've dug into the code a bit more, and have come up with some answers, some things are still no clear, though. would appreciate if the authors could confirm and give more clarity on the remaining questions (in bold):
I think my diagram was pretty close, with the difference that the output/leave nodes are split differently at different branches, as opposed to sharing a common split (10 in my above diagram).
Yes, default 1000 trees with sqrt(number of total genes) per tree.
Yes, whole matrix is used, conflicts are simply minimized and put in "wrong" baskets and this end up affecting the final score (reduction of variance) at each node.
Minimized impurity split is calculated at each node.
Need information from the authors.
Yes, the decision trees are not used in the traditional sense, what matters is the metadata (reduction in variance at each node).
Need information from the authors.
|
Hi, I have some questions regarding the specific methodology of the paper.
Firstly, I'd like to apologise for not being an expert in stats, I watched some tutorials and read some chapters on relevant stats textbooks, but am having trouble translating their examples to this method.
"For each gene j, a learning sample is generated with expression levels of j as output values and expression levels of all other genes as input values"
Is this what each tree would look like? (Gene "X" in my diagram would be equivalent of gene "j" in the paper)
1.1 And for each gene j, this is repeated many times, with all the inputs/intermediary genes randomized?
1.2 I assume for each tree, the whole counts matrix is used, i.e. all observations of gene expression profiles. How are conflicts handled? For example, in my above diagram, if Gene3>2 has a few counts of GeneX>10, but ALSO a few counts of GeneX<=10? Is it a winner-takes-all binary split? Do these conflicts directly affect the weight of each tree, i.e. "sums of total variance reduction"? If not the whole matrix used, what is the splitting/bagging heuristic?
1.3 How does the tree arrive at the binary split threshold at each node? And what about the final output Gene X/j (e.g. Gene X > or <= "10")?
1.4 Why was "sums of total variance reduction" used as opposed to a more traditional metric, i.e. Gini impurity?
1.5 Ultimately, decision trees should each "vote" on an outcome... But not in this application, right? Only the total variance reduction matters? Or does it work in another way entirely?
1.6 The vignette mentions 2 ways of threshoding: for example top 5 per gene, and weight >0.1. This was also left open in the paper. What value was used in the paper to win the DREAM4 challenge? How did the authors arrive at it?
Am I right to say that this is a bulk sequencing of 907 E. coli cells in one go, on one plate, so the resulting expression file consists of only 1 column, i.e.
If yes, how can a random forest get constructed? Does this use the same splitting/bagging heuristic as above?
I'd also appreciate feedback on whether I am asking the right questions, or if there are other related questions that I should have asked but didn't.
Thanks!
The text was updated successfully, but these errors were encountered: