-
Notifications
You must be signed in to change notification settings - Fork 1
Vertical Hoeffding Tree Classifier
Vertical Hoeffding Tree (VHT) classifier is a distributed classifier that utilizes vertical parallelism on top of Very Fast Decision Tree (VFDT) classifier.
VFDT is the pioneer of decision tree induction for online classification. VFDT uses Hoeffding bound to decide the minimum number of arriving instances to achieve certain level of confidence in splitting the node. This confidence level determines how close the statistics between the attribute chosen by VFDT and the attribute chosen by decision tree for batch learning.
For more comprehensive summary of VFDT, read chapter 3 of Data Stream Mining: A Practical Approach.
Vertical Parallelism is a parallelism approach which partitions the instances in term of attribute for parallel processing. Vertical-parallelism-based decision tree induction processes the partitioned instances (which consists of subset of attribute) to calculate the information-theoretic criteria in parallel. For example, if we have instances with 100 attributes and we partition the instances into 5 portions, we will have 20 attributes per portion. The algorithm processes the 20 attributes in parallel to determine the "local" best attribute to split and combine the parallel computation results to determine the "global" best attribute to split and grow the tree.
For more explanation about available parallelism types for decision tree induction, you can read chapter 4 of Distributed Decision Tree Learning for Mining Big Data Streams, the master thesis from the inception of SAMOA project.
We implemented VHT using SAMOA API, and the diagram below shows the implementation:
Refer to figure above, the source PI and the evaluator PI are parts of an evaluation task in SAMOA, such as Prequential Evaluation. The model-aggregator PI consists of the decision tree model. It connects to local-statistic PI via attribute stream and control stream. The model-aggregator PI splits instances based on attribute and each local-statistic PI contains local statistic for attributes that assigned to it. The Model-aggregator PI sends the split instances via attribute stream and it sends control messages to ask local-statistic PI to perform computation via control stream. Users configure n, which is the parallelism level of the algorithm. The parallelism level is translated into the number of local-statistic PIs in the algorithm.
The model-aggregator PI sends the classification result via result stream to an evaluator PI for the corresponding evaluation task or other destination PI. The evaluator PI performs evaluation of the algorithm and the evaluation could be in term of accuracy and throughput. Incoming instances to the model-aggregator PI arrive via source stream. The calculation results from local statistic arrive to the model-aggregator PI via computation-result stream.
For more details about the algorithms (i.e. pseudocode), go to section 4.2 of Distributed Decision Tree Learning for Mining Big Data Streams, the master thesis from the inception of SAMOA project.