Session 7: Machine Learning
내용정리
- Machine Learing Techniques and Tools
- Our main focus is to analyze and understand variables that may affect the group’s performance
- Two machine learning and data-mining method has used (Decision Tree & Feature Selection)
- Decision Tree
- C4.5 Algorithm (Quinlan, 1986), updated version if ID3
- handle discrete and continuous data
- missing data
- tree pruning
- Decision trees are supervised learning methods that make use of already classified training data to build predictive models
- The aim of a decision tree classifier is to divide the training samples into partitions that are homogeneous with repect to the dependent variable
- Outputs a model in the form of a tree (end nodes are the final preditions)
- C4.5 algorithm employs a normalized information gain as the criterion for variable seletion and the variable with the highest normalized informaiton gain
- The biggest advantage of decision trees is that a single tree has the ability to describe the whole feature space. This ease of interpretability makes them quite popular among practioners
- Feature Selection
- Feature selection aims to select a compact subset of independent variables that can predict the dependent variable without much loss of inormation
- The purpose is the trim the dataset into a manageable one by focusing on independent variables that have high predictive power
- Feature selection mines the most informative features and gets rid of the redundant or strongly correlated features.
- This process helps achieve a compact smaller set of features and therefore, imporves model interpretability as well as training time and generalization by less over fitting
- Feature selection methods are minly categorized into three types
- Filter
- A subset of features can be judged as informative or not irrespective of how well they are able to predict the target dependent variable
- Wrapper
- Wrapper methods evaluate the model accuracy using a learning method for different subset of features and return the best performing feature subset
- Embedded
- Embedded methods, on the contrary try to merge the subset search adn evaluation phase, by incorporating the search within the machine learning model itself
- WEKA
- GUI based machine leanring tool
- Dataset and metrics
- A game based test-bed: SABRE - situation authorable behavior research environment
- Neverwinter Nights
- 56 teams,, of four members each
- to search for hidden weapons caches in an urban environment while earning or loosing goodwill point (DV, performance matrics)
- Individual level metrics
- Role type
- These are based upon the kind of role the individual is playing within the team
- Skill type
- It reflect upon the skill of a team member
- Group level metrics
- Total
- aggregate individual score
- Information Entropy (Teachman)
- the fractional contribution of the member n for individual metric X
- High score -> Highly homogeneous
- Low score -> Highly heterogeneous
- Team configuration
- 1-1-1-1: all working separate
- 1-1-2: two working together and the other two separately
- 1-3: One working separately and three together
- 2-2: Working in groups of two
- 4: All working together
- Group performance
- 0-Low (bottom 25%)
- 1-Medium (50% in the middle)
- 2-High (top 25%)
- Experimentation
- 3 micro
- role metrics
- skills metrics
- Group configuration
- 1 macro (all variable)
- Analysis
- Correlation
- How individual metrics affet the performance?
- How pairs of indivudal metrics affect each other and the performance?
- Decision Trees
- How group of individual metrics affect each other and the performance?
- Feature Selection
- Select the most important group of metrics that affect performance
- Dicision trees on selected group of metrics
- How gorup of individual metrics affect each other and the performance?
- Experiment 1
- Correlation analysis
- Totla amount of tips sent and entropy of tips sent are significantly correlated
- Total metrics are more related in general to the ferformance rather than the entropy metrics
- The more a team interacts with the NPCs the more likely the team gets more tips from them
- High entropy for a given variable indicates that team members behave similarly with respect to that variable and low entropy indicates that there is a large variation among the team members for the given variable
- Decision Tree
- We are satisfied if our model fits the training data sufficiently well and focus on interpretation of feature space
- Tips_recv_total and Tips_sent_entropy contains mostly medium and high perfoaqmnce leaves
- Higher tips circulated within the team and higher tips sent entorpy are all related to team performance according to the model
- If the tips receiving entorpy of the gorup is less than 1.7 it is predicted to be high performing
- Feature Selection
- In the machine learning, a subset of the most important variables and rank among them is done using feature selection method
- Output returns a ranked list of all the attributes as per their relevance
- In fact, both for decision trees as well as feature selection, there was almost no difference between the models built using traning set (with low error)
- Decision Tree on selected top 5 variables
- Decision tree showed details, but feature selection is blockbox model
- Combine the best of both mehtod
- We used top five highly ranked features
- In this way we leverage the ranking informaiton from feature selection to lower the size of feature set from 16 to the five most important ones
- The big marked circle on the right contains a sub-tree whose leves are either medium or high performing, implying that if a team falls in this sub-tree it si highly probable that it would perform well
- CHAT_RECT_TOTAL
- TIPS_SENT_TOTAL
- Good performance
- everyone in team should be communicating via both chatting as well as exchanging tips, but only a few members should be recieving a lot sof tips from NPC and entering bulidings