8 Session 7: Machine Learning

8.1 내용정리

  • Machine Learing Techniques and Tools
    • Our main focus is to analyze and understand variables that may affect the group’s performance
    • Two machine learning and data-mining method has used (Decision Tree & Feature Selection)
  • Decision Tree
    • C4.5 Algorithm (Quinlan, 1986), updated version if ID3
      • handle discrete and continuous data
      • missing data
      • tree pruning
    • Decision trees are supervised learning methods that make use of already classified training data to build predictive models
    • The aim of a decision tree classifier is to divide the training samples into partitions that are homogeneous with repect to the dependent variable
      • Outputs a model in the form of a tree (end nodes are the final preditions)
    • C4.5 algorithm employs a normalized information gain as the criterion for variable seletion and the variable with the highest normalized informaiton gain
    • The biggest advantage of decision trees is that a single tree has the ability to describe the whole feature space. This ease of interpretability makes them quite popular among practioners
  • Feature Selection
    • Feature selection aims to select a compact subset of independent variables that can predict the dependent variable without much loss of inormation
    • The purpose is the trim the dataset into a manageable one by focusing on independent variables that have high predictive power
    • Feature selection mines the most informative features and gets rid of the redundant or strongly correlated features.
    • This process helps achieve a compact smaller set of features and therefore, imporves model interpretability as well as training time and generalization by less over fitting
    • Feature selection methods are minly categorized into three types
      • Filter
        • A subset of features can be judged as informative or not irrespective of how well they are able to predict the target dependent variable
      • Wrapper
        • Wrapper methods evaluate the model accuracy using a learning method for different subset of features and return the best performing feature subset
      • Embedded
        • Embedded methods, on the contrary try to merge the subset search adn evaluation phase, by incorporating the search within the machine learning model itself
  • WEKA
    • GUI based machine leanring tool
  • Dataset and metrics
    • A game based test-bed: SABRE - situation authorable behavior research environment
    • Neverwinter Nights
      • 56 teams,, of four members each
      • to search for hidden weapons caches in an urban environment while earning or loosing goodwill point (DV, performance matrics)
    • Individual level metrics
      • Role type
        • These are based upon the kind of role the individual is playing within the team
      • Skill type
        • It reflect upon the skill of a team member
    • Group level metrics
      • Total
        • aggregate individual score
      • Information Entropy (Teachman)
        • the fractional contribution of the member n for individual metric X
        • High score -> Highly homogeneous
        • Low score -> Highly heterogeneous
    • Team configuration
      • 1-1-1-1: all working separate
      • 1-1-2: two working together and the other two separately
      • 1-3: One working separately and three together
      • 2-2: Working in groups of two
      • 4: All working together
    • Group performance
      • 0-Low (bottom 25%)
      • 1-Medium (50% in the middle)
      • 2-High (top 25%)
  • Experimentation
    • 3 micro
      • role metrics
      • skills metrics
      • Group configuration
    • 1 macro (all variable)
  • Analysis
    • Correlation
      • How individual metrics affet the performance?
      • How pairs of indivudal metrics affect each other and the performance?
    • Decision Trees
      • How group of individual metrics affect each other and the performance?
    • Feature Selection
      • Select the most important group of metrics that affect performance
    • Dicision trees on selected group of metrics
      • How gorup of individual metrics affect each other and the performance?
  • Experiment 1
    • Correlation analysis
      • Totla amount of tips sent and entropy of tips sent are significantly correlated
      • Total metrics are more related in general to the ferformance rather than the entropy metrics
      • The more a team interacts with the NPCs the more likely the team gets more tips from them
      • High entropy for a given variable indicates that team members behave similarly with respect to that variable and low entropy indicates that there is a large variation among the team members for the given variable
    • Decision Tree
      • We are satisfied if our model fits the training data sufficiently well and focus on interpretation of feature space
      • Tips_recv_total and Tips_sent_entropy contains mostly medium and high perfoaqmnce leaves
      • Higher tips circulated within the team and higher tips sent entorpy are all related to team performance according to the model
      • If the tips receiving entorpy of the gorup is less than 1.7 it is predicted to be high performing
    • Feature Selection
      • In the machine learning, a subset of the most important variables and rank among them is done using feature selection method
      • Output returns a ranked list of all the attributes as per their relevance
      • In fact, both for decision trees as well as feature selection, there was almost no difference between the models built using traning set (with low error)
    • Decision Tree on selected top 5 variables
      • Decision tree showed details, but feature selection is blockbox model
      • Combine the best of both mehtod
      • We used top five highly ranked features
      • In this way we leverage the ranking informaiton from feature selection to lower the size of feature set from 16 to the five most important ones
      • The big marked circle on the right contains a sub-tree whose leves are either medium or high performing, implying that if a team falls in this sub-tree it si highly probable that it would perform well
      • CHAT_RECT_TOTAL
      • TIPS_SENT_TOTAL
      • Good performance
        • everyone in team should be communicating via both chatting as well as exchanging tips, but only a few members should be recieving a lot sof tips from NPC and entering bulidings

8.2 더 읽어볼 자료