Platformy programowania Prof. C. Jędrzejek, Room 317 building with the clock Mob. Tel 793 161 101 Zima 2020 Precision and Recall (zwrot) – classification measures When evaluating a search tool or a classifier, we are interested in at least two performance measures: Precision: Within a given set of positively - labeled results, the fraction that were true positives = tp /( tp + fp ) Recall: Given a set of positively - labeled results, the fraction of all positives that were retrieved = tp /( tp + fn) Positively - labeled means judged “relevant” by the search engine or labeled as in the class by a classifier. tp = true positive, fp = false positive etc. Precision and Recall Search tools and classifiers normally assign scores to items. Sorting by score gives us a precision - recall plot which shows what performance would be for different score threshol d s Score increasing In IR a query consisting of 5 words (features) describing an object Be careful of “Accuracy” The simplest measure of performance would be the fraction of items that are correctly classified, or the “accuracy” which is: But this measure is dominated by the larger set (of positives or negatives) and favors trivial classifiers. e.g. if 5% of items are truly positive, then a classifier that always says “negative” is 95% accurate. tp + tn tp + tn + fp + fn Confusion matrix z wartościami liczbowymi jakiegoś klasyfikatora Zadanie 1 – podobne do tego Sensitivity = = ? TP TP+FN Specificity = = ? TN TN+FP Sensitivity : the ability of a test to correctly identify patients with a disease. Specificity : the ability of a test to correctly identify people without the disease. True positive: the person has the disease and the test is positive. ... False negative: the person has the disease and the test is negative. Konstanz, 27 - 28.3.2000 EDBT2000 tutorial - Class 10 Decision trees • A tree where • internal node = test on a single attribute • branch = an outcome of the test • leaf node = class or class distribution A? B? C? D? Yes Konstanz, 27 - 28.3.2000 EDBT2000 tutorial - Class 11 Classical example: play tennis? Outlook Temperature Humidity Windy Class sunny hot high false N sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P rain mild high true N z Training set from Quinlan’s book Konstanz, 27 - 28.3.2000 EDBT2000 tutorial - Class 12 Decision tree obtained with ID3 (Quinlan 86) outlook overcast humidity windy high normal false true sunny rain N N P P P Konstanz, 27 - 28.3.2000 EDBT2000 tutorial - Class 13 From decision trees to classification rules • One rule is generated for each path in the tree from the root to a leaf • Rules are generally simpler to understand than trees outlook overcast humidity windy high normal false true sunny rain N N P P P IF outlook=sunny AND humidity=normal THEN play tennis 14 overcast high normal false true sunny rain No No Yes Yes Yes Example Tree for “Play?” Outlook Humidity Windy 15 Building Decision Tree [Q93] • Top - down tree construction – At start, all training examples are at the root. – Partition the examples recursively by choosing one attribute each time. • Bottom - up tree pruning – Remove subtrees or branches, in a bottom - up manner, to improve the estimated accuracy on new cases. 16 Choosing the Splitting Attribute • At each node, available attributes are evaluated on the basis of separating the classes of the training examples. A Goodness function is used for this purpose. • Typical goodness functions: – information gain (ID3/C4.5) – information gain ratio – gini index witten&eibe 17 Which attribute to select? witten&eibe 18 A criterion for attribute selection • Which is the best attribute? – The one which will result in the smallest tree – Heuristic: choose the attribute that produces the “purest” nodes • Popular impurity criterion : information gain – Information gain increases with the average purity of the subsets that an attribute produces • Strategy: choose attribute that results in greatest information gain witten&eibe 19 Zadanie 2 Weather Data with ID code ID Outlook Temperature Humidity Windy Play? A sunny hot high false No B sunny hot high true No C overcast hot high false Yes D rain mild high false Yes E rain cool normal false Yes F rain cool normal true No G overcast cool normal true Yes H sunny mild high false No I sunny cool normal false Yes J rain mild normal false Yes K sunny mild normal true Yes L overcast mild high true Yes M overcast hot normal false Yes N rain mild high true No Information gain Information gain IG(A) is the measure of the difference in entropy from before to after the set S is split on an attribute A . In other words, how much uncertainty in S was reduced after splitting set S on attribute A