thesis.pdf -

Please enable JavaScript to view the full PDF

Example This is an example to discuss decision trees. In this instance, we see the key factors causing burn from the sun in individuals who are going to the sea shore in day time. The four attributes utilized in this example are, Weight, Height, Lotion, Hair. These attributes are conditional or independent. Reference: P. Winston, 1992. Given Data 10 Phase 1: From Data to Tree A) In this phase we comprehend the data and make a tree dependent on the given information. The first step includes performing average entropy counts on the total data collection for every one of the four properties: Entropy Formula Entropy describes the (im) purity, or homogeneity, of a arbitrary collection of samples. Given: • Nb = The number of positive instances exists in branch b. • nbc, = The total number of instances exists in branch b of class c. • nt,= The total number of instances exists in the all branches. Entropy • • As you move from perfect balance and perfect homogeneity, entropy varies easily somewhere in the range of zero and one.. o Entropy is one when the set is perfectly inhomogeneous 11 o Entropy is zero when the set is perfectly homogeneous Let’s inspect the entropy of properties Weight, Height, Hair Color and Lotion. Here we look at Hair color first: Attributes: Hair Color Reference: positive: sunburned negative: none b1 = blonde Average Entropy = 0.50 b2 = red b3 = brown Sample average entropy computation for the property "hair color" 12 B) Now we should inspect to check whether the Height property has lower or higher entropy than Hair Color property. Entropy results for property “Height” Property: Height Reference: positive: sunburned negative: none B1 = short Average Entropy = 0.69 b2 = average b3 = tall Sample average computation for the property "Height” 13 C) Now we inspect to check whether the Weight property has lower or higher entropy than Height, Hair Color. Entropy results for property “Weight” Property: Weight Reference: positive: sunburned negative: none 14 b1 = light Average Entropy = 0.94 b2 = average b3 = heavy Sample average entropy results for the property "Weight" D) Now how about we look at to check whether the Lotion property has lower or higher entropy than Hair Weight, Height and Color Entropy results for property “Lotion” Property: lotion Reference: positive: sunburned 15 negative: none B1 = no Average Entropy = 0.6 b2 = yes Sample results for average entropy for the property “Lotion” The property with least entropy is picked after the calculations. Results The property "hair color" is chosen as the first test since it minimizes the entropy. 16 Liza Sandy Michael Maria David Tania Jeffery Natasha Also, we presently pick another test to separate out the sunburned people from the blonde haired inhomogeneous subset, {Natasha, Maria, Liza, and Tania}. Result The property "lotion" is chosen since it minimizes the entropy in the blonde hair subset. In this manner, utilizing the "lotion" and "hair color" tests together guarantees the best possible identification of all tests. 17 Natasha Maria Sandy Michael David Liza Tania Jeffery Rules: Now we draw rules from above decision trees. Excluding the rules to get final rules: 18 19 See5 / C5.0 See5 / C5.0 is a package to establish patterns from the information. it's developed by RuleQuest Research Pty Ltd. It will wear down the over-fitting information, information with missing properties. it's improved the productivity to an excellent extent. Data mining is tied in with dig down patterns from company’s data center or warehouse. they'll be used to realize up data into varied company's operations, and to foresee results for future circumstances as a guide to create wise and sensible choices. Patterns frequently concern the classifications to that circumstances have an area. for example, is that the application for loan ought to be approved or not? can a particular class overlook a mailout or react to that? can a procedure offer high, medium, or low yield on a batch of raw material? See5 (Windows 7/8/10) and its UNIX system counterpart C5.0 square measure sensible data processing tools for locating patterns that depict classifications, amassing them into classifiers, and utilizing them to create prediction. Some main features are: • See5/C5.0 has been developed to investigate databases containing thousands to millions number of records and numeric or nominal fields. • To maximize interpretability, See5 / C5.0 classifiers are communicated as decision trees or forms that are commonly more obvious to grasp than neural networks. • See5 / C5.0 is user friendly to utilize and doesn't assume propelled knowledge of Statistics or Machine Learning 20 Example This is a representation on the usage of See5 for a medical application - mining a database of thyroid examines to build diagnostic standards for hypothyroidism. Each case concerns a solitary referral and contains data on the source of the referral, patient information, mentioned tests and doctor's remarks. Here are three cases: 21 This is actually the kind of task for which See5 was structured. Each case has a place with one of few mutually exclusive classes (primary, secondary, negative, compensated). Attributes of each case that might be related to its class are given, 22 although a few cases may have obscure or non-pertinent results for certain attributes. There are total 24 attributes in this model, however See5 can manage any number of attributes. See5's main responsibility is to discover how to anticipate a case's class from the prediction of different attributes. See5 does this by developing a classifier that makes this forecast. As we will see, See5 can develop classifiers communicated as decision trees or as sets of rules. Application filestem Each See5 application has a short name called a filestem; we will utilize the filestem hypothyroid for this example. All files read or composed by See5 for an application have names of the form filestem.extension, where filestem recognizes the application and extension tells the content. The instance of letters in both the filestem and extension is significant - file names app.data, App.Data and APP.DATA all are different from each other. It is significant that the extensions are composed precisely as demonstrated as follows, generally See5 won't perceive the files for your application. Names file Two records are fundamental for all See5 applications and there are three further optional documents, each recognized by extension. The main fundamental document is the names file (for example hypothyroid.names) that depicts the classes and attributes. There are two significant subgroups of attributes: • The estimation of an explicitly-defined attribute is given straightforwardly. A discrete attribute has a value drawn from a set of numeric values, a constant attribute has a numeric value, a date attribute holds a calendar date, a time attribute holds a clock time, a timestamp attribute holds a date and time, and a attribute of label serves just to distinguish a specific case. • The value of a certainly characterized attribute is determined from a formula. (Mostly attributes are explicitly defined, so you may never require certainly characterized attributes.) The file hypothyroid.names looks like this: 23 24 What's in a name? Names, classes, names and discrete values are presented by self-assertive series of characters, with some fine print: • Tabs and spaces are allowed inside a value or name, however See5 collapses each grouping of these characters to one space. • Special characters (comma, colon, period, vertical bar '|') can show up in names and values, however should be suffixed by escape character '\'. For instance, the name "Filch, Grabbit, and Co." would be composed as 'Filch\, Grabbit\, and Co\.'. (Colons in times and periods in numbers don't should be escaped.) White space (spaces, tab characters and blank lines) is disregarded aside from inside a value or name and can be utilized to improve readability. Except if it is escaped as over, the vertical bar '|' makes the rest of the line be overlooked and is convenient for including comments. This utilization of '|' ought not happen inside a value. The principal line of the names file gives the classes, either by naming a discrete attribute that contains the class value, or by posting them expressly. The attributes are then defined in the order that they will be given for each case. Explicitly-defined attributes The name of every explicitly-defined attribute is trailed by a colon ':' and an outline of the values taken by the attribute. There are six conceivable out comes: continuous The attribute receives numeric values. date The attribute's values in date is within the kind YYYY-MM-DD or YYYY/MM/DD e.g. 2017-09-30 or 2017/09/30 25 time The attribute's values in times ar within the kind HH:MM:SS. they may have values between 00:00:00 and 23:59:59. timestamp The attribute's values is time within the format of YYYY/MM/DD HH:MM:SS or YYYY-MM-DD HH:MM:SS, as an example 2017-09-30 16:08:12. (Note that there's an area separating the date and time) a comma-separated names list. The attribute takes distinct values, and they are the approved values. The values could be introduced by [ordered] to indicate that they're given during a important order, else they're going to be taken as unordered. as an example, the values low, medium, high ordered, while fish, meat, vegetables and poultry undoubtedly not. The previous is also declared as follows: [ordered] low, medium, high. In the event that declare attribute values have a universe, it's smarter to proclaim them consequently in order that See5 will abuse the order. (NB: the target attribute ought not be declared as ordered.) discrete N for a few number N The attribute has distinct, unordered values, nonetheless the values ar gathered from the knowledge itself; N is that the GHB range of such values. (This is not counseled, since the knowledge cannot be checked, nonetheless it tends to be convenient for unordered distinct attributes with varied values.) (NB: This format can't be used for the target attribute.) ignore The values of the attribute got to be unnoticed 26 Label This attribute contains a recognizing label for each case, for example, an order code or an account number. The value of the attribute is overlooked when classifiers are built, however is utilized when referring to individual cases. A label attribute can make it simpler to find mistakes in the data and to cross-reference results to individual cases. In the event that there are at least two label attributes, just the latter is utilized. Attributes defined by formulas The name of every implicitly-defined attribute is followed by ':=' and at the moment a formula formulating the attribute worth. The equation is written within the commonplace means, utilizing enclosures wherever needed, and will talk over with any attribute outlined before this one. Constants within the formula is numbers (written in decimal), dates, times, and separate attribute values (enclosed in single/double quotes '"'). The operators and functions that may be used within the formula area unit as follows • +, -, *, /, % (mod), ^ (mean `raised to the power') • >, <, <=, >=, <>, = or != (both mean `not equal') • or, and • sin(...), log(...), tan(...), cos(...), exp(...), int(...) (mean `integer part of') The estimation of such an attribute is either continuous or true/false relying upon the formula. For instance, the attribute FTI above is continuous, since its value is gotten by separating one number by another. The value of a theoretical attribute , for example, strange := referral source = "EAST" or age > 35. would be t or f as a result of the value allotted by the formula is true or false. If the value of the formula can't be determined for a specific case as a result of one or additional of the attributes showing within the formula have unidentified or non-applicable values, the value of the implicitly-defined attribute is unidentified. 27 Dates, times, and timestamps Dates are saved by See5 as the number of days since a specific starting point so a few activities on dates bode well. Therefore, if we have attributes d1: date. d2: date. we can define as interval := d2 - d1. gap := d1 <= d2 - 7. d1-day-of-week := (d1 + 1) % 7 + 1 interval represents the quantity of days from d1 to d2 (non-inclusive) and gap would have a true/false value flagging whether d1 is in any event week before d2. The last definition is a somewhat non-evident method for deciding the day of the week on which d1 falls, with values ranging in between from 1 (Monday) to 7 (Sunday). Similarly, times are save in seconds from midnight. If the names file includes start: time. finish: time. elapsed := finish – start the value of elapsed by is the total seconds from start to finish. Timestamps are somewhat more perplexing. A timestamp is adjusted to the closest minute, however restrictions on the accuracy of floating point numbers imply that the values saved for timestamps from over thirty years back are approximate. If the names file includes departure: timestamp. arrival: timestamp. flight time := arrival - departure. Here the value of flight time is the total minutes from departure to arrival. 28 Attributes of classifiers An optional final entry in the names file influences the manner in which that See5 develops classifiers. This section takes one of the structures attributes included: attributes excluded: followed by a list of attribute names. the primary structure range the attributes to those expressly named; the next structure depicts that classifiers should not utilize any of the named attributes. Excluding associate attribute from classifiers is not similar to ignoring the attribute (check 'ignore' above). for example, assume that numeric attributes A and B area unit outlined within the data, but basic data suggests that their distinction is critical. The names file could then contain the subsequent values: ... A: continuous. B: continuous. Diff := A - B. ... attributes excluded: A, B. In this model the attributes A and B couldn't be defined as ignore in light of the fact that the meaning of Diff would then be invalid. 29 Data file The second essential file, the application's record (e.g. hypothyroid.data) provides information on the preparation cases from that See5 can draw patterns. The section for every case contains of a minimum of one lines that offer the values for all explicitly-defined attributes. On the off probability that the categories square measure exists within the 1st line of the names file, the attribute prices square measure followed by the case's category value. Values square measure separated by commas and therefore the section is optionally over by a amount. By and by, something on a line when a vertical bar '|' is forgotten. (On the off probability that the information for a case involves quite one line, guarantee line breaks inserted when commas.) For instance, 1st three cases from file hypothyroid.data are: On the off probability that there are not any commas, at that time See5 will not have the choice to method the information. Notice that '?' is used to suggest a worth that's undefined or missing. basically, 'N/A' means that a value that may not relevant for a particular case. Likewise note that the cases do not contain values for the attribute FTI since its values square measure registered from different attribute values. The third form of document used by See5 contains of latest cases (for example hypothyroid.test) on that the classifier will be assessed. This file is discretionary and, whenever used, has the exact same format. As a basic delineation, here is that the main window of See5 once the hypothyroid application has been chosen. 30 The main window of See5 has 6 buttons on its toolbar. From left to right, they are Locate Data Invokes a window to discover the documents for your application, or to change the present application; Construct Classifier Chooses the sort of classifier to be developed and sets different options; Stop Intrudes on classifier-generating process; Review Output Re-shows output from the last classifier development; Use Classifier Intelligently applies the present classifier to at least one cases; and 31 Cross-Reference Indicates how cases in preparing or test data identify with (portions of) a classifier and vice versa; These functions can also be invoked from File menu of software. The Edit menu entertain changes to the names & costs files after an application's files have been found. On-line help is accessible through the Help menu. Constructing Classifiers When the data, optional files and names are started, everything is ready to utilize See5. The first step is to seek out the date utilizing the Locate Data button from the toolbar. Here square measure a couple of choices that effects the type of classifier that See5 produces and the way it's designed. The Construct Classifier button from the toolbar displays a panel that sets out these classifier construction options: 32 A considerable lot of the options have default values that ought to be acceptable for generally applications. 33 Decision trees At the point when See5 is initiated with the default values, it develops a decision tree and produces results like this: 34 35 (Since hardware platforms will distinction in floating purpose preciseness and adjusting, the result that you just see in all probability will not be truly resembling on top of mentioned.) See5 builds a choice tree from the 2772 making ready cases within the record hypothyroid.data, and this shows up straightaway. Despite the actual fact that it should not look very similar to a tree, this result may be reworded as: 36 etc. The tree utilizes a case's attribute values to map to a leaf assigning one of the classes. Each leaf of the tree is followed by a cryptic (n) or (n/m). For example, the last leaf of the decision tree is negative (3/0.2), for which n is 3 and m is 0.2. The value of n is total cases in the file hypothyroid.data that are mapped to this leaf, and m (in the event that it shows up) is the quantity of them that are ordered mistakenly by the leaf. Rulesets Decision trees can now and again be hard to get it. A significant component of See5 is its mechanism to change over trees into sets of rules, also called rulesets. The Rulesets option makes rules to be extracted from trees delivered as above, giving the accompanying rules: 37 Each rule comprises of: • A rule number -- this is a quite arbitrary and serves just to distinguish rule. • Statistics (n, lift x) or (n/m, lift x) that abridge the performance of the rule. Additionally to a leaf, n is the number of training cases secured by rule and m, in the event that it shows up, demonstrates what number of them don't have a place with the class anticipated by the rule. The rule's precision is evaluated by the Laplace proportion (n- m+1)/(n+2). The lift x is the output of dividing the rule's assessed accuracy by the general recurrence of the anticipated class in the training set. • At least one conditions that should all be fulfilled if the rule is to be applicable. • A class anticipated by the rule. •A value somewhere in the range of 0 and 1 that shows the certainty with which this forecast is made. 38 The Problem and Analysis Steps Server log files from acme study center site are utilized in this proposition. The server log documents (630 MB) are stored in the hard disk. The log records are then queried to separate just the logs where the users utilized various keywords to arrive at a same web site. For Example: Keywords: acme, acme study center, acme courseware, computer center of acme, dha computer studies Search Engines: google.com, yahoo.com, bing.com etc. The keywords and its significance to the web site came to is seen physically by the human eye utilizing information and intuition. At the point when the user arrives at the site, the activity recorded advises if the user at any point needed to be on the site. Was he leading any business or went to the site accidentally. These ends are drawn by understanding the log records of every single specific user. The action of every user in the main data file is again mined structure the source file by utilizing 'grep'. The log file of the specific user, who has utilized a keyword in a search engine to arrive at this particular site, is obtained. From this it is comprehended whether the user was thoughtful in his search, or not. The probability for error in such an expectation exists. Never the less, it is increasingly essential to understand bigger picture, so a particular depiction, thought is acquired from these server log files about enormous number of users. Therefore, the raw data about users, which is stored in server log records is mined to acquire knowledge so that, this data can be utilized to break down the time and cash spent on the web site is justified or not and to provide user friendly site. In a business way, this can be changed over into retail dollars where the information achieved can be utilized to make the site more user friendly and give what they need. 39 The steps taken a. Access to Server log files b. Cleansing the data c. Specific user activity d. Analysis of user activity e. Formation of data sets for draw out decision trees using See5 / C5.0 f. Understanding the decision trees g. Analytics Concluded h. Applications a) Access to Server log files The following are the server log files from the acme courseware website. The following is a small part of the original log files. The original log file is 630MB and has 2.06 Million records. This is a huge file and patterns from this data are extracted. format=%Ses->client.ip% - %Req->vars.oauth-user% [%SYSDATE%] "%Req->reqpigb.clfrequest%" %Req->srvhdris.clf-status% %Req->srvhdris.content-length% "%Req- >headers.referer%" "%Req->headers.user-agent%" %Req->reqpb.method% %Req->reqpb.uri% 43.254.15.255 - - [02/Nov/2017:16:59:57 -0400] "GET /~knapp/is613.htm HTTP" 404 - "http://search.msn.com/spbasic.htm?MT=free%20relationship%20diagram%20software" "Mozilla/52.0 (compatible; MSIE 11.0; Windows Server 2012)" GET /~knapp/is613.htm 40