Data Mining

Orange Heart Disease Dataset 1 Problem Description The dataset used in this exercise is the heart disease dataset available in heart_disease.tab obtained from the Orange datasets repository. This dataset describes risk factors for heart disease. The attribute diameter narowing represents the (binary) class attribute: class 1 means there is diameter narrowing; class 0 indicates no diameter narrowing. The main aim of this exercise is to predict heart disease in terms of diameter narrowing from the other attributes in the dataset. Obviously, this is a classification problem. The software to be used is Orange. However, feel free to try any ideas you may have to tackle the problem with any other software. The description of this exercise is stepwise. Therefore, I hope you can get a better understanding of the various aspects and questions involved in the KDD (Knowledge Discovery in Databases) process. 1.1 Data Understanding The first step in approaching the problem is to get acquainted with the data. Answering the following questions will help you to better understand the data. The data file heart_disease.tab contains some information about the data stored in it. Load the data file in Orange. 1. For each attribute find the following information. (a) The attribute type, e.g. nominal, ordinal, numeric. (b) Percentage of missing values in the data. (c) Max, min, mean, standard deviation. (d) Are there any records that have a value for the attribute that no other record has? (e) Study the histogram at the lower right and informally describe how the attribute seems to influence the risk for heart disease. What does it mean the pop-up messages that appear when dragging the mouse over the graphic? (f) Are there any outliers for the attribute under consideration? i. Investigate the possibility of using the Orange widgets to detect outliers. 2. Use Visualize widgets to visualize 2D-scatter plots for each pair of attributes. (a) Which attributes seem to be the most/least linked to heart disease? Summarize in a table your findings concerning the predictive value of each attribute. (b) Does any pair of attributes seem to be correlated? 3. Investigate also possible multivariate associations of attributes with the class attribute, i.e. study scatter plots of two attributes X and Y and try to identify possible ”dense” heart disease areas (if any). (a) If you find ”dense” heart disease areas in any scatter plot then quantify the heart disease rate in these areas with respect to the entire data set. 1.2 Data Preprocessing The second step is to preprocess the data such that the transformed data is in a more suitable form for the mining algorithms. 1. Attribute selection. Investigate the possibility of using the widget AttributeSelection for selecting a subset of attributes with good predicting capability. Then, describe briefly the widget you used and compare the results you obtained with the conclusions you obtained in the previous section. 2. Handling missing values. Consider the following methods for handling missing values and investigate each possibility within Orange. Note that, as rule of thumb, if an attribute has more than 5% missing values then the records should not be deleted and it is advisable to impute values where data is missing, using a suitable method. (a) Replace the missing values by the attribute mean, if the attribute is numeric. Otherwise, replace missing values by attribute mode (if the attribute is categorical). Save the dataset you obtained without missing values in the file heart-disease2.tab. (b) Investigate the possibility of using (linear) regression to estimate the missing values for each attribute. Save the dataset you obtained without missing values in the file heart_disease3.tab 3. Eliminating outliers. (a) Eliminate the outlier records and save the dataset you obtained without outliers in the file heart_disease4.tab 1.3 Mining the Data The third step is to use some classifier algorithms available in Orange to discover hidden patterns in the data. You should repeat the steps described below for each of the datasets you created during preprocessing, besides using also the original dataset (if possible). 1. Use more than one classifier (Decision Tree, SVM, K Nearest Neighbor) (a) What can you conclude? Compare your conclusions with your previous conclusions obtained in section 1.1. (b) Compare the accuracy of the classifier on the training set with the accuracy estimation obtained through 10 fold-cross validation. How do you explain the difference (if any)? (b) Describe the patterns you obtained and compare with your previous conclusions. 1.4 Clustering Tendency Investigate whether there is a clustering tendency in the dataset. You may start by clustering the data with K Means Clustering algorithm. 1. Do not use the class attribute, diameter narrowing for clustering. 2. Find a suitable value for k, i.e. the number of clusters you are going to build. Justify your choice of k. 1.5 Predicting Performance In the previous step you have built several models. Finally, you need to compare the different models and describe your final conclusions. 1. Orange outputs several performance measures. Choose some of the performance measures and motivate your choice. 2. Summarize in a table the performance measures for each classifier and each dataset. 3. What can you conclude? 1.6 Conclusions Describe your final conclusions and indicate which risk factors for heart disease have you found in the data. Data Mining Orange Heart Disease Dataset 1 Problem Description The dataset used in this exercise is the heart disease dataset available in heart_disease.tab obtained from the Orange datasets repository. This dataset describes risk factors for heart disease. The attribute diameter narowing represents the (binary) class attribute: class 1 means there is diameter narrowing; class 0 indicates no diameter narrowing. The main aim of this exercise is to predict heart disease in terms of diameter narrowing from the other attributes in the dataset. Obviously, this is a classification problem. The software to be used is Orange. However, feel free to try any ideas you may have to tackle the problem with any other software. The description of this exercise is stepwise. Therefore, I hope you can get a better understanding of the various aspects and questions involved in the KDD (Knowledge Discovery in Databases) process. 1.1 Data Understanding The first step in approaching the problem is to get acquainted with the data. Answering the following questions will help you to better understand the data. The data file heart_disease.tab contains some information about the data stored in it. Load the data file in Orange. 1. For each attribute find the following information. (a) The attribute type, e.g. nominal, ordinal, numeric. (b) Percentage of missing values in the data. (c) Max, min, mean, standard deviation. (d) Are there any records that have a value for the attribute that no other record has? (e) Study the histogram at the lower right and informally describe how the attribute seems to influence the risk for heart disease. What does it mean the pop-up messages that appear when dragging the mouse over the graphic? (f) Are there any outliers for the attribute under consideration? i. Investigate the possibility of using the Orange widgets to detect outliers. 2. Use Visualize widgets to visualize 2D-scatter plots for each pair of attributes. (a) Which attributes seem to be the most/least linked to heart disease? Summarize in a table your findings concerning the predictive value of each attribute. (b) Does any pair of attributes seem to be correlated? 3. Investigate also possible multivariate associations of attributes with the class attribute, i.e. study scatter plots of two attributes X and Y and try to identify possible ”dense” heart disease areas (if any). (a) If you find ”dense” heart disease areas in any scatter plot then quantify the heart disease rate in these areas with respect to the entire data set. 1.2 Data Preprocessing The second step is to preprocess the data such that the transformed data is in a more suitable form for the mining algorithms. 1. Attribute selection. Investigate the possibility of using the widget AttributeSelection for selecting a subset of attributes with good predicting capability. Then, describe briefly the widget you used and compare the results you obtained with the conclusions you obtained in the previous section. 2. Handling missing values. Consider the following methods for handling missing values and investigate each possibility within Orange. Note that, as rule of thumb, if an attribute has more than 5% missing values then the records should not be deleted and it is advisable to impute values where data is missing, using a suitable method. (a) Replace the missing values by the attribute mean, if the attribute is numeric. Otherwise, replace missing values by attribute mode (if the attribute is categorical). Save the dataset you obtained without missing values in the file heart-disease2.tab. (b) Investigate the possibility of using (linear) regression to estimate the missing values for each attribute. Save the dataset you obtained without missing values in the file heart_disease3.tab 3. Eliminating outliers. (a) Eliminate the outlier records and save the dataset you obtained without outliers in the file heart_disease4.tab 1.3 Mining the Data The third step is to use some classifier algorithms available in Orange to discover hidden patterns in the data. You should repeat the steps described below for each of the datasets you created during preprocessing, besides using also the original dataset (if possible). 1. Use more than one classifier (Decision Tree, SVM, K Nearest Neighbor) (a) What can you conclude? Compare your conclusions with your previous conclusions obtained in section 1.1. (b) Compare the accuracy of the classifier on the training set with the accuracy estimation obtained through 10 fold-cross validation. How do you explain the difference (if any)? (b) Describe the patterns you obtained and compare with your previous conclusions. 1.4 Clustering Tendency Investigate whether there is a clustering tendency in the dataset. You may start by clustering the data with K Means Clustering algorithm. 1. Do not use the class attribute, diameter narrowing for clustering. 2. Find a suitable value for k, i.e. the number of clusters you are going to build. Justify your choice of k. 1.5 Predicting Performance In the previous step you have built several models. Finally, you need to compare the different models and describe your final conclusions. 1. Orange outputs several performance measures. Choose some of the performance measures and motivate your choice. 2. Summarize in a table the performance measures for each classifier and each dataset. 3. What can you conclude? 1.6 Conclusions Describe your final conclusions and indicate which risk factors for heart disease have you found in the data. Data Mining Final Project Objective The purpose of this project is to familiarize you with the process of data mining using a modern programming toolkit to apply numerous data mining strategies. Tools This project uses Orange, a suite of data mining tools interfaced via C++, Python or through GUI widgets. Deliverables This project will require you to do five things: 1. Read and briefly summarize all documents on the reading list. • Summarize the following • What is data mining • Orange as a Data mining tool • Basic Data Manipulation and Preparation • Visualization • Data modeling • Evaluation of model performance 2. Using the tutorial and documentation examples as a guide, complete your own data mining process against a dataset of your choosing. Your Own Data Mining Process After you've completed step 1 above, you should have a good understanding of what tools are available to you in Orange. Now it's time to try some of these approaches on a dataset of your own choosing. For this part of the project, you must: • Choose and explore a dataset. You can use any of the ones provided in orange or import your own dataset. • Select at least 3 data mining strategies and apply them to the dataset. • Describe your intent, approach and results. • Lastly, did you discover anything meaningful or surprising? If so, document your findings. If not, describe what you might do next to refine your process or choose different/improved mining strategies. Your data mining process doesn't have to be perfect, or even yield incredibly interesting results; the important thing is the process. So don't be afraid to try something fun even if it may not yield amazing results. 3. Submit complete documentation of items 1 and 2 above. Reading List Data Mining, Material for lectures on Data mining, at Kyoto University, Dept. of Health Informatics: http://eprints.fri.uni-lj.si/1150/1/DataMining-Kyoto.pdf A Data Mining Tutorial: http://maths-people.anu.edu.au/~steve/pdcn.pdf Functional Genomics Workshop http://docs.orange.biolab.si/_downloads/bio-tutorial.pdf