Lesson 1: Importing and Cleaning Data
Introduction and Context
Jenny is the marketing manager for a small Internet design and advertising firm. Jenny’s boss asks her to develop a data set containing information about Internet users. The company will use this data to determine what kinds of people are using the Internet and how the firm may be able to market their services to this group of users. To accomplish his assignment, Jenny creates an online survey and places links to the survey on several popular Web sites. Within two weeks, Jenny has collected enough data to begin analysis, but she finds that her data needs to be denormalized. She also notes that some observations in the set are missing values or that they appear to contain invalid values. Jenny realizes that some additional work on the data needs to take place before analysis begins.
After completing the reading and exercises in this chapter, you should be able to:
- Explain the concept and purpose of data scrubbing
- List possible solutions for handling missing data
- Explain the role and perform basic methods for data reduction
- Define and handle inconsistent data
- Discuss the importance and process of attribute reduction
Our first task in data preparation is to handle missing data. However, because this will be our first time using RapidMiner, the first few steps will involve getting RapidMiner set up. We’ll then move straight into handling missing data. Missing data are data that do not exist in a data set. Missing data is not the same as zero or some other value. It is blank, and the value is unknown. Missing data are also sometimes known in the database world as null. Depending on your objective in data mining, you may choose to leave missing data as they are, or you may wish to replace missing data with some other value.
To learn about handling missing data in RapidMiner, follow the steps below to connect to your data set and begin modifying it:
- Launch the RapidMiner application. This can be done by double clicking your desktop icon or by finding it in your application menu. The first time RapidMiner is launched, you will get a message about the need to set up a repository.
- For most purposes, a local repository will be sufficient. Click OK to accept the default option to create a new local repository. Select a location on your computer where you will store your data and click “Finish.”
- You may get a notice that updates are available. Don’t worry about the updates for now.
- RapidMiner will open and your window should look like:
- Click “Blank” under the Start a New Project row towards the top.
- Next, click the “Import data button” under the Repository are towards the top-left.
- When the dialogue box opens, you will select “My Computer,” and then navigate to find the Chapter 3 dataset that should already be stored on your computer.
- You will now see that Rapidminer has imported the dataset in a tabular format similar to what it would look like in Excel.
- Click “Next” and you will now see
- Notice how the view is similar to what you saw on the last screen. However, underneath the variable names (Gender, race, etc.), you will now see terms such as “polynominal” and “integer.” These are data types corresponding to the variables. Rapidminer typically guesses these variable types correctly, but not always. In this case, the variable types are all correct, so we don’t need to change them.
- You will also notice as you examine the data that there are some ? symbols. These depict cells that are missing data. We will be addressing these missing data shortly.
- Click “Next” and you will now see that Rapidminer is ready to import the dataset into the local repository. Once the dataset is located in the appropriate location, click “Finish.”
- You will now be taken to the Results view (which looks similar to the import windows in our last steps).
- This dataset only has 11 records, so it is fairly easy to look at it and identify where there are missing data. Yet, in a larger dataset, it wouldn’t be so easy. To identify which variables have missing and/or incorrect data, we can click on the Statistics” view (in the left navigation) to learn more about our data.
- There is a Missing column in this view that shows which variables have missing data. The variables that are missing data include:
- Years on Internet
- Hours Per Day
- Read News
- Online Shopping
- Online Gaming
- Other Social Network
- You’ll also see other statistics associated with the variables, including minimum and maximum values. Applying knowledge of our business problem which we’re ultimately trying to solve as well as our data, we might recognize that some of these minimum or maximum values can’t be right. Inaccurate data can be due to data that were keyed in inaccurately, decisions that were made in terms of what was entered if a value was required but not available (an example of this is if a birthday is required but the data entrant doesn’t know and inputs 01/01/1900 as a default). Entries such as this, while technically correct and allowable from the computer’s perspective, will cause problems when it comes time to develop models and analyses. This is where humans cannot yet be completely replaced by AI…! Let’s click the dropdown arrow related to the Twitter variable. You’ll notice that there are values of N, Y, and 99. 99 can’t be right since the data should only be telling us whether or not the respondent has a Twitter account. This is a data issue that we’ll soon be addressing.
- Click on the Design tab at the top to get back into the main Rapidminer design area, where we’ll locate our Chapter 3 dataset and drag it into the main design area.
- Each rectangle in a process in RapidMiner is an operator. The Retrieve operator simply gets a data set and makes it available for use. The small half-circles on the sides of the operator, and of the Main Process window, are called ports. An output (out) port from our data set’s Retrieve operator is connected to a result set (res) port via a spline. The splines, combined with the operators connected by them, constitute a data mining stream. To run a data mining stream and see the results, click the blue, triangular Play button in the toolbar at the top of the RapidMiner window. This will change your view from Design Perspective to Results Perspective, which shows your stream’s results. When you hit the Play button, you may be prompted to save your process, and you are encouraged to do so. RapidMiner may also ask you if you wish to overwrite a saved process each time it is run, and you can select your preference on this prompt as well.
- After you click the blue triangular play button, you will be taken back to the Results view on the ExampleSet (Retrieve Chapter 03Dataset) tab (exact same view and options as what we saw when we first imported the dataset)
Handling Missing Data
- Click back into Design view
- To find a tool in the Operators area, you can navigate through the folder tree in the lower left hand corner.
- RapidMiner offers many tools, and sometimes, finding the one you want can be tricky. There is a handy search box that enables you to type in keywords to find tools that might do what you need. Type the word ‘missing’ into this box, and you will see that RapidMiner automatically searches for tools with this word in their name. We want to replace missing values, and we can see that operator come up in our search results.
- Let’s add this Replace Missing Values” operator to our stream. Click and hold on the operator name, and drag it up to your spline. When you point your mouse cursor on the spline, the spline will turn slightly bold, indicating that, when you let go of your mouse button, the operator will be connected into the stream. If you let go and the Replace Missing Values operator fails to connect into your stream, you can reconfigure your splines manually. Simply click on the out port in your Retrieve operator, and then click on the exa port on the Replace Missing Values operator. Exa stands for example set, and remember that ‘examples’ is the word RapidMiner uses for observations in a data set. Be sure the exa port from the Replace Missing Values operator is connected to your result set (res) port so that when you run your process, you will have output. Your model should now look like:
- When an operator is selected in RapidMiner, it has an orange rectangle around it. This will also enable you to modify that operator’s parameters, or properties. The Parameters pane is located on the right side of the RapidMiner window. For this exercise, we have decided to change all missing values in the Online_Gaming attribute to be ‘N’, since this is the most common response in that attribute. To do this, change the ‘attribute filter type’ to ‘single’, and you will see that a dropdown box appears, allowing you to choose the Online_Gaming attribute as the target for modification.
- Next, expand the ‘default’ dropdown box, and select ‘value’, which will cause a ‘replenishment value’ box to appear.
- Type the replacement value ‘N’ in this box. Note that you may need to expand your RapidMiner window, or use the vertical scroll bar on the left of the Parameters pane in order to see all options, as the options change based on what you have selected. When you are finished, your parameters should look like
- You should understand that there are many other options available to you in the parameters pane. We will not explore all of them here, but feel free to experiment with them. For example, instead of changing a single attribute at a time, you could change a subset of the attributes in your data set. You will learn much about the flexibility and power of RapidMiner by trying out different tools and features. When you have your parameter set, click the play button. This will run your process and switch you to results perspective once again. Your results should look like
- You can see now that the Online_Gaming attribute has been moved to the front of our list, and that there are zero missing values. Click on the Statistics button, and you will see that the Online_Gaming variable is now populated with only ‘Y’ and ‘N’ values. We have successfully replaced all missing values in that attribute.
- Clicking back to the Data View, take note of how missing values are annotated in other variables, Online_Shopping for example. A question mark (?) denotes a missing value in an observation. Suppose that, for this variable, we do not wish to replace the null values with the mode, but rather, that we wish to remove those observations from our data set prior to mining it. This is accomplished through data reduction.
- Switch back to design perspective. The next set of steps will teach you to reduce the number of observations in your data set through the process of filtering. First, in the search box within the Operators tab, type in “filter examples”
- This will help you locate the ‘Filter Examples’ operator, which is what we will use in this example. Drag the Filter Examples operator over and connect it into your stream, right after the Replace Missing Values operator. Your window will look like
- In the condition class, choose “attribute_value_filter”
- and for the parameter_string, type the following: Online_Shopping=. (be sure to include the period).
- This parameter string refers to our attribute, Online_Shopping, and it tells RapidMiner to filter out all observations where the value in that attribute is missing. This is a bit confusing, because in Data View in results perspective, missings are denoted by a question mark (?), but when entering the parameter string, missings are denoted by a period (.). Once you’ve typed these parameter values in, your screen will look like
- Go ahead and run your model by clicking the play button. In results perspective, you will now see that your data set has been reduced from eleven observations (or examples) to nine.
- This is because the two observations where the Online_Shopping attribute had a missing value have been removed. You’ll be able to see that they’re gone by selecting the Statistics view. They have not been deleted from the original source data, they are simply removed from the data set at the point in the stream where the filter operator is located and will no longer be considered in any downstream data mining operations. In instances where the missing value cannot be safely assumed or computed, removal of the entire observation is often the best course of action. When attributes are numeric in nature, such as with ages or number of visits to a certain place, an arithmetic measure of central tendency, such as mean, median or mode might be an acceptable replacement for missing values, but in more subjective attributes, such as whether one is an online shopper or not, you may be better off simply filtering out observations where the datum is missing.
- Data mining can be confusing and overwhelming, especially when data sets get large. It doesn’t have to be though, if we manage our data well. The previous example has shown how to filter out observations containing undesired data (or missing data) in an attribute. Now, we’ll move on to handling inconsistent data.
Handling Inconsistent Data
- Remember the Twitter variable? The data values are N, Y, and 99. We know that 99 shouldn’t be a valid value, so we’re going to handle that.
- First, let’s delete the Replace Missing Values and Filter Examples operators from the Design view window so that we’ve reverted back to our initial dataset.
- Use the search feature in the Operators tab to find an operator called Replace.
- Drag this operator into your stream.
- Reconnect the splines as follows
- In the parameters pane, change the attribute filter type to single, then indicate Twitter as the attribute to be modified. In truth, in this data set there is only one instance of the value 99 across all attributes and observations, so this change to a single attribute is not actually necessary in this example, but it is good to be thoughtful and intentional with every step in a data mining process. Most data sets will be far larger and more complex that the Chapter 3 data set we are currently working with. In the ‘replace what’ field, type the value 99, since this is the value we’re looking to replace. Finally, in the ‘replace by’ field, we must decide what we want to have in the place of the 99. If we leave this field blank, then the observation will have a missing (?) when we run the model and switch to Data View in results perspective. We could also choose the mode of ‘N’, and given that 80% of the survey respondents indicated that they did not use Twitter, this would seem a safe course of action. You may choose the value you would like to use. For this example, we will enter ‘N’ and then run our model.
- You can see that we now have nine values of ‘N’, and two of ‘Y’ for our Twitter attribute.
- Keep in mind that not all inconsistent data is going to be as easy to handle as replacing a single value. It would be entirely possible that in addition to the inconsistent value of 99, values of 87, 96, 101, or others could be present in a data set. If this were the case, it might take multiple replacements and/or missing data operators to prepare the data set for mining. In numeric data we might also come across data which are accurate, but which are also statistical outliers. These might also be considered to be inconsistent data, so an example in a later chapter will illustrate the handling of statistical outliers. Sometimes data scrubbing can become tedious, but it will ultimately affect the usefulness of data mining results, so these types of activities are important, and attention to detail is critical.
- In many data sets, you will find that some attributes are simply irrelevant to answering a given question. In the next lesson, we will discuss methods for evaluating correlation, or the strength of relationships between given attributes. In some instances, you will not know the extent to which a certain attribute will be useful without statistically assessing that attribute’s correlation to the other data you will be evaluating. In our process stream in RapidMiner, we can remove attributes that are not very interesting in terms of answering a given question without completely deleting them from the data set. Remember, simply because certain variables in a data set aren’t interesting for answering a certain question doesn’t mean those variables won’t ever be interesting. This is why we recommended bringing in all attributes when importing the Chapter 3 data set earlier in this lesson.
- Uninteresting or irrelevant attributes are easy to exclude within your stream by following these steps:
- Return to design perspective.
- In the operator search field, type Select Attribute.
- The Select Attributes operator will appear.
- Drag it onto the end of your stream so that it fits between the Replace operator and the result set port. Your window should look like
- In the Parameters pane, set the attribute filter type to ‘subset’, then click the Select Attributes button.
- A window similar to the following will appear
- Using the blue right and left arrows, you can select which attributes you would like to keep. Suppose we were going to study the demographics of Internet users. In this instance, we might select Birth_Year, Gender, Marital_Status, and Race, and move them to the right under Selected Attributes using the right blue arrow. You can select more than one attribute at a time by holding down your control or shift keys (on a Windows computer) while clicking on the attributes you want to select or deselect. We could then click Apply, and these would be the only attributes we would see in results perspective when we run our model. All subsequent downstream data mining operations added to our model will act only upon this subset of our attributes.
This lesson has introduced you to a number of concepts related to data preparation. Recall that Data Preparation is the third step in the CRISP-DM process. Once you have established Organizational Understanding as it relates to your data mining plans, and developed Data Understanding in terms of what data you need, what data you have, where it is located, and so forth, you can begin to prepare your data for mining. This has been the focus of this lesson.
The lesson used a small and very simple data set to help you learn to set up the RapidMiner data mining environment. We have then stepped through adding Excel/CSV files to a RapidMiner data repository in order to handle missing data, reduce data through observation filtering, handle inconsistencies in data, and reduce the number of attributes in a model.
All of these methods will be used in future lessons to prepare data for modeling. Data mining is most successful when conducted upon a foundation of well-prepared data. Recall the quotation from Chapter 1 from Alice’s Adventures in Wonderland: Which way you go does not matter very much if you don’t know, or don’t care, where you are going. Likewise, the value of where you arrive when you complete a data mining exercise will largely depend upon how well you prepared to get there. Sometimes we hear the phrase “It’s better than nothing.” Well, in data mining, results gleaned from poorly prepared data might be “worse than nothing” because they may be misleading. Decisions based upon them could lead an organization down a detrimental and costly path. Learn to value the process of data preparation, and you will learn to be a better data miner.