Problem I – (10 points)
What is data mining? In your answer, address the following:
– Is it another fad?
– Out of the three pre-requisite data science skills (database management, statistics, and machine learning) which one(s) are most important to master?
– Explain how the evolution of database technology led to data mining.
– Describe the steps involved in data mining when viewed as a process of knowledge discovery.
Problem II – (10 points)
Robust data loading poses a challenge in database systems because the input data are often dirty. In many cases, an input record may have several missing values and some records could be contaminated (i.e., with some data values out of range or of a different data type than expected). Work out a step-by-step data cleaning and loading procedure so that the erroneous data will be marked and contaminated data will not be mistakenly inserted into the database during data loading.
Problem III – (10 points)
Outline the major steps of decision tree classification.
Problem IV – (10 points)
a. Compare the advantages and disadvantages of eager classification (e.g., Decision tree, Bayesian, neural network) versus lazy classification (e.g., k-nearest neighbor, case based reasoning).
b. Create a hypothetical example for one of the classifiers discussed in part a.
Problem V – (10 points)
Association rule mining often generates a large number of rules. Name at least one effective method that can be used to reduce the number of rules generated while still preserving most of the interesting rules.
Problem VI – (50 points)
You are a consultant working for the company “Data Mining R Us.” Your client is a major luxury automobile manufacturer, Lexcedes. They have come up with a brand-new model called the “Chimera” and they want to target the car for young, filthy rich individuals. Besides having their own company databases, Lexcedes purchased a large collection of databases containing historic information about people, their attributes, and what they buy. They want to use data mining to help sell their new model.
Describe in detail a comprehensive step-by-step data mining procedure you would follow if you were given this task. Make sure that your answer reflects the situation stated above (in other words, do not give a generic answer). State your assumptions.