Executive Software
Engineering Program
EE380L –Data Mining-SE
Spring 2010
Instructor:
Joydeep
Ghosh, Ph.D. Professor and Schlumberger Centennial
Chair in Engineering
Email
address: ghosh@ece.utexas.edu;
URL:
http://www.ideal.ece.utexas.edu/~ghosh
Course
Title and Description:
Many companies that gather huge amounts of electronic data have now begun applying data mining techniques to their data warehouses to discover and extract “hidden” patterns useful for making smart business decisions. Effective data mining requires an understanding of concepts from exploratory data analysis, pattern recognition, machine learning/ AI, heterogenous data bases, parallel processing and data visualization, in addition to knowing the application domain. I will focus on basic techniques for data mining, including methods useful for analyzing information from the world wide web. Demos using an industrial strength software (SAS)) will be given and some applications/case studies will be discussed. The course involves a mid-term exam, a paper presentation and a term project. There will be no final exam. You will also be exposed to Matlab, which is excellent for working with vectors and matrices, and for fast algorithm development.
Textbooks:
Author: Pang-Ning
Tan, Michael Steinbach, and Vipin Kumar (TSK)
Title:
Introduction to Data Mining
Publisher:
Addison-Wesley
(2005)
ISBN: 0-321-32136-7.
Author: Witten
and Frank (WF)
Title: Data
Mining (2nd Ed)
Publisher:
Morgan Kaufmann (2005)
ISBN: 0-12-088407-0
Course
Expectations:
This course requires
students to have an undergraduate
level understanding of some basic concepts
from probability/statistics, data analysis and linear algebra. This
is a graduate course so the workload will be medium..
While studying techniques for database representation/modeling, clustering, classification, finding associations and sequence processing, emphasis will be placed on the issues of algorithm scalability, performance, interpretability and the ability to deal with garbage data. 10-15 minute student talks will be interwoven with the lectures, depending on class size. The last two classes will largely consist of student term-project presentations, followed by active discussion.
Class
outline:
Introduction – January 15 and 16
Reading Assignment: TSK ch 1-3; Appendix A, B, C, D; WF ch
1, 2, 7.1-7.3
Area of study: overview, SAS demos. Data quality and pre-processing; Intro to Regression.
February 12 and 13
Area of study: Predictive modeling;
Classification methods; Finding association rules.
March 12 and 13
Reading Assignment::
TSK ch 8, 9, WF 3.9, 4.8, 6.6;
Area of study: clustering/segmentation;
market basket analysis.
April 9 and 10
Reading Assignment:
TSK Ch 5.6; WF 7.5 and 8.3
Area of study: Combining multiple
models; intro to web analytics:
analyzing hyperlink structure
May 14 and 15
Reading Assignment: notes
Area of study: web analytics (contd): analyzing content and usage of web sites.; project presentations; course wrap-up; the future of data
mining.
Grading
Information:
35% final project,
30% written homework
20% mid-term
10% brief
presentation of research paper/topic (groups of 2; list provided). (dropped for large class sizes)
5% class
participation