Executive Software Engineering Program

EE380L –Data Mining-SE

Spring 2010

 

Instructor:

 

Joydeep Ghosh, Ph.D. Professor and Schlumberger Centennial Chair in Engineering

Email address: ghosh@ece.utexas.edu;

URL: http://www.ideal.ece.utexas.edu/~ghosh

 

 

Course Title and Description:

           

Many companies that gather huge amounts of electronic data have now begun applying data mining techniques to their data warehouses to discover and extract “hidden” patterns useful for making smart business decisions. Effective data mining requires an understanding of concepts from exploratory data analysis, pattern recognition, machine learning/ AI, heterogenous data bases, parallel processing and data visualization, in addition to knowing the application  domain. I will focus on basic techniques for data mining, including methods useful for analyzing information from the world wide web.  Demos using an industrial strength software (SAS)) will be given and some applications/case studies will be discussed.  The course involves a mid-term exam, a paper presentation and a term project. There will be no final exam. You will also be exposed to Matlab, which is excellent for working with vectors and matrices, and for fast algorithm development.

 

 

Textbooks:

 

Author:  Pang-Ning Tan, Michael Steinbach, and Vipin Kumar (TSK)

Title:  Introduction to Data Mining

Publisher:   Addison-Wesley (2005)

ISBN: 0-321-32136-7.
 

Author:  Witten and Frank (WF)

Title:  Data Mining (2nd Ed)

Publisher:   Morgan Kaufmann (2005)

ISBN:  0-12-088407-0

 

 

 

Course Expectations:

 

This course requires students to have an undergraduate level understanding  of  some basic concepts from probability/statistics, data analysis and linear algebra. This is a graduate course so the workload will be medium.. 

While studying techniques for database representation/modeling, clustering, classification, finding associations and sequence processing, emphasis will be placed on the issues of algorithm scalability, performance, interpretability and  the ability to deal with garbage data. 10-15 minute student talks will be interwoven with the lectures, depending on class size. The last two classes will largely consist of student term-project presentations, followed by active discussion.

 

 


 

 

 

 

 

 

 

 

Class outline:

            Introduction – January 15 and 16

Reading Assignment:  TSK ch 1-3; Appendix A, B, C, D; WF ch 1, 2, 7.1-7.3

 Area of study: overview, SAS demos.  Data quality and pre-processing; Intro to Regression.

            February 12 and 13

Reading Assignment : TSK   Ch 4, 5; parts of Ch 6; WF rest of Ch 4-6  (Ch 3.4)

                        Area of study: Predictive modeling; Classification methods; Finding association rules.

March 12 and 13

Reading Assignment:: TSK ch 8, 9, WF 3.9, 4.8, 6.6;

 Area of study: clustering/segmentation; market basket analysis.

            April 9 and 10

Reading Assignment: TSK Ch 5.6; WF 7.5 and 8.3

                        Area of study:  Combining multiple models; intro to web analytics: analyzing hyperlink structure

            May 14 and 15

                        Reading Assignment: notes

                         Area of study: web analytics (contd): analyzing content and usage of web sites.; project presentations; course wrap-up; the future of data mining.

 

Grading Information:

 

35% final project,

30% written homework

20% mid-term

10% brief presentation of research paper/topic (groups of 2; list provided). (dropped for large class sizes)

 5% class participation