Some papers are on the ML 101 web site:

http://machinelearning102.pbworks.com/w/browse/#view=ViewFolder¶m=Papers%20for%20Presentation

The Breiman paper is on bagging - Breiman taught at Berkeley and worked with Jerome Friedman.

The paper by Fisher on the Iris Data is one of the original classic papers

Remember that the wine data set and the glass data set had few examples of some of the targets. Those are unbalanced data sets and there is a paper on those. There is a paper on the use of SVM's to distinguish between more than two classes of data. There is also a paper on Bayesian Belief Networks which are built using Bayes Theorem. Finally there is a survey paper on Statistic Pattern Recognition.

Here are some more interesting papers on clustering and anomaly detection:

1) A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A review. ACM Computing Surveys, 31(3): 264-323, September 1999

http://portal.acm.org/citation.cfm?id=331504

free with sign in to ACM PDF attached - ClusteringSurveyp264-jain.pdf

This is a general survey of clustering techniques. Many other papers reference this paper. There is a review of many topics the class covered and also examples from computer vision.

2) J. M. Kleinberg. An Impossibility Theorem for Clustering. In Proc. of the 26th Annual Conf. on Neural Information Processing Systems, December, 9-14 2002

http://arxiv.org/PS_cache/arxiv/pdf/1011/1011.5270v2.pdf attached ClassifyingDluste...pdf

Discussion of some of the trade-offs that clustering algorithms make and proves that it is impossible for a clustering algorithm to simultaneously possess three simple properties.

3) P. S. Bradley, K. P. Bennett, and A. Demiriz MSR-TR-2000-65

http://research.microsoft.com/pubs/69796/tr-2000-65.pdf attached as ConstrainedKMean...pdf

This k-means algorithm avoids local solutions with clusters having very few points. Discusses a K-Means method for use when the number of dimensions > 10 and desired number of clusters k>20. Results are less prone to poor local solutions, producing a better summary of the underlying data.

4) Joydeep Ghosh and Alexander Strehl. Cluster Ensembles - A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research 3 (2002) 583-617 Published 12/02

http://strehl.com/download/strehl-jmlr02.pdf attached as ClusterEnsembles..pdf

This paper introduces the problem of combining multiple partitioning of a set of objects

into a single consolidated clustering without accessing the features or algorithms that deter-

mined these partitioning.

5) Ying Zhao and George Karypis Data Clustering in Life Sciences Molecular Biotechnology, 31(1), pp. 55--80, 2005

http://www-users.cs.umn.edu/~yzhao/ attached as Clusterbio...pdf

The primary goal of this article is to provide an overview of the various issues involved in clustering large datasets, describe the merits and underlying assumptions of some of the commonly used clustering approaches, and provide insights on how to cluster datasets arising in various areas within life sciences.

6) Fionn Murtagh, Clustering in Massive Data Sets

http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/Murtagh.pdf attached as ClustBigData...pdf

Time and storage costs of clustering algorithms are analyzed. Adaptations for several clustering algorithms for big data sets are presented. Visual and image representations are presented. Although this paper is older (1999), it does enlighten scalability limitations and issues with various methods.

7) Varun Chandola, Arindam Banerjee, and Vipin Kumar Anomaly Dection: A Survey , Technical Report 07-017 Univeristy of Minnesota Aug 15, 2007

This survey provides a structured and comprehensive overview of the research on anomaly detection. This paper discusses different directions in which research has been done and how techniques developed in one area can be applied to other domains. They conclude with specific examples from 1) intrusion detection of computer networks and sensor networks, 2) fraud detection in identity theft, credit card, mobile phones account misuse, insurance claim fraud, insider trading detection, 3) medical and public health anomaly detection, and 4) fault detection of mechanical and industrial components along with some 5) Image Processing and 6) Text Processing.

http://www.cs.umn.edu/tech_reports_upload/tr2007/07-017.pdf attached as AnomalyDetSurvey...pdf

8) Pradeep Mohan, Shashi Shekhar, James A. Shine, James P. Rogers. Cascading spatio-temporal pattern discovery: A summary of results. (To appear) In Proc. of 10th SIAM International Data Mining (SDM) 2010, Columbus, OH, USA (PDF).

http://www.spatial.cs.umn.edu/paper_ps/mohan-shekhar-shine-rogers.pdf attached as SpatioTemp...pdf

Although this doesn't totally fit into clustering, it is an interesting take on looking at multi-resolution spatio-temporal data. So, it is a follow up to the synthetic control data set problems.

## Comments (0)

You don't have permission to comment on this page.