Similarity and Distances Measured
Similarity and distance measures are used in data mining to compare and quantify the similarity or dissimilarity between objects, data points, or groups. The choice of similarity or distance measure depends on the type of data being analyzed and the specific problem being addressed. Here are some commonly used similarity and distance measures:
Euclidean distance: This is a commonly used distance measure for continuous data, which calculates the straight-line distance between two points in a multi-dimensional space.
Manhattan distance: Also known as the taxicab distance, this measure calculates the distance between two points by summing the absolute differences between their coordinates.
Cosine similarity: This measure is used to compare the similarity between vectors and is commonly used in text analysis, document clustering, and recommendation systems.
Jaccard similarity: This measure is used to compare the similarity between sets and is commonly used in clustering and classification problems.
Hamming distance: This distance measure is used to compare the similarity between binary strings of the same length.
Pearson correlation coefficient: This measure is used to quantify the linear relationship between two variables.
Minkowski distance: This distance measure is a generalization of the Euclidean and Manhattan distances and can be used for continuous data.
These measures can be used in various data mining techniques, such as clustering, classification, and recommendation systems. The choice of similarity or distance measure depends on the nature of the data being analyzed and the specific problem being addressed.
Partitioned Algorithms
Partitioned clustering algorithms are another commonly used clustering technique in data mining. Unlike hierarchical clustering, which builds a tree-like structure of nested clusters, partitioned clustering algorithms directly partition the data points into a predetermined number of clusters. The two most commonly used partitioned clustering algorithms are k-means and fuzzy c-means.
K-means: K-means is a centroid-based clustering algorithm that aims to partition the data points into k clusters, where k is a predetermined number. The algorithm works by randomly selecting k centroids from the data points and then assigning each point to the nearest centroid. The centroids are then updated to the mean of the points assigned to them, and the process is repeated until the centroids no longer move.
Fuzzy c-means: Fuzzy c-means is a soft clustering algorithm that assigns a membership value to each point indicating the degree of membership in each cluster. Unlike k-means, which assigns each point to a single cluster, fuzzy c-means allows points to belong to multiple clusters to varying degrees. The algorithm works by initializing the membership values for each point and then iteratively updating the cluster centers and membership values until convergence.
Partitioned clustering algorithms have several advantages over hierarchical clustering, including their ability to handle large datasets and their faster computation times. However, they can be sensitive to the initial choice of cluster centers and may not be suitable for datasets with irregularly shaped clusters or overlapping data points.
Overall, partitioned clustering algorithms are a powerful tool in data mining for clustering large datasets into a predetermined number of clusters. The choice of algorithm depends on the nature of the data being analyzed and the specific problem being addressed.