1 MA T R U S RI E N GI N EE R I NG C O L L EG E (An Autonomous Institution) (Sponsored by: Matrusri Education Society, Estd: 1980 ) ( Approved b y A I C T E and A f fil i at e d t o O s m a n i a Un i v e rsi t y) S a id a b a d, H y der a b a d # 16 - 1 - 486, Saidabad, Hyderabad - 500059. Ph: 040 - 24072764 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING (DATA SCIENCE) (Accredited By NBA and NAAC ) L A B M A N UA L Mining of Massive Datasets lab ( PC552DSU23 ) III B.E S E M E S TE R - I (2025 - 202 6 ) 2 DEPARTMENT VISION The Computer Science and Engineering Department aims to produce competent professionals with strong analytical skills, technical skills, research aptitude and ethical values. MISSION M1: To provide hands - on - experience and problem - solving skills by imparting quality Education M2: To conduct skill - development programs in emerging technologies to serve the Needs of industry, society and scientific community. M3: To promote comprehensive education and professional development for effective Teaching - learning processes. M4: To impart project management skills with an attitude for life - long learning with Ethical values. 3 PROGRAMME EDUCATIONAL OBJECTIVES (PEOs) 1. To learn engineering knowledge and problem analysis skills to design and develop solutions for computer science and engineering problems. 2. To address the feature engineering with the usage of modern IT and Software tools. 3. To acquire and practice the profes sion with due consideration to environment issues in conformance with societal needs and ethical values. 4. To manage projects in multidisciplinary environments as a member and as a leader with effective communications. 5. To engage in life - long learning in the context of ever - changing technology. PROGRAMME OUTCOMES (POs) Upon the completion of programme, the student will be able 1. Engineering knowledge : Apply and integrate the knowledge of computing to computer science and engineering problems. 2. Problem Analysis : Identify, formulate and analyze complex engineering problems using computer science and engineering knowledge. 3. Design/Development of solutions : Design and develop components or processes to engineering problems as per specification with environmental con sideration. 4. Conduct Investigations of complex problems : Interpret and integrate information to provide solutions to real world problems. 5. Modern tool usage : Select and apply modern engineering and information technology tools for complex engineering problem s 6. The engineer and Society : Assess and responsible for societal, health, safety, legal and cultural issues in professional practice. 7. Environment and Sustainability : Understand the impact of computing solutions in the context of societal, environmental and economical development. 8. Ethics : Commit to professional ethics and responsibilities and norms of the engineering practice. 9. Individual and team work : Function as an individual, as a member or leader in multidisciplinary environment. 10. Communication : Acquire ef fective written and oral communication skills on technical and general aspects. 11. Project management and finance : Apply engineering and management principles to manage projects in multidisciplinary environments. 12. Life - Long learning : Identify the need of self - learning and life - long learning in the broad context of technological evolution. 4 PROGRAM SPECIFIC OUTCOMES (PSOs) Upon the completion of programme, the student will be able to ; 1. Familiar with open - ended programming environments to develop software applications. 2. Apply the knowledge of Computer System Design, Principles of Algorithms and Computer Communications to manage projects in multidisciplinary environments. COURSE OBJECTIVES AND COURSE OUTCOMES Course Objectives: 1. To provide hands - on experience in working with massive datasets and various data streaming tools. 2. To implement data preprocessing techniques and evaluate them using statistical methods for handling large - scale data. 3. To apply frequent pattern mining, clustering, and data stream algorithms to extract meaningful insights from big data. 4. To develop skills in link analysis and social network mining, enabling students to analyze web and social media data effectively. Course Outcomes: On completion of this co urse, the student will be able; 5. Explain the fundamental statistical and computational approaches in data mining and similarity measures for finding similar items. 6. Implement frequent itemset mining algorithms for discovering associations and correlations in large datasets. 7. Apply clustering techniques to group similar data points. 8. Develop methods for mining data streams, efficiently computing PageRank, and performing link analysis to understand web connectivity. Utilize data mining techniques to analyze social network graphs and optimize online advertising strategies. 5 SYLLABUS 1. Introduction of various massive datasets and Data streaming tools 2. Implementation of pre - processing Techniques and evaluate with various statistical methods for any given raw data. 3. Implementation of Association rule mining - Apriori Algorithm, FP - Algorithm 4. Implementation of clustering algorithms Partitioning Algorithm: K - means Algorithms Hierarchical Clustering: BIRCH algorithm , CURE Algorithm, Density Base Clustering: Implementation of clustering algorithms – Agglomerative algorithm. DBSCAN Algorithm Implementation of Datastream Algorithm – Bloom Filter 5. Implementation of link analysis algorithms - Page Rank 6. Case study on - Mining of Social Network Mining of Advertising on the Web 6 List of Programs S. No Lab Programs Page No 1 Introduction of various massive datasets and Data streaming tool 2 Implementation of pre - processing Techniques and evaluate with various statistical methods for any given raw data. 3 Implementation of Association rule mining FP - Algorithm 4 Implementation of clustering algorithms Partitioning Algorithm: K - means Algorithms. 5 Implementation of clustering algorithms Hierarchical Clustering: BIRCH algorithm. 6 Implementation of clustering algorithms CURE Algorithm. 7 Implementation of clustering algorithms Density Base Clustering: DBSCAN Algorithm. 8 Implementation of clustering algorithms - Agglomerative algorithm. 9 Implementation of Data - stream Algorithm - Bloom Filter. 10 Implementation of link analysis algorithms - Page Rank 11 Case study on - Mining of Social Network 12 Case study on - Mining of Advertising on the Web 7 1. Introduction of various massive datasets and Data streaming tools. Here's an introduction to various massive datasets and data streaming tools, commonly used in data science, machine learning, and big data analytics: Massive Datasets Massive datasets, also known as big data, are typically characterized by the 3 Vs — Volume, Velocity, and Variety. These datasets often come from domains like social media, IoT, (Internet of Things) scientific research, healthcare, finance, and more. 1. Common Examples of Massive Datasets Open Datasets • ImageNet – Large image dataset used in computer vision research. • Common Crawl – Web crawling data collected and made freely available. • Google Open Images – Annotated images for machine learning. • Amazon Reviews – Customer reviews for products over many years. • Enron Email Dataset – Large set of email data for NLP tasks. • OpenStreetMap (OSM) – Geographic data for mapping and navigation. • MIMIC( imitate (someone or their actions or words), - III – Critical care health data (de - identified). Scientific Datasets • CERN’ s The European Organization for Nuclear Research , abbreviated as CERN (from the French 'Conseil européen pour la Recherche Nucléaire' or European Council for Nuclear Research, as its predecessor was called) LHC Data – Particle physics data. • NASA Nationa l Aeronautics and Space Administration Earth Observing System Data – 8 Satellite and environmental data. • Human Genome Project – Genetic sequencing data. AI/ML Datasets • COCO (Common Objects in Context) – Object detection and segmentation. • MNIST/CIFAR - 1 0 – ( Modified National Institute of Standards and Technology database ) The Canadian Institute for Advanced Research (CIFAR) Classic datasets for image classification. • The Pile – add something in large amounts. A large - scale dataset for training language models. Data Streaming Tools Data streaming tools are used to process data in real - time or near real - time. This is essential for systems where decisions must be made instantly (e.g., fraud detection, live analytics, IoT). 1. Apache Kafka • Distributed ev ent streaming platform. • Excellent for building real - time data pipelines. • High throughput and scalability. 2. Apache Flink • Stream - processing framework. • Handles batch and stream data in one system. • Provides exactly - once semantics and stateful compu tations. 3. Apache Spark Streaming • Extension of Apache Spark for real - time data processing. • Uses mini - batch processing. 9 • Integrates well with other big data tools. 4. Apache Pulsar • Similar to Kafka but with additional features like multi - tenancy and geo - replication. • Supports both streaming and queuing. 5. Amazon Kinesis • AWS cloud - native streaming service. • Integrates easily with AWS ecosystem. • Used for real - time analytics and machine learning. 6. Google Cloud Dataflow • Fully managed stream an d batch data processing service. • Based on Apache Beam. • Unified model for real - time and batch processing. 7. Redis Streams • Lightweight, in - memory streaming. • Useful for event - driven microservices and real - time dashboards. Use Cases for Datasets and Streaming Domain Dataset Example Streaming Tool Use Case Social Media Twitter API Data Kafka, Spark Streaming Real - time sentiment analysis Finance Stock Market Feeds Flink, Kafka Fraud detection, algorithmic trading Healthcare MIMIC - III, Genomic Data Go ogle Dataflow Real - time patient monitoring IoT Sensor Data AWS Kinesis, 10 Pulsar Smart city analytics, predictive maintenance E - commerce Amazon Reviews Kafka, Flink Recommendation systems, live product insights 2. Implementation of pre - processing Techniques and evaluate with various statistical methods for any given raw data # Install dependencies (run this first in your terminal/command prompt) # pip install mlxtend import pandas as pd from mlxtend.frequent_patterns import apriori, associati on_rules from mlxtend.preprocessing import TransactionEncoder # Sample dataset (transaction - style) dataset = [ ['milk', 'bread', 'butter'], ['bread', 'butter'], ['milk', 'bread'], ['milk', 'bread', 'butter', 'jam'], ['bread', 'jam'] ] # Convert to dataframe format for one - hot encoding te = TransactionEncoder() te_ary = te.fit(dataset).transform(dataset) df = pd.DataFrame(te_ary, columns=te.columns_) # Step 1: Find frequent itemsets frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True) 11 print("Frequent Itemsets: \ n", frequent_itemsets) # Step 2: Generate association rules rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7) print(" \ nAssociation Rules: \ n", rules[['antecedents', 'consequent s', 'support', 'confidence', 'lift']]) OUTPUT Python 3.12.5 (tags/v3.12.5:ff3bc82, Aug 6 2024, 20:45:27) [MSC v.1940 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license()" for more information. = RESTART: C: \ Users \ admin \ Desktop \ PYTHON PROGRAMS \ prog2.1.py Frequent Itemsets: support itemsets 0 1.0 (bread) 1 0.6 (butter) 2 0.6 (milk) 3 0.6 (butter, bread) 4 0.6 (milk, bread) Warning (from warnings module): File "C: \ Users \ admin \ AppData \ Local \ Programs \ Python \ Python312 \ Lib \ site - packages \ mlxtend \ frequent_patterns \ association_rules.py", line 186 12 cert_metric = np.where(certainty_denom == 0, 0, certainty_num / certainty_denom) RuntimeWarning: invalid value en countered in divide Association Rules: antecedents consequents support confidence lift 0 (butter) (bread) 0.6 1.0 1.0 1 (milk) (bread) 0.6 1.0 1.0 3. Implementation of Association rule mining FP - Algorithm. # Install required package (run this first in your terminal/command prompt) # pip install mlxtend import pandas as pd from mlxtend.preprocessing import TransactionEncoder from mlxtend.frequent_patterns import fpgrowth, association_rules # Sam ple transaction data dataset = [ ['milk', 'bread', 'butter'], ['bread', 'butter'], ['milk', 'bread'], ['milk', 'bread', 'butter', 'jam'], ['bread', 'jam'] ] # Convert to one - hot encoded DataFrame te = TransactionEncoder() te_ary = te.fit(dataset).transform(dataset) df = pd.DataFrame(te_ary, columns=te.columns_) # Step 1: Generate frequent itemsets using FP - Growth frequent_itemsets = fpgrowth(df, min_support=0.4, use_colnames=True) print("Frequent Itemsets: \ n", frequent_itemsets) # Step 2: Generate association rules 13 rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7) print(" \ nAssociation Rules: \ n", rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]) OUTPUT: Python 3.12.5 (tags/v 3.12.5:ff3bc82, Aug 6 2024, 20:45:27) [MSC v.1940 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license()" for more information. = RESTART: C: \ Users \ admin \ Desktop \ MECS ACADEMICS \ 2025 - 26 \ MMD \ PYTHON PROGRAMS \ prog3.py Frequent Itemsets: support itemsets 0 1.0 (bread) 1 0.6 (milk) 2 0.6 (butter) 3 0.4 (jam) 4 0.6 (milk, bread) 5 0.6 (butter, bread) 6 0.4 (butter, milk) 7 0.4 (butter, milk, bread) 8 0.4 (jam, bread) Warning (from warnings module): File "C: \ Users \ admin \ AppData \ Local \ Programs \ Python \ Python312 \ Lib \ site - packages \ mlxtend \ frequent_patterns \ association_rules.py", line 186 cert_metric = np.where(certainty_denom == 0, 0, certainty_num / certainty_denom) RuntimeWarning: invalid value encountered in divide Association Rules: antecedents consequents support confidence lift 0 (milk) (bread) 0.6 1.0 1.0 1 (butter) (bread) 0.6 1.0 1.0 2 (butter, milk) (bread) 0.4 1.0 1.0 3 (jam) (bread) 0.4 1.0 1.0 14 4. Implementation of clustering algorithms , Partitioning Algorithm: K - means Algorithms. import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs # Generate sample data X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42) class KMeans: def __init__(self, k=4, max_iters=100, tolerance=1e - 4): self.k = k self.max_iters = max_iters self.tolerance = tolerance def fit(self, X): # Randomly initialize centroids np.random.seed(42) random_idxs = np.random.permutation(X.shape[0])[:self.k] self.centroids = X[random_idxs] for _ in range(self.max_iters): # Assign clusters self.labels = self._assign_clusters(X) # Calculate new centroids new_centroids = np.array([X[self.labels == i].mean(axis=0) for i in range(self.k)]) # Check for convergence if np.all(np.linalg.norm(self.centroids - new_centroids, axis=1) < self.tolerance): break self.centroids = new_centroids def _assign_clusters(self, X): distances = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2) return np.argmin(distances, axis=1) 15 def pre dict(self, X): return self._assign_clusters(X) # Run KMeans kmeans = KMeans(k=4) kmeans.fit(X) y_kmeans = kmeans.predict(X) # Plot results plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=30, cmap='viridis') plt.scatter(kmeans.centroids[:, 0], kmeans.centroids[:, 1], c='red', s=200, alpha=0.75, marker='X') plt.title('K - Means Clustering') plt.show() OUTPUT: 16 5. Implementation of clustering algorithm - Hierarchical Clustering: BIRCH algorithm import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs # Generate synthetic dataset X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=0.6, random_state=42) class KMeans: def __init__(self, k=3, max_iters=100, tol=1e - 4): self.k = k self.max_iters = max_iters self.tol = tol def fit(self, X): np.random.seed(42) random_idx = np.random.permutation(X.shape[0])[:self.k] self.centroids = X[random_idx] for i in range(self.max_iters): self.labels = self._assign_clusters(X) new_centroids = np.array([X[self.labels == j].mean(axis=0) for j in range(self.k)]) diff = np.linalg.norm(self.centroids - new_centroids) if diff < self.tol: break self.centroids = new_centroids def _assign_clusters(self, X): distances = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2) return np.argmin(distances, axis=1) def predict(self, X): return self._assign_clusters(X) # Run KMeans model = KMeans(k=3) model.fit(X) y_pred = model.predict(X) # Plotting plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=50, cmap='viridis') plt.scatter(model.centroids[:, 0], model.centroids[:, 1], c='red', s=200, mar ker='X') 17 plt.title("K - Means Clustering (From Scratch)") plt.show() OUTPUT: 18 6. Implementation of clustering algorithms CURE Algorithm. import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from scipy.spatial.distance import cdist # Generate synthetic data X, _ = make_blobs(n_samples=100, centers=4, random_state=42) class CureClustering: def __init__(self, k=4, c=5, alpha=0.2): self.k = k # Final number of clusters s elf.c = c # Representative points per cluster self.alpha = alpha # Shrinking factor self.clusters = [] # Stores clusters def fit(self, X): # Initialize each point as its own cluster self.clusters = [[x] for x in X] while len(self.clusters) > self.k: # Compute distance matrix between clusters distances = np.full((len(self.clusters), len(self.clusters)), np.inf) for i in range(len(self.clusters)): for j in range(i + 1, len(self.clusters)): d = self._cluster_distance(self.clusters[i], self.clusters[j]) 19 distances[i][j] = d # Find and merge closest clusters i, j = np.unravel_index(np.argmin(distances), distances.shape) new_cluster = self.clusters[i] + self.clusters[j] self.clusters.pop(j) self.clusters.pop(i) self.clusters.append(new_cluster) def _cluster_dist ance(self, cluster1, cluster2): # Get representatives and compute min distance reps1 = self._get_representatives(cluster1) reps2 = self._get_representatives(cluster2) return np.min(cdist(reps1, reps2)) def _get_represen tatives(self, cluster): # Select c farthest points from centroid cluster = np.array(cluster) centroid = np.mean(cluster, axis=0) dists = cdist(cluster, [centroid]).flatten() idx = np.argsort(dists)[ - self.c:] reps = cluster[idx] # Shrink representatives toward centroid return reps * (1 - self.alpha) + centroid * self.alpha def predict(self, X): # Assign each point to nearest cluster labels = np.zeros(len(X)) 20 for i, x in enumerate(X): dists = [np.min(cdist([x], cluster)) for cluster in self.clusters] labels[i] = np.argmin(dists) return labels.astype(int) # Run CURE clustering cure = CureClustering(k=4, c=5, alpha=0.2) cu re.fit(X) labels = cure.predict(X) # Visualize results plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis') plt.title("CURE Clustering (Simplified)") plt.show() OUTPUT: