CS 550: Massive Data Mining and Learning


Course Information


Instructor: Yongfeng Zhang
Email: yongfeng.zhang AT rutgers.edu
Office: CBIM 20

Time: Monday and Thursdays, 12:00-1:20 pm
Location: Allison Road Classrooms (ARC) 107
Office Hours: Fridays 2:00-4:00pm or by appointment

TA: Sepehr Janghorbani
Email: sj620 At scarletmail.rutgers.edu
Office: Hill 202
TA Office Hours: Wednesdays 12:00pm-2:00pm or by appointment

Textbook: (LRU) Mining Massive Data Sets by J. Leskovec, A. Rajaraman, J. D. Ullman



Announcements

  • 4/10: Project presentation schedules emailed to each team from Sakai
  • 4/09: Homework 4 released on Sakai, due date Sunday 4/22 11:59PM, one late day allowed
  • 3/25: Homework 3 released on Sakai, due date Sunday 4/08 11:59PM, one late day allowed
  • 3/20: Solutions of homework 2 released on Sakai
  • 3/02: Homework 2 released on Sakai, due date Sunday 3/18 11:59PM, one late day allowed
  • 2/27: Solutionos of homework 1 released on Sakai
  • 2/26: Final project will be released and the project description will be uploaded to Sakai
  • 2/09: Homework 1 released on Sakai, due date Sunday 2/25 11:59PM, one late day allowed
  • 1/19: Homework 0 released on Sakai, this is a practice homework and does not require submission
  • 1/10: The first class is on Thursday, Jan 18.


  • Course Descriptions

    This class introduce computing infrastructurs, algorithms, thories, and practice of massive data analytics and machine learning, as well as their application in frequently used scenarios, including recommender systems, web search, social networks, computational advertising, smart city, etc. Students will learn algorithms to store, process, mine, analyze, and synthesize streaming data, or data at rest that does not fit in Random Access Memory. The material covered here equips students with the main backend algorithms and infrastructure for the Capstone Project and research tasks closely related with data science and analytics.



    Prerequisites

  • CS 512 or CS 513 (Fundamental Algorithms)
  • Linear Algebra, Basic Probability (Moments, Typical Distributions, MLE)
  • Programming Languages: C++/Java
  • Infrastructure: Hadoop Cluster


  • Expected Work

  • Homework: 4 homework assignments (10% each, 40% total)
  • Midterm: Thu, Mar 8, 12:00pm – 1:15pm (25%)
  • Final Project: Complete as a team of at most 3 students, choose between an assigned project or a self-proposed project, and provide a 10 min presentation (35%, and 5% bonus for self-proposed projects)

  • The midterm is closed-book, but you are allowed to bring 1 letter-sized page of note that you prepared by yourself.

    Self-proposed projects should be equal to or exceed the amount of work of the assigned project, and the proposed project is subject to pre-approval by the instructor.



    Tentative Schedule

    Note that the schedule may be subject to change (e.g., due to snow or campus close). Please check the course website frequently for the latest schedule.

    Week
    Date
    Topics and Assignments
    1

    1/18

      Introduction, Map Reduce I (Reading: Ch 1, Ch 2.1-2.4)
    2

    1/22
    1/25

      Map Reduce II (Reading: Ch 2.1-2.4)
      Association Rule Mining (Reading: Ch 6)
    3

    1/29
    2/1

      Frequent Item Sets Mining (Reading: Ch 6)
      Locally Sensitive Hashing I (Reading: Ch 3.1-3.4)
    4

    2/5
    2/8

      Locally Sensitive Hashing II (Reading: Ch 3.5-3.8)
      Clustering, similarity, k-means, BFR (Reading: Ch 7.1-7.4)
    5

    2/12
    2/15

      Dimensionality Reduction, SVD (Reading: Ch 11.1-11.3)
      Dimensionality Reduction, CUR (Reading: Ch 11.4-11.5)
    6

    2/19
    2/22

      Content-based Recommendation (Reading: Ch 9.1-9.2)
      Collaborative Filtering, Latent Factor Models (Reading: Ch 9.3-9.4)
    7

    2/26
    3/1

      Learning to Rank and Deep Learning for RS, Project Description
      Link Analysis, Page Rank (Reading: Ch 5.1-5.3, 5.5)
    8

    3/5
    3/8

      Web Spam, Trust Rank (Reading: Ch 5.4)
      Mid-term exam
    9

    3/12
    3/15

      No class, Spring recess
    10

    3/19
    3/22

      Social Networks, Community Detection (Reading: Ch 10.1-10.2, 10.6)
      Spectral Clustering, Trawling (Reading: Ch 10.1-10.2, 10.6)
    11

    3/26
    3/29

      Overlapping Communities (Reading: Ch 10.3-10.5, 10.7-10.8)
      Large-scale Machine Learning I (Reading: Ch 12)
    12

    4/2
    4/5

      Large-scale Machine Learning II (Reading: Ch 12)
      Mining Data Streams I (Reading: Ch 4.1-4.3)
    13


    4/9
    4/12
     

      Mining Data Streams II (Reading: Ch 4.4-4.7)
      Guest Lecture: Case Study in Smart Cities
    14

    4/16
    4/19

      Computational Advertising (Reading: Ch 8)
      Learning through Experimentations with Bandit-based Learning
    15

    4/23
    4/26

      Project Presentations I
      Project Presentations II
    16

    4/30

      Course Review