Dataset

JD.com E-Commerce Data

    Summary:
  • JD.com is one of the largest Chinese E-commerce websites. This dataset contains consumer purchasing behaviors, user ratings, reviews, and product metadata from Jan 1, 2011 to Mar 31, 2014 (3 years and a quater), covering 15 first-level product categories, 987 second-level product categories, nearly 2 million users, over 100K products, and over 60 million reviews.
    Each piece of textual review in this dataset consists of three subreviews: a positive review, a negative review, and an overall review.
    The dataset can be downloaded by clicking here, and download passcode is 3ru2.
  • References:
  • [1] Yongfeng Zhang, Min Zhang, Yi Zhang, Guokun Lai, Yiqun Liu, Honghui Zhang, Shaoping Ma. Daily-Aware Personalized Recommendation based on Feature-Level Time Series Analysis. In Proceedings of the 24th International World Wide Web Conference (WWW 2015), May 18 - 22, 2015, Florence, Italy. [PDF][slides]

Amazon Baby Registry Dataset

    Summary:
  • This dataset is from Amazon Baby Registry (http://www.amazon.com/babyregistry). In this website, people register a wishlist of products to purchase for their new baby. As a result, each list is a set of complementary products. Each list contains the user_id and a list of product_id; for each product, we know its title, brand, price, category (book, toy, etc.), and product URL. The dataset can be downloaded at here.
  • References:
  • [1] Qi Zhao, Yongfeng Zhang, Yi Zhang, Daniel Friedman. Multi-Product Utility Maximization for Economic Recommendation. In Proceedings of the 10th International Conference on Web Search and Data Mining (WSDM 2017), February 6 - 10, 2017, Cambridge, UK. [PDF][slides][code]

Dianping Review Dataset

    Summary:
  • This dataset contains the user reviews as well as the detailed business meta data information crawled from a famous Chinese online review webset DianPing.com, including the 3,605,300 reviews of 510,071 users towards 209,132 businesses. The numerical ratings of this dataset are used for collaborative filtering (Localized Matrix Factorization) in [1] and [2], and the textual reviews are used for sentiment analysis and explanable recommendation in [3] and [4], respectively. Detailed data format descriptions are included in the readme.txt file.
    This dataset can be downloaded at Dianping with the download password "t23c", and the extraction password for the zip file is "yongfeng.me".
  • References:
  • [1] Yongfeng Zhang, Min Zhang, Yiqun Liu, Shaoping Ma and Shi Feng. Localized Matrix Factorization for Recommendation based on Matrix Block Diagonal Forms. In Proceedings of the 22nd International Conference on World Wide Web (WWW 2013), May 13 – 17, 2013, Rio de Janeiro, Brazil. [PDF]
  • [2] Yongfeng Zhang, Min Zhang, Yiqun Liu and Shaoping Ma. Improve Collaborative Filtering Through Bordered Block Diagonal Form Matrices. In Proceedings of the 36th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2013), July 28 - August 1, 2013, Dublin, Ireland. [PDF]
  • [3] Yongfeng Zhang, Haochen Zhang, Min Zhang, Yiqun Liu and Shaoping Ma. Do Users Rate or Review? Boost Phrase-level Sentiment Labeling with Review-level Sentiment Classification. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2014), July 6 - 11, 2014, Gold Coast, Australia. [PDF]
  • [4] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu and Shaoping Ma. Explicit Factor Models for Explainable Recommendation based on Phrase-level Sentiment Analysis. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2014), July 6 - 11, 2014, Gold Coast, Australia. [PDF]

Phrase-level Sentiment Labeled Reviews

    Summary:
  • This dataset contains the reviews of two domains, which are restaurant reviews and digital cameras. We extract all of the product feature word to user opinion word pairs (e.g. service | good, price | reasonable, etc) from each of the reviews, as well as the sentiment polarity of these pairs. The methodology of sentiment polarity labelling and feature-opinion pair extraction are presented in [1] and [2] respectively, and this dataset is used for explanable recommendation in [3].
    The labeled dianping dataset can be downloaded at Labeled Dianping with the download password "ouk2", and the labeled DC review dataset can be downloaeded at Labeled DC reviews with the download password "tier"; The extraction passwords for the zip files are both "yongfeng.me".
  • Data Format:
  • A user review is formatted as an XML entry of the form:
    <DOC>
    userid itemid flavor_rating environment_rating service_rating
    review_text
    feature-opinion pairs matched in the review_text, each of the form [feature_word, opinion_word, sentiment_polarity, times_of_ occurrence, reversed_or_not]
    </DOC>
    e.g. [service, good, +1, 1, Y] means that the pair 'service | good' is matched for once in the review, and the pair itself represents a positive sentiment (+1), however, it is reversed (Y means that it is indeed reversed, and N is not reversed) by a negation word (e.g. 'not'), so the final sentiment of this pair in this review would be negative.
  • References:
  • [1] Yongfeng Zhang, Haochen Zhang, Min Zhang, Yiqun Liu and Shaoping Ma. Do Users Rate or Review? Boost Phrase-level Sentiment Labeling with Review-level Sentiment Classification. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2014), July 6 - 11, 2014, Gold Coast, Australia. [PDF]
  • [2] Yunzhi Tan, Yongfeng Zhang, Min Zhang, Yiqun Liu and Shaoping Ma. A Unified Framework for Emotional Elements Extraction based on Finite State Matching Machine. Natural Language Processing and Chinese Computing, Communications in Computer and Information Science (CCIS), Volume 400, 2013, pp 60-71. [PDF]
  • [3] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu and Shaoping Ma. Explicit Factor Models for Explainable Recommendation based on Phrase-level Sentiment Analysis. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2014), July 6 - 11, 2014, Gold Coast, Australia. [PDF]
Comming soon...

BiliBili Time Synchronized Comments (TSC) Dataset