Dataset

E-Commerce Conversational Search and Recommendation Dataset

    This is a semi-synthetic dataset for conversational search and recommendation in e-commerce. Basically, each conversation is constructed from a piece of user-item review. We extract product features and user opinions on these features from each review, and then a conversation is constructed based on a system ask - user response manner. Specifically, the system asks about each feature according to the order that the features apprear in the review, and the sentence commenting on that feature is considered as the user's response. We constructed conversations in four domains, which are CDs_and_Vinyl, Cell_Phones_and_Accessories, Electronics, and Kindle_Store. It should be noted that not all of the response sentences are good in terms of quality, but they show user's opinion (or requirement) on the corresponding features. The features are extracted based on our Sentires: Phrase-level Sentiment Analysis Toolkit, which can be downloaded in the software page of this website. If a researcher would like to increase the quality of the conversations (which will sacrifice the quantity though), he or she could leverage this toolkit and change the parameter settings to extract higher quality (but less in terms of quantity) feature and opinion words. The dataset can be downloaded by clicking here (download).

    References:
  • [1] Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W. Bruce Croft. Towards Conversational Search and Recommendation: System Ask, User Respond. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM 2018), October 22 - 26, 2018, Turin, Italy. [PDF]

BiliBili Time Synchronized Comments (TSC) Dataset

    This is the Time Syncronized Comments (TSC) dataset from BiliBili.com (commenly known as the B-site), which is one of the largest user-interactive video sharing websites in China. We used the dataset for the Personalized Key Frame Recommendation research published in SIGIR 2017, which attempts to display personalized key frames for different users even on the same video. This released dataset is larger and more complete than what we used in the paper, including more than 500K users, 900 videos, and 1.5 million time syncronized comments. This data can support large-scale model training for various research tasks in Recommendation, IR, Multimedia, etc. Please stay tuned as we are going to release a more complete version which includes the user profiles and video metadata. The dataset can be accessed by clicking here (download). We appreciate your citing the following paper if using this dataset for your research.

    References:
  • [1] Xu Chen, Yongfeng Zhang, Qingyao Ai, Hongteng Xu, Junchi Yan, and Zheng Qin. Personalized Key Frame Recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017), August 7 - 11, 2017, Tokyo, Japan. [PDF][slides][code]

JD.com E-Commerce Data

    JD.com is one of the largest Chinese E-commerce websites. This dataset contains consumer purchasing behaviors, user ratings, reviews, and product metadata from Jan 1, 2011 to Mar 31, 2014 (3 years and a quater), covering 15 first-level product categories, 987 second-level product categories, nearly 2 million users, over 100K products, and over 60 million reviews. Each piece of textual review in this dataset consists of three subreviews: a positive review, a negative review, and an overall review. The dataset can be downloaded by clicking here (download), and the download passcode is 3ru2.

    References:
  • [1] Yongfeng Zhang, Min Zhang, Yi Zhang, Guokun Lai, Yiqun Liu, Honghui Zhang, Shaoping Ma. Daily-Aware Personalized Recommendation based on Feature-Level Time Series Analysis. In Proceedings of the 24th International World Wide Web Conference (WWW 2015), May 18 - 22, 2015, Florence, Italy. [PDF][slides]

Amazon Baby Registry Dataset

    This dataset is from Amazon Baby Registry (http://www.amazon.com/babyregistry). In this website, people register a wishlist of products to purchase for their new baby. As a result, each list is a set of complementary products. Each list contains the user_id and a list of product_id; for each product, we know its title, brand, price, category (book, toy, etc.), and product URL. The dataset can be downloaded here (download).

    References:
  • [1] Qi Zhao, Yongfeng Zhang, Yi Zhang, Daniel Friedman. Multi-Product Utility Maximization for Economic Recommendation. In Proceedings of the 10th International Conference on Web Search and Data Mining (WSDM 2017), February 6 - 10, 2017, Cambridge, UK. [PDF][slides][code]

Dianping Review Dataset

    This dataset contains the user reviews as well as the detailed business meta data information crawled from a famous Chinese online review webset DianPing.com, including the 3,605,300 reviews of 510,071 users towards 209,132 businesses. The numerical ratings of this dataset are used for collaborative filtering (Localized Matrix Factorization) in [1] and [2], and the textual reviews are used for sentiment analysis and explanable recommendation in [3] and [4], respectively. Detailed data format descriptions are included in the readme.txt file. This dataset can be downloaded here (download). The download password is "t23c", and the extraction password for the zip file is "yongfeng.me".

    References:
  • [1] Yongfeng Zhang, Min Zhang, Yiqun Liu, Shaoping Ma and Shi Feng. Localized Matrix Factorization for Recommendation based on Matrix Block Diagonal Forms. In Proceedings of the 22nd International Conference on World Wide Web (WWW 2013), May 13 – 17, 2013, Rio de Janeiro, Brazil. [PDF]
  • [2] Yongfeng Zhang, Min Zhang, Yiqun Liu and Shaoping Ma. Improve Collaborative Filtering Through Bordered Block Diagonal Form Matrices. In Proceedings of the 36th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2013), July 28 - August 1, 2013, Dublin, Ireland. [PDF]
  • [3] Yongfeng Zhang, Haochen Zhang, Min Zhang, Yiqun Liu and Shaoping Ma. Do Users Rate or Review? Boost Phrase-level Sentiment Labeling with Review-level Sentiment Classification. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2014), July 6 - 11, 2014, Gold Coast, Australia. [PDF]
  • [4] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu and Shaoping Ma. Explicit Factor Models for Explainable Recommendation based on Phrase-level Sentiment Analysis. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2014), July 6 - 11, 2014, Gold Coast, Australia. [PDF]

Phrase-level Sentiment Labeled Reviews

    This dataset contains the reviews of two domains, which are restaurant reviews and digital cameras. We extract all of the product feature word to user opinion word pairs (e.g. service | good, price | reasonable, etc) from each of the reviews, as well as the sentiment polarity of these pairs. The methodology of sentiment polarity labelling and feature-opinion pair extraction are presented in [1] and [2] respectively, and this dataset is used for explainable recommendation in [3]. We also provide the toolkit for extracting such product feature and user opinion words from arbitrary (English or Chinese) textual corpa, for detail, please refer to the "Sentires" toolkit under the Software tab of my homepage.

    The labeled dianping dataset can be downloaded at here (download) with password "ouk2", and the labeled DC review dataset can be downloaded here (download) with the password "tier"; The extraction passwords for the zip files are both "yongfeng.me". A brief description of the data format is as follows:

    A user review is formatted as an XML entry of the form:
    <DOC>
    userid itemid flavor_rating environment_rating service_rating
    review_text
    feature-opinion pairs matched in the review_text, each of the form [feature_word, opinion_word, sentiment_polarity, times_of_ occurrence, reversed_or_not]
    </DOC>
    e.g. [service, good, +1, 1, Y] means that the pair 'service | good' is matched for once in the review, and the pair itself represents a positive sentiment (+1), however, it is reversed (Y means that it is indeed reversed, and N is not reversed) by a negation word (e.g. 'not'), so the final sentiment of this pair in this review would be negative.

    References:
  • [1] Yongfeng Zhang, Haochen Zhang, Min Zhang, Yiqun Liu and Shaoping Ma. Do Users Rate or Review? Boost Phrase-level Sentiment Labeling with Review-level Sentiment Classification. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2014), July 6 - 11, 2014, Gold Coast, Australia. [PDF]
  • [2] Yunzhi Tan, Yongfeng Zhang, Min Zhang, Yiqun Liu and Shaoping Ma. A Unified Framework for Emotional Elements Extraction based on Finite State Matching Machine. Natural Language Processing and Chinese Computing, Communications in Computer and Information Science (CCIS), Volume 400, 2013, pp 60-71. [PDF]
  • [3] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu and Shaoping Ma. Explicit Factor Models for Explainable Recommendation based on Phrase-level Sentiment Analysis. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2014), July 6 - 11, 2014, Gold Coast, Australia. [PDF]