Public Data Release 1.0.0
This repo contains the description of the data released in conjunction with our Nature Scientific Reports paper Shopper Intent Prediction from Clickstream E‑Commerce Data with Minimal Browsing Information.
The dataset is available for research and educational purposes here. To obtain the dataset, you are required to fill out a form with information about you and your institution, and agree to the Terms And Conditions for fair usage of the data.
For convenience, Terms And Conditions are also included in a pure txt
format in this repo:
usage of the data implies the acceptance of these Terms And Conditions.
The dataset is provided as one big text file (.csv
), inside a zip
archive containing an additional copy of the
Terms And Conditions. The final dataset contains 5.433.611 individual events, and it is the first dataset of this
kind to be released to the research community. A sample file is included in this repository, showcasing the data structure.
Field | Type | Description |
---|---|---|
session_id_hash | string | Hashed identifier of the shopping session. A session groups together events that are at most 30 minutes apart: if the same user comes back to the target website after 31 minutes from the last interaction, a new session identifier is assigned. |
event_type | enum | The type of event according to the Google Protocol, one of { pageview , event }; for example, an add event can happen on a page load, or as a stand-alone event. |
product_action | enum | One of { detail, add, purchase, remove, click }. If the field is empty, the event is a simple page view (e.g. the FAQ page) without associated products. |
product_skus_hash | string | If the event is a product event, hashed identifiers of all products in the event (e.g. all the products in a transaction), pipe separated. |
server_timestamp_epoch_ms | int | Epoch time, in milliseconds. The epoch time has been shifted in time to further anonymize the data. |
hashed_url | string | Hashed url of the current web page. |
We refer the reader to the original paper for an extended explanation of how to use the dataset for the clickstream prediction challenge. Usage of this data implies the acceptance of the Terms And Conditions as set forward in the download page.
For questions about the paper, please refer to the corresponding author, Lucas Lacasa.
For questions about the dataset, please reach out to Jacopo Tagliabue.
The original paper is a product of collaboration between industry and academia, over a dataset gently provided by Coveo. The authors of the paper are:
- Borja Requena - Institut de Ciencies Fotoniques, The Barcelona Institute of Science and Technology
- Giovanni Cassani - Department of Cognitive Science and Artificial Intelligence, Tilburg University
- Jacopo Tagliabue - Coveo AI Labs
- Ciro Greco - Coveo AI Labs
- Lucas Lacasa - School of Mathematical Sciences, Queen Mary University of London
The authors wish to thank Richard Tessier and Coveo's legal team for supporting our research and believing in this data sharing initiative.
If you make use of this dataset, please cite our work:
@article{Requena2020,
author = {Requena, Borja and Cassani, Giovanni and Tagliabue, Jacopo and Greco, Ciro and Lacasa, Lucas},
title = {Shopper intent prediction from clickstream e-commerce data with minimal browsing information},
year = {2020},
journal = {Scientific Reports},
pages = {2045-2322},
volume = {10},
doi = {10.1038/s41598-020-73622-y}
}