The ICE toolkit is designed to embed the concepts of items into an embedding representation such that the resulted embeddings can be compared in terms of overall conceptual similarity regardless of item types (ICE: Item Concept Embedding via Textual Information, SIGIR 2017). For example, a song can be used to retrieve conceptually similar songs (homogeneous) as well as conceptually similar concepts (heterogeneous).
In specific, ICE incorporates items and their representative concepts (words extracted from the item's textual information) using a heterogeneous network and then learns the embeddings for both items and concepts in terms of the shared concept words. Since items are defined in terms of concepts, adding expanded concepts into the network allows the learned embeddings to be used to retrieve conceptually more diverse and yet relevant results.
- gcc 6.4
- python3
- cython
$ git clone https://github.com/cnclabs/ICE
$ cd ./ICE/ICE
$ make ice
This is an alternative way to use the toolkit via its APIs. For the usage, please refer to Section 2.2.2.
$ make python
[Note: The API is only tested with Python 3.]
Users need to provide an entity-text network and a text-text network to construct an ICE network. For more details, please refer to our paper.
Toy_Story toys 1
Toy_Story stuffed_animals 1
Star_Wars jedi 1
Star_Wars rebel 1
toys toys 1
toys stuffed_animals 1
stuffed_animals toys 1
stuffed_animals stuffed_animals 1
jedi jedi 1
rebel rebel 1
$ python3 construct_graph.py -et ../data/movie_et.edge -tt ../data/movie_tt.edge -ice movie_ice.edge
-et <string>, --et_network <string>
Input Entity-text Network
-tt <string>, --tt_network <string>
Input Text-text Network
-ice <string>, --ice_network <string>
Output ICE Network
For sample files, please see data/movie_et.edge
and data/movie_tt.edge
.
./ice -train movie_ice.edge -save movie.embd -dim 4 -sample 10 -neg 5 -thread 1 -alpha 0.025
Options:
-train <string>
Path to the network used for embedding learning
-save <string>
Path to save the embedding file
-dim <int>
Dimension of embedding; default is 64
-neg <int>
Number of negative examples; default is 5
-sample <int>
Number of training samples *Million; default is 10
-thread <int>
Number of training threads; default is 1
-alpha <float>
Initial learning rate; default is 0.025
After compiling, please use python3 example.py
for running the following codes.
from pyICE import pyICE
ice = pyICE()
network = {
'MAYDAY': {'Taiwanese': 1, 'rock': 1,'band': 1},
'MAYDAY@': {'Taiwanese': 1, 'rock': 1, 'band': 1},
'Sodagreen': {'Taiwanese': 1, 'indie': 1, 'pop_rock': 1, 'band': 1},
'SEKAI_NO_OWARI': {'Japanese': 1, 'indie': 1, 'pop_rock': 1, 'band': 1},
'The_Beatles': {'England': 1, 'rock': 1, 'pop': 1}
}
ice.load_dict(network)
ice.init(dimension=4)
ice.train(sample=11, neg=5, alpha=0.025, workers=1)
ice.save_weights(model_name='example.embd')
Here, we report the average performance based on 10 embeddings trained under the same setting. For more details, please refer to our paper.
- IMDB word-to-movie retrieval task:
- Graph construction: 20 representative words per item and 5 expanded words per representative word.
- Embedding learning: dim=256, sample=200, neg=2
Genre | Horror | Thriller | Western | Action | Short | Sci-Fi | Average |
---|---|---|---|---|---|---|---|
Precision@50 | 0.322 | 0.206 | 0.318 | 0.449 | 0.100 | 0.386 | 0.297 |
Precision@100 | 0.316 | 0.203 | 0.281 | 0.423 | 0.080 | 0.382 | 0.281 |
@inproceedings{Wang:2017:IIC:3077136.3080807,
author = {Wang, Chuan-Ju and Wang, Ting-Hsiang and Yang, Hsiu-Wei and Chang, Bo-Sin and Tsai, Ming-Feng},
title = {ICE: Item Concept Embedding via Textual Information},
booktitle = {Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval},
series = {SIGIR '17},
year = {2017},
isbn = {978-1-4503-5022-8},
location = {Shinjuku, Tokyo, Japan},
pages = {85--94},
numpages = {10},
url = {http://doi.acm.org/10.1145/3077136.3080807},
doi = {10.1145/3077136.3080807},
acmid = {3080807},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {concept embedding, conceptual retrieval, information network, textual information},
}