Page MenuHomePhabricator

Q4 2024 update of Property Suggester data
Open, Needs TriagePublic

Description

Problem:
The last update to the Property Suggester data was in 2023. As such, current suggestions do not reflect changes made since the last update because of the outdated Property Suggester Data. We need to update it.

How to Guide: https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/PropertySuggester_update

Steps to Reproduce

  1. Go to wikidata.org try to add a property to an existing item which isn't protected
  2. Observe if the specified property does or does not appear in the list

Acceptance criteria:

  • Property suggestions are based on data from within the last 3 months
  • property:P7859 no longer is available
  • ISNI is available
  • What 'legacy' means for property suggester is clarified and the doc is updated

Original report:

Event Timeline

Yes please, this has been long overdue. I've added several properrties to the refine script since the last time it got updated, which will also hopefully have a big impact on the quality of suggestions.

karapayneWMDE updated the task description. (Show Details)
karapayneWMDE moved this task from Incoming to Product Backlog on the Wikidata Dev Team board.
karapayneWMDE subscribed.

the following A/C as of 20 Nov 2024 are met
property:P7859 no longer is available
ISNI is available

How to Guide: https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/PropertySuggester_update

Well, this seems unfortunate:

Find the latest wbs_propertypairs on https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wikidata/wbs_propertypairs/ (generated on stat1005 by a cron from ladsgroup).

The latest directory at that link is 20240429, i.e. over half a year old. This may or may not be related to stat1005 being decommissioned in June (T353785).

We might need to resurrect that cronjob first, assuming we can figure out what it even ran. (Codesearch for analyzed-out.gz comes up empty at the moment; the PropertySuggester README suggests it was wikibase/property-suggester-scripts.)

Apparently an older version of the guide had more details on how to generate analyzed-out.gz, though it’s just a guess that the cronjob was an automated version of the steps listed there (maybe Marius or Amir can confirm?).

To my big surprise the update script is still running (on stat1010, see below). Given that this used to take days, not weeks, I suspect something is wrong here. I'll look into it.

hoo      1436350  0.1  0.0  39260 29188 pts/27   RN+  Oct29  40:47 python scripts/dumpconverter.py /mnt/data/xmldatadumps/public/wikidatawiki/entities/latest-all.json.gz