Wikidata:Requests for permissions/Bot/Pi bot 14
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved--Ymblanter (talk) 12:55, 20 October 2020 (UTC)[reply]
Pi bot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Mike Peel (talk • contribs • logs)
Task/s: Import short descriptions from the English Wikipedia
Code: Available on BitBucket
Function details: The script looks through enwp short description tracking categories to find cases that don't match the Wikidata description beyond capitalisation differences. It then has two options:
- If there is no English description from Wikidata, then import the short description from English Wikipedia (from en:Category:Short description with empty Wikidata description
If Wikidata has an English description, but it doesn't match the English Wikipedia short descriptions, then replace the Wikidata description with the enwp short description
The assumption is that these descriptions are short enough to be copyright free. Also, the first letter is always changed to lower case to match Wikidata's style.
This aims to solve the problem that the English Wikipedia is stopping using Wikidata descriptions as short descriptions in search etc., but that the descriptions are also used here and on other projects (such as in the Commons infobox). This proposal will likely be controversial, which is why I am proposing two options - we can either do (2) and (1), or just (1), or neither. As such, I've also posted a notice at Project chat and on enwp. Thanks. Mike Peel (talk) 19:32, 3 August 2020 (UTC)[reply]
- I also posted a matching bot proposal at en:Wikipedia:Bots/Requests for approval/Pi bot 5, also see the discussion at en:Wikipedia:Village_pump_(proposals)#Synchronising_short_descriptions_and_Wikidata_descriptions. Thanks. Mike Peel (talk) 21:39, 6 August 2020 (UTC)[reply]
Discussion
[edit]- Strong oppose to overwrite existing Wikidata descriptions. There is absolutely no need to keep this synced. —MisterSynergy (talk) 19:47, 3 August 2020 (UTC)[reply]
- @MisterSynergy: Can you elaborate, please? I'm assuming that a newer description will be better than an older one, and that they are more likely to be improved on enwp than wikidata. The initial import may be different, but I think we need to do that sync at some point, even if it's hard. Thanks. Mike Peel (talk) 19:51, 3 August 2020 (UTC)[reply]
- No, why should we sync this? Enwiki chose to depart from the standard descriptions, which is fine—frankly we do not need them as data users, but please let this not backfire to our project. You're also mistaken when assuming that the enwiki "short description" is necessarily newer than a local description here at Wikidata. I personally would be pretty pissed if my descriptions would be overwritten by a bot based on content taken from another project, particularly one where there is an extremely hostile environment against everything which has to do with Wikidata. Consider what happens if I align an enwiki "short description" to what I would like to see in the Wikidata description field… Thanks, but no thanks. —MisterSynergy (talk) 20:00, 3 August 2020 (UTC)[reply]
- I think we do need them as data users, and that this is a problem we need to solve somehow. I'm open to other suggestions on how to do that. Thanks. Mike Peel (talk) 20:04, 3 August 2020 (UTC)[reply]
- They wouldn't return to usage of Wikidata descriptions if we now started to copy all their short descriptions anyways. We would just alienate our local users doing so.
Can you please elaborate why these two descriptions need to be in sync? I don't get this point, to be honest. —MisterSynergy (talk) 20:09, 3 August 2020 (UTC)[reply]- Would they return to using Wikidata descriptions anyway? I'm trying to avoid the inevitable questions about why a given description isn't the same, and why one is worse than the other, in uses on Commons and elsewhere. Thanks. Mike Peel (talk) 20:13, 3 August 2020 (UTC)[reply]
- The problem with descriptions is that they are used in different contexts, which means there isn't necessarily an "ideal description" possible. It is also not correct to assume quality issues here at Wikidata when descriptions differ. It is no drama if enwiki sets up rules for their needs which are then incompatible with ours. We do not need another round of English Wikipedia imperialism in this project.
You can also try to propose overwriting "short descriptions" in enwiki if the Wikidata description differs. What do you think would happen with such a proposal? Yeah, they would reject it directly, of course (and I would understand it). —MisterSynergy (talk) 20:21, 3 August 2020 (UTC)[reply]- What different contexts? If we need to handle different descriptions for different contexts, can we do that in structured data? I'll propose a mirror enwp bot request shortly. Thanks. Mike Peel (talk) 20:34, 3 August 2020 (UTC)[reply]
- Descriptions were originally meant to be used as disambiguators for items with identical labels. Years later someone found that these Wikidata descriptions are a relatively good resource that could be used to improve search and display in many, but not all scenarios. Enwiki "short descriptions" center around the latter use, which can be quite different from our orginal intentions with the Wikidata descriptions.
I don't think there is much to arrange with structured data here, given the purpose of (Wikidata) descriptions. What enwiki does with their local "short descriptions" is their problem. As a sidenote, I just figured that it is pretty complicated to compare Wikidata descriptions and enwiki "short descriptions", as there is apparently no possibility to query differences efficiently on a large scale; due to the way how the "short descriptions" are stored locally in enwiki, comparisons need to be done basically individually which sucks. —MisterSynergy (talk) 20:52, 3 August 2020 (UTC)[reply]- I can see how that would have made sense at the time, but It doesn't match the current usage where they are used to summarise the topic. Anyhow, I've started en:Wikipedia:Bots/Requests for approval/Pi bot 5. Thanks. Mike Peel (talk) 21:20, 3 August 2020 (UTC)[reply]
- Wikidata descriptions are nowadays still not used to summarize a topic. Anyways, if you want to allow imports from enwiki to Wikidata or vice versa, please make a tool and gamify this process. Collect items where the two descriptions differ, and provide a UI on toolforge where users can query items in which they are interested, in order to review the different descriptions and eventually update the Wikidata description or the enwiki "short description" if they deem that appropriate. This allows manual oversight, but users can still transfer their descriptions if desired. —MisterSynergy (talk) 21:41, 3 August 2020 (UTC)[reply]
- I can do that (along the same lines as [1]), but I'd really prefer not to do so as it would waste a lot of our editor's time. Thanks. Mike Peel (talk) 21:48, 3 August 2020 (UTC)[reply]
- Wikidata descriptions are nowadays still not used to summarize a topic. Anyways, if you want to allow imports from enwiki to Wikidata or vice versa, please make a tool and gamify this process. Collect items where the two descriptions differ, and provide a UI on toolforge where users can query items in which they are interested, in order to review the different descriptions and eventually update the Wikidata description or the enwiki "short description" if they deem that appropriate. This allows manual oversight, but users can still transfer their descriptions if desired. —MisterSynergy (talk) 21:41, 3 August 2020 (UTC)[reply]
- I can see how that would have made sense at the time, but It doesn't match the current usage where they are used to summarise the topic. Anyhow, I've started en:Wikipedia:Bots/Requests for approval/Pi bot 5. Thanks. Mike Peel (talk) 21:20, 3 August 2020 (UTC)[reply]
- Descriptions were originally meant to be used as disambiguators for items with identical labels. Years later someone found that these Wikidata descriptions are a relatively good resource that could be used to improve search and display in many, but not all scenarios. Enwiki "short descriptions" center around the latter use, which can be quite different from our orginal intentions with the Wikidata descriptions.
- What different contexts? If we need to handle different descriptions for different contexts, can we do that in structured data? I'll propose a mirror enwp bot request shortly. Thanks. Mike Peel (talk) 20:34, 3 August 2020 (UTC)[reply]
- The problem with descriptions is that they are used in different contexts, which means there isn't necessarily an "ideal description" possible. It is also not correct to assume quality issues here at Wikidata when descriptions differ. It is no drama if enwiki sets up rules for their needs which are then incompatible with ours. We do not need another round of English Wikipedia imperialism in this project.
- Would they return to using Wikidata descriptions anyway? I'm trying to avoid the inevitable questions about why a given description isn't the same, and why one is worse than the other, in uses on Commons and elsewhere. Thanks. Mike Peel (talk) 20:13, 3 August 2020 (UTC)[reply]
- They wouldn't return to usage of Wikidata descriptions if we now started to copy all their short descriptions anyways. We would just alienate our local users doing so.
- I think we do need them as data users, and that this is a problem we need to solve somehow. I'm open to other suggestions on how to do that. Thanks. Mike Peel (talk) 20:04, 3 August 2020 (UTC)[reply]
- No, why should we sync this? Enwiki chose to depart from the standard descriptions, which is fine—frankly we do not need them as data users, but please let this not backfire to our project. You're also mistaken when assuming that the enwiki "short description" is necessarily newer than a local description here at Wikidata. I personally would be pretty pissed if my descriptions would be overwritten by a bot based on content taken from another project, particularly one where there is an extremely hostile environment against everything which has to do with Wikidata. Consider what happens if I align an enwiki "short description" to what I would like to see in the Wikidata description field… Thanks, but no thanks. —MisterSynergy (talk) 20:00, 3 August 2020 (UTC)[reply]
- @MisterSynergy: Can you elaborate, please? I'm assuming that a newer description will be better than an older one, and that they are more likely to be improved on enwp than wikidata. The initial import may be different, but I think we need to do that sync at some point, even if it's hard. Thanks. Mike Peel (talk) 19:51, 3 August 2020 (UTC)[reply]
- Comment be careful with lowercasing as some may start with a proper noun, e.g. the short description of w:French Polynesia is "French overseas country in the Southern Pacific ocean" and w:Latinx is "U.S. gender-neutral term for people of Latin American heritage". Awkward42 (talk) 20:17, 3 August 2020 (UTC)[reply]
- @Awkward42: If there are standard ways to catch these exceptions, then I'd be happy to include them. Otherwise, using lower case in the comparison checks should avoid most problems here. Thanks. Mike Peel (talk) 20:26, 3 August 2020 (UTC)[reply]
- The premise of this bot seems to hinge on the assumption that descriptions in both systems need to be the same. While theoretically I might agree, I don't think that this will be helpful for the short term, as I expect it to trigger unnecessary strife and conflict between communities. Missing descriptions on the other hand, that might work. However, I still have copyright concerns with moving more than simple facts from WP to Wikidata. I know that there are lots of cases where people have done that manually, but still.... to do it with a bot (approved by the community). That just seems dangerous from a legal perspective. TheDJ (talk) 10:45, 4 August 2020 (UTC)[reply]
- Support, assuming the legal theory holds up (maybe have a check that it's not longer than x words?). Regarding #1, yes, given the prividom before. Regarding #2, could we check if the Wikidata one is newer and then not do the automatic overwrite? Or have some other way to stop the automatic overwrite? (E.g. a template on the talk page that says "English label intentionally diverged"). --Denny (talk) 13:18, 4 August 2020 (UTC)[reply]
- Oppose to replacing existing descriptions. Much per MisterSynergy and against enwiki dominance. Lymantria (talk) 11:19, 5 August 2020 (UTC)[reply]
- @Lymantria: "against enwiki dominance" - what do you mean by that please? Is it an objection about process or content? Thanks. Mike Peel (talk) 21:42, 6 August 2020 (UTC)[reply]
- @Mike Peel: It is an objection against assuming that the enwiki "short description" is better than an existing description here. Lymantria (talk) 07:25, 7 August 2020 (UTC)[reply]
- Oppose to replacing existing descriptions. Not any community except the Wikidata community should decide which description is better. By letting a bot replace existing Wikidata descriptions with the value from the enwiki short descriptions, the enwiki community gets to decide what we use as a description. We're not going to let enwiki decide what the description of an image on Commons should be (by forcing the enwiki caption as the commons description). And if a description has been cleared after the bot added it, the bot shouldn't readd it. Mbch331 (talk) 06:04, 7 August 2020 (UTC)[reply]
- Oppose replacing existing descriptions. We stopped or blocked two users doing that not too long and the option on the enwiki script doing that was adjusted. If you want to overwrite enwiki descriptions with the ones from Wikidata, I'd be ok with that. --- Jura 00:32, 11 August 2020 (UTC)[reply]
The discussion here was archived, and the enwp one has also been closed (consensus against both options proposed there). Here, there's consensus against syncing with enwp (and I've struck that option out above accordingly), but I still think there's an opportunity to import short descriptions from enwp where we don't have a description here yet (from en:Category:Short description with empty Wikidata description, there are at least 170k of these). I haven't heard back from WMF legal about their opinion on this, and at this point I don't expect a reply, but in general I don't think short descriptions can be copyrighted (e.g., copyright.gov says "Copyright does not protect names, titles, slogans, or short phrases"). I could impose a word limit (less than 10 words?) before importing descriptions to further reduce that risk. @Lucas_Werkmeister, ChristianKl: you raised concerns about copyright in the discussion, can you elaborate on them and how you think the issues could be avoided please? @MisterSynergy, Awkward42, Denny, Lymantria, Mbch331, Jura1: do you want to change your !votes above per this change in direction or make a further comment? Thanks. Mike Peel (talk) 18:35, 5 September 2020 (UTC)[reply]
- I've modified the code (updated on bitbucket) so it only imports descriptions where there isn't already one here, and it skips short descriptions with over 10 bits of whitespace in it (not quite the same as words, as "this - example" would count as 3 words, but close enough). I've added an array with exceptions for words that it shouldn't convert to lower-case, I can easily add more exceptions if anyone wants to suggest them. Example edits at [2] and [3]. I'd like to start doing bot runs soon, initially with smaller runs to test the configuration, then a batch run to catch up, followed by daily imports after then. @Ymblanter: what do you think? Also pinging @Lucas_Werkmeister, ChristianKl: @MisterSynergy, Awkward42, Denny, Lymantria, Mbch331, Jura1: again in case my previous ping didn't get through. Thanks. Mike Peel (talk) 17:34, 12 September 2020 (UTC)[reply]
- Good enough for me. Lymantria (talk) 17:36, 12 September 2020 (UTC)[reply]
- Can't discover how you prevent re-adding the short description if removed (leaving an item with an empty description again), but other than that it's fine by me. And for your exception list: Dutch, Belgian, Flemish. Mbch331 (talk) 19:15, 12 September 2020 (UTC)[reply]
- @Mbch331: If the bot-imported description is removed then I would expect that the editor would put a new description in its place, which would stop the bot re-importing it - I can't see how insisting on having a blank description would be useful. I've added those words to the exception list, and I've also moved it on-wiki: if you add new lines to User:Pi bot/lowercase exceptions then they will also be excluded. Thanks. Mike Peel (talk) 19:46, 12 September 2020 (UTC)[reply]
- Let's see how often it happens. And thanks for moving the list onwiki. Easier to add new entries. Mbch331 (talk) 20:07, 12 September 2020 (UTC)[reply]
- @Mbch331: If the bot-imported description is removed then I would expect that the editor would put a new description in its place, which would stop the bot re-importing it - I can't see how insisting on having a blank description would be useful. I've added those words to the exception list, and I've also moved it on-wiki: if you add new lines to User:Pi bot/lowercase exceptions then they will also be excluded. Thanks. Mike Peel (talk) 19:46, 12 September 2020 (UTC)[reply]
- Can't discover how you prevent re-adding the short description if removed (leaving an item with an empty description again), but other than that it's fine by me. And for your exception list: Dutch, Belgian, Flemish. Mbch331 (talk) 19:15, 12 September 2020 (UTC)[reply]
- Good enough for me. Lymantria (talk) 17:36, 12 September 2020 (UTC)[reply]
- In total there are about 215k items without an English description whose English Wikipedia article does have a "short description". However, some short descriptions should not be imported an you should blacklist at least the following ones:
- Surname list
- Wikipedia list article
- Disambiguation page providing links to topics that could be referred to by the same search term
- Name list
- Wikimedia list article
- Index of articles associated with the same name
- Wikipedia bibliography
- list of ships with the same or similar names
- list of chemical structure articles associated with the same molecular formula
- Place
- Wikipedia index
- list of plants with the same or similar names
- wikimedia list article
- Wikipedia glossary
- Type of
- index of animals with the same common name
- Wikipedia list
- Wikipedia list of lists article
- Wikipedia disambiguation page
- Wikipedia list article: Flora of Palestine
- Wikipedia list of Marathi films
- Wikipedia timeline article
- Wikipedia film lists
- Links to Wikipedia articles about notable rock outcrops
- Wikipedia glos
- Wikipedia list of 2020 Mollywood films
- Wikipedia glossary list
- Wikipedia list articles
- Wikipedia list page
- Wikipedia list article of rivers of Georgia, U.S.
- Wikipedia's portal for exploring content related to General
- Wikipedia Biography
- Wikimedia disambiguation page
- Wikimedia index article
- Wikimedia list
- Wikimedia list articles
- Wikimedia list page
- Wikimedia data page
- Wikimedia List article
- wikimedia timeline article
- wikimedia index article
- index of chemical compounds with the same name
- list of people with the same nickname
- list of mountains with the same or similar names
- list of roads or other routes with the same name
- list of sports-related pages with the same or similar names
- index of enzymes associated with the same name
- list of locomotives with the same or similar names
- There is also wiki markup in same cases which you might want to remove from the short descriptions before importing them. Another very important case to ignore would be "United States historic place", which is the short description of ~38k Wikipedia articles, but as a heritage designation it is IMO not a good Wikidata description. —MisterSynergy (talk) 21:00, 12 September 2020 (UTC)[reply]
- @MisterSynergy: I've set up a blacklist at User:Pi bot/shortdesc exclusions, and updated the code to use it. Feel free to add more exceptions there if you want. Thanks. Mike Peel (talk) 08:31, 13 September 2020 (UTC)[reply]
@Ymblanter: I think this is ready to go, if it's something we want to do. Thanks. Mike Peel (talk) 18:51, 17 October 2020 (UTC)[reply]
- Let us wait for a couple of days and then I can approve the bot, I see that the objections have been addressed.--Ymblanter (talk) 18:55, 17 October 2020 (UTC)[reply]
- Maybe anything with "Wikipedia", "Wikimedia", "WMF", "article", "page" or "list" in the description shouldn't be imported. --- Jura 19:19, 17 October 2020 (UTC)[reply]
- +"this", "you", "we", "criminal", 7 words, etc. --- Jura 19:31, 17 October 2020 (UTC)[reply]
- +label in the description; descriptions starting with "is ", "a ", "the ". --- Jura 20:21, 17 October 2020 (UTC)[reply]
- @Jura1: I think those exclusions are too general, they could just as easily exclude good descriptions as they could bad ones. I'd prefer to stick with the list at User:Pi bot/shortdesc exclusions (which you can add to!), the candidates to import are checked against that (but wildcards aren't allowed). Thanks. Mike Peel (talk) 18:40, 19 October 2020 (UTC)[reply]
- Can you log those that match the above somewhere? --- Jura 18:42, 19 October 2020 (UTC)[reply]
- @Jura1: Probably it's easier to find them by a SPARQL query/set up a listeria page that tracks them? Thanks. Mike Peel (talk) 20:41, 19 October 2020 (UTC)[reply]
- @Jura1: Actually, I can temporarily set up a list of descriptions to double-check. I've set up a list of words at User:Pi bot/doublecheck words and starting words at User:Pi bot/doublecheck start words, which then flags them at User:Pi bot/doublecheck. Does that work for you? I would like to admin-protect these configuration pages at some point, though... Thanks. Mike Peel (talk) 21:00, 19 October 2020 (UTC)[reply]
- Looks good. BTW, for "we" (and possibly some of the others) it needs to include a word boundary marker, e.g. " we " or "\bwe\b". If you save it to a ".js" subpage, only your bot (and admins) can edit it. --- Jura 08:20, 20 October 2020 (UTC)[reply]
- Maybe I will recall a few other problems we had when we imported the short descriptions from the template they used to have at the bottom of articles. Plbot imported those. --- Jura 08:28, 20 October 2020 (UTC)[reply]
- Can you log those that match the above somewhere? --- Jura 18:42, 19 October 2020 (UTC)[reply]
- @Jura1: I think those exclusions are too general, they could just as easily exclude good descriptions as they could bad ones. I'd prefer to stick with the list at User:Pi bot/shortdesc exclusions (which you can add to!), the candidates to import are checked against that (but wildcards aren't allowed). Thanks. Mike Peel (talk) 18:40, 19 October 2020 (UTC)[reply]