Jump to content

Wikifunctions:Status updates/2024-12-12

From Wikifunctions
Wikifunctions Status updates Translate

<translate> Abstract Wikipedia via mailing list</translate> <translate> Abstract Wikipedia on IRC</translate> <translate> Wikifunctions on Telegram</translate> <translate> Wikifunctions on Mastodon</translate> <translate> Wikifunctions on Twitter</translate> <translate> Wikifunctions on Facebook</translate> <translate> Wikifunctions on YouTube</translate> <translate> Wikifunctions website</translate> Translate

Sketching a path to Abstract Wikipedia

The main goal of Wikifunctions is to support Abstract Wikipedia: a source of multi-lingual Wikipedia content where we can create and maintain the content only once, but have it available across many different languages to fill some of the gaps that currently exist in some Wikipedias.

Today, I would like to sketch out how the natural language generation for Abstract Wikipedia might develop. As an example goal, let’s take the following sentence (based on the English Wikipedia article about Waakye):

English
"Waakye is a Ghanaian dish of cooked rice and beans."
French
"Le waakye est un mets ghanéen de riz et de haricots cuits."
German
"Waakye ist ein ghanaisches Gericht aus gekochten Reis und Bohnen."

We look at four stages to work towards this text.

Stage 1: String-based substitution

In Stage 1, we use simple string substitution, in the style of Mad Libs. This approach requires the user to carefully select the right strings, which is quite simple in English, but gets more complicated in French or German.

So we could have the following function calls:

Instance with origin string-based English("Waakye", "dish", "Ghanaian")

→ "Waakye is a Ghanaian dish."

Instance with origin string-based French("Le waakye", "un mets", "ghanéen")

→ "Le waakye est un mets ghanéen."

Instance with origin string-based German("Waakye", "ein Gericht", "ghanaisches")

→ "Waakye ist ein ghanaisches Gericht."

This is possible right now. It requires quite detailed grammatical knowledge by the function caller, as they need to enter the right form manually. The benefit of this method is difficult to see in this example.

Stage 2: Lexeme-based generation

In Stage 2, instead of using strings, we use Wikidata Lexemes, possible in the past few months. This allows for a version of the function where the function caller does not have to worry about agreement and entering the right form manually, but the function implementer needs to select the right form from the Lexeme instead. This shifts some of the burden from the function user to the function author.

This makes the calling much simpler: we don’t have to know whether "waakye" in French will be "Le waakye" or "La waakye", we don’t have to select the agreeing adjective in German ("ghanaisches Gericht" or "ghanaischer Gericht"), etc. The correct form will be chosen by the Function.

Now we would have the following function calls:

Instance with origin Lexeme-based English(Lxxx/Waakye, L3964/dish, Lxxx/Ghanaian)

→ "Waakye is a Ghanaian dish."

Zxxx/Instance with origin Lexeme-based French(Lxxx/waakye, L24812/mets, Lxxx/ghanéen)

→ "Le waakye est un mets ghanéen."

Zxxx/Instance with origin Lexeme-based German(Lxxx/Waakye, L500931/Gericht, Lxxx/ghanaisch)

→ "Waakye ist ein ghanaisches Gericht."

You also will find that a lot of Lexemes are missing for this particular example, such as the French Lexeme for something from Ghana. We in the Wikimedia movement need to think about how to approach this gap in what is now – and ever should be – in Wikidata's Lexemes.

We were hoping that this would be possible right now, and we created a number of functions during our offsite to test these capabilities. Unfortunately, we learned that the system is currently failing to evaluate most such function calls, and accordingly we decided to put a big focus in the upcoming Quarter on getting these functions to run.

Stage 3: Item-based generation

In the third stage, we would use Wikidata items to help us select Lexemes from a given language that have comparable meanings. The function caller does not have to know or look up the right Lexeme in all the languages they want to generate the text in. They can just put in the relevant Wikidata items, and the function developer can implement the relevant lookups.

This means that whether or not the function caller knows that the concept "dish" is called "mets" in French or "Gericht" in German, they will still be able to create perfectly fluid and correct sentences in those languages.

This allows us to make the following calls (note that all three calls use the same function here, and the caller does not have to know the languages at all):

Instance with origin(Q14783691/Waakye, Q746549/dish, Q117/Ghana, Z1002/English)

→ "Waakye is a Ghanaian dish."

Instance with origin(Q14783691/Waakye, Q746549/dish, Q117/Ghana, Z1004/French)

→ "Le waakye est un mets ghanéen."

Instance with origin(Q14783691/Waakye, Q746549/dish, Q117/Ghana, Z1430/German)

→ "Waakye ist ein ghanaisches Gericht."

Note that the function will in most cases just route to the language-specific functions developed for the previous stage, but that happens behind the scenes and transparently for the function caller.

This is currently not possible to implement on Wikifunctions — we still need to add a function that allows us to find the Lexemes connected to a given Item. We will work on that in the coming Quarter, and are thankful to the Search and Wikidata teams for the necessary pre-work they have recently performed to unlock the possibility.

Stage 4: Item-based content

The final stage we want to discuss today is based on using the knowledge in Wikidata to create text. We can pull from Wikidata that Q14783691/Waakye is a dish from Q117/Ghana, we can look up the ingredients and their Lexemes, etc. Given the current knowledge about Waakye in Wikidata, this could then generate the following sentences:

Food with origin and ingredients(Q14783691/Waakye, Z1002/English)

→ "Waakye is a Ghanaian dish with bean, rice, water, and salt."

Food with origin and ingredients(Q14783691/Waakye, Z1004/French)

→ "Le waakye est un plat ghanéen composé de haricots, de riz, d'eau et de sel."

Food with origin and ingredients(Q14783691/Waakye, Z1430/German)

→ "Waakye ist ein ghanaisches Gericht aus Bohnen, Reis, Wasser und Salz."

This further simplifies writing the function calls: all we need to select is the dish and the language, and we get a whole sentence that can, in many cases, make a good opening sentence for the Wikipedia article about the given dish, or as an entry or short description in various places.

I hope that this gives a good overview of our next few planned steps with regards to natural language generation and how Wikifunctions can support bringing together our different language communities.

Team offsite in Lisbon

Abstract Wikipedia team at the offsite in Lisbon 2024. From left to right, front row: Cory Massaro, Grace Choi, Genoveva Galarza Heredero, Daphne Smit. Back row: James Forrester, Denny Vrandečić, David Martin, Sharvani Haran. Not in picture: Amy Tsay, Amin Al Hazwani, Luca Martinelli, Elena Tonkovidova, Vaughn Walters.

Last week, the team met for its annual meeting in Lisbon, Portugal. What a beautiful city! We enjoyed walking through the city, and had very productive meetings, discussing our plans, team procedures, and using the time for bonding and social cohesion – very difficult and important to achieve in a team that is fully remote.

The most tangible outcome is the planning for the next Quarter; we had very lively discussions to find a consensus, which we still need to write up. We will report on the plan in one of the next two updates.

New tool for querying Wikifunctions

Feeglgeef created a new tool that allows you to query Wikifunctions in a very flexible way. You can search for functions with implementations in Python, Types that use numbers as keys, functions that take three arguments, or return booleans. The tool is available on Replit (note that this is outside of Wikimedia servers), and examples and documentation of the query language are linked from the front page of the tool: wf-query.replit.app

Hogü-456 created an overview of existing tools. If you are aware of more tools, feel free to add them: Wikifunctions:Tools

Recent Changes in the software

There's no release of MediaWiki software this week due to the Release Engineering team's ordered release freeze, so nothing new to update. As always, please alert us if you run into any issues.

News in Types: Gregorian calendar date, Byte, Unicode code point

We finally have a Type for Gregorian calendar dates. We have been working a while towards it, having created a Type for the relevant months, for years, etc. The discussion was lengthy and didn’t lead to a full consensus. A rationale for the decisions on the design of the Type is provided. We invite you to create functions using the Type!

This has been by far the most complex Type we are providing so far.

We would like to create Types for other, non-Gregorian calendars, like the Chinese, Ethiopian, Japanese, Hebrew, and other calendars. If you know any of these calendars well, please reach out so that we can create the respective calendars.

In other Type-related work, proposals for fixing the Byte Type and the Unicode code point Type (previously character Type) have been made. Input is and discussions are very welcome.

Recordings of December’s Volunteers’ Corner

December 2024 Volunteers' Corner

We had a Volunteers’ Corner this Monday, December 9. It was lively with many good questions. A recording of the Corner is available on Commons.

The function we built together is featured below as the Function of the Week.

Recording of Denny’s SWIB24 keynote

Denny Vrandečić gave a keynote address at the Semantic Web in Libraries 2024 conference. The topic was on the role of knowledge representations in a world of large language models. The recording is available on YouTube.

Function of the Week: how many days between two days in the Roman year

The last newsletter introduced the days of the Roman year as a new Type. As of now, we have 18 new functions using the Type. Also, this week’s Volunteers’ Corner created such a function, so we will take a look at the resulting function.

How many days are there between two days? Function Z20733 can answer that question. The function has three arguments: the two days, and a Boolean which tells us whether the days are in a leap year or not. It returns a natural number stating how many days are between the two given days.

It might be easiest to clarify what the function does by looking at the tests:

The tests are incomplete, with the most notable omission being for any tests where the first day is after the second, and what that exactly means with regards to understanding the leap year.

Currently, there is only one implementation for this function so far, which is partly due to the fact that we didn’t have much time left in the Volunteers’ Corner, and so we only did one in composition, because we found that the easiest way to implement the function.

The core of the composition is to turn both days into a number, counting which day of the year it is (i.e. 1 January is the first day, 2 January the second, 1 February the 32nd, etc.), and then subtract the first number from the second. The result is then turned from an integer to a natural number, in order to avoid negative numbers.