Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search/sort having trouble with special characters #258

Open
jfly opened this issue Dec 7, 2015 · 7 comments
Open

Search/sort having trouble with special characters #258

jfly opened this issue Dec 7, 2015 · 7 comments
Labels
AREA: api TECH: ruby Requires knowledge of Ruby TYPE: bug Bug reported by a stakeholder

Comments

@jfly jfly added the TYPE: bug Bug reported by a stakeholder label Dec 7, 2015
@jfly jfly changed the title Search having trouble with special characters Search/sort having trouble with special characters Jan 11, 2016
@jfly
Copy link
Contributor Author

jfly commented Jan 11, 2016

A related bug from Radu talking about the registrations page:

And one more observation for the software team. It seems that competitors are sorted by name, but "ș" letter is at the end, after all others, while normally it should be after "s".

@pedrosino
Copy link
Contributor

I meant the search at persons.php

@jfly
Copy link
Contributor Author

jfly commented Sep 5, 2016

There's a pretty heated discussion about the Ł character on the mysql bug tracker: https://bugs.mysql.com/bug.php?id=9604. There's a lot of confusion/unsatisfying answers, until you get to Peter Gulutzan's well written explanation of what's going on here. The answer is that Mysql's utf8_unicode_ci collation uses the Unicode Collation Algorithm (UCA) 4.0.0, which doesn't treat Ł as level 1 equivalent to L (see http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt). This was changed in UCA 4.1.0:

Data tables for 4.1.0 contain the following changes:
...
2. The change of weights for characters Æ, Ǽ, Ǣ; Đ, Ð; Ħ; Ł, Ŀ; and Ø, Ǿ (and their lowercase and accented forms) to have secondary (accent) differences from AE; D; H; L; and O, respectively. This is to provide a much better default for languages in which those characters are not tailored. See also the section on user expectations.

UCA 4.1.0 and above do treat Ł as level 1 equal to L (see jfly/translighterate#2 (comment) for some more information).

One fix would be to change the collation of our tables to utf8_unicode_520_ci, which does treat Ł as level 1 equal to L:

MariaDB [(none)]> select "L" = "Ł" COLLATE utf8_unicode_520_ci;
+----------------------------------------+
| "L" = "Ł" COLLATE utf8_unicode_520_ci  |
+----------------------------------------+
|                                      1 |
+----------------------------------------+
1 row in set (0.00 sec)

MariaDB [(none)]> select "L" = "Ł" COLLATE utf8_unicode_ci;
+------------------------------------+
| "L" = "Ł" COLLATE utf8_unicode_ci  |
+------------------------------------+
|                                  0 |
+------------------------------------+
1 row in set (0.00 sec)

Right now, our default table collation is ``:

and the Persons.name column has COLLATE utf8_unicode_ci,

I think the right thing to do here would be to change all our table column COLLATE's to utf8_unicode_520_ci. @timhabermaas, @larspetrus, @FatBoyXPC do you have any experience making such a change? Anything we should worry about?

@jfly
Copy link
Contributor Author

jfly commented Sep 5, 2016

Unfortunately, it turns out our production version of Mysql is 5.5.49:

~ @production> mysql -e "SELECT VERSION();"
+-------------------------+
| VERSION()               |
+-------------------------+
| 5.5.49-0ubuntu0.14.04.1 |
+-------------------------+

which doesn't support utf8_unicode_520_ci:

mysql> show collation where Collation like '%utf8_unicode_%';
+-----------------+---------+-----+---------+----------+---------+
| Collation       | Charset | Id  | Default | Compiled | Sortlen |
+-----------------+---------+-----+---------+----------+---------+
| utf8_unicode_ci | utf8    | 192 |         | Yes      |       8 |
+-----------------+---------+-----+---------+----------+---------+
1 row in set (0.01 sec)

It's hard for me to find out when support for utf8_unicode_520_ci was added, but according to Peter Gulutzan's comment here, it seems like support was added in Mysql 5.6.

EDIT: Updating our version of mysql should be as simple as changing this one line in our Chef configuration: https://github.com/jfly/worldcubeassociation.org/blob/66a9530b6ab0e6dd23839a3f552862901363243c/chef/site-cookbooks/wca/recipes/default.rb#L74.

jfly added a commit to jfly/worldcubeassociation.org that referenced this issue Sep 5, 2016
@jfly
Copy link
Contributor Author

jfly commented Aug 21, 2017

Note that thanks to #1489, we're actually using utf8mb4 instead of utf8. That means we want to use utf8mb4_unicode_520_ci instead of utf8_unicode_520_ci as mentioned above.

However, after reading this article, it sounds like we might want to switch to utf8mb4_0900_as_cs for accent sensitive and case sensitive search (I cannot imagine why we would want case insensitive search by default). Of course, mysql 8 isn't currently supported by AWS RDS =(

@toadfrommariokart64
Copy link

toadfrommariokart64 commented Aug 11, 2022

unsure if this is the same issue, or should be made a seperate post, but searching for Denmark & Friends will show the comp with the name "Denmark &amp ; Friends" (except there's no space before the semicolon), so there seems to be an issue when un-escaping characters?

@dunkOnIT dunkOnIT added the TECH: ruby Requires knowledge of Ruby label Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AREA: api TECH: ruby Requires knowledge of Ruby TYPE: bug Bug reported by a stakeholder
Projects
Status: No status
Development

No branches or pull requests

5 participants