Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data cleanliness #400

Open
jfly opened this issue Feb 23, 2016 · 7 comments
Open

Data cleanliness #400

jfly opened this issue Feb 23, 2016 · 7 comments
Labels
PRIORITY: low-priority A step above backlog - nice to have, but not essential to work on.

Comments

@jfly
Copy link
Contributor

jfly commented Feb 23, 2016

Right now, many database records don't actually pass our validations. I added a rake db:data:validate script in 168d45d to check all our records to see if they're valid (as of February 22nd, 2016, there are 1612 invalid records in our database). Here's a set of unique validation errors in our database (as of February 22nd, 2016):

An exception occurred: invalid date
An exception occurred: The single-table inheritance mechanism failed to locate the subclass: 'report'. This error is raised because the column 'type' is reserved for storing the class in case of inheritance. Please rename this column if you didn't intend it to be used for storing the inheritance class or overwrite CompetitionsMedia.inheritance_column to use another column for that information.
#<Competition errors: ["Competition nickname is invalid"]>
#<Competition errors: ["Competition nickname is invalid", "WCA delegate(s)  are not all delegates"]>
#<Competition errors: ["Eventspecs invalid event ids: 333bts"]>
#<Competition errors: ["Eventspecs invalid event ids: 333bts,360"]>
#<Competition errors: ["Eventspecs invalid event ids: 333ts,333bts"]>
#<Competition errors: ["Eventspecs invalid event ids: 360"]>
#<Competition errors: ["Eventspecs invalid event ids: mirbl"]>
#<Competition errors: ["Eventspecs invalid event ids: un3sbf"]>
#<Competition errors: ["Name is invalid"]>
#<Competition errors: ["Name is invalid", "Competition nickname is invalid"]>
#<Competition errors: ["Name is invalid", "Competition nickname is invalid", "WCA delegate(s)  are not all delegates"]>
#<Competition errors: ["Name is invalid", "WCA delegate(s)  are not all delegates"]>
#<Competition errors: ["Venue address can't be blank"]>
#<Competition errors: ["Venue address can't be blank", "Eventspecs invalid event ids: snake,222oh"]>
#<Competition errors: ["Venue address can't be blank", "WCA delegate(s)  are not all delegates"]>
#<Competition errors: ["Venue address can't be blank", "WCA delegate(s)  are not all delegates", "Eventspecs invalid event ids: 333r3,333bts"]>
#<Competition errors: ["Venue address can't be blank", "WCA delegate(s)  are not all delegates", "Eventspecs invalid event ids: snake"]>
#<Competition errors: ["Venue address can't be blank", "Website can't be blank"]>
#<Competition errors: ["Venue address can't be blank", "Website can't be blank", "WCA delegate(s)  are not all delegates"]>
#<Competition errors: ["Venue can't be blank"]>
#<Competition errors: ["Venue can't be blank", "Venue address can't be blank"]>
#<Competition errors: ["Venue can't be blank", "Venue address can't be blank", "WCA delegate(s)  are not all delegates"]>
#<Competition errors: ["Venue can't be blank", "Venue address can't be blank", "Website can't be blank"]>
#<Competition errors: ["Venue can't be blank", "Venue address can't be blank", "Website can't be blank", "WCA delegate(s)  are not all delegates"]>
#<Competition errors: ["Venue can't be blank", "WCA delegate(s)  are not all delegates"]>
#<Competition errors: ["Venue can't be blank", "Website can't be blank", "WCA delegate(s)  are not all delegates"]>
#<Competition errors: ["WCA delegate(s)  are not all delegates"]>
#<Competition errors: ["WCA delegate(s)  are not all delegates", "Eventspecs invalid event ids: 333bts"]>
#<Competition errors: ["WCA delegate(s)  are not all delegates", "Eventspecs invalid event ids: 333bts,333si"]>
#<Competition errors: ["WCA delegate(s)  are not all delegates", "Eventspecs invalid event ids: 333bts,360"]>
#<Competition errors: ["WCA delegate(s)  are not all delegates", "Eventspecs invalid event ids: 333ni"]>
#<Competition errors: ["WCA delegate(s)  are not all delegates", "Eventspecs invalid event ids: 333ni,333bts"]>
#<Competition errors: ["WCA delegate(s)  are not all delegates", "Eventspecs invalid event ids: 333ni,snake"]>
#<Competition errors: ["WCA delegate(s)  are not all delegates", "Eventspecs invalid event ids: mirbl"]>
#<Competition errors: ["Website can't be blank"]>
#<Competition errors: ["Website can't be blank", "WCA delegate(s)  are not all delegates"]>
#<Competition errors: ["Website is invalid"]>
#<Competition errors: ["Website is invalid", "Venue address can't be blank", "WCA delegate(s)  are not all delegates"]>
#<CompetitionOrganizer errors: ["Organizer can't be blank"]>
#<Poll errors: ["Deadline you can only change the deadline"]>
#<Post errors: ["Title has already been taken"]>
#<Registration errors: ["Competition Competition registration is closed"]>
#<Registration errors: ["Competition Competition registration is closed", "Events invalid event ids: 333bf"]>
#<Registration errors: ["Competition Competition registration is closed", "Events invalid event ids: 333bf,sq1,clock,333mbf,magic"]>
#<Registration errors: ["Competition Competition registration is closed", "Events invalid event ids: 333fm"]>
#<Registration errors: ["Competition Competition registration is closed", "Events invalid event ids: pyram"]>
#<Registration errors: ["Competition Competition registration is closed", "Events invalid event ids: sq1"]>
#<Registration errors: ["Competition Competition registration is closed", "Events invalid event ids: sq1,clock,magic"]>
#<Registration errors: ["Competition Competition registration is closed", "Events invalid event ids: sq1,magic"]>
#<Registration errors: ["Events invalid event ids: 222"]>
#<Registration errors: ["Events invalid event ids: 222,333ft"]>
#<Registration errors: ["Events invalid event ids: 333bf,333oh"]>
#<Registration errors: ["Events invalid event ids: 333fm"]>
#<Registration errors: ["Events invalid event ids: 333fm,333ft"]>
#<Registration errors: ["Events invalid event ids: 333fm,333mbf"]>
#<Registration errors: ["Events invalid event ids: 333fm,444bf,333mbf"]>
#<Registration errors: ["Events invalid event ids: 333fm,666,333mbf"]>
#<Registration errors: ["Events invalid event ids: 333fm,666,777"]>
#<Registration errors: ["Events invalid event ids: 333fm,666,777,333mbf"]>
#<Registration errors: ["Events invalid event ids: 333fm,666,777,444bf,333mbf"]>
#<Registration errors: ["Events invalid event ids: 333fm,666,777,444bf,555bf,333mbf"]>
#<Registration errors: ["Events invalid event ids: 333ft"]>
#<Registration errors: ["Events invalid event ids: 333mbf"]>
#<Registration errors: ["Events invalid event ids: 333mbf,mmagic"]>
#<Registration errors: ["Events invalid event ids: 333oh"]>
#<Registration errors: ["Events invalid event ids: 444bf"]>
#<Registration errors: ["Events invalid event ids: 444bf,333mbf"]>
#<Registration errors: ["Events invalid event ids: 555bf"]>
#<Registration errors: ["Events invalid event ids: 666,777"]>
#<Registration errors: ["Events invalid event ids: 666,777,444bf,333mbf"]>
#<Registration errors: ["Events invalid event ids: 666,777,444bf,555bf,333mbf"]>
#<Registration errors: ["Events invalid event ids: magic"]>
#<Registration errors: ["Events invalid event ids: mmagic"]>
#<Registration errors: ["Events must register for at least one event"]>
#<User errors: ["Birthdate must be in the past"]>
#<User errors: ["Full name can't be blank"]>
#<User errors: ["Unconfirmed WCA ID already assigned to a different user"]>
#<Vote errors: ["Poll poll is closed"]>

Also see #165.

@jonatanklosko
Copy link
Member

Just experienced #<Registration errors: ["Events must register for at least one event"]> while using <registration>.update!(<so on>) in a migration.
Moreover there is a registration whose competition is nil...

@jfly jfly changed the title Every record should pass validations Data cleanliness Jul 29, 2019
@jfly
Copy link
Contributor Author

jfly commented Jul 29, 2019

Discussed in email thread "Peculiar data in "events" tab for Toulon Open 2017": when we were launching the edit events page, we played around with it a bit for some competitions (there was no public view events page yet), and left some competitions in a state that does not reflect their final results. @SAuroux put together this query to identify mismatches:

SELECT competitionId, r_events, ce_events 
FROM (SELECT competitionId, count(distinct eventId) as r_events FROM Results GROUP BY competitionId) as r
INNER JOIN (SELECT competition_id, count(distinct event_id) as ce_events FROM rounds as ro INNER JOIN competition_events as ce on ce.id = ro.competition_event_id GROUP BY competition_id) as ce
ON r.competitionId = ce.competition_id
HAVING r_events <> ce_events

I think this query is correct, but "it only compares the number of events, and not the number of rounds of each event". @viroulep, could we leverage any of the results validation logic you've been working on to do a more complete check of the database?

@viroulep
Copy link
Contributor

Yes at some point running the individual validators will make it easier to do a complete check of the database (to the best of our knowledge at least).

@gregorbg gregorbg added PRIORITY: low-priority A step above backlog - nice to have, but not essential to work on. polish-serious-work labels Jun 18, 2021
@dunkOnIT
Copy link
Contributor

dunkOnIT commented Jan 1, 2025

@gregorbg this hasn't seemed to be an issue while I've been on the team

@gregorbg
Copy link
Member

gregorbg commented Jan 3, 2025

Well that depends on what you mean by "issue".

The core takeaway point here could be rephrased as "we have old legacy data in our DB [mostly Competitions, but also other models] that don't abide by our modern data quality standards". The simplest example is that a very, very old competition like WC2003 wouldn't pass our current-day validations anymore because it doesn't have a website schedule, etc.

This usually doesn't hurt us, because we never touch WC2003 and we most certainly don't update it in our application logic. But from a high-level perspective, it would still be nice to have all of our records conform to all of our validations.

@gregorbg gregorbg reopened this Jan 3, 2025
@dunkOnIT
Copy link
Contributor

dunkOnIT commented Jan 3, 2025

from a high-level perspective, it would still be nice to have all of our records conform to all of our validations.

What would our strategy be for achieving this, especially in the case of those older competitions? Give WRT a list of competitions that currently fail validations, the reasons why, and let them see what they're able to correct?

@gregorbg
Copy link
Member

gregorbg commented Jan 3, 2025

Depends. If it's a trivial error (like some regex not matching) we can fix it ourselves, but in cases like "schedule missing" yes WRT probably would have to deal with it. It really depends on a case-by-case basis.

The bigger issue would probably be to carve out time for such a big project anyways. It's not urgent at all, but still something "nice to have".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PRIORITY: low-priority A step above backlog - nice to have, but not essential to work on.
Projects
Status: No status
Status: Done
Development

No branches or pull requests

5 participants