-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data cleanliness #400
Comments
Just experienced |
Discussed in email thread "Peculiar data in "events" tab for Toulon Open 2017": when we were launching the edit events page, we played around with it a bit for some competitions (there was no public view events page yet), and left some competitions in a state that does not reflect their final results. @SAuroux put together this query to identify mismatches: SELECT competitionId, r_events, ce_events
FROM (SELECT competitionId, count(distinct eventId) as r_events FROM Results GROUP BY competitionId) as r
INNER JOIN (SELECT competition_id, count(distinct event_id) as ce_events FROM rounds as ro INNER JOIN competition_events as ce on ce.id = ro.competition_event_id GROUP BY competition_id) as ce
ON r.competitionId = ce.competition_id
HAVING r_events <> ce_events I think this query is correct, but "it only compares the number of events, and not the number of rounds of each event". @viroulep, could we leverage any of the results validation logic you've been working on to do a more complete check of the database? |
Yes at some point running the individual validators will make it easier to do a complete check of the database (to the best of our knowledge at least). |
@gregorbg this hasn't seemed to be an issue while I've been on the team |
Well that depends on what you mean by "issue". The core takeaway point here could be rephrased as "we have old legacy data in our DB [mostly Competitions, but also other models] that don't abide by our modern data quality standards". The simplest example is that a very, very old competition like This usually doesn't hurt us, because we never touch WC2003 and we most certainly don't update it in our application logic. But from a high-level perspective, it would still be nice to have all of our records conform to all of our validations. |
What would our strategy be for achieving this, especially in the case of those older competitions? Give WRT a list of competitions that currently fail validations, the reasons why, and let them see what they're able to correct? |
Depends. If it's a trivial error (like some regex not matching) we can fix it ourselves, but in cases like "schedule missing" yes WRT probably would have to deal with it. It really depends on a case-by-case basis. The bigger issue would probably be to carve out time for such a big project anyways. It's not urgent at all, but still something "nice to have". |
Right now, many database records don't actually pass our validations. I added a
rake db:data:validate
script in 168d45d to check all our records to see if they're valid (as of February 22nd, 2016, there are 1612 invalid records in our database). Here's a set of unique validation errors in our database (as of February 22nd, 2016):Also see #165.
The text was updated successfully, but these errors were encountered: