Legal issues, information management and documentation, as well as mutual mediation are good for a long-term artificial intelligence journey
For centuries, libraries have kept the world’s information organized. In the age of big data, it would be good to respect key principles of the library slightly more.
Have you ever wondered what this is when you download nasty sources or when you screen expensive basic data collected a year ago CX-98/001 the code was used, and can I use it in a new project?
Definition of codes, laBels and numbers in relation to the data flowing into the models affect not only the design and decompression of the models, but also how the data is managed and controlled. This context clearly causes a number of contradictions between the many disciplines involved. I share selected observations on how to address these conflicts, and I do not intentionally cover finance and product management.
The article briefly discusses data and quality management in general before focusing on the findings regarding documentation, data responsibility, and legal aspects from a professional perspective.
Information management and quality
Honestly, every artificial intelligence / ML initiative does need its information. Not only input data, but also train, validation, test, hyperparameter and test result data and in particular the final model parameter introduced in the live model. It’s a hell of a lot of trackable data points that get crowded where people don’t necessarily know who to change what and when.
The importance of data quality and other operational issues is arouses more interest. In addition, The rise of MLOps data quality research is slowly finding its way into model functions is an important part of the implementation of AI / ML initiatives.
First-hand experience with the use of artificial intelligence / ML infrastructure with and without data quality control has shown that mandatory procedures are needed. This also applies when at least a profitable product is selected for the introduction of production. If this is not done, sad hidden technical debt catches.
What is CX-98/001?
In the example above, without further context, you cannot tell if the code “CX-98/001” refers to a garment or a machine or if it is a typo and refers to a flight. This is exactly what happens when data and documentation are separated and maintained independently. The scientific communities have established metadata standard quite a few years ago.
Too many times someone “gets information”, does this once and then forgets everything or worse, leaves the team or the company. The cost of understanding the data well enough to place it in the system is irreversibly lost. Adopting scientific standards for scientific documentation will clearly help. Experience, with both well-documented and poorly documented data, shows the difference in factor 3 in project and employee start-up times.
Perhaps in an ideal situation, some documents exist, but even where they are stored or who has access may not be clear. Data transfer agreements and data documents are good tools. However, if either is separated from the code and data, the work is often futile. As soon as the data or its meaning is separated, the data is lost during the data search.
For decades, data elements have been coded for more efficient storage or transmission. And everything is great. But when the data is reassembled, the semantics must be restored. To prevent the separation of one good practice is to translate the codes into real meaning of all exposed data except for transfer or storage – ideally it should be as well. Another option would be to use attribute descriptions diligently. Most SQL systems know COMMENT statement for years. There is nothing stopping anyone from adding comments on a wiki link when more detailed details need to be provided.
When information and documentation are kept close together, information is crucial available during data retrieval and access is automatic. If, for some reason, the data quality system does not detect new undocumented code, the next data scientist performing the analysis will probably wonder what it means – unlike wondering what all these codes mean, and never realize that one code is new. It slightly increases the tolerance to data quality rules.
Unfortunately the contradiction between writing code and documenting is eternal – and the same shows its place in the data. Small notes go a long way. A brief comment on the purpose of the attribute codes and a quick link to find the details will not interfere with the workflow.
80% of the work
Obtaining the right information is clearly important before building good features. However, many organizations have divided data download operations into two expert roles: data engineer and data scientist. It is for all purposes that are not ideal and it can best be handled through role layering rather than role sharing. Data scientists and machine learning engineers usually know what they need. By speaking directly to people who can provide information, often also to engineers or statisticians, choosing the right data together is faster than involving a third party. In addition, direct interaction helps to create better features. Because data can be selected to target specificities, rules on the quality of incoming data can be developed quickly and legal aspects discussed immediately (see below). Instead of being considered a highly skilled IT broker, it is much better to use a scalable property trade and organization of a data warehouse.
All this data work is not just a very popular activity. However, it strikes running a template that enters data that is full of false codes, which then leads to a decrease in model performance due to quality issues or worse, cannot be legally used for a model purpose.
Friendly lawyer on the next floor
Legal use and access rights are vital. As with documents, these should preferably be stored in a way that appears during data retrieval. If GDPR can be any instruction, the consequences of illegal use of the data can be serious. This is without even consideration lose the benefits of responsible artificial intelligence.
Even worse, when an organization does not know the legal status of its data because the data was lost during countless transfers or has never been collected. The effort to allow lawful use can be extensive. Legal risk may well delay or even prevent the introduction of a well-functioning machine learning use case.
Basically, it’s never too early to talk to a lawyer. This is true in life, as well as in the information you want to use in your models.
Respecting thousands of years of experience in organizing a library, there are a few practical steps involved in storing today’s information.
In summary, in order to increase the likelihood that the organization understands the meaning of “CX-98/001” and whether to allow the use of a record, the following four aspects are key:
- Establish data management and data quality control
- Organize data documentation close to real data
- Paste the lawful use information close to the record
- Practice flash and direct communication between the data source and the model design teams
- Proposal for a regulation on harmonized rules for artificial intelligence (2021, European Commission)
- If your knowledge is bad, your machine learning tools are useless (2018, Thomas C.Redman)
- DAMA International Guide to the Information Management Database (2010, DAMA International)
- Data documents (Illinois Library, online)
- Where does metadata live? (Tandem Holvi, online)
- Machine learning algorithms: a study of noise sensitivity (2003, Kalapanidas, Avouris, Craciun & Neagu)