Private Database Implementation of Generative AI in Accounting

Vectorized Database,
Vectorized Database, Created by Microsoft Copilot (Powered by DALL-E 3) from Microsoft

A private database is a data source that is not publicly accessible or shared with anyone else. It can help prevent data leakage, help maintain the accounting client confidentiality, and assure compliance with data protection laws and regulations, such as the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA,) when using a large language model like ChatGPT by ensuring that the model is only trained on data that is relevant, authorized, and anonymized. Data leakage involves exposing sensitive or private information from the training data through the model’s output. For example, if a large language model is trained on accounting journal entries that contain personally identifiable information (PII), it might memorize and leak some of these details when prompted, by other users, with certain queries. A private database can mitigate this risk by filtering out any PII or other confidential data from the training data, or by applying techniques such as differential privacy or encryption to protect the data from unauthorized access or inference. By using a private database, one can also control the quality and diversity of the training data, and avoid potential biases or inaccuracies that might affect the model’s performance or reliability.

Consequently, if you must use information from your private, SQL, accounting database, for example, in a large language model (LLM,) despite the risk of data leakage, to benefit from natural language processing there are best practices. We will consider anonymization, encryption, and vectorization (N.B. in our examples, the LangChain framework plays a chief role in anonymizing, deanonymyzing, encrypting, decrypting, and vectorization): 

Conclusion

Using a vectorized database is not necessarily superior to anonymizing or encrypting data on the database, but rather a different approach that has its advantages and disadvantages. They both accomplish the task of protecting personal identifying information and, particularly, avoiding data leakage, and misuse. Depending on the use case and the requirements, one may choose to use one or all methods to protect the data.

Anonymizing data on the database means removing or replacing personally identifiable information (PII) from the data, such as names, phone numbers, email addresses, &c., during in-context-learning with the LLM. Encrypting data from the database means transforming data into an unreadable form using a secret key or algorithm, also, during in-context-learning with the LLM, thus preventing unauthorized access to private data. Only those with the key or algorithm can decrypt the data and restore its original form.

Using a vectorized database means converting data into numerical vectors that can be used for mathematical operations and machine learning. A vectorized database can also help leverage the general knowledge and information in the data without compromising the data subjects’ identities; and it is used as an adjunct to an LLM, which does not and cannot memorize the vectorized data.

Certainly, the accountant can accomplish his goal of client confidentiality and compliance with data protection laws and regulations, such as GDPR and CCPA, when using an LLM through the methods of anonymization, encryption, vectorization, or a combination of them. Next time, we recapitulate the “AI Use in Accounting” series and provide further reference material.

–Richard Thomas (with the assistance of Microsoft Copilot)

 

 Previous, Part VI

Next, Part VIII

Leave a Reply

Your email address will not be published. Required fields are marked *

UPCOMING TRAINING

SHARE TO SOCIAL MEDIA