
A private database is a data source that is not publicly accessible or shared with anyone else. It can help prevent data leakage, help maintain the accounting client confidentiality, and assure compliance with data protection laws and regulations, such as the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA,) when using a large language model like ChatGPT by ensuring that the model is only trained on data that is relevant, authorized, and anonymized. Data leakage involves exposing sensitive or private information from the training data through the model’s output. For example, if a large language model is trained on accounting journal entries that contain personally identifiable information (PII), it might memorize and leak some of these details when prompted, by other users, with certain queries. A private database can mitigate this risk by filtering out any PII or other confidential data from the training data, or by applying techniques such as differential privacy or encryption to protect the data from unauthorized access or inference. By using a private database, one can also control the quality and diversity of the training data, and avoid potential biases or inaccuracies that might affect the model’s performance or reliability.
Consequently, if you must use information from your private, SQL, accounting database, for example, in a large language model (LLM,) despite the risk of data leakage, to benefit from natural language processing there are best practices. We will consider anonymization, encryption, and vectorization (N.B. in our examples, the LangChain framework plays a chief role in anonymizing, deanonymyzing, encrypting, decrypting, and vectorization):
Anonymization is the process of removing or replacing personally identifiable information (PII) from data, such as names, phone numbers, credit card numbers, or addresses, &c. This is done to protect the privacy and security of the data subjects and comply with data protection laws and regulations. Anonymization can also help prevent data leaks and misuse when using large language models (LLMs) and generative AI.
To use anonymization with a LLM, one possible approach is as follows:
1. Choose an anonymization model or framework that can perform both anonymization and deanonymization of data. Examples of such models are Presidio, Faker, or Differential Privacy.
2. Use the anonymization model to anonymize the data before passing it to the LLM. You can specify which types of PII you want to anonymize, such as PERSON, PHONE_NUMBER, EMAIL_ADDRESS, CREDIT_CARD, &c. The model will replace the original data with fake or random data, such as
John Doe, 555-1234,
john.doe@example.com,
1234-5678-9012-3456, &c.
3. Use the anonymized data as input or context for the LLM. You can use the Langchain.chat() function to interact with the LLM using the anonymized data as the context. You can specify the context parameter to indicate the anonymized data, and the model parameter to indicate which LLM you want to use, such as chatgpt, gpt3, &c. The function will return a string with the LLM’s response.
4. Optionally, you can use the anonymization model to deanonymize the LLM’s response using the previous model. The model will restore the original data from the fake or random data, such as
John Doe -> Alice Smith,
555-1234 -> 987-6543,
john.doe@example.com -> alice.smith@company.com,
1234-5678-9012-3456 -> 4321-8765-1098-7654, &c.
Here is an example of how the code might look like:
Python Script
# This is a generalized outline of the required script; in OpenAI an API key would be
# required and assigned as follows:
# openai.api_key = “sk-…”
#
# langchain and OpenAI objects could be declared for easier access to the object
# functions. Like so:
# Create a LangChain object
# lc = langchain.LangChain()
# Create an OpenAI object
# openai = lc.openai(model_name=”gpt-3.5-turbo”
#Continuing
# Import the required packages
import langchain
import langchain_openai
import langchain_experimental
import presidio_analyzer
import presidio_anonymizer
import faker
# Create an instance of the PresidioReversibleAnonymizer class
anonymizer = langchain_experimental.data_anonymizer.PresidioReversibleAnonymizer(analyzed_fields=[“PERSON”, “PHONE_NUMBER”, “EMAIL_ADDRESS”, “CREDIT_CARD”], faker_seed=42)
# Anonymize the data before passing it to the LLM
data = “Hello, my name is Alice Smith and my phone number is 987-6543. You can also email me at alice.smith@company.com. My credit card number is 4321-8765-1098-7654.”
an_data = anonymizer.anonymize(data)
# Interact with the LLM like
ChatGPT using the anonymized data as the context
res = langchain.chat(“Hi, I am ChatGPT. What can I do for you?”,
context=an_data, model=”chatgpt”)
# Deanonymize the LLM’s response
using the anonymizer instance
de_res = anonymizer.deanonymize(res)
Anonymization Advantages and Disadvantages
The advantages of using anonymization for training a LLM are:
· It protects the privacy and security of the data by removing or replacing any PII before passing it to the LLM.
· It reduces the risk of data leaks and misuse when using third-party LLMs and generative AI tools.
· It allows the LLM to access and leverage the general knowledge and information in the data without compromising the data subjects’ identities.
The disadvantages of using anonymization with a LLM are:
· It may introduce errors or inaccuracies in the anonymization and deanonymization processes, which could affect the quality and reliability of the data and the LLM’s response.
· It may not capture the nuances and subtleties of natural language, such as tone, emotion, sarcasm, &c.
· It may not be compatible with all types of data, LLMs, or anonymization models, which could limit the scope and applicability of the approach.
This was a brief and simplified overview of how to use anonymization with an LLM.
Encryption is the process of transforming data into an unreadable form using a secret key or algorithm. This is done to prevent unauthorized access or modification of the data. Only those with the key or algorithm can decrypt the data and restore its original form. Encryption can help protect the privacy and security of a private database when connecting it with a large language model (LLM).
To use encryption with a LLM, one possible approach is as follows:
1. Choose an encryption model or algorithm that can perform both encryption and decryption of data. Examples of such models are AES, DES, RSA, &c.
2. Use the encryption model to encrypt the data from the private database before passing it to the LLM. You can specify the key parameter to indicate the key you want to use, and the algorithm parameter to indicate which algorithm you want to use. The model will transform the original data into a bytes object with the encrypted data.
3. Use the encrypted data as input or context for the LLM. You can use the Langchain.chat() function to interact with the LLM using the encrypted data as the context. You can specify the context parameter to indicate the encrypted data, and the model parameter to indicate which LLM you want to use, such as chatgpt, gpt3, &c. The function will return a string with the LLM’s response.
4. Optionally, you can use the encryption model to decrypt the LLM’s response using the same key and algorithm as before. You can specify the key parameter to indicate the key you used, and the algorithm parameter to indicate which algorithm you used. The model will transform the bytes object with the encrypted data into the original data.
Here is an example of how the code might look like:
Python Script
# This is a generalized outline of the required script; in OpenAI an API key would be
# required and assigned as follows:
# openai.api_key = “sk-…”
#
# langchain and OpenAI objects could be declared for easier access to the object
# functions. Like so:
# Create a LangChain object
# lc = langchain.LangChain()
# Create an OpenAI object
# openai = lc.openai(model_name=”gpt-3.5-turbo”
#
# Continuing
# Import the required packages
import langchain
import langchain_openai
import langchain_experimental
import cryptography
# Encrypt the data from the private database using AES
data = “SELECT * FROM customers”
key = b”secret”
algorithm = “AES”
enc = langchain.encrypt(data, key=key, algorithm=algorithm)
# Interact with the LLM like ChatGPT using the encrypted data as the context
res = langchain.chat(“Hi, I am ChatGPT. What can I do for you?”, context=enc, model=”chatgpt”)
# Decrypt the LLM’s response using AES
dec = langchain.decrypt(res, key=key, algorithm=algorithm)
Encryption Advantages and Disadvantages
The advantages of using encryption for training a LLM are:
· It protects the privacy and security of the data by transforming it into an unreadable form before passing it to the LLM.
· It reduces the risk of data leaks and misuse when using third-party LLMs and generative AI tools.
· It allows the LLM to access and leverage the general knowledge and information in the data without compromising the data integrity.
The disadvantages of using encryption with a LLM are:
· It may introduce errors or inaccuracies in the encryption and decryption processes, which could affect the quality and reliability of the data and the LLM’s response—low risk.
· It may increase the computational cost and time of the data processing and analysis, as encryption and decryption are complex operations.
· It may not be compatible with all types of data, LLMs, or encryption models, which could limit the scope and applicability of the approach.
This was a brief and simplified overview of how to use encryption with a LLM.
Vectorization is the process of converting data into numerical vectors that can be used for mathematical operations and machine learning. In the context of LLMs,
vectorization can help improve the performance and efficiency of data processing and analysis, as well as enable semantic similarity and retrieval tasks.
However, when connecting a vectorized database with an LLM, the LLM does not learn from the vectorized database; but rather uses it as an adjunct of information or knowledge. The LLM queries the vectorized database for the most relevant vectors that match the prompt or context; and uses the retrieved vectors and data as the input or context for generating a response or an action. The LLM does not modify or update the
vectorized database, nor does it store or remember the vectorized database for future use. Therefore, the vectorized database remains unlearnt by the LLM.
The reason why the vectorized database remains unlearnt by the LLM is because the LLM is not designed or trained to learn from vectorized databases, but rather from natural language texts. The LLM’s neural network architecture and parameters are optimized for natural language generation, not for vector manipulation or storage. The LLM’s objective is to produce coherent and fluent texts, not to learn from or update the vectorized database. Therefore, the LLM treats the vectorized database as a read-only resource, not as a learning target.
Again, the connection between the LLM and the vectorized database is temporary for the session and the vectorized database is a temporary adjunct. This means that the LLM does not establish a permanent or persistent connection with the vectorized database, and again, nor does it store or remember the vectorized database for future use. The LLM only connects with the vectorized database when it needs to query or access it, and
disconnects from it when it does not need it. Ad nauseam, the LLM only uses the vectorized database as a supplementary or auxiliary resource, not as a primary or essential resource.
To use vectorization with an LLM, one possible approach is as follows:
1. Choose a pre-trained model that can generate vector embeddings for words, sentences, or documents. Examples of such models are Spacy, Bert, or GPT.
2. Use the Langchain.vectorize() function to vectorize the data using the chosen model. You can specify the model parameter to indicate which model you want to use, and the output parameter to indicate whether you want to return a numpy array or a pandas DataFrame with the vectorized data.
3. Use the vectorized data as input or context for the LLM. You can use the Langchain.chat() function to interact with the LLM using the vectorized data as the context. You can specify the context parameter to indicate the vectorized data, and the model parameter to indicate which LLM you want to use, such as ChatGPT, GPT, &c. The function will return a string with the LLM’s response.
4. Optionally, you can use the Langchain.devectorize() function to devectorize the LLM’s response using the same model as before. You can specify the model parameter to indicate which model you used, and the output parameter to indicate whether you want to return a numpy array or a pandas DataFrame with the devectorized data.
Here is an example of how the code might look like:
Python Script
# Import the required packages
import langchain
import langchain_openai
import langchain_experimental
import spacy
# Vectorize the data using spacy
vec = langchain.vectorize(“Hello, world!”, model=”spacy”, output=”array”)
# Interact with the LLM like ChatGPT using the vectorized data as the context
res = langchain.chat(“Hi, I am ChatGPT. What is your name?”, context=vec, model=”chatgpt”)
# Devectorize the LLM’s response using spacy
dev = langchain.devectorize(res, model=”spacy”, output=”df”)
Vectorization Advantages and Disadvantages
The advantages of using vectorization with a LLM are:
· It reduces the size and complexity of the data, making it easier and faster to process and analyze.
· It enables semantic similarity and retrieval tasks, such as finding the most relevant documents or words for a given query or context.
· It allows the LLM to access and leverage the knowledge and information encoded in the vector embeddings.
The disadvantages of using vectorization with a LLM are:
· It may introduce errors or inaccuracies in the vectorization and devectorization processes, which could affect the quality and reliability of the data and the LLM’s response—low risk.
· It may not capture the nuances and subtleties of natural language, such as tone, emotion, sarcasm, &c.
· It may not be compatible with all types of data, LLMs, or vector models, which could limit the scope and applicability of the approach.
This was a brief and simplified overview of how to use vectorization with an LLM.
· Anonymization: the process of removing or replacing personally identifiable information (PII) from data, such as names, phone numbers, email addresses, &c. This is done to protect the privacy and security of the data subjects and comply with data protection laws and regulations.
· Anonymizer: a tool that can transform or mask sensitive data in a database, such as personal information, to protect the privacy of individuals or organizations.
An anonymizer for databases can use different techniques, such as encryption,
substitution, generalization, or perturbation, to make the data less identifiable or linkable. It is useful for complying with data protection regulations, such as GDPR, or for sharing data with external parties, such as researchers, without compromising confidentiality or utility
· Deanonymization: the process of reversing the anonymization of data, that is, revealing the identity or personal information of individuals or organizations that have been hidden or masked in a dataset. Deanonymization can be done by cross-referencing anonymized data with other sources of data, such as public records, social media, or leaked databases, to find matches or correlations that can expose the original data
· Differential privacy: a mathematical framework for ensuring the privacy of individuals in datasets. It adds noise to data in a controlled way while still allowing for the extraction of valuable insights
· Encryption: the process of transforming data into an unreadable form using a secret key or algorithm. This is done to prevent unauthorized access or modification of the data. Only those with the key or algorithm can decrypt the data and restore its original form.
· LangChain: a high-level Python library/framework that simplifies the creation of applications using large language models (LLMs). It provides a standard
interface for chains, lots of integrations with other tools, and end-to-end
chains for common applications. LangChain can facilitate most use cases for
LLMs and natural language processing (NLP), such as chatbots, intelligent
search, question-answering, summarization services, or even virtual agents
capable of robotic process automation. It allows users to write expressions in
natural language and execute them as SQL queries or LLM prompts. Also, it
supports data anonymization, encryption, and vectorization using Microsoft
Presidio and Faker frameworks.
· LLM: a large language model, such as ChatGPT, that can generate natural language texts based on a given prompt or context. LLMs are powered by deep neural networks and trained on massive amounts of text data.
· PII: Personal identifiable information
· psycopg2: a Python library that can be used to connect a large language model (LLM) like ChatGPT with a SQL database.
· Python: a popular, high-level, general-purpose programming language that can be used for various purposes. Its design philosophy emphasizes code readability with
the use of significant indentation. Python is dynamically typed (an attribute of a piece of data that informs a computer system how to interpret its value)
· Regular expressions: aka regex or regexp, are sequences of characters that express patterns for text matching that are used for various purposes, such as searching, replacing, validating, or extracting information from strings
· SQL database: a relational database management system that uses the Structured Query Language (SQL) to store and manipulate data. Examples of SQL databases are MySQL, PostgreSQL, Oracle, &c.
· Vector embeddings: numerical representations of data that capture their meaning and relationships. They are often used for machine learning tasks such as text analysis, recommendation, and search. Vector embeddings can be created by using models that translate data into vectors, such as neural networks or word2vec. Vector embeddings can measure the similarity of data by calculating the distance or angle between their vectors in a multidimensional space
· Vectorization: the process of converting data into numerical vectors that can be used for mathematical operations and machine learning. This is done to improve the performance and efficiency of data processing and analysis.
· Vectorized data: data that is stored and processed as vectors, which are ordered arrays of numbers. Vectorized data can be more efficient and faster to manipulate than scalar data, which is stored and processed as individual values. Vectorized data can also capture the meaning and relationships of the data by using techniques such as vector embeddings, which are numerical representations of features or attributes
· Vectorized model: a mathematical representation of a system or phenomenon that uses vectors and matrices to describe its variables and parameters. A vectorized model can simplify the analysis and computation of complex systems by reducing the number of equations and operations involved. A vectorized model can also take advantage of the efficiency and functionality of vector and matrix operations, such as dot product, matrix multiplication, inverse, transpose, and eigenvalues
Conclusion
Using a vectorized database is not necessarily superior to anonymizing or encrypting data on the database, but rather a different approach that has its advantages and disadvantages. They both accomplish the task of protecting personal identifying information and, particularly, avoiding data leakage, and misuse. Depending on the use case and the requirements, one may choose to use one or all methods to protect the data.
Anonymizing data on the database means removing or replacing personally identifiable information (PII) from the data, such as names, phone numbers, email addresses, &c., during in-context-learning with the LLM. Encrypting data from the database means transforming data into an unreadable form using a secret key or algorithm, also, during in-context-learning with the LLM, thus preventing unauthorized access to private data. Only those with the key or algorithm can decrypt the data and restore its original form.
Using a vectorized database means converting data into numerical vectors that can be used for mathematical operations and machine learning. A vectorized database can also help leverage the general knowledge and information in the data without compromising the data subjects’ identities; and it is used as an adjunct to an LLM, which does not and cannot memorize the vectorized data.
Certainly, the accountant can accomplish his goal of client confidentiality and compliance with data protection laws and regulations, such as GDPR and CCPA, when using an LLM through the methods of anonymization, encryption, vectorization, or a combination of them. Next time, we recapitulate the “AI Use in Accounting” series and provide further reference material.
–Richard Thomas (with the assistance of Microsoft Copilot)
Previous, Part VI
Next, Part VIII