Joint Text-EEG Embeddings

Here is a proposal I pitched to the Emotiv EEG company, since they have a bunch of EEG data and infrastructure to pay customers to help with distributed data collection. They were interested but we ended up stopping short of having me work there for the summer.

My motivation is that the main approach to AI alignment right now is RLHF, but one key limitation is the bandwidth constraint of humans ranking text generated by large language models. It would be nice to have joint text-eeg embeddings in order to increase that bandwidth. I think there are ways of doing this naively that would be pretty bad – for example, if we did RLHF based on people’s internal reactions to a piece of text, I could see that optimizing to align with parts of ourselves that we don’t identify with (eg stimula that are short-term pleasurable).

But I think there is a version of this where we scale the amount of communication between AIs and people, as AIs become more direct personal assistants. I am consciously thinking about EEG as opposed to the more reasonable fMRI because EEG is much more likely to be integrated into everyday use, which can lead to much for data being collected, which has positive feedback effects. Also as far as I can tell, it hasn’t really been tried with transformers before somehow.

I have this broad intuition that more communication is good, but is is also possible in the short term for increased communication to lead to a smaller integral of total communication. I admit that sociopaths exist, and that generally speaking your don’t want to give our your social security number which is indeed communication. Maybe a crux for communication is that in the process of sharing your concepts, you also share your values, since values are a particular kind of concept. I don’t think Hitler is a good example for othogonality, because his values were not really disjoint from his conceptual structure of how the world works.

In any event, there is probably a better version of the following proposal where you don’t create a joint embedding space but rather use EEG embeddings to help a pretrained LM do better prediction of next word. I imagine this is a more reasonable task do to the low spatial resolution of EEG.

A Proposal for Developing a Joint Text-EEG Embedding Model using Emotiv EEG Headsets

Author: Scott Viteri Affiliation: Stanford University Date: March 27th, 2023

Introduction

The goal of this project is to create a dataset and develop a joint text-EEG embedding model using Emotiv EEG headsets. This dataset will consist of pairs of language model completions and the corresponding EEG headset data when a participant reads those completions. By learning a joint EEG/text embedding space, we aim to align machine learning models with human values, allowing AI systems to better emulate human cognitive processes and values. Additionally, this project will provide an incentive for people to buy Emotiv headsets, as they can use the joint embedding model to fine-tune language models with their own data, creating a more personalized AI assistant.

Emotiv’s Market Opportunity

The joint text-EEG embedding model developed using Emotiv EEG headsets has the potential to significantly increase the demand for Emotiv’s products. If this joint embedding becomes generally available, it would provide a strong incentive for consumers to purchase Emotiv headsets, particularly those with an earbud form factor. The ability to fine-tune language models on their own thoughts using the embedding model and the headset would be a major selling point. By integrating the headset with smartphones or internet-connected devices, users could interact with language models in real-time using their thoughts as context, effectively creating a next-generation smart assistant experience. This innovation represents a natural progression from voice-based smart assistants, and the potential market size is vast.

By collaborating with Emotiv, we can ensure that the joint text-EEG embedding model is optimized for use with their EEG headsets. This will create a unique value proposition for Emotiv’s products, differentiating them from competitors and driving increased sales. Moreover, the success of this project could lead to further research and development opportunities, reinforcing Emotiv’s position as a leader in the field of brain-computer interfaces and AI-driven technologies.

Cost Estimation

The total cost estimation for this project is around $120K. This is broken down as follows:

  1. $50K for the joint brain/EEG embeddings
  2. $30K for an EEG predictive model
  3. $40K to pay for the project lead’s summer internship

These costs can be justified by the potential to further solidify Emotiv’s position as a leader in the brain-computer interface field while opening up new avenues of innovation, connecting human thought to computers, AI, and each other in fundamentally different ways from before.

Data Collection Cost

We estimate that the cost per data point for labeling is $30/hr, assuming a labeler can read and label a paragraph within a minute. This results in 2 samples per dollar. To potentially reduce costs and increase participant engagement, we can consider allowing participants to choose their reading material. This can be achieved by providing a browser extension that displays one paragraph or sentence at a time or by hosting the interface on the Emotiv labs platform, where participants can input the URL of the content they want to read. While this approach might introduce some bias in the dataset, it could be worth it if we are participant-limited and can ensure a diverse set of readings, including content that elicits varying emotional responses.

Number of Data Points and Model Training Cost

For the contrastive model, we would naively expect 400 million pairs, similar to the CLIP model, which would cost $200M to annotate. However, since we can use OpenAI’s existing embedding model for the text to vector function, we might only need 100 million pairs, which would still cost

$50M to annotate – an amount still out of range.

To reduce costs, we can pretrain a model to predict the next EEG signals using existing data repositories. By removing the last layer(s) of this predictive model, we can use it as a starting point for the brain-to-vector function. This approach could save three orders of magnitude, putting the cost at $50K. However, we would need a good predictive model for future brain data based on past data. Estimating the difficulty of this task to be around the difficulty of language modeling, the cost for a baseline competence 110M parameter model would be about $30K ( https://arxiv.org/abs/2004.08900).

Splitting the Problem: Pre-training and Contrastive Training

We split the problem into pre-training and contrastive training steps for several reasons. First, pre-training allows us to leverage existing data repositories to build a strong foundation for the brain-to-vector function at a lower cost. This step reduces the complexity of the problem and enables us to focus on learning a meaningful representation from brain data. Second, the contrastive training step refines the model to learn the joint embedding space between text and EEG data, ensuring that the model can effectively capture relevant information from both modalities. Splitting the problem this way facilitates a more efficient and cost-effective approach to developing the joint text-EEG embedding model.

Incremental Spending

To reduce the risk of financial loss, the project can be executed incrementally, allowing for adjustments and the possibility of stopping data collection if necessary. The following strategies will be employed:

  1. Learn the predictive model of the EEG signal before starting human data collection.
  2. Use the EEG predictive model to create baselines and possibly cheaper ways to come up with translations between EEG and language.
  3. Train the contrastive model as data is collected, allowing for adjustments or stopping data collection if the model does not show progress.

Data Collection Procedure

Participant Recruitment

Recruit a diverse group of participants, ensuring a representative sample of different ages, genders, and cultural backgrounds. Obtain informed consent from all participants, and ensure they understand the purpose of the study and their right to withdraw at any time.

Onsite Collection Protocol

  1. Equip each participant with an Emotiv EEG headset, ensuring a comfortable fit and proper electrode placement. Use EEG headsets with multiple channels to capture a broader range of brain activity. Additionally, integrate the headset with a smartphone or computer application to facilitate real-time data collection and language model interaction.
  2. Prepare a diverse set of texts covering various topics and complexities to elicit a wide range of cognitive responses from participants. The texts can be generated by language models or curated from existing sources.
  3. Randomize the order of text presentations for each participant to minimize any order effects that may influence the EEG readings.
  4. For each participant, follow these steps: a. Present a text to the participant and ask them to read it silently. b. Record the participant’s EEG data while they read the text. c. Store the (text, EEG data) pair securely.
  5. Repeat this process for several texts per participant, ensuring a sufficient number of (text, EEG data) pairs are collected.
  6. Store the data securely, anonymizing it to protect participant privacy.

Remote Collection Protocol

  1. Equip each participant with an Emotiv EEG headset and integrate it with a browser extension or an interface hosted on the Emotiv labs platform.
  2. Allow participants to input the URL of the content they want to read, ensuring a diverse set of readings and a wide range of cognitive responses.
  3. Follow the data collection, preprocessing, and model training steps outlined in the Onsite Collection Procedure section.
  4. Monitor the dataset

for biases and take corrective measures as needed to ensure a balanced representation of content and emotional responses.

By implementing this data collection procedure, we can potentially reduce costs, increase participant engagement, and encourage a diverse range of readings. However, it is essential to be aware of potential biases and address them accordingly to ensure the quality and usefulness of the joint text-EEG embedding model.

Model Training

EEG Data Preprocessing

  1. Perform artifact removal and noise reduction on the collected EEG data to ensure high-quality signals for subsequent analysis.
  2. Apply Short-Time Fourier Transform (STFT) to the EEG data to generate spectrograms, converting time-domain signals into time-frequency representations. This process will help capture important features in the EEG data.
  3. Vertically stack the STFT signals from multiple EEG channels to create combined spectrogram representations.

Model Training and Evaluation

  1. Follow the methodology from MULAN: A Joint Embedding of Music Audio and Natural Language (Huang et al., 2022) as a reference for creating a joint EEG/text embedding model.
  2. Use a Vision Transformer (ViT) to process the combined STFT spectrograms and generate EEG signal embeddings.
  3. Train the model using contrastive loss to find a joint embedding space between the text data and the EEG data.
  4. Evaluate the model’s performance using various metrics and qualitative analysis to ensure the model effectively captures meaningful information from both text and EEG data.

Public Release and Kaggle Competition

  1. Prepare the dataset for public release, ensuring it complies with privacy regulations and ethical guidelines.
  2. Publish the initial joint text-EEG embedding model along with the dataset to provide a baseline for researchers and developers.
  3. Collaborate with Kaggle to organize a competition, encouraging participants to improve upon the initial model and explore novel applications of the joint embedding space.

Summer Internship

The proposer of this project, if accepted, would like to be considered for a summer internship at Emotiv. The internship’s primary focus would be to assist in the data collection process, model development, and organizing the Kaggle competition. Additionally, the intern would contribute to writing an academic paper targeting an ML-focused journal to report the project’s results and implications.

Project Timeline

  1. Phase 1: Develop an EEG predictive model (1 month)
  2. Phase 2: Begin human data collection and establish baselines (1 month)
  3. Phase 3: Train the contrastive model as data is collected, and monitor progress (3 months)
  4. Phase 4: Evaluate the project’s success and revise the data collection process, if necessary (1 month)

Total project duration: 6 months

Advancing Emotiv’s Vision for Brain-Computer Interfaces and Conclusion

The successful development of a joint text-EEG embedding model has the potential to advance Emotiv’s leadership in brain-computer interface technology. By connecting human thought to computers, AI, and to each other, Emotiv will be ideally positioned to capitalize on the growing market for personalized AI assistants and innovative BCI applications.

In conclusion, this proposal outlines a plan to develop a joint text-EEG embedding model using Emotiv EEG headsets, with the ultimate goal of aligning machine learning models with human values and creating more personalized AI assistants. By publicly releasing the dataset and initial model, we aim to encourage research and innovation in this area, as well as generate interest in Emotiv’s products. We believe this collaboration will be mutually beneficial and result in significant advancements in the fields of AI and brain-computer interfaces.

Scott Viteri
Scott Viteri
PhD Candidate in Computer Science

I am a PhD candidate at the Center for Automated Reasoning at Stanford University, under the guidance of Prof. Clark Barrett. My research focuses on improving the prosocial tendencies of language models (LMs) through a series of unique developmental approaches. This includes the introduction of communication channels during autoregressive training (akin to a kindergarten setting), allowing a parent LM to guide a child LM by curating its training data, and enhancing human feedback on LMs via a combined embedding of EEG data and speech.