This site is a work in progress.
%% This script is for pre-processing EEG data for my new project % Write a short description of what the script is for
clear; % Clears the workspace
clc; % Clears the command window
eeglab; % Start EEGLAB
format bank; % This format suppresses scientific notation
This requires you to know about directories, variables and how to concatenate
project_folder = 'D:\work\my_EEG_project\'; % this is the main folder where everything lives
raw_eeg_folder = [project_folder,'data\raw_eeg\']; % this is where the raw eeg data files will be
Notice how it is a combination of the 'project_folder' variable + something else
behavioural_data_folder = [project_folder,'behavioural_data\raw_eeg\']; % this is where the excel file containing behavioural data will be
all_directories = dir(raw_eeg_folder); % gets all directory information of files in this folder
If you inspect 'all_directories' you'll notice that it is a structure that contains lots of information - we're only interested in the file names, so that's what we'll extract
all_filenames = {all_directories.name}; % extract the file name
If you don't include the curly brackets, it won't extract all of the filenames properly.
If you inspect 'all_filenames' you'll notice that the first 2 elements are '.' and '..' - in computer language these stand for the 'current folder' and 'parent folder', we don't need these so we'll get rid of them
all_filenames = all_filenames(3:end); % this script uses indexing to ignore the first 2 rows - it does this by keeping data from positions 3
You can loop through filenames, or you can parse the filenames to extract any information that may be stored in it - I prefer to parse the filenames so I can access/store subject ID and other information separately from each other and the file extension.
If you're filename contains multiple bits of information that you want to extract and separate, you can use 'regular expressions' to achieve this.
For example, 'P001_oddball_left.bdf' % the filename contains the participant id, task, condition and file extension - you might want to split these.
The code below splits the filename by 'commas' and 'dots', and stores the output in a cell-array.
filenames_split = regexp(all_filenames,'(\_|\.)','split');
You will notice that the output is actually a cell-array nested in a cell-array. One way to make the data more accessible to convert the cell-array to a table.
filenames_split = cell2table(filenames_split);
You will notice that all of the all the columns are stored under the same 'filenames_split' variable, which makes it difficult to access each column. You can separate them using the code below. The first argument is the table, and the second is the name of the variable you wish to split. The output will be 'filenames_split_1', 'filenames_split_2' etc.
filenames_split = splitvars(filenames_split,{'filenames_split'});
Inspect the output and assign the desired column in the 'filenames_split' table by using a 'table_name.variable_name' notation.
all_subject_id = filenames_split.filenames_split_1;
all_tasks = filenames_split.filesnames_split_2;
all_conditions = filenames_split.filenames_split_3;
for filei = 1:length(all_filenames) % will repeat code between for-end by as many times as there are files
filei is a called a 'counter' - it's basically a variable but it's value changes on each iteration of the loop
disp(filei); % will display on the command window the 'value' of filei - this number will change on each iteration of the loop
Inside this loop, you can include code for all the pre-processing steps.
end
Pro-Tip: The quickest and easiest way to create a script for EEG pre-processing is to use the EEGLAB graphic user interface. Do the thing you want by clicking through the menu, and once the thing is complete, type 'eegh' in the console (which stands for 'eeg history'). It will give you the code equivalent of all the operations you have just performed. You can copy and paste this code into your script, replacing any 'hard-coded' information like a filepath with a 'soft-coded' variable which can update on each loop. To understand what the code is doing, you can just google the function name to see what each argument does and you can try to trim unnecessary information.
The benefit of using the 'eegh' method when creating new scripts is that the functions will update with newer versions of EEGLAB (if you have them installed).
file2use = [raw_eeg_folder,all_filenames{filei}]; % define file path of which file to open
Recall that all_filenames is the list of all files. the counter 'filei' can be used as an index to extract a single file (starting with the first). As the counter increases (+1) with each iteration, a different file will be loaded.
EEGLAB can import many file types and uses different functions to do so. The specific function you will need will depend on the file type.
EEG = pop_biosig(file2use); % loading Biosemi '.bdf' files
EEG = pop_loadset(file2use); % If you are loading an existing EEGLAB '.set' file (already pre-processed in EEG to some degree)
Delete unused channels
Inspect manually in EEGLAB to figure out which channels to exclude. In the MATLAB workspace, look for EEG.chanlocs (inside the EEG structure) to see a list of all channel names. Find and enter label names of unused channels you want to delete.
EEG = pop_select(EEG,'nochannel',{'EXG4','EXG5','EXG6','EXG7','EXG8'});
The Biosemi has 8 extra channels labeled 'EXG1-8'. Some of them might be used as 'eye-channels' but many of them will be unused.
In order for
Rename eye-channels
EEG = pop_chanedit(EEG,'changefield',{65,'labels','IO1'},'changefield',{66,'labels','LO1'},'changefield',{67,'labels','LO2'}); % re-relabel channel numbers 65,66 and 67
Inspect manually in EEGLAB to figure out which channels you want to relabel
'IO1' is below the left eye
'LO1' is lateral to left eye
'LO2' is lateral to right eye
Load channel locations
EEG = pop_chanedit(EEG,'lookup','Standard-10-5-Cap385_witheog.elp'); %load channel locations from chanlocs file
Filter the data (High Pass and Line Noise Removal, or Band-Pass)
EEG = pop_eegfiltnew(EEG,30,1,3380,0,[],0); % 1 - 30Hz FIR filter
The 3380 part is the filter order - this number will change depending on your sampling rate. This value is automatically determined when you run it manually. What I do is run the filter on a single participant, type 'eegh' in the console to extract the command and paste that into my code.
the '0's and '[]' are additional arguments that we don't need.
Channel rejection and interpolation
reference_params = struct('referenceeChannels',1:64,...
'evaluationChannels',1:64,...
'rereference',1:EEG.nbchan);
[EEG, EEG.etc.robust_reference_output] = performReference(EEG,reference_params);
Re-reference
EEG = pop_reref(EEG,[]); % computes average reference
Correct for artifacts using Artificial Subspace Reconstruction (ASR)
EEG = clean_asr(EEG,100);
Define which event markers to extract data around
markers2use = {'101', '102', '103', '104'};
Define what time interval around the marker to extract data from
epoch_window = [-.1 1]; %in seconds
Extract epochs
EEG = pop_epoch(EEG,markers2use,epoch_window)
Keeping track of trials
It is very important to keep track of trials.
In my experiments, I export a table in excel containing trial-level information about: trial number, trial type, RT, accuracy - any relevant information. During my EEG analysis, I import this excel file and save it inside the EEG structure.
To keep track of trials, it is important that the number of rows in the excel data, matches the number of epochs that are extracted. This means, when I extract EEG data, I make sure to extract every single trial (even if i won't use it).
If this is done properly, you can use the behavioural data to index which trials you want.
If you are extracting response-locked ERPs, trials without a response cannot be extracted, so you will have less epochs in the EEG compared to the behavioural data. What you need to do is match the behavioural data. If you have trial-level RT, you can use behavioural_data(behavioural_data.RT ~= 9999) to subset the behavioural data. This way you should have EEG and behavioural data of only response trials.
It is also possible to feed this back to the full data.
Find out how many Independent Components to extract
When you do ICA, the maximum number of sources (or independent components) you can extract is equal to the 'data rank'. The data rank is the number of unique sources which contribute to the data - i.e. number of channels. So, the more channels you have, the higher your rank will be. When you exclude bad channels, the number of independent components you can extract from the ICA should also decrease. Even if you interpolate those bad channels, will data rank doesn't go back up because the interpolated channel does not contain any unique information, it's just a combination of surrounding channels.
A simple way to estimate the data rank is to subtract the total number of interpolated channels from the total number of trials.
data_rank = EEG.nbchan - total_interpolated_channels;
Another thing to consider is that ICA needs a certain amount of data to work properly. If you have a very short recording with a lot of channels (e.g. recording from 128 channels over 2 minutes), there might not be enough data to extract all 128 ICs properly. If you have a short recording, what you can do is decrease the number of ICs extracted. It is suggested that your dataset contains 20-30 time points per number of channels squared (30*number_of_channels^2). So, if you have 16 channels, you need ~7680 data points (which is 30.72 seconds of data at 250 Hz) to extract 16 ICs. Keep in mind that, if you epoch your data before conducting ICA - only the data points within the epoch are included, and not data points between trials.
You can shuffle around the above equation to determine how many ICs you can extract.
ICs_to_extract = floor(sqrt(EEG.pnts*EEG.trials/30));
pcakeep = ICs_to_extract;
AMICA
You can make ICA run faster if you run it on down-sampled data, and apply the solution to the original data. You can do this because lower frequencies contribute more to the ICA solution than higher frequencies. Use the pop_resample() function to down-sample to 100 Hz and store the output in a new variable.
EEG_4_ICA = pop_resample(EEG,100);
You can now run ICA by using the runamica15() function and putting the down-sampled data as the first argument. Even though it's running on the down-sampled data, the results are being stored in the original 'EEG' structure which contains the original data.
You can adjust the number of threads depending on how many CPU cores you have (it is suggested you set number of real cores - 1)
The line with the reject stuff allows AMICA to exclude data that it thinks is unlikely from the ICA optimisation procedure
'pcakeep' is where you can set how many independent components you want to extract.
[EEG.icaweights, EEG.icasphere, EEG.icamods] = runamica15(EEG_4_ICA.data(:,:), ... % reshapes data into super trial 'max_threads', 8, 'max_iter', 2000, ... % threads and iterations
'do_reject', 1,'numrej', 3, 'rejsig', 4, 'rejstart', 3, 'rejint', 3, ... % reject unlikely samples
'pcakeep', pcakeep); %reduce rank based on interpolated channels
EEG = eeg_checkset(EEG, 'ica');
ICLabel
ICLabel is an algorithm which has been trained on 6000+ datasets to identify independent components. This function can be used to classify ICs in our dataset. The output is a table of 7 numbers for each IC, telling us whether it thinks the component is 'brain', 'muscle', 'eye', 'heart', 'linenoise', 'channelnoise', or 'other'. These numbers represent the relative confidence of the classification, and the 7 numbers for each component add up to 1. The results of the ICLabel procedure is stored in 'EEG.etc.ic_classiciation.ICLabel.classifications'.
EEG = iclabel(EEG, 'default');
Reject unwanted Independent Components
There are many ways to decide which IC components to keep/remove. The method shown is keeps ICs that have less than 70% confidence that the IC is a non-brain component. In other words, it's removing activity that it is more than 70% confident that the IC is either muscle, eye, heart, linenoise, channelnoise or other. The first part indexes the IC of each category.
not_muscle_ic_idx = find(EEG.etc.ic_classification.ICLabel.classifications(:,2) <= 0.7);
not_eye_ic_idx = find(EEG.etc.ic_classification.ICLabel.classifications(:,3) <= 0.7);
not_heart_ic_idx = find(EEG.etc.ic_classification.ICLabel.classifications(:,4) <= 0.7);
not_linenoise_ic_idx = find(EEG.etc.ic_classification.ICLabel.classifications(:,5) <= 0.7);
not_channoise_ic_idx = find(EEG.etc.ic_classification.ICLabel.classifications(:,6) <= 0.7);
not_other_ic_idx = find(EEG.etc.ic_classification.ICLabel.classifications(:,7) <= 0.7);
The second part calculates the intersect of all these classifications.
% finding ics that satisfy all thresholds
keep_ic_idx = intersect(not_muscle_ic_idx, not_eye_ic_idx);
keep_ic_idx = intersect(keep_ic_idx, not_heart_ic_idx);
keep_ic_idx = intersect(keep_ic_idx, not_linenoise_ic_idx);
keep_ic_idx = intersect(keep_ic_idx, not_channoise_ic_idx);
keep_ic_idx = intersect(keep_ic_idx, not_other_ic_idx);
This part uses the pop_subcomp() function to remove unwanted ICs
% remove ICS
EEG = pop_subcomp(EEG, keep_ic_idx, 0, 1);
Clears EEG.icaact - not sure what this is
EEG.icaact = [];
This deletes the IC classification results for removed ICs
EEG.etc.ic_classification.ICLabel.classifications = EEG.etc.ic_classification.ICLabel.classifications(keep_ic_idx,:);
This checks the ICA result - always good to do this when you modify the data structure
EEG = eeg_checkset(EEG, 'ica');
Baseline Correction
Baseline correct to the pre-stimulus interval (in ms) - If you're running ICA on the epoched data, it's good to baseline correct after ICA because baseline correcting before ICA introduces offsets which can throw off the ICA algorithm.
EEG = pop_rmbase(EEG,[-100 0]);
Artifact Rejection of Trials
Just a simple procedure to detect whether any data points exceed 100 mV in either direction. This doesn't remove those trials, it just flags them. You can append this to the output to ensure that these trials are not included in statistical analyses. I don't like to delete trials because (1) it's harder to track because the matrix sizes become inconsistent between individuals and (2) it also deletes the behavioural data which might still be valid.
artifact_trials = EEG.data > 100 | EEG.data < -100;
artifact_trials = squeeze(sum(artifact_trials,2));
artifact_trials = squeeze(sum(artifact_trials,1));
artifact_trials = (artifact_trials > 1)';
EEG.etc.artifact_trials = artifact_trials;
Store useful pre-processing information (e.g. data rank, ICA components removed, how many trials have artifacts)
EEG.etc.subject_id = subject_id;
EEG.etc.artifact_trials = artifact_trials;
EEG.etc.good_ics = length(keep_ic_idx);
Store behavioural data
EEG.etc.behavioural_data = behavioural_data;
Save processed data
pop_saveset(EEG, 'filename',[subject_id,'_processed.set'],'filepath',export_folder); % Save output
That's it - the next part would be (1) figure out how to plot grand-averaged waveforms and scalp maps (2) measure and export ERP amplitudes