l Thursday AI Seminar Contents
The Vanguard of Big Data and Artificial Intelligence in Marketing Science
: Weekly Seminar
Our research center becomes a hub every Thursday for an academic collective exploring the application of big data and artificial intelligence technologies in the field of marketing science. Under the guidance of Professor Sunnyoung Lee, students from Dongguk University form various research teams to delve into the latest trends in data analysis and machine learning algorithms, and their innovative impact on marketing science. This seminar integrates theoretical foundations with case study analyses to enable participants to understand the practical applications of big data and artificial intelligence technologies, providing deep insights into the latest academic developments in this field.
The research teams participating in this seminar share essential information for their studies and learning, acquiring advanced knowledge that can be utilized in each team's research. Through this process, students gain critical skills for making data-driven decisions, improving customer experiences, and developing strategies to secure a competitive advantage in the market.
This seminar has established itself as an essential forum for scholars seeking to apply big data and artificial intelligence technologies in marketing science, aiming to contribute to the expansion of knowledge and the advancement of innovative research in this area. We eagerly await the active participation of those who wish to conduct research at the forefront of data science alongside our research center.
24. 03. 28
Moonjung & Mingyu from RESEARCH 2 / 3
: Logistics & Numerical Sequence
Audio Data Loading and Preliminary Analysis Using Librosa
The script is setting up for an analysis of audio data. By mounting Google Drive, it allows for the use of large datasets without having to upload the data to the Colab environment directly. The use of librosa for loading the audio files is a common choice in audio signal processing, as it simplifies many tasks like reading audio files, extracting features, and more.
After loading the data, the script stores each audio signal in audio_data_list and their respective sample rates in sr_list. This is typically the first step in a pipeline that could include further analysis like feature extraction (e.g., Mel-frequency cepstral coefficients (MFCCs), Zero-crossing rate, Spectral centroid) and machine learning tasks such as classification or clustering of audio samples.
The printed information gives immediate feedback on the data being processed and ensures that the audio files are loaded correctly. This can be used to quickly check for inconsistencies or issues in the data loading phase before moving on to more complex analyses.
This script serves as a preliminary step in an audio data analysis project, setting the stage for more detailed exploration of the audio files' contents.
Step 1. Create input and output variables
1. `X_train = np.zeros((40, 20))`:
- This line creates a 2D NumPy array of shape (40, 20) filled with zeros. This array is intended to be the feature set for training a machine learning model, where you have 40 samples and each sample has 20 features.
2. `y_train = np.zeros(40)`:
- Here, a 1D NumPy array of length 40 is created, also filled with zeros. This array represents the target variable or labels for the training set. In a typical machine learning task, this could be a binary classification problem where 0 might represent one class and 1 represents another class.
3. `y_train[0:20] = 1`:
- This line assigns the value 1 to the first 20 elements of the `y_train` array. This indicates that the first half of the samples belongs to one class (e.g., class 1), while the remaining half belongs to another class (e.g., class 0, since they are initialized to 0).
Step 2. Extract characteristics of sound data
Importing Audio Files and Feature Extraction:
The code contains two for loops, each designed to process a set of audio files that represent two categories of baby crying sounds: "hungry" and "laugh."
In the first loop, the code processes the files 'hungry02.wav' through 'hungry06.wav'.
In the second loop, it processes 'laugh03.wav' through 'laugh07.wav'.
The librosa.load function is called to load the audio file into the variable y with its sample rate in sr.
Then, the librosa.feature.mfcc function is called to compute the Mel-frequency cepstral coefficients (MFCCs) for the loaded audio. MFCCs are commonly used features in audio processing and speech recognition.
Step 3. Bundle into data sets
Imports:
csv: This module is used to read and write data in CSV format, but it is not used in the snippet provided.
numpy: This is a fundamental package for scientific computing in Python. It is being used here to handle arrays and mathematical operations efficiently.
Dataset Initialization:
data_sets = np.zeros((40, 21)): This line creates a 2D numpy array with 40 rows and 21 columns, initialized with zeros. The comment suggests that this array is meant to hold 20 feature vectors (X_train) and a single column for the labels (y_train).
Populating Feature Vectors:
data_sets[:, :20] = X_train: This line assigns the feature vectors from X_train to the first 20 columns of data_sets. This assumes that X_train is a 2D array with a shape that matches the first 20 columns of data_sets (i.e., 40 rows and 20 columns).
Populating Labels:
data_sets[:, 20] = y_train: This assigns the label for each feature vector into the 21st column of data_sets. The y_train array is expected to be a 1D array with 40 elements, where each element is the label corresponding to the feature vector in the same row.
Step 4. Extract to file using csv module
Imports:
The code begins by importing the necessary Python modules. csv is used for CSV file operations, and numpy is used for numerical operations in Python.
Writing to a CSV File:
with open(...) as f: This opens a file named baby_cry_data.csv in write mode ('w') located at the given path in the user's Google Drive folder babycriyingsound. The newline='' parameter is set to prevent the writer from adding extra newlines on Windows platforms.
writer = csv.writer(f): This creates a CSV writer object which will write to the file f.
Writing the Header Row:
writer.writerow(...): This writes the first row of the CSV file, which is typically used for headers. It generates headers for the feature columns as 'Feature_1', 'Feature_2', ..., 'Feature_20', and then adds a 'Label' column at the end.
Writing the Data Rows:
The for loop iterates 40 times (assuming there are 40 samples in the dataset).
writer.writerow(data_sets[i, :]): This writes a row to the CSV file for each sample. It takes the entire ith row from data_sets, which contains 20 feature values and 1 label, and writes it to the CSV file.
The code provided outlines several functions used in a gradient descent optimization process for a logistic regression model. Let's go through the main steps and the functionality of each part:
1. AccumAscentCurv Function:
- This function calculates the accumulated absolute difference between consecutive elements in a given list `A`. This could be a feature extraction method, where `X` is constructed by applying `AccumAscentCurv` to each training example `X_train`.
2. Sigmoid Function:
- It defines the sigmoid activation function, commonly used in logistic regression to map predictions to probabilities.
3. Cost Function:
- `cost_func` computes the cost using cross-entropy loss for logistic regression, which measures the difference between the predicted values (`Y_pred`) and actual labels (`a`).
- `delta` is a small value added to prevent division by zero when taking the logarithm.
4. Error Function:
- This function is actually the same as `cost_func` and calculates the cost using the cross-entropy loss. The name suggests it's used for readability to differentiate between calculating cost and interpreting it as an error.
5. Predict Function:
- Predicts whether the outcome is `1` (hungry) or `0` (laugh) based on the sigmoid of the linear combination of inputs and parameters, using `0.79` as the decision threshold.
6. Numerical Derivative Function:
- Computes the gradient of a function `f` at points `x` using a finite difference approach. It's used to approximate the derivative of the cost function with respect to the parameters `beta0` and `beta1`.
7. Parameters Update:
- Initializes the parameters `beta1` and `beta0` with random values.
- Applies the gradient descent optimization algorithm to update the parameters in the direction that minimally decreases the cost function.
- The learning rate controls how much the parameters are updated during each iteration.
- The process iterates for a large number of epochs (`1000001`), printing out the error value every `100000` steps.
Observations and Potential Issues:
- Both `cost_func` and `Error` functions are identical. One of them could be removed to make the code more efficient.
- The `predict` function uses a hard-coded threshold of `0.79`, which is an arbitrary value for classifying the predictions. In practice, this threshold should be determined through analysis or set to `0.5` for standard binary classification unless there is a specific reason for a different threshold.
- Using numerical differentiation is computationally expensive and less efficient than analytical derivatives. For logistic regression, analytical derivatives are straightforward to compute.
- There is no stopping criterion in the gradient descent loop aside from the fixed number of epochs, which means it will always run the full number of iterations, even if convergence is reached early.
- The parameters `beta0` and `beta1` are updated simultaneously within the loop without checking for convergence, which might lead to overshooting the minimum of the cost function if the learning rate is not set appropriately.
- There's no output or storage of the final model parameters, which are needed for making predictions on new data.
This is a fairly standard approach to logistic regression, but it's worth noting that practical implementations would typically use more sophisticated optimization algorithms (like Adam or RMSprop) and frameworks that can handle analytical gradients for efficiency, such as scikit-learn, TensorFlow, or PyTorch.
24. 03. 14
Dongyoon from RESEARCH 3
: K-Nearest Neighbor(KNN)
The K-Nearest Neighbors (KNN) algorithm is one of the most straightforward and easy-to-implement machine learning algorithms, utilized in pattern recognition and applicable to both classification and regression problems within supervised learning.
When Used for Classification:
The KNN classification algorithm is employed to predict which category a given data point belongs to. For instance, it can be used to classify emails as 'spam' or 'non-spam', or to determine whether a patient has a certain disease in medical diagnosis.
The mechanism is simple:
1. When a new data point is introduced, the algorithm identifies the 'K' nearest neighbors within the dataset.
2. The new data point is then classified into the category most common among its 'K' nearest neighbors, based on majority voting.
When Used for Regression:
In regression problems, KNN is used to predict continuous values. This could be predicting numerical values such as the price of a house or temperature.
The principle for regression is as follows:
1. The 'K' nearest neighbors to the new data point are located.
2. The new data point's value is predicted by calculating the average of the continuous values of these neighbors.
Key Considerations:
- Distance Measurement: KNN uses a metric of distance to determine neighbors. The most common distance metric used is the Euclidean distance.
- Choice of K Value: The value of 'K' greatly impacts performance. This value can be optimized using techniques such as Cross-validation.
- Data Normalization: As KNN is based on distance, it's important to normalize the data so that all features are on the same scale.
- Computational Complexity: Since KNN calculates the distance for every single data point, it can be computationally expensive with large datasets.
Applications:
- Disease diagnosis in the medical field
- Credit rating in banking
- Anomaly detection in stock markets
- Recommending items to users in recommendation systems
KNN is a simple and effective algorithm, however, it can be inefficient with large datasets, is sensitive to outliers, and requires careful consideration in choosing the 'K' value and scaling the features.
K-Nearest Neighbor(KNN) Average
The image features the K-Nearest Neighbor (KNN) algorithm at work, highlighting the concept of '1-nearest neighbor' with accompanying graphics. At the center, a new data point is depicted, distinguished by a red dot labeled 'New'. Scattered around this central point are other dots, colored yellow and blue, each representing existing data points within the dataset.
This particular panel of the image serves to illustrate how the KNN algorithm functions when set to K=1, the simplest case. When a prediction is required for the new data point, KNN looks for the single closest existing neighbor. In this graphic, the proximity is visualized by a circle emanating from the 'New' data point, encompassing the nearest neighbor, indicated by the nearest yellow dot linked with a red arrow.
The crux of KNN's operation lies in this concept of neighborhood; the algorithm asserts that similar things exist in close proximity. In the context of K=1, the assumption is that the closest data point has the most influence and therefore, the label (or value in the case of regression) of the nearest neighbor is assigned directly to the new data point. This method of labeling based on the nearest single neighbor is a direct approach but can be sensitive to noise in the data. It makes the algorithm highly flexible but also quite susceptible to overfitting, especially when the dataset has a lot of variance.
The simplicity of K=1 KNN makes it a useful tool for understanding the fundamental mechanics of the algorithm. Yet, in practice, selecting a higher K value is common as it tends to yield better generalization by reducing the noise and ensuring that the classification or prediction is based on a broader sampling of the data. This accounts for more complexity in the relationships within the data, potentially leading to more robust and accurate predictions.
The image presents an illustrative example of the K-Nearest Neighbors (KNN) algorithm, specifically highlighting the scenario when K is set to 3. The KNN algorithm is a versatile and intuitive method used in machine learning for both classification and regression tasks. It belongs to the family of instance-based, non-parametric learning algorithms. Non-parametric means that it does not make any assumptions on the underlying data distribution. This feature is particularly useful in real-world scenarios where most of the practical data do not follow mathematical theoretical assumptions.
In the given visualization, we observe a bidimensional space populated with data points belonging to two distinct classes, typically represented by different colors. In our case, the classes are denoted by yellow and blue points. A new data point, marked as 'New' and depicted by a red dot, has been introduced into this space and requires classification.
The essence of the KNN algorithm lies in its simplicity: classify a new data point based on the majority class among its nearest neighbors. The circle drawn around the 'New' data point represents the boundary within which the three nearest neighbors are located. These neighbors are linked to the 'New' point by red lines, visually indicating the process of neighbors' selection. The choice of three neighbors is an arbitrary decision made by the practitioner and can greatly influence the classification accuracy.
In a real-world scenario, choosing the optimal K is a result of trade-offs: too small a K may lead to overfitting, overly complex models that capture noise in the dataset; too large a K may cause underfitting, overly simple models that cannot capture the complexity of the data. Optimal K values are usually selected through a process of cross-validation.
In this particular instance, with K=3, we notice two blue points and one yellow point falling within the new data point's neighborhood. The classification of the 'New' point will be based on the majority vote principle, implying that if we follow a simple majority rule, the 'New' data point will be classified as belonging to the blue class. This decision process encapsulates the core concept of KNN as a method that relies on local approximation and all available data, as opposed to a global generalization model.
The KNN algorithm doesn't learn any model with parameters and thus is considered a type of lazy learning. This characteristic means that the computational cost is incurred at the time of prediction because the algorithm must compute the distance between the new point and every other point in the dataset. Therefore, while KNN can be extremely accurate, it can also become computationally expensive and slow as the size of the dataset grows, making it less suitable for applications with very large datasets or those requiring real-time prediction.
The visualization conveys this mechanism with elegant simplicity, yet behind it lies the intricate balance of machine learning theory, practical application, and the constant quest for achieving the highest possible accuracy in predicting outcomes.
This image is a visual aid for understanding the K-Nearest Neighbors (KNN) algorithm, specifically how it is used for regression to predict a numerical value. In the context of KNN regression, the prediction is made by averaging the values of the K nearest neighbors to the new data point.
In the graphic, we have a two-dimensional feature space with numerous blue points, each representing an individual data point with its own value. The red point marked "New" is the query point for which we want to predict a value based on its neighbors.
There are two scenarios presented:
1. When K=1: The algorithm looks for the single nearest neighbor. In this example, the closest data point to the new red point has a value of 15. Hence, when K is set to 1, the predicted value for the new data point is simply the same as its nearest neighbor, which is 15.
2. When K=3: The prediction is an average of the three nearest neighbors' values. Here, the nearest neighbors have values of 15, 30, and 21. To calculate the prediction for the new data point, we average these values: (15 + 30 + 21) / 3, resulting in 22. Therefore, the predicted value for the new data point when K is set to 3 is 22.
The image effectively demonstrates how the choice of K affects the predicted outcome in KNN regression. With a larger K, the prediction is typically smoother and less sensitive to noise in the data, as it is based on a larger number of neighbors. However, if K is too large, it may smooth over important trends in the data, which can lead to underfitting. Conversely, a smaller K can capture more of the data's nuances but can also make the prediction susceptible to noise, potentially leading to overfitting.
The visual representation also includes two circles centered on the "New" data point, with the inner circle representing the boundary for K=1 and the outer circle for K=3. This illustrates how the neighborhood expands as K increases.
This image is a visual representation of the Euclidean distance formula in a multi-dimensional space. The Euclidean distance is the straight-line distance between two points in Euclidean space, which can be extended beyond two dimensions.
The formula given at the top of the image is the general formula for Euclidean distance between two points A and B in a p-dimensional space:
\[ d(A, B) = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2 + \ldots + (a_p - b_p)^2} = \sqrt{\sum_{i=1}^{p} (a_i - b_i)^2} \]
Here, \( a_1, a_2, \ldots, a_p \) represent the coordinates of point A, and \( b_1, b_2, \ldots, b_p \) represent the coordinates of point B in a p-dimensional space.
On the left side of the image, we see a 2-dimensional representation where point A is at coordinates (1,1) and point B is at (3,3). The Euclidean distance between these two points in 2D space is calculated using the formula, resulting in a distance of \( \sqrt{8} \) or approximately 2.83 units.
On the right side of the image, the concept is expanded into a 3-dimensional space, with point A at (2,0,0) and point B at (0,3,2). The Euclidean distance is calculated using the 3-dimensional version of the formula, resulting in \( \sqrt{17} \), which is approximately 4.12 units.
The image visually demonstrates how the distance is the hypotenuse of a right-angled triangle in 2D, and the principle extends into 3D space. The blue lines depict the direct path from point A to point B, which represents the Euclidean distance. This concept is essential in many applications, including machine learning algorithms like K-Nearest Neighbors, where the distance between points is used to determine the nearest neighbors for classification or regression tasks.
24. 03. 07
Yujin from RESEARCH 1
:K-means Clustering
K-means clustering is a fundamental algorithm in unsupervised learning, a branch of machine learning where the data has no labels, and we aim to find the inherent structure in the data. The term 'cluster' refers to a collection of data points aggregated together because of certain similarities.
The concept of 'clustering' involves grouping a set of objects in such a way that objects in the same group, called a cluster, are more similar to each other than to those in other groups. It’s a method widely used for statistical data analysis, which serves to discover structure in a dataset.
In a nutshell, K-means clustering is a powerful tool for data analysis, pattern recognition, and feature learning, and it provides critical insights into the underlying structure of the data which can be further used for more sophisticated data analysis and machine learning tasks.
Euclidean Distance & Manhattan Distance
- Euclidean Distance
There are two types of distance measures: Euclidean distance and Manhattan distance, which are used in various applications such as geometry, physics, and machine learning for measuring the distance between two points.
The Euclidean distance, often referred to as the "ordinary" distance, is the straight-line distance between two points in Euclidean space. The formula provided in the image outlines the calculation of this distance in both one-dimensional and two-dimensional spaces. For one dimension, the Euclidean distance is the absolute value of the difference between the two points (|p - q|), which simplifies to the length of the segment connecting the points on the number line. In two dimensions, it extends to \(\sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\), which is the length of the diagonal of the rectangle formed between the two points (p and q) on the Cartesian plane.
The Manhattan distance, also known as the taxicab distance or city block distance, measures the distance between two points in a grid-based path like the street layout of Manhattan. It is the sum of the absolute differences of their coordinates. This measure is not provided in detail in the text but is often used in optimization problems where diagonal movement is not permitted.
The image also includes a graphical representation of the Euclidean distance in a two-dimensional Cartesian coordinate system, where the points p and q have coordinates (p1, p2) and (q1, q2), respectively. The Euclidean distance is the length of the line segment connecting these two points, which is the hypotenuse of a right-angled triangle with sides parallel to the axes, and can be computed using the Pythagorean theorem.
Overall, this image provides an overview of different ways to calculate distances between points, emphasizing the Euclidean distance's significance and its mathematical computation.
In the context of different dimensions:
In a single dimension, the Euclidean distance can be represented as the absolute value of the difference between two points.
In two dimensions, the distance formula expands to the square root of the sum of the squared differences in both the x and y coordinates.
For a space with k dimensions, the distance is the square root of the sum of the squared differences across all dimensions.
Visual Representation: The slide also includes a diagram showing a three-dimensional representation of Euclidean distance. In this diagram, the distance between point p (with coordinates p1, p2, p3) and point q (with coordinates q1, q2, q3) is the hypotenuse of a right-angled triangle in three-dimensional space, calculated as the square root of the sum of the squares of differences between corresponding coordinates.
Highlighting the Difference Between Points: The slide emphasizes that this distance metric can capture the direct 'as-the-crow-flies' distance, regardless of the dimensionality of the space.
This image outlines the Manhattan distance, also known as the taxicab distance or L1 norm. It’s a distance metric that measures the distance between two points in a grid-based path, considering only the vertical and horizontal steps, rather than the diagonal distance that the Euclidean distance would measure. The key points from the image can be described as follows:
- Manhattan Distance : This term defines a method to compute the distance that only allows for horizontal and vertical movement, mirroring the way one would navigate through a city block grid like that of Manhattan.
- In a practical context: The Manhattan distance is utilized in various applications where only orthogonal (right-angle) movements are permitted. For example, in digital signal processing, it can be used to calculate the similarity between different signals.
- Mathematical Definition: The mathematical formula for calculating the Manhattan distance between two points p and q in a k-dimensional space is the sum of the absolute differences of their respective coordinates. It is represented by the formula:
\[ d(p, q) = |p_1 - q_1| + |p_2 - q_2| + \ldots + |p_k - q_k| \]
- Sensitivity to Outliers: Compared to the Euclidean distance, the Manhattan distance is less influenced by outliers because it accumulates the absolute differences across each dimension, which can mitigate the impact of large discrepancies in any single dimension.
The image also includes a graphical representation, probably of a two-dimensional grid, where the Manhattan path is shown in right-angled steps connecting two points. This visualization helps to convey the concept that the Manhattan distance is effectively the sum of the absolute horizontal and vertical distances between points on a grid, which is a useful property in various real-world applications, including urban planning, robotics, and more.
K-means Clustering Process
Here is an elaboration of K-means clustering process:
1. Selection of K: The process starts with the selection of 'K', which is the number of clusters you want to identify in the dataset. This is a critical step as it defines the granularity of the clustering.
2. Initialization: Once 'K' is chosen, the next step is to initialize 'K' centroids in the data space. These centroids can be randomly selected from the data points, or they could be strategically placed through an approach like K-means++ to improve the chances of optimal clustering.
3. Assignment: The assignment phase involves allocating each data point to the nearest centroid. The "nearest" typically means the centroid with the minimum Euclidean distance from the data point. Each data point is evaluated against all centroids, and is associated with the centroid that is closest.
4. Update: After all points have been assigned to clusters, the position of the centroids is recalculated. This is done by taking the average of all the points within each cluster, hence the term "means" in K-means. This average becomes the new centroid for each cluster.
5. Iteration: The assignment and update steps are repeated iteratively. With each iteration, the centroids move within the data space, and the points are reassigned. This process continues until there is little to no change in the positions of the centroids, indicating that the clusters are stable and the algorithm has converged.
The end result is a segmentation of the dataset into 'K' clusters, where each cluster is defined by the proximity of its points to a central centroid. The K-means algorithm is a method to partition the dataset into distinct groups that minimize the variance within each group and maximize the variance between groups. It's a widely used technique for exploratory data analysis and pattern discovery.
24. 02. 22
Seoyoung from RESEARCH 2
:BERT
This slide outlines the key sections of the presentation. It covers the fundamental concepts of BERT, how its architecture is designed, how to fine-tune BERT models for specific tasks, and finally, how to apply BERT in practical scenarios, including the actual code implementations. The visual backdrop of a machine implies the mechanical or systematic nature of the algorithm, even though BERT itself is a software-based model.
This slide provides an overview of BERT's (Bidirectional Encoder Representations from Transformers) main features:
Bidirectional: This term indicates that BERT reads the input data in both directions (left-to-right and right-to-left) for a comprehensive understanding of the context. This is in contrast to previous models that processed data in only one direction.
Encoder: The slide mentions the use of an encoder, which refers to the part of the transformer architecture that processes the input text. In BERT's case, the model is made up entirely of encoders stacked on top of each other (illustrated by multiple layers of "Encoders" on the slide).
Representations from: It seems to be an incomplete point but it likely refers to BERT's ability to generate representations from the input data. These representations capture the context of each word within the sentence.
Transformers: The diagram depicts the transformer model architecture, which consists of encoders and decoders. The transformer model is the underlying architecture of BERT. However, BERT only uses the encoder stack for its operations.
The slide focuses on the attention mechanism which is a key component of the BERT model. Here’s a breakdown of what the slide contains:
BERT 구조 (BERT Structure): The slide is titled 'BERT Structure', indicating that it will discuss the internal mechanisms of BERT.
Learning by telling the model the entire internet sentence, responding to the mask and predicting the corresponding word: This text explains that BERT learns by being provided with sentences where some words are masked out (replaced with a [MASK] token). The training process involves predicting these masked words based on the context provided by the rest of the sentence.
Q, K, V, d_k: These are the components of the scaled dot-product attention mechanism:
Q (Query): Represents the vector that queries the keys.
K (Key): Represents the keys that are compared against the queries.
V (Value): Represents the values that are retrieved based on the query-key match.
d_k: Represents the dimension of the key vectors. It is used to scale the dot product to avoid extremely large values when computing the softmax function.
The matrices and equations below illustrate the calculation process of the attention scores and the output of the attention mechanism. The attention scores determine how much focus to put on other parts of the input sentence when encoding a particular word.
This slide illustrates the Masked Language Model (MLM) pre-training aspect of BERT (Bidirectional Encoder Representations from Transformers). The MLM task is one of the two pre-training strategies used by BERT to understand language context and improve language representation.
In the MLM task, some percentage of the input tokens are masked at random, and the goal is to predict these masked tokens based on their context. This encourages the model to learn a deep, contextualized representation of language.
Here's what's depicted on the slide:
BERT Pre-train (1) 마스크 언어 모델 (Masked Language Model): This indicates the first pre-training component of BERT, which is the Masked Language Model.
BERT(12-layers): This refers to the architecture of BERT, indicating that the model has 12 transformer layers. These layers work together to process the input text and generate predictions for the masked tokens.
[CLS], my, [MASK], is, cute, [SEP], king, likes, play, ##ing, [SEP]: This is an example of an input sequence fed into BERT during pre-training. The [CLS] token is used for classification tasks and is added to the start of every input sequence. The [MASK] token indicates that the original word at this position is hidden from the model and must be predicted. [SEP] is a separator token, often used to separate two sentences or to indicate the end of a sequence.
MLM Classifier: Above the BERT architecture, there are several "MLM Classifier" blocks. These blocks likely represent the output layer specific to the MLM task, where each masked token is classified into one of the vocabulary tokens.
Here's what's on the slide, translated and explained:
BERT Pre-train (2) 다음 문장 예측 (Next Sentence Prediction): This indicates the second pre-training objective of BERT, focusing on predicting whether two sentences are sequentially related.
두 개의 문장을 준 후, 이어지는지 아닌지 맞추는 방식 (After giving two sentences, predicting whether they follow each other or not): BERT is provided with pairs of sentences during training, and it must predict whether the second sentence is a logical follow-up to the first.
이어지는 문장의 경우 (For the case where the sentence continues):
Sentence A: "The man went to the store."
Sentence B: "He bought a gallon of milk."
Label = IsNextSentence
이어지는 문장이 아닌 경우 (For the case where the sentence does not continue):
Sentence A: "The man went to the store."
Sentence B: "Dogs are so cute."
Label = NotNextSentence
In these examples, BERT learns from Sentence A and Sentence B to predict if they are likely to be adjacent sentences in a natural text sequence. For the first pair, Sentence B is a logical continuation of Sentence A, so the correct label is "IsNextSentence." For the second pair, Sentence B is not related to Sentence A, and the correct label is "NotNextSentence."
This NSP task enables BERT to understand the flow of ideas and narrative within a text, which significantly improves its performance on downstream NLP tasks that require understanding the relationship between sentences.
The BERT (Bidirectional Encoder Representations from Transformers) Embedding Layer is a crucial part of the BERT architecture, and it's responsible for converting input text into a form that the model can process. The embedding layer comprises three main components:
1. WordPiece Embeddings:
- BERT uses a subword tokenization algorithm called WordPiece. It splits words into smaller pieces (tokens) so that the model can handle a wide range of vocabulary without having a separate embedding for every possible word.
- For example, the word "playing" could be split into "play" and "##ing". This allows BERT to understand that "playing" is related to "play" and to generalize across various forms of the word.
- Each WordPiece token is assigned an initial embedding vector.
2. Positional Embeddings:
- Unlike traditional sequential models like RNNs or LSTMs, BERT processes all tokens simultaneously and hence does not inherently capture sequential information.
- To address this, BERT adds positional embeddings to the token embeddings. These are vectors that encode the position of each token in the sequence, allowing BERT to take into account the order of words.
- The positional embeddings have the same dimension as the token embeddings so that they can be added together.
3. Segment Embeddings:
- BERT can take a pair of sentences as input for tasks such as question answering or natural language inference.
- Segment embeddings are added to distinguish between the first and the second sentence in a pair. Each sentence is marked with a separate segment embedding, allowing the model to differentiate between them.
- There are only two segment embeddings in the model, one for each possible sentence in a pair.
When an input sequence is fed into BERT, each word is first converted into its corresponding WordPiece token and then mapped to its initial token embedding. Positional embeddings are added to these to capture the sequence order, and if the input is a pair of sentences, segment embeddings are included as well. The combination of these embeddings results in a rich representation of each token, infused with semantic, syntactic, and positional information.
All of these embeddings are learned during the pre-training phase of BERT on a large corpus of text, which allows the model to learn contextual relationships between words and their positions within a sentence or across sentence pairs. The final embedded vectors then serve as the input to the subsequent layers of the BERT model.
The slide titled "BERT Fine-Tuning" outlines the process of customizing a pre-trained BERT model for various specialized tasks. After its initial pre-training on a large corpus, BERT's parameters are fine-tuned with additional training on a task-specific dataset. This allows BERT to adjust its parameters to the specifics of a particular task.
The slide lists several NLP tasks suitable for BERT fine-tuning:
- Text Classification: Categorizing texts into predefined classes.
- Natural Language Inference: Determining the relationship between sentences, like entailment or contradiction.
- Question Answering: Providing answers to questions based on given context.
- Named Entity Recognition: Identifying and classifying entities within the text into predefined categories such as person names, organizations, locations, etc.
Lastly, the slide indicates that for their purposes, BERT will be fine-tuned for text classification tasks, which could involve classifying text by emotion or other categories. Fine-tuning tailors the BERT model to perform well on specific tasks by leveraging its general understanding of language, honed during pre-training, and applying it to the particularities of specialized datasets.
You can access more presentation materials by getting in touch with Min-Gyu.