Data Collection, Preparation, and Cleaning

The first code provided is aimed at extracting data from YouTube related to a specific search topic query, focusing on video titles, views, and URLs. First, it utilizes the Selenium and BeautifulSoup libraries to scrape search results for videos matching the query. The extracted data includes the title, views, and URL for each video. These details are stored in a list called videos_data.

Once the data is collected, the script sorts the videos based on their view counts, selecting the top 5 videos for deeper analysis. For each of these top videos, the script retrieves the number of likes using regular expressions applied to the HTML content of the video page. This information is then displayed alongside the video's title, views, and URL.

Following the extraction of view and like data, the script proceeds to scrape the transcripts of the top 5 most viewed videos. It employs the youtube_transcript_api library to access YouTube's automatically generated transcripts. For each video, the script extracts the video ID from the URL and retrieves the corresponding transcript using the YouTubeTranscriptApi.get_transcript(video_id) method.

Once the transcript data is obtained, the script cleans it by extracting only the text content from the JSON transcript objects. It iterates through each transcript item, collecting the spoken words, and combines them into a single string. This process effectively removes any timestamps or metadata associated with the transcript, resulting in a cohesive block of text representing the spoken content of the video.

Data Example

The first image displays one example of raw transcript data collected from our youtube web scraping and beneath it is a cleaned example. Here we use the topic of "Discrete Random Variables". In the first image, the text data includes timestamps, text indicators, and meta data. Beneath, the text is unstructured with no visible timestamps or metadata, indicating it has been cleaned and prepared for analysis. The focus of this data is solely the spoken content of the video. It contains all lectures, discussions, and examples presented in the video.

Visualizations

The visualizations below are a display of our preliminary exploratory statistical analysis. Here, we primarily employ the use of the TextBlob python library such that we can quantify the text. In doing so, we can perform a range of natural language processing (NLP) tasks, such as noun phrase extraction, sentiment analysis, subjectivity analysis, etc. By utilizing TextBlob, which is constructed from the NLTK and Pattern libraries, we are able to comprehend and analyze the complexities of language data. Through these visualizations, we begin to uncover insights and patterns within the text, laying the groundwork for deeper analyses and informed model implementation.

Keyword Extraction

To begin our exploratory analysis, we defined our sample video transcript as a multi-line string. Then, we constructed a TextBlob object from the transcript using TextBlob(text). Next, we utilized the noun_phrases attribute of the TextBlob object to perform noun phrase extraction, which extracts phrases that act as nouns in the text.

We display the extracted keywords where the size of the keywords correspond to their occurrence in transcript. These keywords represent important phrases in the transcript that can help identify key topics or concepts discussed in the video.

Subjectivity Analysis

Subjectivity analysis is a natural language processing (NLP) technique used to determine the degree of subjectivity or objectivity present in a piece of text. The subjectivity score, ranging from 0 to 1, quantifies the overall subjectivity of the text, where a higher score indicates a higher degree of opinionated content, quantifing the overall subjectivity of the text. Understanding the subjectivity level of text is beneficial for various tasks, such as content filtering, summarization, and decision-making. It enables automated systems to prioritize subjective content for sentiment analysis or opinion mining tasks and provides valuable context for interpreting textual data in real-world applications.

The pie chart visual provides a presentation of the degree of subjectivity present in the text, distinguishing between factual and opinionated content. Through this visual representation, we can easily grasp the balance between the two in the text, aiding in the understanding and interpretation of the text's tone. In the context of the provided text example, a lecture on discrete random variables, it makes sense that the majority of text would come back factual. However, it appears there is quite a bit of opinionated text which we can cut from the data as we move to model implementation.

Sentence Length

Sentence length refers to the number of words or characters contained within a sentence in written or spoken language. Analyzing sentence length can provide insights into the structure, complexity, and cohesion of a text. In natural language processing (NLP), researchers use sentence length statistics to assess writing style, readability, and linguistic features.

After computing the lengths of all sentences in terms of the number of words, we plotted a histogram with 10 bins, depicting the frequency of sentence lengths. Overlaying the histogram is an exponential distribution curve, which represents the hypothetical distribution of sentence lengths if they were governed by an exponential process. The curve is fitted to the histogram using the mean sentence length minus the minimum observed length as the scale parameter, ensuring alignment with the data.

The resulting graph provides a visual comparison between the observed distribution of sentence lengths and the exponential distribution, offering insights into the underlying structure of sentence lengths within the text. This visualization aids in understanding the typical sentence length patterns and the potential presence of outliers or non-exponential behavior in the data.

This box plot offers a comprehensive visualization of the distribution of sentence lengths within our given transcript. The box in the plot represents the range of sentence lengths, or Inner Quartile Range, providing insight into the structure of the language used. Here we can see that the median sentence length is about 12 words and there are a few outliers, one of which is over 70 words long. This is an area of interest as it indicates an area that we should investigate for further cleaning or a rambing sentence that may be dense with imprtant information. Either way, this analysis has gudied our attention to an area that clearly needs it.

Sentence Length and Polarity

These visualizations allows us to observe how the average sentiment score varies across different ranges of sentence lengths. Here, we compute the length of each sentence in terms of the number of words and calculate the polarity score of each sentence using sentiment analysis techniques.

The resulting visualization is a scatter plot; the x-axis represents the sentence length (measured in the number of words) and the y-axis depicts the polarity score. This visualization reflects the emotional sentiment expressed in each sentence. The scatter plot provides a visual depiction of the relationship between sentence length and sentiment polarity within the text, enabling insights into potential correlations or patterns. This visualization aids in understanding how the length of sentences may influence the emotional tone of the text, facilitating deeper analysis and interpretation of the text's content and sentiment. The scatter plot allows for a visual assessment of any potential correlation between sentence length and polarity.

The correlation coefficient, a statistical measure quantifying the strength and direction of the relationship between the two variables, is utilized to measure the degree of correlation between sentence length and polarity. A correlation coefficient close to 1 or -1 suggests a strong positive or negative correlation, respectively. A correlation coefficient close to 0 indicates no discernible correlation between the two variables. In our case, a correlation score of 0.06 suggests that changes in sentence length have little to no effect on polarity score.

The stacked bar chart provides a clear comparison between the sentiment scores within each range of sentence lengths. By examining the height of each bar, we can discern the average sentiment score within each range of sentence lengths. Overall, the stacked bar chart provides a comprehensive overview of the distribution of sentiment scores across different ranges of sentence lengths, aiding in the interpretation and analysis of the emotional content within the transcript.

Each cell in the heatmap corresponds to a bin defined by ranges of sentence lengths and polarity scores. The color intensity of each cell reflects the density of data points within that bin: brighter colors indicate lower densities, while darker colors represent higher densities. This visualization offers insights into patterns and trends within the text data. Areas of higher density may indicate regions where sentences cluster around specific lengths and polarity scores, highlighting potential areas of focus or interest. While regions of lower density suggest sparser data points, where sentences vary more widely in length and polarity.

Because the majority of the heat map is low density, it suggests that there is a wide dispersion of sentence lengths and polarity scores across the text, with no clear patterns or clusters. This make sense as there is little to no correlation between the two variables. It may also indicate that the transcript has very diverse sentence structures and emotional tones, resulting in a more uniform distribution of data points. Additionally, it may suggest that the text contains a mix of factual information and subjective opinions dispersed throughout, rather than concentrated in specific sections. This is supported by our previous subjectivity analysis. Theses low-density regions present an opportunity for further exploration and analysis. By examining these regions more closely, we can identify more nuanced patterns that may better our understanding of the transcript's content and sentiment. These insights can allow for more targeted text processing and analysis techniques for our summarizations and model implementation.

Sentiment Analysis

Sentiment analysis identifies the most positive and negative sentences or phrases within the transcript by utilizing a pre-trained sentiment analysis model to identify the sentiment polarity of text. By analyzing the polarity scores of individual sentences or words, sentiment analysis identifies segments of the transcript that convey strong positive or negative sentiment. This allows for an understanding of the emotional content of the video.

This histogram offers a visualization of the distribution of sentiment polarity scores within our transcript. The x-axis of the histogram represents the range of sentiment polarity scores, spanning from the most negative to the most positive values. The y-axis illustrates the frequency or count of text elements falling within each polarity score bin. It is clear that the majority of the sentiment is neutral, centered about 0 with a wide dispersion, signifing variability in sentiment expressions, indicating the complexity of the text's emotional content. However, it is also apparent that there is a peak of outliers about -0.5 that lays beyond beyond the main distribution. This may represent especially negative text elements that we may want to address before model implementation or remove from our summaries of the transcript.

This violin plot showcases the distribution of sentiment polarity scores, offering insight into the emotional spectrum present in the video transcript. In the context of sentiment analysis, since we're visualizing polarity scores, the most dense part of the violin plot indicates the range of polarity scores that occur most frequently in the dataset. Because the most dense part of the plot is centered around a polarity score of 0 (neutral sentiment), it suggests that the majority of the text data has a neutral sentiment. That being said, we can also observe the presence of both positive and negative outliers, potentially an area for further exploration or cleaning as we subject the data to model testing.

Here, each sentence in the transcript is considered as an index. The line plot visualizes the sentiment scores assigned to each sentence in the transcript. Each point on the plot represents a sentence, with the x-axis denoting the index of the sentence within the transcript and the y-axis indicating the sentiment score. . Analyzing the distribution of sentiment scores across the transcript allows for a deeper understanding of the emotional context and narrative progression within the text. The median of the data set is denoted by the dashed green line, in agreement with both graphs above. Additionally, identifying outlier points or clusters of similarly scored sentences may highlight areas of particular significance or emotional intensity within the transcript. It appears that in the beginning half of the video the sentiment is largely negative or neutral, then tends to 0, and lastly has one final negative to positive leap. We expect the neutral sections of the videos to contain the more useful, factual, information. This may allow us to further focus our efforts on a particular section of the video for our summary content.

Page updated

Report abuse