Using ffmpeg, yt-dlp, and gpt-4o to Automate Extraction and Explanation of Python Code from YouTube Videos

Today I was watching a video on LLM evaluation https://www.youtube.com/watch?v=SnbGD677_u0. It is a long video(2.5 hours) with multiple sections. There are multiple speakers covering different sections. In one of the sections speaker showed code in Jupyter notebooks. Because of the small font and pace at which speaker was talking it was hard to follow the section.

I was thinking if I could use youtube-dlp along with an LLM to solve this problem. This is what I want to do:

Download the specific section of a video
Take screenshot of different frames in that video section
Send the screenshots to LLM to extract code
Ask LLM to explain the code in a step by step manner

The first two steps can be done using following commands.

yt-dlp https://www.youtube.com/watch?v=SnbGD677_u0 -o out.mp4
ffmpeg -ss 1:13:00 -to 1:33:00 -i out.mp4 -vf fps=1/60 img%03d.jpg

The above will first download the complete video and then for each frame it will capture one image every 60 seconds. You can change fps to 1/30 to capture one image per 30 seconds.

After executing the above two commands you will have 20 images in your current directory.

Now, I need to extract the code related to dataset preparation. This data was in images img005.jpg to img008.jpg. For this I used SGPT tool. It is a command-line tool like llm . The llm tool does not support images so I had to use SGPT.

We will loop and store output.

output=""

for i in {5..8}; do
    img="img00${i}.jpg"
    result=$(sgpt -m "gpt-4o" -t 0 -s 2000 -i "$img" "Extract the Python code from screenshots. Make sure code is extracted as is. Do not extract Jupyter notebook cells output.")
    output="${output}${result}"
done

It did made some mistakes in extracting code. So, you will have to check for correctness and fix code if required.

Next, we will send the output again to LLM for understanding

sgpt -m "gpt-4o" -t 0 -s 2000 "Explain the below Python code ${output}"

The output is show below.

Sure, I'd be happy to explain the provided Python code snippets. Here’s a detailed breakdown:

### Part 1

```python
pd.set_option('display.max_colwidth', 1000)
```

This line sets an option in the pandas library to display a maximum column width of 1000 characters. It helps when you have large text data in your dataframe, ensuring that the entire content of a cell can be displayed without truncation.

### Part 2: Loading and Preparing FIB Dataset

```python
# Prepare FIB data

# FIB contains one-sentence summaries on CNN/DM & XSUM news articles.
# Note: We exclude the CNN/Daily Mail data as it is pretty bad.

fib_ds = load_dataset('t-feb/fib', split='test')
fib_df = fib_ds.to_pandas()
logger.info(f'No of rows in FIB: {len(fib_df)}')
```

1. **Comment**: Explains what the FIB dataset is and notes that CNN/Daily Mail data is excluded due to poor quality.
2. **Loading Dataset**: The `load_dataset` function is used to load the FIB dataset (test split) from the `t-feb/fib` configuration.
3. **Conversion to DataFrame**: Converts the loaded dataset into a pandas DataFrame using `to_pandas()`.
4. **Logging**: Logs the number of rows in the FIB DataFrame using `logger.info`.

### Part 3: Selecting Columns

```python
# Only keep selected columns.
fib_df[['dataset' == 'xsun'
fib_df.fib.df[['dataset' == 'cm_dm', 'input', 'list_choices', 'correct_choice']].head(5)
fib_df.df[['input', 'list_choices', 'correct_choice']].head(5)
```

There seems to be a syntax error here. Assuming the intent was to filter and select specific columns, the correct code might look something like:

```python
# Only keep selected columns for the 'xsun' and 'cm_dm' datasets.
xsun_df = fib_df[fib_df['dataset'] == 'xsun'][['input', 'list_choices', 'correct_choice']].head(5)
cm_dm_df = fib_df[fib_df['dataset'] == 'cm_dm'][['input', 'list_choices', 'correct_choice']].head(5)

# Display the first 5 rows of the combined result
combined_df = fib_df[['input', 'list_choices', 'correct_choice']].head(5)
```

### Part 4: Removing Duplicates and Saving Datasets

```python
fib_train = fib_train.drop_duplicates(subset=['input', 'label'])
fib_val = fib_val.drop_duplicates(subset=['input', 'label'])
fib_test = fib_test.drop_duplicates(subset=['input', 'label'])

logger.info(f"Rows in balanced train: {len(fib_train)}, val: {len(fib_val)}, test: {len(fib_test)}")

fib_train.to_csv('/content/drive/My Drive/fib/fib-train.csv', index=False)
fib_val.to_csv('/content/drive/My Drive/fib/fib-val.csv', index=False)
fib_test.to_csv('/content/drive/My Drive/fib/fib-test.csv', index=False)
```

1. **Removing Duplicates**: Deduplicate the training, validation, and test datasets using `drop_duplicates`, based on the 'input' and 'label' columns.
2. **Logging**: Log the number of rows in each deduplicated dataset.
3. **Saving to CSV**: Save the deduplicated training, validation, and test datasets to CSV files on Google Drive.

### Part 5: Re-loading and Renaming Columns in Dataset

```python
# Test loading into dataset
fib_files = {
    'train': '/content/drive/My Drive/fib/fib-train.csv',
    'val': '/content/drive/My Drive/fib/fib-val.csv',
    'test': '/content/drive/My Drive/fib/fib-test.csv'
}
fib_ds = load_dataset('csv', data_files=fib_files)
fib_ds = fib_ds.rename_column('list_choices', 'hypothesis')
fib_ds = fib_ds.rename_column('premise', 'premise_column('list_choices', 'hypothesis').rename_column('label', 'label')

logger.info(f'Label distribution - Train: {Counter(fib_ds["train"]["label"])}, Val: {Counter(fib_ds["val"]["label"])}, Test: {Counter(fib_ds["test"]["label"])}')
```

1. **File Paths Definition**: Define file paths for the saved CSV files.
2. **Loading Dataset**: Reload datasets from the CSV files using `load_dataset`.
3. **Renaming Columns**:
   - Renames the 'list_choices' column to 'hypothesis'.
   - Appears to have a syntax error. Corrected, it might look like:
     ```python
     fib_ds = fib_ds.rename_column('list_choices', 'hypothesis').rename_column('premise', 'premise_column').rename_column('label', 'label')
     ```
4. **Logging**: Log the distribution of labels in training, validation, and test datasets using `Counter`.

### Part 6: Downloading NLTK Data

```python
import nltk
nltk.download('punkt')
```

1. **Importing NLTK**: Imports the NLTK library.
2. **Downloading Dataset**: Downloads the 'punkt' tokenizer models used for tokenizing text into sentences and words.

### Part 7: Processing USB Dataset

```python
usb_train["label"] = usb_train["label"].apply(lambda x: 1 if x == 1 else 2)
usb_val["label"] = usb_val["label"].apply(lambda x: 1 if x == 1 else 2)

logger.info(f"Label distribution (train): \n{usb_train['label'].value_counts()}")
logger.info(f"Label distribution (val): \n{usb_val['label'].value_counts()}")
```

1. **Label Mapping**: Apply a lambda function to map labels. If the label is 1, it remains 1; otherwise, it changes to 2.
2. **Logging**: Log the distribution of labels in the training and validation datasets.

### Part 8: Saving and Loading USB Dataset

```python
usb_train.to_csv('/content/drive/My Drive/usb-train.csv', index=False)
usb_val.to_csv('/content/drive/My Drive/usb-val.csv', index=False)
```

1. **Saving to CSV**: Save the training and validation datasets to CSV files on Google Drive.

```python
usb_files = {
    'train': '/content/drive/My Drive/usb-train.csv',
    'val': '/content/drive/My Drive/usb-val.csv'
}
usb_ds = load_dataset('csv', data_files=usb_files)
usb_ds = usb_ds.select_columns(['source', 'summary', 'label'])
usb_ds = usb_ds.rename_column('source', 'premise').rename_column('summary', 'hypothesis')

logger.info(f"Label distribution - Train: Counter(usb_ds['train']['label']), Val: {Counter(usb_ds['val']['label'])}")
```

1. **File Paths Definition**: Define file paths for the new USB training and validation CSV files.
2. **Loading Dataset**: Reload datasets from the CSV files using `load_dataset`.
3. **Selecting Columns**: Select only the 'source', 'summary', and 'label' columns.
4. **Renaming Columns**:
   - Rename 'source' to 'premise'.
   - Rename 'summary' to 'hypothesis'.
5. **Logging**: Log the distribution of labels in the training and validation datasets for USB.

The code provided aims to preprocess and manage two datasets (FIB and USB), ensuring the datasets are clean, properly formatted, and ready for further analysis or model training.

Discover more from Shekhar Gulati

Subscribe to get the latest posts sent to your email.

Using ffmpeg, yt-dlp, and gpt-4o to Automate Extraction and Explanation of Python Code from YouTube Videos

Discover more from Shekhar Gulati

One thought on “Using ffmpeg, yt-dlp, and gpt-4o to Automate Extraction and Explanation of Python Code from YouTube Videos”

Leave a comment Cancel reply

Discover more from Shekhar Gulati

Share this:

Related

One thought on “Using ffmpeg, yt-dlp, and gpt-4o to Automate Extraction and Explanation of Python Code from YouTube Videos”

Leave a comment Cancel reply