Helloworld example

The dataset in this module is used to detect malicious URL detection.

The dataset has only one attribute: URL name.

Copy and paste the following link to open google colab

https://colab.research.google.com/notebooks/welcome.ipynb

Then click File --> New notebook

Click the red box area in the website and change the file name to Logistic Regression for malicious URL detection.ipynb

Next click the Runtime and change runtime type (in Hardware accelerator) to GPU (Because it will run faster than CPU)

On the first code cell copy and paste the following code to upload the dataset to google colab

from google.colab import files

uploaded = files.upload()

And the click the run button(looks like play button) to run this code cell

After successful execution of that cell you should be able to see the same result like the following picture. Next step, click choose upload fraud credit card dataset(creditcard.csv) into google colab.

This is the link for malicious URL detection dataset. You can also directly download the dataset here.

You need to agree with the terms and conditions, and require a Gmail account.

https://www.kaggle.com/antonyj453/urldataset

It might take about 3 mins to upload this file because this dataset is a little bit large.

Next we will import some library we will use, create a new code cell and copy and paste the following code in that cell and run it.

import pandas as pd

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

Next, Create a new cell, copy and paste the following code and run it.

The purpose of this code is to receive the data from the csv file and split the dataset into training set and test set. The split rate is 0.20.(it means that the training set is 80% of the whole dataset and the test set is 20% of the whole dataset)

dataset = pd.read_csv("data.csv",sep = ',')

X = dataset.iloc[:,0]

y = dataset.iloc[:,1]

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

we will use logistic regression algorithm to train and predict our test set.

Finally, we see the the accuracy

Copy and paste the following code and then run it. The result should be similar to the following picture.

classifier = LogisticRegression(random_state = 0)

classifier.fit(X_train, y_train)

accuracy =classifier.score(X_test, y_test)

print("The accuracy is: {:.2%}".format(accuracy))

As we can see, the accuracy is 96.33%.

Page updated

Google Sites

Report abuse