Lab 1. Setup the Web Crawler

To get started, navigate to your Strigo lab in your browser. At the command prompt run the script “get_elastic_password”. This will show the generated password for your Elastic cluster.

2. Now we will need to change to our “Elastic” tab in Strigo. We might also have to “Reload this view”…

3. Once the page loads you should see a login prompt. Use “elastic” for the user and paste the password from the terminal window in the password field.

4. Once logged in you’ll be greeted by a message about adding integrations. We will be skipping this step as we do not need any integrations for this lab.

Click on “Explore on my own”.

5. On the resulting page, click on “Enterprise Search”

6. In the middle of the next page click “Create an Elasticsearch Index”.

7. On the next screen choose “Use a Web Crawler”.

8. Be sure to name it “elastic-docs”, then click “Create Index”.

🤓 🤞 Naming it elastic-docs is important because the code for the next lab will reference this index by name.

9. Near the top of the screen select “Pipelines”. Pipelines in the context are referring to inference pipelines. These are different from the ingest pipelines that process data prior to it being indexed. Inference pipelines become part of ingest pipelines.

10 .Click on “Copy and customize” under Ingest Pipelines.

11. Click on “Add Inference Pipeline” in the Machine Learning Inference Pipelines box.

12. Enter “title-vector” for the name. Then select the “Dense Text Embedding” model which came preloaded into your cluster.

Elastic allows for the import and inclusion of multiple transformer models for different use cases. At the bottom, click “continue”.

13. On the next screen, enter “title” for the Source Field. Leave the Target Field blank and then click "continue" at the bottom.

Here we are telling the transformer model which field we want to apply the vectorization to.

14. Click “Continue” again to pass the option test of the model, then click “Create Pipeline”.

15. Now that the pipeline is created, we need to make an adjustment to the vector dimensions.

On the left hand menu select “Dev Tools” in the Management section.

16. Paste the below code into the console to tell Elastic that we’re going to use 768 dimensions.

We could increase this to 2048, however that would incur additional resource cost during ingest processing…

POST search-elastic-docs/_mapping

{

"properties": {

"title-vector": {

"type": "dense_vector",

"dims": 768,

"index": true,

"similarity": "dot_product"

}

17. Check for the following response on the right side of the screen…

{

"acknowledged": true

}

Now we need to add an additional pipeline to compare vectorization with Elastic’s ELSER model.

18. Navigate back to Enterprise Search

19. Click on "Indices" under overview

20. In the list of indices click on “search-elastic-docs”.

Notice that we entered “elastic-docs” for the cluster name earlier, however, we’re referencing it by “search-elastic-docs” here.

This because we preface search indexes with “search”.

21. Near the top of the next screen click on “Pipelines”...

22. In the inference pipeline section we’ll add another pipeline like we did for vectors.

Click on “Add Inference Pipeline”…

23. On this screen we will enter similar information as before with a few adjustments.

Let’s start with choosing “New Pipeline” and then setting the name to “title-elser”.

Under models we’ll choose “Elser Text Expansion”.

Then click “Continue” at the bottom of the page

24. On the next screen we’ll add a mapping.

In the list of source fields select “title”.

Then click “Add” to the right.

Notice that the target field is automatically named.

At the bottom click continue.

25. At the bottom click “Continue”.

We’ll skip testing the model for now so click “Continue" again.

26. On the review page click “Create Pipeline”.

27. Now let's configure the crawler to capture the Elastic documentation.

On the navigation menu to the left, select Enterprise Search -> Overview

28. Under Content click on “Indices”.

29. Under “Available Indices” click on “search-elastic-docs”.

30. Click on the “Manage Domains” tab and enter “https://www.elastic.co/guide/en”.

Then click “Validate Domain”.

This checks that the domain we want to index is available and doesn’t have any limitations like a robot.txt file.

31. You’ll get a warning about robots.txt. This can be ignored.

32. After the checks complete click “Add Domain”.

33. Then click “Crawl Rules” and add the following rules one at a time.

These rules make sure that we don’t index data we don’t need or that won’t help us in the use case.

Rules can be in different formats and ordered to follow specific logic.

Note: If you need to reorder the rules click on the “=” sign and drag up or down until correct.

34. Now scroll to the top of the page and click on the blue button titled “Crawl”.

Then select “Crawl all domains on this index”

Lab 1 is complete.

Page updated

Report abuse