Advanced Text Search With Macrometa
The search is not over. As we continue with the second blog in our search patterns series we will cover prefix matching, full-text token search, and phrase and proximity search. If you missed our first blog, you may want to take a look to become more familiar with the basic search capabilities of the Macrometa Global Data Network (GDN) and exact value matching.
If you already downloaded the dataset from the first blog, you can skip the steps in the first two listings. The dataset can be downloaded from this URL. After downloading the JSON file, replace <DATA> in Listing 1 with the content of the JSON file in the following command. The dataset can be imported to your GDN federation issuing the CURL command on a terminal as shown in the Listing 1. Before executing the following CURL command you need to first create a fabric named Hotels in your GDN federation and then create a document collection called hotel_reviews within that fabric.
curl --location --request POST 'https://api-<HOST>/_fabric/Hotels/_api/import/hotel_reviews' \
--header 'accept: application/json' \
--header 'Authorization: <BEARER_TOKEN>' \
--header 'Content-Type: text/plain' \
--data-raw '{
"data": <DATA>,
"details": false,
"primaryKey": "",
"replace": false
}'
Listing 1: How to copy the content from the JSON file into the CURL command
curl --location --request POST 'https://api-<HOST>/_fabric/Hotels/_api/import/hotel_reviews' \
--header 'accept: application/json' \
--header 'Authorization: <BEARER_TOKEN>' \
--header 'Content-Type: text/plain' \
--data-raw '{
"data": [{
"Property Name": "The Savoy",
"Review Rating": 5,
"Review Title": "a legend",
"Review Text": "We stayed in May during a short family vacation. Location is perfect to explore all the London sights. Service and facilities are impeccable. The hotel staff was very nicely taking care of our kids. We'll be back for sure!",
"Location Of The Reviewer": "Oslo, Norway",
"Date Of Review": "6\/28\/2018"
}],
"details": false,
"primaryKey": "",
"replace": false
}'
Listing 2: Importing the sample dataset to GDN federation via invoking the REST API of a GDN node
In the above example in Listing 2 we have imported only a single review made for the hotel named The Savoy by specifying its JSON content. The values <HOST> and <BEARER_TOKEN> refers to the host name of the GDN node and the bearer token can be copied by referring to the REST API of the GDN node.
Implementing prefix matching
Many search scenarios can be found where you are interested in knowing all the strings which start with a particular prefix. Finding the longest matching prefix from a collection of keywords is a long-studied problem with multiple applications such as dictionary searches, computational geometry, Internet packet routing, and DNA Sequencing. Searching for strings or tokens which start with one or more substrings is accomplished via the prefix search facility of the Macrometa GDN.
Prefix matching can be performed at different levels of complexity. If you need to apply exact prefix matching then indexing strings with the identity analyzer is adequate. Let's begin by finding all the hotel names which start with the term "The". The corresponding search query can be written as shown in Listing 3.
FOR review IN sample1_view1
SEARCH ANALYZER(STARTS_WITH(review.Property_Name, "The "), "identity")
RETURN review.Property_Name
Listing 3: Find all the hotel names which starts with the term "The "
This should result in a list of 3963 records¹ as shown below.
Prefix matching can also be done considering multiple prefix terms. If you need to find all the reviews made for hotel names starting from either "The " or "Hotel " that can be accomplished using the following query. The results should indicate there are 4524 reviews¹ that satisfy these criteria.
FOR review IN sample1_view1
SEARCH ANALYZER(STARTS_WITH(review.Property_Name, "The ") OR STARTS_WITH(review.Property_Name, "Hotel "), "identity")
RETURN review.Property_Name
Listing 4: Find all the reviews having the hotel names which start with either the term "The " or "Hotel "
The following example in Listing 5 shows how prefix matching is conducted on multiple attributes. In this scenario, we are interested in finding hotel names that start with "The " and the review titles that start with "Awesome ".
FOR review IN sample1_view1
SEARCH ANALYZER(STARTS_WITH(review.Property_Name, "The ") AND STARTS_WITH(review.`Review Title`, "Awesome "), "identity")
RETURN {
Property_Name : review.Property_Name,
`Review Title` : review.`Review Title`
}
Listing 5: Find all the reviews having the hotel name start with "The " and review text start with "Awesome "
This should result in the following three reviews.
Using full-text token search
When searching strings it is highly useful to search for tokens in full-text which can occur in any order. Text Analyzers tokenize the full-text strings so that each token can get indexed separately. There are two ways to search for tokens that are called token search and phrase search. While the former is described in this section the latter is presented in the section below.
This approach searches for token which can appear in any order. The words that are searched for have to be contained in the source string. First, a text analyzer view has to be defined via invoking a CURL command as follows below.
curl --location --request POST 'https://api-<HOST>/_fabric/Hotels/_api/search/view' \
--header 'Content-Type: application/json' \
--header 'Accept: application/json' \
--header 'Authorization: <BEARER_TOKEN>' \
--data-raw '{
"name": "sample1_view8",
"links": {
"hotel_reviews": {
"analyzers": [],
"fields": {
"Review Text": {
"analyzers": [
"text_en"
]
}
}
}
},
"type": "search"
}
'
Listing 6: Defining a view for token search using the text Analyzer
Once the view is ready we can specify a token search query which searches for the occurrence of at least one of the praising words — Awesome or Excellent or Lovely — in a review text and select its review score as the result as shown in Listing 7.
FOR review IN sample1_view8
SEARCH ANALYZER(review.`Review Text` IN TOKENS("Awesome Excellent Lovely", "text_en"), "text_en")
RETURN review.`Review Rating`
Listing 7: Find how relevant the keywords of praising to the review rating of a hotel review
When executed this should list 3803 review ratings¹ as the results as shown below.
Applying phrase and proximity search
Phrase search allows you to search for phrases and nearby words in full text. You may also specify how many arbitrary tokens may occur between the defined tokens for word proximity searches. We can use the same search view defined in the previous section here as well.
Let's search for hotel review comments which say "rooms are small" and select the hotel names and their review ratings.
FOR review IN sample1_view8
SEARCH ANALYZER(PHRASE(review.`Review Text`, "rooms are small"), "text_en")
RETURN {
Property_Name: review.Property_Name,
`Review Rating`: review.`Review Rating`
}
Listing 8: Find the review comments which say "rooms are small"
This should result in 75 review comments as shown below.
The PHRASE() function allows for specifying tokens and the number of wild card tokens in alternating order. This can be effectively utilized for two words with one arbitrary word in between the two words. For example, one could search for review comments specifying the number of nights the reviewer has stayed in the hotel as follows.
FOR review IN sample1_view8
SEARCH ANALYZER(PHRASE(review.`Review Text`, "for", 1, "nights"), "text_en")
RETURN {
Property_Name: review.Property_Name,
`Review Rating`: review.`Review Rating`
}
Listing 9: Identify whether reviewer has mentioned about the number of nights the reviewer has stayed in the hotel
Execution of the above query should result in 859 results.
Now that we covered prefix matching, full-text token search, and phrase and proximity search, you can get started based on these examples. Be sure to tune in later this month as we go over range queries, faceted search, and geospatial search.
¹The GUI displays up to 1000 records. Complete results can be found via Macrometa GDN's REST API.
Photo by Johannes Mändle on Unsplash