How to Design a Recommendation System for Technical Blogs: Semi-Automatic Recommendations Based on Google Search

Compared to statistics, it is often easier to recommend similar content to users based on content. There are two approaches to recommendations:

Manual Recommendations
Automatic Recommendations

(PS: I admit that this statement is essentially meaningless.)

As shown in the figure below:

What to Recommend

Manual Recommendations. In the technical field, authors are usually more knowledgeable than most readers, often knowing what the readers need. For example, if you read an article related to React, you might need content related to Redux.

Automatic Recommendations. This requires some prerequisites: integrating existing system data, obtaining some user information, and then calculating relevant content to return to the reader.

In this article, we will introduce:

Tag Generation Methods
Manual Tag-Based Recommendations
Semi-Automatic Tag Recommendations
Fully Automatic Content-Based Recommendations

Tag Generation

Articles differ significantly from the items we use daily. For example, a mobile phone has fixed specifications, such as price, screen size, RAM, internal storage, CPU, rear camera pixels, front camera pixels, etc. We can easily understand what the user might need based on these features. If a user browses a certain Pro 7 phone with 2880 specifications, then a Xiaomi 6 phone may be more suitable for that user.

However, an article is a kind of unstructured data. Apart from information like the author and the writing date, it is difficult to directly describe its characteristics, making it hard to determine whether articles are similar. Therefore, we need to extract keywords, or tags, from the article to determine the categories users prefer.

For applications that use tags to recommend products to users, there are four methods for tag generation:

Manual Tags
Machine-Generated Recommendations
User-Generated Content (UGC) Tags
Hybrid Learning

Manual Tags

This means that relevant tags are manually added by the author or publisher, which is often the most reliable method. After all, authors tend to be more professional. For example, in the article “Creating an Alexa Smart Speaker Using AVS Device SDK on Raspberry Pi,” readers may not understand much beyond Raspberry Pi, while the author has tagged the keywords (tags) as avs device sdk, amazon alexa, amazon voice services, raspberry pi. In a sense, these keywords represent the characteristics of the document, from which we can infer the general content of the article.

For products, the situation is similar; they have corresponding product data, such as price, type, time, etc., at the time of listing.

Machine-Generated Recommendations

This method extracts relevant tags based on the content, title, and other information of the article. It then calculates relevance based on certain weights. For example, if keywords contained in tags are highly relevant, their weights should also be larger. If two articles have similar keywords in their tags, they may also be similar for users. This is what we will discuss later regarding ‘Content-Based Recommendations.’

User-Generated Content (UGC) Tags

For products that lack content, they rely on user-generated tags and comments. For example, movies and books on Douban can only be tagged by users to find similar content and recommend it to users.

Douban Tag Example

At this point, if a new book lacks user reviews, it might not be recommended to the right users. Therefore, these contents need to explore the ‘cold start’ aspect of tagging, such as generating some tags. For those that cannot be determined by content, users may be asked to select tags of interest after registration.

Zhihu Tag Example

Hybrid Learning

In the case of manual tags, if it is UGC content, users may intentionally or unintentionally add some irrelevant tags for more traffic. For example, if Tag A has more followers, it will likely attract more traffic, so articles tagged with A may receive more attention. However, if the article is not related to Tag A, it will inevitably lead to user dissatisfaction. If ordinary users can determine whether the article is relevant, it will help reduce this impact to some extent.

Similarly, as mentioned above, machine-generated tags may also encounter certain issues. Therefore, the best approach is to combine several different tagging methods.

Manual Tag-Based Recommendations: Tag Quantity Relevance

Since the CMS based on Django I use already includes a feature for manually recommending related articles in the backend, my idea is to filter related articles based on the quantity of certain specific tags.

In my first prototype, the approach I used was relatively primitive:

Get all tags of the article
Count all article tags
Get the tag with the highest count from the article’s tags and find blogs with the same tag
From the remaining blogs, select the second most frequent tag and filter the remaining blogs

keywords_name = model.get_keywordsfield_name()
assigned = getattr(model, keywords_name).all()
all_keywords = Keyword.objects.filter(assignments__in=assigned)
keywords = all_keywords.annotate(item_count=Count("assignments")).order_by('-item_count')

# TODO: filter most popular tag
first_keyword = keywords.first()
if first_keyword:
    first_filtered_blogposts = BlogPost.objects.published().filter(keywords__keyword__title__contains=first_keyword.title)
    first_filtered_blogposts = first_filtered_blogposts.filter(~Q(id=post_id))
    second_keyword = keywords[1]

    if second_keyword:
        blog_posts = first_filtered_blogposts.filter(keywords__keyword__title__contains=second_keyword.title)
        return blog_posts[:3]
    else:
        return []

However, this recommendation algorithm may have some issues: if there are too many articles in the same series, such as various Vue imitation sites online, users may have already mastered them, and the value of the articles diminishes, or they may be as unvaluable as chicken soup. For example, in the ‘What to Recommend’ article, a series of articles related to home assistant and raspberry pi may not demonstrate any differences.

Disadvantages

Within the site, this algorithm has its specific significance: a large number of tags. However, it does not truly solve user problems. It may reflect the site’s value, but it doesn’t necessarily provide value to users.

If a user searches for an article on raspberry pi + homebridge, they can indeed read some related articles, but articles like raspberry pi alexa gpio may seem like a more preferred choice for users.

What to Recommend – User Search Results

At this time, recommendations made by editors may be more accurate. Unlike products, manual recommendations for articles often reflect content that readers may find valuable. Therefore, we can retain users while gaining more favorable user behavior flow.

Google Analytics Behavior Flow Example

The above figure shows the user behavior flow of ‘What to Recommend’:

Starting page: In 387 sessions, 260 users left midway
First interaction: In 127 sessions, 47 users left midway
Second interaction: In 80 sessions, 31 users left midway

This means that 32% of users visited another page after accessing a certain page, and 20% of users visited another page after the previous one. If we can improve the recommendation system to increase first interactions to 50%, there would be considerable traffic.

To improve the accuracy of our algorithm, we may need some additional elements: <span>weights</span>, thus requiring a weighted calculation method. For articles, a simple weighted method is to calculate the keywords in the title. However, I strongly suspect that this method cannot truly be effective. Single keywords may be valuable to the site itself but not necessarily to users.

Semi-Automatic Tag Recommendations: Optimizing Based on Google Search Weight

While using Google Analytics, I suddenly thought I could use Google Search Console to obtain keywords that users searched for. That is:

Google Search Result Example

The table below shows the corresponding position, click-through rate, impressions, and other information in Google Search Console:

Queries	Clicks	Impressions	CTR	Position
homebridge-miio	7	28	25%	8.2
home assistant broadlink	4	10	40%	15
amazon echo raspberry pi	3	10	30%	5.0
raspberry pi homebridge	2	6	33.33%	7.7
raspberry pi alexa gpio	2	4	50%	10
nodemcu homekit	2	3	66.67%	13
arduino homekit	1	3	33.33%	9.7

As a professional programmer, when we search for content, we all use a ‘keyword’ interface that is machine-oriented. As a professional MD programmer and SEO expert, when writing article titles, we should also include keywords in the titles.

For example, corresponding to the first search result homebridge-miio, its title is ‘Homekit + Siri Control Xiaomi Socket: Based on HomeBridge and homebridge-miio’; similarly, when users search for home assistant broadlink on Google, its corresponding article title is ‘Raspberry Pi + Home Assistant Smart Home (Part 2): Universal Remote Broadlink RM Pro Infrared Control of All Appliances’, and so on.

Thus, this is the true ‘valuable’ weight.

Updating Weights

Then I downloaded the CSV, created a new model, and imported it into the database. After that, I created a simple weight algorithm:

First Keyword = Keyword Frequency * 0.25 + Keyword Query Frequency * 0.75

The code is as follows:

    for keyword in keywords:
        related_queries = Query.objects.filter(queries__contains=keyword.title)
        keywords[index].item_count *= 0.75

        if related_queries:
            for query in related_queries:
                keywords[index].item_count += query.clicks * 0.25

        if index &gt; 1 and keywords[index].item_count &gt; keywords[index - 1].item_count:
            top_rank_keyword = keywords[index].title

        index += 1

In the code, the second keyword is still ‘frequency-based’, and if it overlaps with the first keyword, the second most frequent keyword is selected.

Considering that the search results from Google Search Console are quite meaningful, I also performed a relevance search. When users search for Raspberry Pi, they might also want to see Arduino? Thus, it can display search content that users might be interested in on the right side of the website.

Tag Generation

Manual Tag-Based Recommendations: Tag Quantity Relevance

Disadvantages

Semi-Automatic Tag Recommendations: Optimizing Based on Google Search Weight

Updating Weights

Related posts

Leave a Comment Cancel reply