Zero-Cost Custom Feeds on Bluesky

Background

I recently built a custom feed on Bluesky to capture the latest discussions on pre-prints from arxiv.org and research papers from conferences like ACL. It was inspired by this bluesky post from a researcher requesting for such a feed.

While there are drag-and-drop custom feed generators like Skyfeed, you are limited to using only regular expressions for the filtering part. If you use a regex pattern to capture all ‘arxiv.org’ links on Skyfeed, it will yield a bunch of false positives with papers from non-ML fields like Quantum Physics, Economics, and so on.

Though it’s possible for us to instead build and host the custom feed from scratch ourself as Bluesky’s protocol is open and provide programmatic access, it will be costly to run a server 24/7, especially if a large number of people subscribe to our custom feed.

As such, I thought of a nice alternate solution to circumvent this need to run a server by leveraging how the Bluesky protocol works with custom feeds. The bluesky app only makes GET requests to the server to fetch a JSON of a list of post IDs. So, we could in theory make use of a static site to host the endpoints with the data that matches what they expect and not run a backend server via Flask / FastAPI.

I implemented this idea and it works perfectly. We can offload to Skyfeed for initial filtering, use GitHub Actions for periodic feed generation, filtering, and ranking, and then host the JSONs on a static site using Cloudflare Pages. This removes the need to run a backend server at all and you can launch a custom feed 100% free.

The feed can be easily added to your homepage here: https://bsky.app/profile/amitness.com/feed/arxiv-feed

High-level Overview

We first use Skyfeed to filter the entire network of posts on Bluesky using a regular expression for posts with links for arxiv.org papers.

Then, the resulting feed is filtered using Bluesky’s atproto library through Python. Here, we iterate through each paper and check if the paper belongs to the arxiv categories for Machine Learning, NLP, and Computer Vision via the pyarxiv library. From the filtered list of papers, we generate the JSON data format required by Bluesky for reading feeds and push that to Cloudflare pages as a static site.

When the feed is loaded on the Bluesky app, the app will make a request to our static page on Cloudflare and get a list of the post IDs as a JSON response. The app will parse each post ID, render it in the app, and display the feed. This runs super quick.

Implementation

1. Clone the code locally

The code for the concept described above has been implemented at https://github.com/amitness/bluesky-arxiv.

First, make a fork of my repo from https://github.com/amitness/bluesky-arxiv and then clone your repo locally.

# Replace with the link to your repo
git clone git@github.com:amitness/bluesky-arxiv.git

Install the required libraries via the requirements.txt file in your virtual environment.

pip install -r requirements.txt

2. Setup Cloudflare pages

We will need a Cloudflare page to host the data in the format needed by Bluesky.

You can create an account on Cloudflare pages. Once the account is created, go to Workers and Pages > Overview from the left sidebar on the dashboard.

You should see two tabs: Workers and Pages. Click the Pages tab.

Then, click the “Upload Assets” button.

Then enter a name for the project. Cloudflare will provide you a unique domain based on it. Click Create Project.

You will be shown a page below that allows you to upload a zip file or a folder. At this stage, just upload a random folder from your device at least one file in it. Once you’re done, click Deploy site.

Once the site is deployed, you should see a message below with the URL of your domain.

In the repo that you cloned locally, change the SERVICE_DOMAIN variable in config.py file to the domain you got above from Cloudflare.

config.py

# Domain provided by Cloudflare pages
SERVICE_DOMAIN = "bluesky-1tj.pages.dev"

3. Initialize a custom feed on Bluesky

Now, we will initialize a custom feed programmatically on Bluesky.

In the repo, you will find a config.py file. You have to change a few configurations inside it.

First, change the HANDLE to your bluesky handle.

config.py

# YOUR bluesky handle
# Ex: user.bsky.social
HANDLE: str = "amitness.com"

Then you need to generate an app password for Bluesky. It’s available at https://bsky.app/settings/app-passwords and will allow us to get programmatic access to Bluesky in Python.

You can set a name to denote what the password is going to be used for. Here I set it to custom-feed.

Then you will receive your app password. Take note of it in a safe place as you won’t be able to access it again.

Now you can set the BLUESKY_APP_PASSWORD environment variable to your password.

export BLUESKY_APP_PASSWORD=...

This will be read by the setup_feed.py script.

config.py

# YOUR bluesky password, or preferably an App Password (found in your client settings)
# Ex: abcd-1234-efgh-5678
PASSWORD = os.environ["BLUESKY_APP_PASSWORD"]

Next, you can modify the name of your custom feed, a description and the slug. Here is what I have set.

config.py

# A short name for the record that will show in urls
# Lowercase with no spaces.
# Ex: whats-hot
RECORD_NAME: str = "arxiv-feed"

# A display name for your feed
# Ex: What's Hot
DISPLAY_NAME: str = "Papers"

# (Optional) A description of your feed
# Ex: Top trending content from the whole network
DESCRIPTION: str = dedent(
    """
 Latest ML research papers and preprints from arxiv.org discussed on Bluesky.
    
 Logic:
 - Fetch arxiv preprints & filters out non-ML via arxiv API
 - Ranks the items using hackernews algorithm
 """
).strip()

Here is how it will render up on Bluesky app later on.

Once everything above is setup, now you can run the script.

python setup_feed.py

This will initialize our custom feed on Bluesky. If everything was set up correctly, you will get an output for the value of FEED_URI.

Update the FEED_URI in config.py file with this value.

config.py

# Feed URI generated by running `python setup_feed.py`
FEED_URI = "at://did:plc:bpuq5cgmyvssgi3iwsyvd4gn/app.bsky.feed.generator/arxiv-feed"

Your feed has been created and now it needs to be populated before you can start using it in the app.

4. Setup Skyfeed

In this step, we will build an initial feed using the interface of the Skyfeed app.

You can signup on skyfeed.app using your Bluesky handle and the app password you created in previous step.

After logging in, go to the top-right and click Create Feed to create a new feed

You will see bunch of options. Since our goal is to filter out all the posts on Bluesky in past 24 hours that mention arxiv.org or aclanthology.org, we can set up the options as such.

First, the Input field specifies how many posts to capture. We will specify the Entire Network and set the time to 24 hours because we want to run a regex over all posts on Bluesky indexed in the past 24 hours. Depending on your usecase, you can modify this part.

As seen below, it yields 6 million posts in the past 24 hours.

Now, we will filter those 6 million posts to only get items that mention either the arxiv.org or the aclanthology.org links. This can be achieved with the below regex and can be pasted in the RegEx field. Make sure the Post Text and Link items are green as we want to search only in the post text and links.

(arxiv.org/.+)|(aclanthology.org/.+)

Here is how it should look after everything is set up correctly.

With this setup, we can now publish the feed as shown below by clicking Update Feed button and clicking Publish in the popup. This will create a feed that can be accessed via Bluesky now.

You should see the link to your published skyfeed as shown below.

Copy the portion as shown above to the SKYFEED_DID variable in config.py. We will be further filtering this feed now using Python in the next steps.

config.py

# Skyfeed path
SKYFEED_DID = "did:plc:bpuq5cgmyvssgi3iwsyvd4gn/feed/aaagg56kp5qzi"

5. Feed Generation in Python

With the above steps done, we can build out the feed generation logic. The main crux of the logic is present in generate_feed.py file. Let’s understand how it works:

1. Cloudflare Page Generation

The entire thing is defined in the main function.

generate_feed.py

def main():
 did_data = {
        "@context": ["https://www.w3.org/ns/did/v1"],
        "id": f"did:web:{config.SERVICE_DOMAIN}",
        "service": [
 {
                "id": "#bsky_fg",
                "type": "BskyFeedGenerator",
                "serviceEndpoint": f"https://{config.SERVICE_DOMAIN}",
 }
 ],
 }
    write_json(did_data, "./_site/.well-known/did.json")

 feed_generator_data = {
        "encoding": "application/json",
        "body": {"did": config.SERVICE_DID, "feeds": [{"uri": config.FEED_URI}]},
 }

    write_json(feed_generator_data, "./_site/xrpc/app.bsky.feed.describeFeedGenerator")

This part of the code will generate some metadata JSON that will be called by Bluesky to our Cloudflare pages at following paths.

2. Filtering Posts

The main logic lies in the code below, which generates the data for the endpoint that contains all the post IDs that should be rendered in the feed.

generate_feed.py

# Fetch latest posts and prepare data in the format expected by Bluesky protocol
post_uris = fetch_latest_posts()

feed_skeletion = {"feed": [{"post": uri} for uri in post_uris]}
write_json(feed_skeletion, "./_site/xrpc/app.bsky.feed.getFeedSkeleton")

It generates the endpoint that will return the post IDs that should be rendered in our custom feed. (https://bluesky-1tj.pages.dev/xrpc/app.bsky.feed.getFeedSkeleton)

The main logic for the feed filtering is defined in the fetch_latest_posts() function in the generate_feed.py file.

generate_feed.py

def fetch_latest_posts():
 client = Client()
 client.login(config.HANDLE, config.PASSWORD)

 data = client.app.bsky.feed.get_feed(
 {
1            "feed": config.SKYFEED_PATH,
            "limit": 100,
 },
        timeout=100,
 )

 feed = data.feed
    for _ in range(2):
 data = client.app.bsky.feed.get_feed(
2 {"feed": config.SKYFEED_PATH, "limit": 100, "cursor": data.cursor},
            timeout=200,
 )
 feed.extend(data.feed)

3 bool_filter = thread_map(filter_item, feed)
 filtered_feed = compress(feed, bool_filter) 
4 sorted_feed = rank_posts(filtered_feed)
 post_uris = [item.post.uri for item in sorted_feed]
    return post_uris

1: We fetch the feed from the Skyfeed custom feed we generated in the earlier step
2: A cursor is used to paginate and select additional 200 items from that feed
3: Then the items are filtered using the filter_item function that checks whether the links present in the item are indeed CS Arxiv papers. We make use of thread_map to parallelize the process.
4: We re-rank the filtered items in the feed to use the Hackernews algorithm

3. Re-ranking with hackernews score

The re-ranking of the posts is defined in the rank_posts function. I made use of hackernews algorithm which is quite simple. We compute the points for a post as the sum of its number of likes, quotes, replies and reposts. Then that score is decayed by how many hours it has been since the post was created so slowly downvote items that are getting older. This balances the popular vs recent research papers.

generate_feed.py

def hackernews_score(item, gravity: float = 2.5):
 hours_passed = (
 datetime.now(timezone.utc) - parse_date(item.post.indexed_at)
 ).total_seconds() / 3600
    if hours_passed >= 12:
        return 0
    else:
 points = (
 item.post.like_count
            + item.post.quote_count
            + item.post.reply_count
            + item.post.repost_count
 )
 score = points / ((hours_passed + 2) ** (gravity))
        return score


def rank_posts(feed):
    return sorted(feed, key=hackernews_score, reverse=True)

6. Running periodically via GitHub Actions

To run our script periodically for free, we can leverage Github Actions. This will fetch the feed from Skyfeed, perform the filtering and re-ranking, and push the resulting data to Cloudflare pages every 30 minutes.

The schedule for the cron job is defined in the build_and_deploy.yml file and can be modified there as needed.

.github/workflows/build_and_deploy.yml

name: Build and deploy site to cloudflare

on:
  push:
    branches:
 - main
  schedule:
 - cron: '*/30 * * * *'

Crontab.guru is a great website to visualize what the cron syntax does.

To enable the actions in your forked GitHub repo, goto “Settings > Secrets and Variables” and click “New Repository Secret” and set these three variables one by one

BLUESKY_APP_PASSWORD
CLOUDFLARE_ACCOUNT_ID
CLOUDFLARE_API_TOKEN

You can get your “CLOUDFLARE_ACCOUNT_ID” by logging in to Cloudflare Pages and then getting the value from the right sidebar as shown below.

To get the CLOUDFLARE_API_TOKEN, create a new token from https://dash.cloudflare.com/profile/api-tokens as shown below.

Once all three secret variables have been set up, you can enable GitHub actions in your forked repo as shown below.

The action should automatically run every 30 minutes now. As such, it will fetch the latest posts from skyfeed, perform the filtering and generate the final set of posts to be displayed on Bluesky and deploy that to Cloudflare.

7. Access your feed

Your feed will be listed on your profile now at bsky.app/feeds and can be pinned to the homepage as well.

You can find the link from the address bar when the feed is open and share it with others.

Conclusion

Thus, we saw an approach on how to make a custom feed on Bluesky with a combination of Skyfeed, Github Actions and Cloudflare pages.

While we built it to get a feed of Arxiv papers, you can extend the same approach to do a bunch of useful stuff. You could integrate lightweight classifiers to classify/re-rank posts for relevance to your interests or even filter out toxic posts from your feed.

You can also skip Skyfeed as the initial source and instead read from the firehose or one of your existing feeds directly using atproto and handle the indexing via a small SQLite database or JSON committed directly to GitHub via the actions.