local.html: Social Discovery by Browser-Based Crawling

#indienews

Putting HTML on the web is a lonely activity.

So what is the simplest way to get feedback on your own and your friends' websites?

The answer I came up with is a PWA that I call local.html.

The app is a feed with a built-in crawler that runs entirely in the browser.

You can explore the project here: Live demo / Source code.

The user seeds the app with URLs. Those URLs are then crawled recursively looking for links with rel=friend attributes.

When the app encounters new content at a URL, it saves the content and displays it in the feed.

All discovered URLs are crawled periodically to detect new content.

Each user’s graph is local and may differ depending on their seed URLs.

Conceptually, this is the inverse of Webmention. Instead of notifying a target when you link to it, you declare relationships explicitly with rel=friend, and the graph is discovered by crawling instead of push notifications.

Crawling the web from the browser is challenging for various reasons.

Same-Origin Policy

Cross-origin crawling is constrained by the Same-Origin Policy.

Targets that set permissive CORS headers can be fetched directly.

If no CORS headers are set, the PWA can be configured to use a proxy.

JavaScript

<script> elements are stripped from the HTML before being inserted into the feed.

All <iframe> elements get a sandbox attribute.

But there are still many opportunities to insert JavaScript in HTML.

Therefore security is dependent on the PWA being served from an origin that sets a restrictive Content-Security-Policy header.

CSS

Inline <style> blocks are rewritten to display properly inside a web component shadow root.

body and :root selectors are rewritten to :host and viewport relative units are rewritten to container relative units.

Inline style attributes are left unchanged. External CSS is stripped.

CSS is allowed to load external resources.

Storage

Resources from third party sites must be cached to persist in the feed.

Updates to a mutable remote resource (e.g. a media file) would update all feed items referencing it, destroying history. Therefore links to external resources are rewritten, versioned and snapshotted.

This preserves history when upstream media changes.

Even with the best of efforts, storage may be evicted unpredictably on some platforms, so the feed is not guaranteed to persist.

Persistence

The feed reflects the current state of the discovered graph, not a complete historical timeline.

If a URL is updated multiple times between crawl intervals, intermediate states are not delivered. I consider this a feature of the web.

Privacy

The PWA does not aggregate or transmit user social graphs. But user behavior can still be observed on target sites.

There is no access control mechanism, so all content in the social network is publicly accessible.

The PWA will attempt to load external content like images, fonts or iframes from any origin.

Linked sites may set security headers that affect how their content can be embedded. Since this is not always an enforceable requirement, the user may choose to configure the PWA to use a CORS proxy.

Any CORS proxy that the PWA is configured to use is a privacy and security risk because it terminates TLS and rewrites upstream responses.

Future Work