Quite. I have a message in my mh drafts folder called "HTTP
Caching: rules and Heuristics" that I can't ever seem to finish
and send out.
I'm trying to come up with some Larch[1] traits to formally
specify correct behaviour of information retrieval on the web.
Here's a snapshot of some working notes:
The idea is to capture the rules in Larch, ala:
WebArch: trait
includes
PartialOrder(Time for T)
introduces
rep: URI, Entity, Time, Time -> Bool
% for rep(r, e, t1, t2), read
% "from t1 to t2, e is a representation of r"
http: Request, Message -> Bool
addr: Request -> URI
date: Request -> Time
date: Message -> Time
ent: Message -> Entity
lmod: Message -> Time
exp: Message -> Time
asserts
forall r: URI, e: Entity, t1, t2, t3, t4: Time
rep(r, e, t1, t4) /\ t1 <= t2 /\ t2 <= t3 /\ t3 <= t4 =>
rep(r, e, t2, t3)
http(q, a) <=> rep(addr(q), ent(a), lmod(a), exp(a))
/\ lmod(a) <= date(q) /\ date(q) <= exp(a)
Then consider a protocol interaction like:
C1 -> P: GET http://S/path q1
P -> S: GET /path q2
S -> P: 200 ok a2
Date: tsresp
Last-Modified: tmod
Expires: texp rep('http://S/path', e, tmod, texp)
P -> C1: 200 ok a1
Forwarded: tPresp
Date: tsresp
Last-Modified: tmod
Expires: texp
C2 -> P: GET http://S/path q3
Try to prove that the proxy can return ent(a2) to C2.
1 thru 6 are the assumptions, 7 is the proof goal:
1. http(q2, a2) % origin server says so
2. addr(q3) = addr(q2) % proxy requires this
3. date(q3) < exp(a2) % proxy checks this
3' date(q3) > lmod(a2) % proxy checks this
4. ent(a3) = ent(a2) % proxy builds reply matching a2
5. lmod(a3) = lmod(a2)
6. exp(a3) = exp(a2)
7. Show http(q3, a3)
8. rep(addr(q2), ent(a2), lmod(a2), exp(a2)) 1, http I
9. rep(addr(q3), ent(a3), lmod(a3), exp(a3)) 8, 2, 4, 5, 6
10. lmod(a3) <= date(q3) /\ date(q3) < exp(a3) 3, 3', 5, 6
11. http(q3, a3) 9, 10, defn http
The trouble is that this keeps falling lower and lower on my
to-do list. :-(
Dan
p.s. note that format negotiation isn't addressed above. I'm
beginning to think that a server can't be bound to return
"the best" representation of a resource -- it just won't
scale. Rather, if a resource has multiple representations,
it must return one, and notify the client that others are
available.
And rather than asking "does the proxy return
exactly what the origin server would return?" we should ask
"does the proxy return something that the origin server
_could_ legally have returned?"
The net effect is that in the case of:
C1 prefers text to html, queries S for U via P, gets text D1
P caches U -> text D1
C2 prefers html to text, queries S for U via P, gets text D1
(because P had it cached)
P is acting correctly, provided that S has told P that the
html version is available (via URI: or Link: headers) and that P
passes this info along to C1 and C2. If the client doesn't like
what it gets, it can ask for the other version with "Pragma: no-cache"
in a subsequent transaction. The idea is that the proxy would
cache "the most popular" representation. The needs of the many...
Another idea that struck me is that the lack of an expires header
should be interpreted not as "it expires now" but "it expires
sometime soon, and clients can guess heuristically." In other
words, the onus is on the origin server to send out Expires:
headers; else proxies/clients are free to cache for some time.
The last idea I'll add here before I go is an answer to
"what does the Expires: header mean?" In the case of HTTP,
we're not so much concerned with when the data in the entity
becomes uninteresting, like a part announcement or a 3-month
sale. Rather, we're concerned with the binding between
a URL and an entity.
So in response to a query for U, an Expires header giving time t
says "the enclosed entity is a representation of the resource
identified by U until t."
Anybody who wants to help write this document is welcome to
step in!
Dan