[w3m-dev-en 00257] Re: bug in ':' function?

From: Adam M. Costello (amc@cs.berkeley.edu)
Date: Tue Oct 24 2000 - 18:57:41 CDT

  • Next message: Lars Bjoennes: "[w3m-dev-en 00258] Patch for bug in HTTP-EQUIV handling in file.c"

    Peter Poeml <poeml@suse.de> wrote:

    > > http://www.kuro5hin.org/?op=displaystory;sid=2000/10/20/24336/134
    >
    > Normally, the separator character between two arguments appended to a
    > URL in a GET request is an ampersand (&).
    >
    > This leads me to saying that the URL you quoted is not correct.

    No, it just means that the semicolon is not an argument separator.
    There is one argument, whose name is "op" and whose value is
    "displaystory;sid=2000/10/20/24336/134".

    Norman Walsh <ndw@nwalsh.com> wrote:

    > RFC 1738 has been replaced by RFC 2396

    Updated by, not replaced. RFC 2396 is the latest authority on generic
    URI syntax, but RFC 1738 still has things to say about particular kinds
    of URLs.

    > which describes the semicolon character in section 3.3:
    >
    > The path may consist of a sequence of path segments separated by a
    > single slash "/" character. Within a path segment, the characters
    > "/", ";", "=", and "?" are reserved. Each path segment may include
    > a sequence of parameters, indicated by the semicolon ";" character.

    This is describing the role of semicolon in the path component, which is
    irrelevant to the URL above, because in that URL the semicolon appears
    in the query-string, not the path.

    Furthermore, the whole issue of which characters are reserved and what
    they mean is irrelevant to this discussion. If a character is reserved,
    it serves as a delimiter; if it's not reserved, it can appear as a
    regular character; either way it might appear.

    RFC 2396 defines the set of characters allowed in URIs, but even that is
    not terribly useful for finding URIs in plain text, because invalid URIs
    containing disallowed ASCII characters are often used and they usually
    work.

    There is no perfectly correct way to find URIs in plain text; you can
    only use heuristics. Here are some empirical observations I've found
    helpful:

    URIs are often enclosed in single quotes, double quotes, angle brackets,
    or parentheses.

    URIs never contain single quotes, double quotes, or angle brackets (but
    they do sometimes contain parentheses).

    URIs never contain whitespace except when that whitespace includes a
    newline.

    Even when the beginning of the URI is left off, the URI almost always
    begins with a letter.

    URIs usually end in a letter, digit, slash, or underscore. (When they
    end in a question mark (empty query-string) or an ampersand (empty
    argument), losing that character will probably have no effect.)

    AMC



    This archive was generated by hypermail 2b29 : Tue Oct 24 2000 - 19:09:30 CDT