Text:: |
Text::Refer - parse Unix "refer" files
This is Alpha code, and may be subject to changes in its public interface. It will stabilize by June 1997, at which point this notice will be removed. Until then, if you have any feedback, please let me know!
Pull in the module:
use Text::Refer;
Parse a refer stream from a filehandle:
while ($ref = input Text::Refer \*FH) { # ...do stuff with $ref... } defined($ref) or die "error parsing input";
Same, but using a parser object for more control:
# Create a new parser: $parser = new Text::Refer::Parser LeadWhite=>'KEEP';
# Parse: while ($ref = $parser->input(\*FH)) { # ...do stuff with $ref... } defined($ref) or die "error parsing input";
Manipulating reference objects, using high-level methods:
# Get the title, author, etc.: $title = $ref->title; @authors = $ref->author; # list context $lastAuthor = $ref->author; # scalar context
# Set the title and authors: $ref->title("Cyberiad"); $ref->author(["S. Trurl", "C. Klapaucius"]); # arrayref for >1 value!
# Delete the abstract: $ref->abstract(undef);
Same, using low-level methods:
# Get the title, author, etc.: $title = $ref->get('T'); @authors = $ref->get('A'); # list context $lastAuthor = $ref->get('A'); # scalar context
# Set the title and authors: $ref->set('T', "Cyberiad"); $ref->set('A', "S. Trurl", "C. Klapaucius");
# Delete the abstract: $ref->set('X'); # sets to empty array of values
Output:
print $ref->as_string;
This module supercedes the old Text::Bib.
This module provides routines for parsing in the contents of "refer"-format bibliographic databases: these are simple text files which contain one or more bibliography records. They are usually found lurking on Unix-like operating systems, with the extension .bib.
Each record in a "refer" file describes a single paper, book, or article. Users of nroff/troff often employ such databases when typesetting papers.
Even if you don't use *roff, this simple, easily-parsed parameter-value format is still useful for recording/exchanging bibliographic information. With this module, you can easily post-process "refer" files: search them, convert them into LaTeX, whatever.
Here's a possible "refer" file with three entries:
%T Cyberiad %A Stanislaw Lem %K robot fable %I Harcourt/Brace/Jovanovich
%T Invisible Cities %A Italo Calvino %K city fable philosophy %X In this surreal series of fables, Marco Polo tells an aged Kublai Khan of the many cities he has visited in his lifetime.
%T Angels and Visitations %A Neil Gaiman %D 1993
The lines separating the records must be completely blank; that is, they cannot contain anything but a single newline.
See refer(1) or grefer(1) for more information on "refer" files.
From the GNU manpage, grefer(1)
:
The bibliographic database is a text file consisting of records separated by one or more blank lines. Within each record fields start with a % at the beginning of a line. Each field has a one character name that immediately follows the %. It is best to use only upper and lower case letters for the names of fields. The name of the field should be followed by exactly one space, and then by the contents of the field. Empty fields are ignored. The conventional meaning of each field is as follows:
NOTE: Uniquely identifies the entry. For example, "Able94".
NOTE: Thanks to Mike Zimmerman for clarifying this for me: it means a "corporate" author: when the "author" is listed as an organization such as the UN, or RAND Corporation, or whatever.
NOTE: Basically, a brief abstract or description.
For all fields except A and E, if there is more than one occurrence of a particular field in a record, only the last such field will be used.
If accent strings are used, they should follow the character to be accented. This means that the AM macro must be used with the -ms macros. Accent strings should not be quoted: use one \ rather than two.
You will nearly always use the input()
constructor to create
new instances, and nearly always as shown in the "SYNOPSIS".
Internally, the records are parsed by a parser object; if you
invoke the class method Text::Refer::input()
, a special default parser
is used, and this will be good enough for most tasks. However, for
more complex tasks, feel free to use "class Text::Refer::Parser"
to build (and use) your own fine-tuned parser, and input()
from
that instead.
Each instance of this class represents a single record in a "refer" file.
while ($ref = input Text::Refer \*STDIN) { # ...do stuff with $ref... }
Do not use this as an instance method; it will not re-init the object you give it.
$ref->attr('X', undef); # delete the abstract
$ref->attr('T', "The Police State Rears Its Ugly Head"); $ref->attr('D', 1997);
$ref->attr('A', ["S. Trurl", "C. Klapaucius"]);
We use an arrayref since an empty array would be impossible to distinguish from the next two cases, where the goal is to "get" instead of "set"...
This method returns the current (or new) value of the given attribute,
just as get()
does:
$author = $ref->attr('A');
will set $author
to "C. Klapaucius"
.
@authors = $ref->attr('A');
will set @authors
to ("S. Trurl", "C. Klapaucius")
.
Note: this method is used as the basis of all "named" access methods; hence, the following are equivalent in every way:
$ref->attr(T => $title) <=> $ref->title($title); $ref->attr(A => \@authors) <=> $ref->author(\@authors); $ref->attr(D => undef) <=> $ref->date(undef); $auth = $ref->attr('A') <=> $auth = $ref->author; @auths = $ref->attr('A') <=> @auths = $ref->author;
A author G govt_no N number S series B book I publisher O other_info T title C city J journal P page V volume D date K keywords Q corp_author X abstract E editor L label R report_no
Then, for each field F with high-level attribute name FIELDNAME,
the method FIELDNAME()
works as follows:
$ref->attr('F', @args) <=> $ref->FIELDNAME(@args)
Which means:
$ref->attr(T => $title) <=> $ref->title($title); $ref->attr(A => \@authors) <=> $ref->author(\@authors); $ref->attr(D => undef) <=> $ref->date(undef); $auth = $ref->attr('A') <=> $auth = $ref->author; @auths = $ref->attr('A') <=> @auths = $ref->author;
See the documentation of attr()
for the argument list.
@authors = $ref->get('A'); # returns list of all authors
In a scalar context, it returns the last value (undefined if none):
$author = $ref->get('A'); # returns the last author
$ref->set('A', "S. Trurl", "C. Klapaucius");
An empty array of VALUES deletes the attribute:
$ref->set('A'); # deletes all authors
No useful return value is currently defined.
print $ref->as_string;
The options are:
Newline=TOSPACE
with LeadWhite=KILLALL
),
there is a risk that the output object will be an invalid "refer" record.
The fields are output with %L first (if it exists), and then the remaining fields in alphabetical order. The following "safety measures" are normally taken:
These safety measures are slightly time-consuming, and are silly if you are merely outputting a "refer" object which you have read in verbatim (i.e., using the default parser-options) from a valid "refer" file. In these cases, you may want to use the Quick option.
Instances of this class do the actual parsing.
The options you may give to new()
are as follows:
[\041-\176]
However, when compiling parser options, you can supply your own regular expression for validating (one-character) field names. (note: you must supply the square brackets; they are there to remind you that you should give a well-formed single-character expression). One standard expression is provided for you:
$Text::Refer::GroffFields = '[A-EGI-LN-TVX]'; # legal groff fields
Illegal fields which are encounterd during parsing result in a syntax error.
NOTE: You really shouldn't use this unless you absolutely need to. The added regular expression test slows down the parser.
%T Incontrovertible Proof that Pi Equals Three (for Large Values of Three) %A S. Trurl %X The author shows how anyone can use various common household objects to obtain successively less-accurate estimations of pi, until finally arriving at a desired integer approximation, which nearly always is three.
This leading whitespace serves two purposes: (1) it makes it impossible
to mistake a continuation line for a field, since % can no longer be the
first character, and (2) it makes the entries easier to read.
The LeadWhite
option controls what is done with this whitespace:
KEEP - default; the whitespace is untouched KILLONE - exactly one character of leading whitespace is removed KILLALL - all leading whitespace is removed
See the section below on "using the parser options" for hints and warnings.
Newline
option controls what is done with the newlines that
separate adjacent lines in the same field:
KEEP - default; the newlines are kept in the field value TOSPACE - convert each newline to a single space KILL - the newlines are removed
See the section below on "using the parser options" for hints and warnings.
Default values will be used for any options which are left unspecified.
The default values for Newline
and LeadWhite
will preserve the
input text exactly.
The Newline=TOSPACE
option, when used in conjunction with the
LeadWhite=KILLALL
option, effectively "word-wraps" the text of
each field into a single line.
Be careful! If you use the Newline=KILL
option with
either the LeadWhite=KILLONE
or the LeadWhite=KILLALL
option,
you could end up eliminating all whitespace that separates the word
at the end of one line from the word at the beginning of the next line.
Text::Refer
.
class()
method.
Returns the object on success, '0' on expected end-of-file, and undefined on error.
Having two false values makes parsing very simple: just input()
records until the result is false, then check to see if that last result
was 0 (end of file) or undef (failure).
Each "refer" object has instance variables corresponding to the actual
field names ('T'
, 'A'
, etc.). Each of these is a reference to
an array of the actual values.
Notice that, for maximum flexibility and consistency (but at the cost of some space and access-efficiency), the semantics of "refer" records do not come into play at this time: since everything resides in an array, you can have as many %K, %D, etc. records as you like, and given them entirely different semantics.
For example, the Library Of Boring Stuff That Everyone Reads (LOBSTER) uses the unused %Y as a "year" field. The parser accomodates this case by politely not choking on LOBSTER .bibs (although why you would want to eat a lobster bib instead of the lobster is beyond me...).
Tolerable. On my 90MHz/32 MB RAM/I586 box running Linux 1.2.13 and Perl5.002, it parses a typical 500 KB "refer" file (of 1600 records) as follows:
8 seconds of user time for input and no output 10 seconds of user time for input and "quick" output 16 seconds of user time for input and "safe" output
So, figure the individual speeds are:
input: 200 records ( 60 KB) per second. "quick" output: 800 records (240 KB) per second. "safe" output: 200 records ( 60 KB) per second.
By contrast, a C program which does the same work is about 8 times as fast.
But of course, the C code is 8 times as large, and 8 times as ugly... :-)
I actually do not use "refer" files for *roffing... I used them as a quick-and-dirty database for WebLib, and that's where this code comes from. If you're a serious user of "refer" files, and this module doesn't do what you need it to, please contact me: I'll add the functionality in.
Some combinations of parser-options are silly.
$Id: Refer.pm,v 1.106 1997/04/22 18:41:41 eryq Exp $
Copyright (C) 1997 by Eryq, eryq@enteract.com, http://www.enteract.com/~eryq.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
For a copy of the GNU General Public License, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.