1 """Low-level interface to NCBI's EUtils for Entrez search and retrieval.
2
3 For higher-level interfaces, see DBIdsClient (which works with a set
4 of database identifiers) and HistoryClient (which does a much better
5 job of handling history).
6
7 There are five classes of services:
8 ESearch - search a database
9 EPost - upload a list of indicies for further use
10 ESummary - get document summaries for a given set of records
11 EFetch - get the records translated to a given format
12 ELink - find related records in other databases
13
14 You can find more information about them at
15 http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
16 but that document isn't very useful. Perhaps the following is better.
17
18 EUtils offers a structured way to query Entrez, get the results in
19 various formats, and get information about related documents. The way
20 to start off is create an EUtils object.
21
22 >>> from Bio import EUtils
23 >>> from Bio.EUtils.ThinClient import ThinClient
24 >>> eutils = ThinClient.ThinClient()
25 >>>
26
27 You can search Entrez with the "esearch" method. This does a query on
28 the server, which generates a list of identifiers for records that
29 matched the query. However, not all the identifiers are returned.
30 You can request only a subset of the matches (using the 'retstart' and
31 'retmax') terms. This is useful because searches like 'cancer' can
32 have over 1.4 million matches. Most people would rather change the
33 query or look at more details about the first few hits than wait to
34 download all the identifiers before doing anything else.
35
36 The esearch method, and indeed all these methods, returns a
37 'urllib.addinfourl' which is an HTTP socket connection that has
38 already parsed the HTTP header and is ready to read the data from the
39 server.
40
41 For example, here's a query and how to use it
42
43 Search in PubMed for the term cancer for the entrez date from the
44 last 60 days and retrieve the first 10 IDs and translations using
45 the history parameter.
46
47 >>> infile = eutils.esearch("cancer",
48 ... daterange = EUtils.WithinNDays(60, "edat"),
49 ... retmax = 10)
50 >>>
51 >>> print infile.read()
52 <?xml version="1.0"?>
53 <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
54 <eSearchResult>
55 <Count>7228</Count>
56 <RetMax>10</RetMax>
57 <RetStart>0</RetStart>
58 <IdList>
59 <Id>12503096</Id>
60 <Id>12503075</Id>
61 <Id>12503073</Id>
62 <Id>12503033</Id>
63 <Id>12503030</Id>
64 <Id>12503028</Id>
65 <Id>12502932</Id>
66 <Id>12502925</Id>
67 <Id>12502881</Id>
68 <Id>12502872</Id>
69 </IdList>
70 <TranslationSet>
71 <Translation>
72 <From>cancer%5BAll+Fields%5D</From>
73 <To>(%22neoplasms%22%5BMeSH+Terms%5D+OR+cancer%5BText+Word%5D)</To>
74 </Translation>
75 </TranslationSet>
76 <TranslationStack>
77 <TermSet>
78 <Term>"neoplasms"[MeSH Terms]</Term>
79 <Field>MeSH Terms</Field>
80 <Count>1407151</Count>
81 <Explode>Y</Explode>
82 </TermSet>
83 <TermSet>
84 <Term>cancer[Text Word]</Term>
85 <Field>Text Word</Field>
86 <Count>382919</Count>
87 <Explode>Y</Explode>
88 </TermSet>
89 <OP>OR</OP>
90 <TermSet>
91 <Term>2002/10/30[edat]</Term>
92 <Field>edat</Field>
93 <Count>-1</Count>
94 <Explode>Y</Explode>
95 </TermSet>
96 <TermSet>
97 <Term>2002/12/29[edat]</Term>
98 <Field>edat</Field>
99 <Count>-1</Count>
100 <Explode>Y</Explode>
101 </TermSet>
102 <OP>RANGE</OP>
103 <OP>AND</OP>
104 </TranslationStack>
105 </eSearchResult>
106
107 >>>
108
109 You get a raw XML input stream which you can process in many ways.
110 (The appropriate DTDs are included in the subdirectory "DTDs" and see
111 also the included POM reading code.)
112
113 WARNING! As of this writing (2002/12/3) NCBI returns their
114 XML encoded as Latin-1 but their processing instruction says
115 it is UTF-8 because they leave out the "encoding" attribute.
116 Until they fix it you will need to recode the input stream
117 before processing it with XML tools, like this
118
119 import codecs
120 infile = codecs.EncodedFile(infile, "utf-8", "iso-8859-1")
121
122
123 The XML fields are mostly understandable:
124 Count -- the total number of matches from this search
125 RetMax -- the number of <ID> values returned in this subset
126 RetStart -- the start position of this subset in the list of
127 all matches
128
129 IDList and ID -- the identifiers in this subset
130
131 TranslationSet / Translation -- if the search field is not
132 explicitly specified ("qualified"), then the server will
133 apply a set of hueristics to improve the query. Eg, in
134 this case "cancer" is first parsed as
135 cancer[All Fields]
136 then turned into the query
137 "neoplasms"[MeSH Terms] OR cancer[Text Word]
138
139 Note that these terms are URL escaped.
140 For details on how the translation is done, see
141 http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#AutomaticTermMapping
142
143 TranslationStack -- The (possibly 'improved' query) fully
144 parsed out and converted into a postfix (RPN) notation.
145 The above example is written in the Entrez query language as
146
147 ("neoplasms"[MeSH Terms] OR cancer[Text Word]) AND
148 2002/10/30:2002/12/29[edat]
149 Note that these terms are *not* URL escaped. Nothing like
150 a bit of inconsistency for the soul.
151
152 The "Count" field shows how many matches were found for each
153 term of the expression. I don't know what "Explode" does.
154
155
156 Let's get more information about the first record, which has an id of
157 12503096. There are two ways to query for information, one uses a set
158 of identifiers and the other uses the history. I'll talk about the
159 history one in a bit. To use a set of identifiers you need to make a
160 DBIds object containing the that list.
161
162 >>> dbids = EUtils.DBIds("pubmed", ["12503096"])
163 >>>
164
165 Now get the summary using dbids
166
167 >>> infile = eutils.esummary_using_dbids(dbids)
168 >>> print infile.read()
169 <?xml version="1.0"?>
170 <!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD eSummaryResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSummary_020511.dtd">
171 <eSummaryResult>
172 <DocSum>
173 <Id>12503096</Id>
174 <Item Name="PubDate" Type="Date">2003 Jan 30</Item>
175 <Item Name="Source" Type="String">Am J Med Genet</Item>
176 <Item Name="Authors" Type="String">Coyne JC, Kruus L, Racioppo M, Calzone KA, Armstrong K</Item>
177 <Item Name="Title" Type="String">What do ratings of cancer-specific distress mean among women at high risk of breast and ovarian cancer?</Item>
178 <Item Name="Volume" Type="String">116</Item>
179 <Item Name="Pages" Type="String">222-8</Item>
180 <Item Name="EntrezDate" Type="Date">2002/12/28 04:00</Item>
181 <Item Name="PubMedId" Type="Integer">12503096</Item>
182 <Item Name="MedlineId" Type="Integer">22390532</Item>
183 <Item Name="Lang" Type="String">English</Item>
184 <Item Name="PubType" Type="String"></Item>
185 <Item Name="RecordStatus" Type="String">PubMed - in process</Item>
186 <Item Name="Issue" Type="String">3</Item>
187 <Item Name="SO" Type="String">2003 Jan 30;116(3):222-8</Item>
188 <Item Name="DOI" Type="String">10.1002/ajmg.a.10844</Item>
189 <Item Name="JTA" Type="String">3L4</Item>
190 <Item Name="ISSN" Type="String">0148-7299</Item>
191 <Item Name="PubId" Type="String"></Item>
192 <Item Name="PubStatus" Type="Integer">4</Item>
193 <Item Name="Status" Type="Integer">5</Item>
194 <Item Name="HasAbstract" Type="Integer">1</Item>
195 <Item Name="ArticleIds" Type="List">
196 <Item Name="PubMedId" Type="String">12503096</Item>
197 <Item Name="DOI" Type="String">10.1002/ajmg.a.10844</Item>
198 <Item Name="MedlineUID" Type="String">22390532</Item>
199 </Item>
200 </DocSum>
201 </eSummaryResult>
202 >>>
203
204 This is just a summary. To get the full details, including an
205 abstract (if available) use the 'efetch' method. I'll only print a
206 bit to convince you it has an abstract.
207
208 >>> s = eutils.efetch_using_dbids(dbids).read()
209 >>> print s[587:860]
210 <ArticleTitle>What do ratings of cancer-specific distress mean among women at high risk of breast and ovarian cancer?</ArticleTitle>
211 <Pagination>
212 <MedlinePgn>222-8</MedlinePgn>
213 </Pagination>
214 <Abstract>
215 <AbstractText>Women recruited from a hereditary cancer registry provided
216 >>>
217
218 Suppose instead you want the data in a text format. Different
219 databases have different text formats. For example, PubMed has a
220 "docsum" format which gives just the summary of a document and
221 "medline" format as needed for a citation database. To get these, use
222 a "text" "retmode" ("return mode") and select the appropriate
223 "rettype" ("return type").
224
225 Here are examples of those two return types
226
227 >>> print eutils.efetch_using_dbids(dbids, "text", "docsum").read()[:497]
228 1: Coyne JC, Kruus L, Racioppo M, Calzone KA, Armstrong K.
229 What do ratings of cancer-specific distress mean among women at high risk of breast and ovarian cancer?
230 Am J Med Genet. 2003 Jan 30;116(3):222-8.
231 PMID: 12503096 [PubMed - in process]
232 >>> print eutils.efetch_using_dbids(dbids, "text", "medline").read()[:369]
233 UI - 22390532
234 PMID- 12503096
235 DA - 20021227
236 IS - 0148-7299
237 VI - 116
238 IP - 3
239 DP - 2003 Jan 30
240 TI - What do ratings of cancer-specific distress mean among women at high risk
241 of breast and ovarian cancer?
242 PG - 222-8
243 AB - Women recruited from a hereditary cancer registry provided ratings of
244 distress associated with different aspects of high-risk status
245 >>>
246
247 It's also possible to get a list of records related to a given
248 article. This is done through the "elink" method. For example,
249 here's how to get the list of PubMed articles related to the above
250 PubMed record. (Again, truncated because otherwise there is a lot of
251 data.)
252
253 >>> print eutils.elink_using_dbids(dbids).read()[:590]
254 <?xml version="1.0"?>
255 <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd">
256 <eLinkResult>
257 <LinkSet>
258 <DbFrom>pubmed</DbFrom>
259 <IdList>
260 <Id>12503096</Id>
261 </IdList>
262 <LinkSetDb>
263 <DbTo>pubmed</DbTo>
264 <LinkName>pubmed_pubmed</LinkName>
265 <Link>
266 <Id>12503096</Id>
267 <Score>2147483647</Score>
268 </Link>
269 <Link>
270 <Id>11536413</Id>
271 <Score>30817790</Score>
272 </Link>
273 <Link>
274 <Id>11340606</Id>
275 <Score>29939219</Score>
276 </Link>
277 <Link>
278 <Id>10805955</Id>
279 <Score>29584451</Score>
280 </Link>
281 >>>
282
283 For a change of pace, let's work with the protein database to learn
284 how to work with history. Suppose I want to do a multiple sequene
285 alignment of bacteriorhodopsin with all of its neighbors, where
286 "neighbors" is defined by NCBI. There are good programs for this -- I
287 just need to get the records in the right format, like FASTA.
288
289 The bacteriorhodopsin I'm interested in is BAA75200, which is
290 GI:4579714, so I'll start by asking for its neighbors.
291
292 >>> results = eutils.elink_using_dbids(
293 ... EUtils.DBIds("protein", ["4579714"]),
294 ... db = "protein").read()
295 >>> print results[:454]
296 <?xml version="1.0"?>
297 <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd">
298 <eLinkResult>
299 <LinkSet>
300 <DbFrom>protein</DbFrom>
301 <IdList>
302 <Id>4579714</Id>
303 </IdList>
304 <LinkSetDb>
305 <DbTo>protein</DbTo>
306 <LinkName>protein_protein</LinkName>
307 <Link>
308 <Id>4579714</Id>
309 <Score>2147483647</Score>
310 </Link>
311 <Link>
312 <Id>11277596</Id>
313 <Score>1279</Score>
314 </Link>
315 >>>
316
317 Let's get all the <Id> fields. (While the following isn't a good way
318 to parse XML, it is easy to understand and works well enough for this
319 example.) Note that I remove the first <Id> because that's from the
320 query and not from the results.
321
322 >>> import re
323 >>> ids = re.findall(r"<Id>(\d+)</Id>", results)
324 >>> ids = ids[1:]
325 >>> len(ids)
326 222
327 >>> dbids = EUtils.DBIds("protein", ids)
328 >>>
329
330 That's a lot of records. I could use 'efetch_using_dbids' but there's
331 a problem with that. Efetch uses the HTTP GET protocol to pass
332 information to the EUtils server. ("GET" is what's used when you type
333 a URL in the browser.) Each id takes about 9 characters, so the URL
334 would be over 2,000 characters long. This may not work on some
335 systems, for example, some proxies do not support long URLs. (Search
336 for "very long URLs" for examples.)
337
338 Instead, we'll upload the list to the server then fetch the FASTA
339 version using the history.
340
341 The first step is to upload the data. We want to put that into the
342 history so we set 'usehistory' to true. There's no existing history
343 so the webenv string is None.
344
345
346 >>> print eutils.epost(dbids, usehistory = 1, webenv = None).read()
347 <?xml version="1.0"?>
348 <!DOCTYPE ePostResult PUBLIC "-//NLM//DTD ePostResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ePost_020511.dtd">
349 <ePostResult>
350 <QueryKey>1</QueryKey>
351 <WebEnv>%7BPgTHRHFBsJfC%3C%5C%5C%5B%3EAfJCKQ%5Ey%60%3CGkH%5DH%5E%3DJHGBKAJ%3F%40CbCiG%3FE%3C</WebEnv>
352 </ePostResult>
353
354 >>>
355
356 This says that the identifiers were saved as query #1, which will be
357 used later on as the "query_key" field. The WebEnv is a cookie (or
358 token) used to tell the server where to find that query. The WebEnv
359 changes after every history-enabled ESearch or EPost so you'll need to
360 parse the output from those to get the new WebEnv field. You'll also
361 need to unquote it since it is URL-escaped.
362
363 Also, you will need to pass in the name of the database used for the
364 query in order to access the history. Why? I don't know -- I figure
365 the WebEnv and query_key should be enough to get the database name.
366
367 >>> import urllib
368 >>> webenv = urllib.unquote("%7BPgTHRHFBsJfC%3C%5C%5C%5B%3EAfJCKQ%5Ey%60%3CGkH%5DH%5E%3DJHGBKAJ%3F%40CbCiG%3FE%3C")
369 >>> print webenv
370 {PgTHRHFBsJfC<\\[>AfJCKQ^y`<GkH]H^=JHGBKAJ?@CbCiG?E<
371 >>>
372
373 Okay, now to get the data in FASTA format. Notice that I need the
374 'retmax' in order to include all the records in the result. (The
375 default is 20 records.)
376
377 >>> fasta = eutils.efetch_using_history("protein", webenv, query_key = "1",
378 ... retmode = "text", rettype = "fasta",
379 ... retmax = len(dbids)).read()
380 >>> fasta.count(">")
381 222
382 >>> print fasta[:694]
383 >gi|14194475|sp|O93742|BACH_HALSD Halorhodopsin (HR)
384 MMETAADALASGTVPLEMTQTQIFEAIQGDTLLASSLWINIALAGLSILLFVYMGRNLEDPRAQLIFVAT
385 LMVPLVSISSYTGLVSGLTVSFLEMPAGHALAGQEVLTPWGRYLTWALSTPMILVALGLLAGSNATKLFT
386 AVTADIGMCVTGLAAALTTSSYLLRWVWYVISCAFFVVVLYVLLAEWAEDAEVAGTAEIFNTLKLLTVVL
387 WLGYPIFWALGAEGLAVLDVAVTSWAYSGMDIVAKYLFAFLLLRWVVDNERTVAGMAAGLGAPLARCAPA
388 DD
389 >gi|14194474|sp|O93741|BACH_HALS4 Halorhodopsin (HR)
390 MRSRTYHDQSVCGPYGSQRTDCDRDTDAGSDTDVHGAQVATQIRTDTLLHSSLWVNIALAGLSILVFLYM
391 ARTVRANRARLIVGATLMIPLVSLSSYLGLVTGLTAGPIEMPAAHALAGEDVLSQWGRYLTWTLSTPMIL
392 LALGWLAEVDTADLFVVIAADIGMCLTGLAAALTTSSYAFRWAFYLVSTAFFVVVLYALLAKWPTNAEAA
393 GTGDIFGTLRWLTVILWLGYPILWALGVEGFALVDSVGLTSWGYSLLDIGAKYLFAALLLRWVANNERTI
394 AVGQRSGRGAIGDPVED
395 >>>
396
397 To round things out, here's a query which refines the previous query.
398 I want to get all records from the first search which also have the
399 word "Structure" in them. (My background was originally structural
400 biophysics, whaddya expect? :)
401
402 >>> print eutils.search("#1 AND structure", db = "protein", usehistory = 1,
403 ... webenv = webenv).read()
404 <?xml version="1.0"?>
405 <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
406 <eSearchResult>
407 <Count>67</Count>
408 <RetMax>20</RetMax>
409 <RetStart>0</RetStart>
410 <QueryKey>2</QueryKey>
411 <WebEnv>UdvMf%3F%60G%3DIE%60bG%3DGec%3E%3D%3Cbc_%5DgBAf%3EAi_e%5EAJcHgDi%3CIqGdE%7BmC%3C</WebEnv>
412 <IdList>
413 <Id>461608</Id>
414 <Id>114808</Id>
415 <Id>1364150</Id>
416 <Id>1363466</Id>
417 <Id>1083906</Id>
418 <Id>99232</Id>
419 <Id>99212</Id>
420 <Id>81076</Id>
421 <Id>114811</Id>
422 <Id>24158915</Id>
423 <Id>24158914</Id>
424 <Id>24158913</Id>
425 <Id>1168615</Id>
426 <Id>114812</Id>
427 <Id>114809</Id>
428 <Id>17942995</Id>
429 <Id>17942994</Id>
430 <Id>17942993</Id>
431 <Id>20151159</Id>
432 <Id>20150922</Id>
433 </IdList>
434 <TranslationSet>
435 </TranslationSet>
436 <TranslationStack>
437 <TermSet>
438 <Term>#1</Term>
439 <Field>All Fields</Field>
440 <Count>222</Count>
441 <Explode>Y</Explode>
442 </TermSet>
443 <TermSet>
444 <Term>structure[All Fields]</Term>
445 <Field>All Fields</Field>
446 <Count>142002</Count>
447 <Explode>Y</Explode>
448 </TermSet>
449 <OP>AND</OP>
450 </TranslationStack>
451 </eSearchResult>
452
453 >>>
454
455 One last thing about history. It doesn't last very long -- perhaps an
456 hour or so. (Untested.) You may be able to toss it some keep-alive
457 signal every once in a while. Or you may want to keep
458
459 The known 'db' fields and primary IDs (if known) are
460 genome -- GI number
461 nucleotide -- GI number
462 omim -- MIM number
463 popset -- GI number
464 protein -- GI number
465 pubmed -- PMID
466 sequences (not available; this will combine all sequence databases)
467 structure -- MMDB ID
468 taxonomy -- TAXID
469
470 The 'field' parameter is different for different databases. The
471 fields for PubMed are listed at
472
473 http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#SearchFieldDescriptionsandTags
474
475 Affiliation -- AD
476 All Fields -- All
477 Author -- AU
478 EC/RN Number -- RN
479 Entrez Date -- EDAT (also valid for 'datetype')
480 Filter -- FILTER
481 Issue -- IP
482 Journal Title -- TA
483 Language -- LA
484 MeSH Date -- MHDA (also valid for 'datetype')
485 MeSH Major Topic -- MAJR
486 MeSH Subheadings -- SH
487 MeSH Terms -- MH
488 Pagination -- PG
489 Personal Name as Subject -- PS
490 Publication Date -- DP (also valid for 'datetype')
491 Publication Type -- PT
492 Secondary Source ID -- SI
493 Subset -- SB
494 Substance Name -- NM
495 Text Words -- TW
496 Title -- TI
497 Title/Abstract -- TIAB
498 Unique Identifiers -- UID
499 Volume -- VI
500
501 The fields marked as 'datetype' can also be used for date searches.
502 Date searches can be done in the query (for example, as
503
504 1990/01/01:1999/12/31[edat]
505
506 or by passing a WithinNDays or DateRange field to the 'date' parameter
507 of the search.
508
509
510 Please pay attention to the usage limits! The are listed at
511 http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
512
513 At the time of this writing they are:
514 * Run retrieval scripts on weekends or between 9 PM and 5 AM ET
515 weekdays for any series of more than 100 requests.
516 * Make no more than one request every 3 seconds.
517 * Only 5000 PubMed records may be retrieved in a single day.
518
519 * NCBI's Disclaimer and Copyright notice must be evident to users
520 of your service. NLM does not hold the copyright on the PubMed
521 abstracts the journal publishers do. NLM provides no legal
522 advice concerning distribution of copyrighted materials, consult
523 your legal counsel.
524
525 (Their disclaimer is at
526 http://www.ncbi.nlm.nih.gov/About/disclaimer.html )
527
528
529 """
530
531 import urllib, urllib2, cStringIO
532 import time
533
534 DUMP_URL = 0
535 DUMP_RESULT = 0
536
537
538
539
540
541
542
543 TOOL = "EUtils_Python_client"
544 EMAIL = "biopython-dev@biopython.org"
545
546 assert " " not in TOOL
547 assert " " not in EMAIL
548
550 """Internal function: convert a list of ids to a comma-seperated string"""
551
552
553
554
555 if not dbids:
556 raise TypeError("dbids list must have at least one term")
557 for x in dbids.ids:
558 if "," in x:
559 raise TypeError("identifiers cannot contain a comma: %r " %
560 (x,))
561 id_string = ",".join(dbids.ids)
562 assert id_string.count(",") == len(dbids.ids)-1, "double checking"
563 return id_string
564
565
566
567 _open_previous = time.time()
568
570 """Client-side interface to the EUtils services
571
572 See the module docstring for much more complete information.
573 """
574 - def __init__(self,
575 opener = None,
576 tool = TOOL,
577 email = EMAIL,
578 baseurl = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"):
579 """opener = None, tool = TOOL, email = EMAIL, baseurl = ".../eutils/"
580
581 'opener' -- an object which implements the 'open' method like a
582 urllib2.OpenDirector. Defaults to urllib2.build_opener()
583
584 'tool' -- the term to use for the 'tool' field, used by NCBI to
585 track which programs use their services. If you write your
586 own tool based on this package, use your own tool name.
587
588 'email' -- a way for NCBI to contact you (the developer, not
589 the user!) if there are problems and to tell you about
590 updates or changes to their system.
591
592 'baseurl' -- location of NCBI's EUtils directory. Shouldn't need
593 to change this at all.
594 """
595
596 if tool is not None and " " in tool:
597 raise TypeError("No spaces allowed in 'tool'")
598 if email is not None and " " in email:
599 raise TypeError("No spaces allowed in 'email'")
600
601 if opener is None:
602 opener = urllib2.build_opener()
603
604 self.opener = opener
605 self.tool = tool
606 self.email = email
607 self.baseurl = baseurl
608
610 """Internal function to add and remove fields from a query"""
611 q = query.copy()
612
613
614 q["tool"] = self.tool
615 q["email"] = self.email
616
617
618
619
620 if "usehistory" in q:
621 if q["usehistory"]:
622 q["usehistory"] = "y"
623 else:
624 q["usehistory"] = None
625
626
627
628 for k, v in q.items():
629 if v is None:
630 del q[k]
631
632
633 return urllib.urlencode(q)
634
635 - def _wait(self, delay = 3.0) :
636 """Enforce the NCBI requirement of one request every three seconds.
637
638 Ideally the calling code would have respected the 3 second rule,
639 but as this often hasn't happenend we check this here.
640
641 wait - number of seconds between queries."""
642 global _open_previous
643 wait = _open_previous + delay - time.time()
644 if wait > 0:
645 time.sleep(wait)
646 _open_previous = time.time()
647
648 - def _get(self, program, query):
649 """Internal function: send the query string to the program as GET"""
650
651
652 self._wait()
653
654 q = self._fixup_query(query)
655 url = self.baseurl + program + "?" + q
656 if DUMP_URL:
657 print "Opening with GET:", url
658 if DUMP_RESULT:
659 print " ================== Results ============= "
660 s = self.opener.open(url).read()
661 print s
662 print " ================== Finished ============ "
663 return cStringIO.StringIO(s)
664 return self.opener.open(url)
665
666 - def esearch(self,
667 term,
668 db = "pubmed",
669 field = None,
670 daterange = None,
671
672 retstart = 0,
673 retmax = 20,
674
675 usehistory = 0,
676 webenv = None,
677 ):
678
679 """term, db="pubmed", field=None, daterange=None, retstart=0, retmax=20, usehistory=0, webenv=none
680
681 Search the given database for records matching the query given
682 in the 'term'. See the module docstring for examples.
683
684 'term' -- the query string in the Entrez query language; see
685 http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html
686 'db' -- the database to search
687
688 'field' -- the field to use for unqualified words
689 Eg, "dalke[au] AND gene" with field==None becomes
690 dalke[au] AND (genes[MeSH Terms] OR gene[Text Word]
691 and "dalke[au] AND gene" with field=="au" becomes
692 dalke[au] AND genes[Author]
693 (Yes, I think the first "au" should be "Author" too)
694
695 'daterange' -- a date restriction; either WithinNDays or DateRange
696 'retstart' -- include identifiers in the output, starting with
697 position 'retstart' (normally starts with 0)
698 'retmax' -- return at most 'retmax' identifiers in the output
699 (if not specified, NCBI returns 20 identifiers)
700
701 'usehistory' -- flag to enable history tracking
702 'webenv' -- if this string is given, add the search results
703 to an existing history. (WARNING: the history disappers
704 after about an hour of non-use.)
705
706 You will need to parse the output XML to get the new QueryKey
707 and WebEnv fields.
708
709 Returns an input stream from an HTTP request. The stream
710 contents are in XML.
711 """
712 query = {"term": term,
713 "db": db,
714 "field": field,
715 "retstart": retstart,
716 "retmax": retmax,
717 "usehistory": usehistory,
718 "WebEnv": webenv,
719 }
720 if daterange is not None:
721 query.update(daterange.get_query_params())
722
723 return self._get(program = "esearch.fcgi", query = query)
724
725 - def epost(self,
726 dbids,
727
728 webenv = None,
729 ):
730 """dbids, webenv = None
731
732 Create a new collection in the history containing the given
733 list of identifiers for a database.
734
735 'dbids' -- a DBIds, which contains the database name and
736 a list of identifiers in that database
737 'webenv' -- if this string is given, add the collection
738 to an existing history. (WARNING: the history disappers
739 after about an hour of non-use.)
740
741 You will need to parse the output XML to get the new QueryKey
742 and WebEnv fields. NOTE: The order of the IDs on the server
743 is NOT NECESSARILY the same as the upload order.
744
745 Returns an input stream from an HTTP request. The stream
746 contents are in XML.
747 """
748 id_string = _dbids_to_id_string(dbids)
749
750
751 program = "epost.fcgi"
752 query = {"id": id_string,
753 "db": dbids.db,
754 "WebEnv": webenv,
755 }
756 q = self._fixup_query(query)
757
758 self._wait()
759
760
761
762 if DUMP_URL:
763 print "Opening with POST:", self.baseurl + program + "?" + q
764 if DUMP_RESULT:
765 print " ================== Results ============= "
766 s = self.opener.open(self.baseurl + program, q).read()
767 print s
768 print " ================== Finished ============ "
769 return cStringIO.StringIO(s)
770 return self.opener.open(self.baseurl + program, q)
771
772 - def esummary_using_history(self,
773 db,
774
775
776 webenv,
777 query_key,
778 retstart = 0,
779 retmax = 20,
780 retmode = "xml",
781 ):
782 """db, webenv, query_key, retstart = 0, retmax = 20, retmode = "xml"
783
784 Get the summary for a collection of records in the history
785
786 'db' -- the database containing the history/collection
787 'webenv' -- the WebEnv cookie for the history
788 'query_key' -- the collection in the history
789 'retstart' -- get the summaries starting with this position
790 'retmax' -- get at most this many summaries
791 'retmode' -- can only be 'xml'. (Are there others?)
792
793 Returns an input stream from an HTTP request. The stream
794 contents are in 'retmode' format.
795 """
796 return self._get(program = "esummary.fcgi",
797 query = {"db": db,
798 "WebEnv": webenv,
799 "query_key": query_key,
800 "retstart": retstart,
801 "retmax": retmax,
802 "retmode": retmode,
803 })
804
809 """dbids, retmode = "xml"
810
811 Get the summary for records specified by identifier
812
813 'dbids' -- a DBIds containing the database name and list
814 of record identifiers
815 'retmode' -- can only be 'xml'
816
817 Returns an input stream from an HTTP request. The stream
818 contents are in 'retmode' format.
819 """
820
821 id_string = _dbids_to_id_string(dbids)
822 return self._get(program = "esummary.fcgi",
823 query = {"id": id_string,
824 "db": dbids.db,
825
826 "retmode": retmode,
827 })
828
829 - def efetch_using_history(self,
830 db,
831 webenv,
832 query_key,
833
834 retstart = 0,
835 retmax = 20,
836
837 retmode = None,
838 rettype = None,
839
840
841 seq_start = None,
842 seq_stop = None,
843 strand = None,
844 complexity = None,
845 ):
846 """db, webenv, query_key, retstart=0, retmax=20, retmode=None, rettype=None, seq_start=None, seq_stop=None, strand=None, complexity=None
847
848 Fetch information for a collection of records in the history,
849 in a variety of formats.
850
851 'db' -- the database containing the history/collection
852 'webenv' -- the WebEnv cookie for the history
853 'query_key' -- the collection in the history
854 'retstart' -- get the formatted data starting with this position
855 'retmax' -- get data for at most this many records
856
857 These options work for sequence databases
858
859 'seq_start' -- return the sequence starting at this position.
860 The first position is numbered 1
861 'seq_stop' -- return the sequence ending at this position
862 Includes the stop position, so seq_start = 1 and
863 seq_stop = 5 returns the first 5 bases/residues.
864 'strand' -- strand. Use EUtils.PLUS_STRAND (== 1) for plus
865 strand and EUtils.MINUS_STRAND (== 2) for negative
866 'complexity' -- regulates the level of display. Options are
867 0 - get the whole blob
868 1 - get the bioseq for gi of interest (default in Entrez)
869 2 - get the minimal bioseq-set containing the gi of interest
870 3 - get the minimal nuc-prot containing the gi of interest
871 4 - get the minimal pub-set containing the gi of interest
872
873 http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html
874
875 The valid retmode and rettype values are
876
877 For publication databases (omim, pubmed, journals) the
878 retmodes are 'xml', 'asn.1', 'text', and 'html'.
879
880 If retmode == xml ---> XML (default)
881 if retmode == asn.1 ---> ASN.1
882
883 The following rettype values work for retmode == 'text'.
884
885 docsum ----> author / title / cite / PMID
886 brief ----> a one-liner up to about 66 chars
887 abstract ----> cite / title / author / dept /
888 full abstract / PMID
889 citation ----> cite / title / author / dept /
890 full abstract / MeSH terms /
891 substances / PMID
892 medline ----> full record in medline format
893 asn.1 ----> full record in one ASN.1 format
894 mlasn1 ----> full record in another ASN.1 format
895 uilist ----> list of uids, one per line
896 sgml ----> same as retmode="xml"
897
898 Sequence databases (genome, protein, nucleotide, popset)
899 also have retmode values of 'xml', 'asn.1', 'text', and
900 'html'.
901
902 If retmode == 'xml' ---> XML (default; only supports
903 rettype == 'native')
904 If retmode == 'asn.1' ---> ASN.1 text (only works for rettype
905 of 'native' and 'sequin')
906
907 The following work with a retmode of 'text' or 'html'
908
909 native ----> Default format for viewing sequences
910 fasta ----> FASTA view of a sequence
911 gb ----> GenBank view for sequences, constructed sequences
912 will be shown as contigs (by pointing to its parts).
913 Valid for nucleotides.
914 gbwithparts --> GenBank view for sequences, the sequence will
915 always be shown. Valid for nucleotides.
916 est ----> EST Report. Valid for sequences from
917 dbEST database.
918 gss ----> GSS Report. Valid for sequences from dbGSS
919 database.
920 gp ----> GenPept view. Valid for proteins.
921 seqid ----> To convert list of gis into list of seqids
922 acc ----> To convert list of gis into list of accessions
923
924 # XXX TRY THESE
925 fasta_xml
926 gb_xml
927 gi (same as uilist?)
928
929
930
931 A retmode of 'file' is the same as 'text' except the data is
932 sent with a Content-Type of application/octet-stream, which tells
933 the browser to save the data to a file.
934
935 A retmode of 'html' is the same as 'text' except a HTML header
936 and footer are added and special character are properly escaped.
937
938 Returns an input stream from an HTTP request. The stream
939 contents are in the requested format.
940 """
941
942
943
944
945
946
947
948
949
950
951
952 if retstart == 0 and retmax > 500:
953 retmax = None
954 return self._get(program = "efetch.fcgi",
955 query = {"db": db,
956 "WebEnv": webenv,
957 "query_key": query_key,
958 "retstart": retstart,
959 "retmax": retmax,
960 "retmode": retmode,
961 "rettype": rettype,
962 "seq_start": seq_start,
963 "seq_stop": seq_stop,
964 "strand": strand,
965 "complexity": complexity,
966 })
967
968 - def efetch_using_dbids(self,
969 dbids,
970 retmode = None,
971 rettype = None,
972
973
974 seq_start = None,
975 seq_stop = None,
976 strand = None,
977 complexity = None,
978 ):
979 """dbids, retmode = None, rettype = None, seq_start = None, seq_stop = None, strand = None, complexity = None
980
981 Fetch information for records specified by identifier
982
983 'dbids' -- a DBIds containing the database name and list
984 of record identifiers
985 'retmode' -- See the docstring for 'efetch_using_history'
986 'rettype' -- See the docstring for 'efetch_using_history'
987
988 These options work for sequence databases
989
990 'seq_start' -- return the sequence starting at this position.
991 The first position is numbered 1
992 'seq_stop' -- return the sequence ending at this position
993 Includes the stop position, so seq_start = 1 and
994 seq_stop = 5 returns the first 5 bases/residues.
995 'strand' -- strand. Use EUtils.PLUS_STRAND (== 1) for plus
996 strand and EUtils.MINUS_STRAND (== 2) for negative
997 'complexity' -- regulates the level of display. Options are
998 0 - get the whole blob
999 1 - get the bioseq for gi of interest (default in Entrez)
1000 2 - get the minimal bioseq-set containing the gi of interest
1001 3 - get the minimal nuc-prot containing the gi of interest
1002 4 - get the minimal pub-set containing the gi of interest
1003
1004 Returns an input stream from an HTTP request. The stream
1005 contents are in the requested format.
1006 """
1007 id_string = _dbids_to_id_string(dbids)
1008 return self._get(program = "efetch.fcgi",
1009 query = {"id": id_string,
1010 "db": dbids.db,
1011
1012 "retmode": retmode,
1013 "rettype": rettype,
1014 "seq_start": seq_start,
1015 "seq_stop": seq_stop,
1016 "strand": strand,
1017 "complexity": complexity,
1018 })
1019
1020 - def elink_using_history(self,
1021 dbfrom,
1022 webenv,
1023 query_key,
1024
1025 db = "pubmed",
1026
1027 retstart = 0,
1028 retmax = 20,
1029
1030 cmd = "neighbor",
1031 retmode = None,
1032
1033 term = None,
1034 field = None,
1035
1036 daterange = None,
1037 ):
1038 """dbfrom, webenv, query_key, db="pubmed", retstart=0, retmax=20, cmd="neighbor", retmode=None, term=None, field=None, daterange=None,
1039
1040 Find records related (in various ways) to a collection of
1041 records in the history.
1042
1043 'dbfrom' -- this is the name of the database containing the
1044 collection of record. NOTE! For the other methods
1045 this is named 'db'. But I'm keeping NCBI's notation.
1046 This is where the records come FROM.
1047 'webenv' -- the WebEnv cookie for the history
1048 'query_key' -- the collection in the history
1049
1050 'db' -- Where the records link TO. This is where you want to
1051 find the new records. For example, if you want to
1052 find PubMed records related to a protein then 'dbfrom'
1053 is 'protein' and 'db' is 'pubmed'
1054
1055 'cmd'-- one of the following (unless specified, retmode is the
1056 default value, which returns data in XML)
1057 neighbor: Display neighbors and their scores by database and ID.
1058 (This is the default 'cmd'.)
1059 prlinks: List the hyperlink to the primary LinkOut provider
1060 for multiple IDs and database.
1061 When retmode == 'ref' this URL redirects the browser
1062 to the primary LinkOut provider for a single ID
1063 and database.
1064 llinks: List LinkOut URLs and Attributes for multiple IDs
1065 and database.
1066 lcheck: Check for the existence (Y or N) of an external
1067 link in for multiple IDs and database.
1068 ncheck: Check for the existence of a neighbor link for
1069 each ID, e.g., Related Articles in PubMed.
1070
1071 'retstart' -- get the formatted data starting with this position
1072 'retmax' -- get data for at most this many records
1073
1074 'retmode' -- only used with 'prlinks'
1075
1076 'term' -- restrict results to records which also match this
1077 Entrez search
1078 'field' -- the field to use for unqualified words
1079
1080 'daterange' -- restrict results to records which also match this
1081 date criteria; either WithinNDays or DateRange
1082 NOTE: DateRange must have both mindate and maxdate
1083
1084 Some examples:
1085 In PubMed, to get a list of "Related Articles"
1086 dbfrom = pubmed
1087 cmd = neighbor
1088
1089 To get MEDLINE index only related article
1090 dbfrom = pubmed
1091 db = pubmed
1092 term = medline[sb]
1093 cmd = neighbor
1094
1095 Given a PubMed record, find the related nucleotide records
1096 dbfrom = pubmed
1097 db = nucleotide (or "protein" for related protein records)
1098 cmd = neighbor
1099
1100 To get "LinkOuts" (external links) for a PubMed record set
1101 dbfrom = pubmed
1102 cmd = llinks
1103
1104 Get the primary link information for a PubMed document; includes
1105 various hyperlinks, image URL for the provider, etc.
1106 dbfrom = pubmed
1107 cmd = prlinks
1108 (optional) retmode = "ref" (causes a redirect to the privder)
1109
1110 Returns an input stream from an HTTP request. The stream
1111 contents are in XML unless 'retmode' is 'ref'.
1112 """
1113 query = {"WebEnv": webenv,
1114 "query_key": query_key,
1115 "db": db,
1116 "dbfrom": dbfrom,
1117 "cmd": cmd,
1118 "retstart": retstart,
1119 "retmax": retmax,
1120 "retmode": retmode,
1121 "term": term,
1122 "field": field,
1123 }
1124 if daterange is not None:
1125 if daterange.mindate is None or daterange.maxdate is None:
1126 raise TypeError("Both mindate and maxdate must be set for eLink")
1127 query.update(daterange.get_query_params())
1128 return self._get(program = "elink.fcgi", query = query)
1129
1130 - def elink_using_dbids(self,
1131 dbids,
1132 db = "pubmed",
1133
1134 cmd = "neighbor",
1135
1136 retmode = None,
1137 term = None,
1138 field = None,
1139
1140 daterange = None,
1141
1142 ):
1143 """dbids, db="pubmed", cmd="neighbor", retmode=None, term=None, daterange=None
1144
1145 Find records related (in various ways) to a set of records
1146 specified by identifier.
1147
1148 'dbids' -- a DBIds containing the database name and list
1149 of record identifiers
1150 'db' -- Where the records link TO. This is where you want to
1151 find the new records. For example, if you want to
1152 find PubMed records related to a protein then 'db'
1153 is 'pubmed'. (The database they are from is part
1154 of the DBIds object.)
1155
1156 'cmd' -- see the docstring for 'elink_using_history'
1157 'retmode' -- see 'elink_using_history'
1158 'term' -- see 'elink_using_history'
1159 'daterange' -- see 'elink_using_history'
1160
1161 Returns an input stream from an HTTP request. The stream
1162 contents are in XML unless 'retmode' is 'ref'.
1163 """
1164 id_string = _dbids_to_id_string(dbids)
1165 query = {"id": id_string,
1166 "db": db,
1167 "dbfrom": dbids.db,
1168 "cmd": cmd,
1169 "retmode": retmode,
1170 "field" : field,
1171 "term": term,
1172 }
1173 if daterange is not None:
1174 import Datatypes
1175 if isinstance(daterange, Datatypes.DateRange) and \
1176 (daterange.mindate is None or daterange.maxdate is None):
1177 raise TypeError("Both mindate and maxdate must be set for eLink")
1178 query.update(daterange.get_query_params())
1179
1180 return self._get(program = "elink.fcgi", query = query)
1181