Package Bio :: Package EUtils :: Module ThinClient
[hide private]
[frames] | no frames]

Source Code for Module Bio.EUtils.ThinClient

   1  """Low-level interface to NCBI's EUtils for Entrez search and retrieval. 
   2   
   3  For higher-level interfaces, see DBIdsClient (which works with a set 
   4  of database identifiers) and HistoryClient (which does a much better 
   5  job of handling history). 
   6   
   7  There are five classes of services: 
   8    ESearch - search a database 
   9    EPost - upload a list of indicies for further use 
  10    ESummary - get document summaries for a given set of records 
  11    EFetch - get the records translated to a given format 
  12    ELink - find related records in other databases 
  13   
  14  You can find more information about them at 
  15    http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html 
  16  but that document isn't very useful.  Perhaps the following is better. 
  17   
  18  EUtils offers a structured way to query Entrez, get the results in 
  19  various formats, and get information about related documents.  The way 
  20  to start off is create an EUtils object. 
  21   
  22  >>> from Bio import EUtils 
  23  >>> from Bio.EUtils.ThinClient import ThinClient 
  24  >>> eutils = ThinClient.ThinClient() 
  25  >>>  
  26   
  27  You can search Entrez with the "esearch" method.  This does a query on 
  28  the server, which generates a list of identifiers for records that 
  29  matched the query.  However, not all the identifiers are returned. 
  30  You can request only a subset of the matches (using the 'retstart' and 
  31  'retmax') terms.  This is useful because searches like 'cancer' can 
  32  have over 1.4 million matches.  Most people would rather change the 
  33  query or look at more details about the first few hits than wait to 
  34  download all the identifiers before doing anything else. 
  35   
  36  The esearch method, and indeed all these methods, returns a 
  37  'urllib.addinfourl' which is an HTTP socket connection that has 
  38  already parsed the HTTP header and is ready to read the data from the 
  39  server. 
  40   
  41  For example, here's a query and how to use it 
  42   
  43    Search in PubMed for the term cancer for the entrez date from the 
  44    last 60 days and retrieve the first 10 IDs and translations using 
  45    the history parameter. 
  46   
  47  >>> infile = eutils.esearch("cancer", 
  48  ...                         daterange = EUtils.WithinNDays(60, "edat"), 
  49  ...                         retmax = 10) 
  50  >>> 
  51  >>> print infile.read() 
  52  <?xml version="1.0"?> 
  53  <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd"> 
  54  <eSearchResult> 
  55          <Count>7228</Count> 
  56          <RetMax>10</RetMax> 
  57          <RetStart>0</RetStart> 
  58          <IdList> 
  59                  <Id>12503096</Id> 
  60                  <Id>12503075</Id> 
  61                  <Id>12503073</Id> 
  62                  <Id>12503033</Id> 
  63                  <Id>12503030</Id> 
  64                  <Id>12503028</Id> 
  65                  <Id>12502932</Id> 
  66                  <Id>12502925</Id> 
  67                  <Id>12502881</Id> 
  68                  <Id>12502872</Id> 
  69          </IdList> 
  70          <TranslationSet> 
  71                  <Translation> 
  72                          <From>cancer%5BAll+Fields%5D</From> 
  73                          <To>(%22neoplasms%22%5BMeSH+Terms%5D+OR+cancer%5BText+Word%5D)</To> 
  74                  </Translation> 
  75          </TranslationSet> 
  76          <TranslationStack> 
  77                  <TermSet> 
  78                          <Term>"neoplasms"[MeSH Terms]</Term> 
  79                          <Field>MeSH Terms</Field> 
  80                          <Count>1407151</Count> 
  81                          <Explode>Y</Explode> 
  82                  </TermSet> 
  83                  <TermSet> 
  84                          <Term>cancer[Text Word]</Term> 
  85                          <Field>Text Word</Field> 
  86                          <Count>382919</Count> 
  87                          <Explode>Y</Explode> 
  88                  </TermSet> 
  89                  <OP>OR</OP> 
  90                  <TermSet> 
  91                          <Term>2002/10/30[edat]</Term> 
  92                          <Field>edat</Field> 
  93                          <Count>-1</Count> 
  94                          <Explode>Y</Explode> 
  95                  </TermSet> 
  96                  <TermSet> 
  97                          <Term>2002/12/29[edat]</Term> 
  98                          <Field>edat</Field> 
  99                          <Count>-1</Count> 
 100                          <Explode>Y</Explode> 
 101                  </TermSet> 
 102                  <OP>RANGE</OP> 
 103                  <OP>AND</OP> 
 104          </TranslationStack> 
 105  </eSearchResult> 
 106   
 107  >>> 
 108   
 109  You get a raw XML input stream which you can process in many ways. 
 110  (The appropriate DTDs are included in the subdirectory "DTDs" and see 
 111  also the included POM reading code.) 
 112   
 113      WARNING! As of this writing (2002/12/3) NCBI returns their 
 114      XML encoded as Latin-1 but their processing instruction says 
 115      it is UTF-8 because they leave out the "encoding" attribute. 
 116      Until they fix it you will need to recode the input stream 
 117      before processing it with XML tools, like this 
 118   
 119          import codecs 
 120          infile = codecs.EncodedFile(infile, "utf-8", "iso-8859-1") 
 121   
 122   
 123  The XML fields are mostly understandable: 
 124    Count -- the total number of matches from this search 
 125    RetMax -- the number of <ID> values returned in this subset 
 126    RetStart -- the start position of this subset in the list of 
 127        all matches 
 128   
 129    IDList and ID -- the identifiers in this subset 
 130   
 131    TranslationSet / Translation -- if the search field is not 
 132        explicitly specified ("qualified"), then the server will 
 133        apply a set of hueristics to improve the query.  Eg, in 
 134        this case "cancer" is first parsed as 
 135          cancer[All Fields] 
 136        then turned into the query 
 137          "neoplasms"[MeSH Terms] OR cancer[Text Word] 
 138   
 139        Note that these terms are URL escaped. 
 140        For details on how the translation is done, see 
 141  http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#AutomaticTermMapping 
 142   
 143    TranslationStack -- The (possibly 'improved' query) fully 
 144        parsed out and converted into a postfix (RPN) notation. 
 145        The above example is written in the Entrez query language as 
 146   
 147          ("neoplasms"[MeSH Terms] OR cancer[Text Word]) AND 
 148                       2002/10/30:2002/12/29[edat] 
 149        Note that these terms are *not* URL escaped.  Nothing like 
 150        a bit of inconsistency for the soul. 
 151   
 152        The "Count" field shows how many matches were found for each 
 153        term of the expression.  I don't know what "Explode" does. 
 154   
 155   
 156  Let's get more information about the first record, which has an id of 
 157  12503096.  There are two ways to query for information, one uses a set 
 158  of identifiers and the other uses the history.  I'll talk about the 
 159  history one in a bit.  To use a set of identifiers you need to make a 
 160  DBIds object containing the that list. 
 161   
 162  >>> dbids = EUtils.DBIds("pubmed", ["12503096"]) 
 163  >>> 
 164   
 165  Now get the summary using dbids 
 166   
 167  >>> infile = eutils.esummary_using_dbids(dbids) 
 168  >>> print infile.read() 
 169  <?xml version="1.0"?> 
 170  <!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD eSummaryResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSummary_020511.dtd"> 
 171  <eSummaryResult> 
 172  <DocSum> 
 173          <Id>12503096</Id> 
 174          <Item Name="PubDate" Type="Date">2003 Jan 30</Item> 
 175          <Item Name="Source" Type="String">Am J Med Genet</Item> 
 176          <Item Name="Authors" Type="String">Coyne JC, Kruus L, Racioppo M, Calzone KA, Armstrong K</Item> 
 177          <Item Name="Title" Type="String">What do ratings of cancer-specific distress mean among women at high risk of breast and ovarian cancer?</Item> 
 178          <Item Name="Volume" Type="String">116</Item> 
 179          <Item Name="Pages" Type="String">222-8</Item> 
 180          <Item Name="EntrezDate" Type="Date">2002/12/28 04:00</Item> 
 181          <Item Name="PubMedId" Type="Integer">12503096</Item> 
 182          <Item Name="MedlineId" Type="Integer">22390532</Item> 
 183          <Item Name="Lang" Type="String">English</Item> 
 184          <Item Name="PubType" Type="String"></Item> 
 185          <Item Name="RecordStatus" Type="String">PubMed - in process</Item> 
 186          <Item Name="Issue" Type="String">3</Item> 
 187          <Item Name="SO" Type="String">2003 Jan 30;116(3):222-8</Item> 
 188          <Item Name="DOI" Type="String">10.1002/ajmg.a.10844</Item> 
 189          <Item Name="JTA" Type="String">3L4</Item> 
 190          <Item Name="ISSN" Type="String">0148-7299</Item> 
 191          <Item Name="PubId" Type="String"></Item> 
 192          <Item Name="PubStatus" Type="Integer">4</Item> 
 193          <Item Name="Status" Type="Integer">5</Item> 
 194          <Item Name="HasAbstract" Type="Integer">1</Item> 
 195          <Item Name="ArticleIds" Type="List"> 
 196                  <Item Name="PubMedId" Type="String">12503096</Item> 
 197                  <Item Name="DOI" Type="String">10.1002/ajmg.a.10844</Item> 
 198                  <Item Name="MedlineUID" Type="String">22390532</Item> 
 199          </Item> 
 200  </DocSum> 
 201  </eSummaryResult> 
 202  >>> 
 203   
 204  This is just a summary.  To get the full details, including an 
 205  abstract (if available) use the 'efetch' method.  I'll only print a 
 206  bit to convince you it has an abstract. 
 207   
 208  >>> s = eutils.efetch_using_dbids(dbids).read() 
 209  >>> print s[587:860] 
 210  <ArticleTitle>What do ratings of cancer-specific distress mean among women at high risk of breast and ovarian cancer?</ArticleTitle> 
 211  <Pagination> 
 212  <MedlinePgn>222-8</MedlinePgn> 
 213  </Pagination> 
 214  <Abstract> 
 215  <AbstractText>Women recruited from a hereditary cancer registry provided 
 216  >>> 
 217   
 218  Suppose instead you want the data in a text format.  Different 
 219  databases have different text formats.  For example, PubMed has a 
 220  "docsum" format which gives just the summary of a document and 
 221  "medline" format as needed for a citation database.  To get these, use 
 222  a "text" "retmode" ("return mode") and select the appropriate 
 223  "rettype" ("return type"). 
 224   
 225  Here are examples of those two return types 
 226   
 227  >>> print eutils.efetch_using_dbids(dbids, "text", "docsum").read()[:497] 
 228  1:  Coyne JC, Kruus L, Racioppo M, Calzone KA, Armstrong K. 
 229  What do ratings of cancer-specific distress mean among women at high risk of breast and ovarian cancer? 
 230  Am J Med Genet. 2003 Jan 30;116(3):222-8. 
 231  PMID: 12503096 [PubMed - in process] 
 232  >>> print eutils.efetch_using_dbids(dbids, "text", "medline").read()[:369] 
 233  UI  - 22390532 
 234  PMID- 12503096 
 235  DA  - 20021227 
 236  IS  - 0148-7299 
 237  VI  - 116 
 238  IP  - 3 
 239  DP  - 2003 Jan 30 
 240  TI  - What do ratings of cancer-specific distress mean among women at high risk 
 241        of breast and ovarian cancer? 
 242  PG  - 222-8 
 243  AB  - Women recruited from a hereditary cancer registry provided ratings of 
 244        distress associated with different aspects of high-risk status 
 245  >>>  
 246   
 247  It's also possible to get a list of records related to a given 
 248  article.  This is done through the "elink" method.  For example, 
 249  here's how to get the list of PubMed articles related to the above 
 250  PubMed record.  (Again, truncated because otherwise there is a lot of 
 251  data.) 
 252   
 253  >>> print eutils.elink_using_dbids(dbids).read()[:590] 
 254  <?xml version="1.0"?> 
 255  <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd"> 
 256  <eLinkResult> 
 257  <LinkSet> 
 258          <DbFrom>pubmed</DbFrom> 
 259          <IdList> 
 260                  <Id>12503096</Id> 
 261          </IdList> 
 262          <LinkSetDb> 
 263                  <DbTo>pubmed</DbTo> 
 264                  <LinkName>pubmed_pubmed</LinkName> 
 265                  <Link> 
 266                          <Id>12503096</Id> 
 267                          <Score>2147483647</Score> 
 268                  </Link> 
 269                  <Link> 
 270                          <Id>11536413</Id> 
 271                          <Score>30817790</Score> 
 272                  </Link> 
 273                  <Link> 
 274                          <Id>11340606</Id> 
 275                          <Score>29939219</Score> 
 276                  </Link> 
 277                  <Link> 
 278                          <Id>10805955</Id> 
 279                          <Score>29584451</Score> 
 280                  </Link> 
 281  >>> 
 282   
 283  For a change of pace, let's work with the protein database to learn 
 284  how to work with history.  Suppose I want to do a multiple sequene 
 285  alignment of bacteriorhodopsin with all of its neighbors, where 
 286  "neighbors" is defined by NCBI.  There are good programs for this -- I 
 287  just need to get the records in the right format, like FASTA. 
 288   
 289  The bacteriorhodopsin I'm interested in is BAA75200, which is 
 290  GI:4579714, so I'll start by asking for its neighbors. 
 291   
 292  >>> results = eutils.elink_using_dbids( 
 293  ...             EUtils.DBIds("protein", ["4579714"]), 
 294  ...             db = "protein").read() 
 295  >>> print results[:454] 
 296  <?xml version="1.0"?> 
 297  <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd"> 
 298  <eLinkResult> 
 299  <LinkSet> 
 300          <DbFrom>protein</DbFrom> 
 301          <IdList> 
 302                  <Id>4579714</Id> 
 303          </IdList> 
 304          <LinkSetDb> 
 305                  <DbTo>protein</DbTo> 
 306                  <LinkName>protein_protein</LinkName> 
 307                  <Link> 
 308                          <Id>4579714</Id> 
 309                          <Score>2147483647</Score> 
 310                  </Link> 
 311                  <Link> 
 312                          <Id>11277596</Id> 
 313                          <Score>1279</Score> 
 314                  </Link> 
 315  >>> 
 316   
 317  Let's get all the <Id> fields.  (While the following isn't a good way 
 318  to parse XML, it is easy to understand and works well enough for this 
 319  example.)  Note that I remove the first <Id> because that's from the 
 320  query and not from the results. 
 321   
 322  >>> import re 
 323  >>> ids = re.findall(r"<Id>(\d+)</Id>", results) 
 324  >>> ids = ids[1:] 
 325  >>> len(ids) 
 326  222 
 327  >>> dbids = EUtils.DBIds("protein", ids) 
 328  >>>  
 329   
 330  That's a lot of records.  I could use 'efetch_using_dbids' but there's 
 331  a problem with that.  Efetch uses the HTTP GET protocol to pass 
 332  information to the EUtils server.  ("GET" is what's used when you type 
 333  a URL in the browser.)  Each id takes about 9 characters, so the URL 
 334  would be over 2,000 characters long.  This may not work on some 
 335  systems, for example, some proxies do not support long URLs.  (Search 
 336  for "very long URLs" for examples.) 
 337   
 338  Instead, we'll upload the list to the server then fetch the FASTA 
 339  version using the history. 
 340   
 341  The first step is to upload the data.  We want to put that into the 
 342  history so we set 'usehistory' to true.  There's no existing history 
 343  so the webenv string is None. 
 344   
 345   
 346  >>> print eutils.epost(dbids, usehistory = 1, webenv = None).read() 
 347  <?xml version="1.0"?> 
 348  <!DOCTYPE ePostResult PUBLIC "-//NLM//DTD ePostResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ePost_020511.dtd"> 
 349  <ePostResult> 
 350          <QueryKey>1</QueryKey> 
 351          <WebEnv>%7BPgTHRHFBsJfC%3C%5C%5C%5B%3EAfJCKQ%5Ey%60%3CGkH%5DH%5E%3DJHGBKAJ%3F%40CbCiG%3FE%3C</WebEnv> 
 352  </ePostResult> 
 353   
 354  >>> 
 355   
 356  This says that the identifiers were saved as query #1, which will be 
 357  used later on as the "query_key" field.  The WebEnv is a cookie (or 
 358  token) used to tell the server where to find that query.  The WebEnv 
 359  changes after every history-enabled ESearch or EPost so you'll need to 
 360  parse the output from those to get the new WebEnv field.  You'll also 
 361  need to unquote it since it is URL-escaped. 
 362   
 363  Also, you will need to pass in the name of the database used for the 
 364  query in order to access the history.  Why?  I don't know -- I figure 
 365  the WebEnv and query_key should be enough to get the database name. 
 366   
 367  >>> import urllib 
 368  >>> webenv = urllib.unquote("%7BPgTHRHFBsJfC%3C%5C%5C%5B%3EAfJCKQ%5Ey%60%3CGkH%5DH%5E%3DJHGBKAJ%3F%40CbCiG%3FE%3C") 
 369  >>> print webenv 
 370  {PgTHRHFBsJfC<\\[>AfJCKQ^y`<GkH]H^=JHGBKAJ?@CbCiG?E< 
 371  >>> 
 372   
 373  Okay, now to get the data in FASTA format.  Notice that I need the 
 374  'retmax' in order to include all the records in the result.  (The 
 375  default is 20 records.) 
 376   
 377  >>> fasta = eutils.efetch_using_history("protein", webenv, query_key = "1", 
 378  ...                                     retmode = "text", rettype = "fasta", 
 379  ...                                     retmax = len(dbids)).read() 
 380  >>> fasta.count(">") 
 381  222 
 382  >>> print fasta[:694] 
 383  >gi|14194475|sp|O93742|BACH_HALSD Halorhodopsin (HR) 
 384  MMETAADALASGTVPLEMTQTQIFEAIQGDTLLASSLWINIALAGLSILLFVYMGRNLEDPRAQLIFVAT 
 385  LMVPLVSISSYTGLVSGLTVSFLEMPAGHALAGQEVLTPWGRYLTWALSTPMILVALGLLAGSNATKLFT 
 386  AVTADIGMCVTGLAAALTTSSYLLRWVWYVISCAFFVVVLYVLLAEWAEDAEVAGTAEIFNTLKLLTVVL 
 387  WLGYPIFWALGAEGLAVLDVAVTSWAYSGMDIVAKYLFAFLLLRWVVDNERTVAGMAAGLGAPLARCAPA 
 388  DD 
 389  >gi|14194474|sp|O93741|BACH_HALS4 Halorhodopsin (HR) 
 390  MRSRTYHDQSVCGPYGSQRTDCDRDTDAGSDTDVHGAQVATQIRTDTLLHSSLWVNIALAGLSILVFLYM 
 391  ARTVRANRARLIVGATLMIPLVSLSSYLGLVTGLTAGPIEMPAAHALAGEDVLSQWGRYLTWTLSTPMIL 
 392  LALGWLAEVDTADLFVVIAADIGMCLTGLAAALTTSSYAFRWAFYLVSTAFFVVVLYALLAKWPTNAEAA 
 393  GTGDIFGTLRWLTVILWLGYPILWALGVEGFALVDSVGLTSWGYSLLDIGAKYLFAALLLRWVANNERTI 
 394  AVGQRSGRGAIGDPVED 
 395  >>>  
 396   
 397  To round things out, here's a query which refines the previous query. 
 398  I want to get all records from the first search which also have the 
 399  word "Structure" in them.  (My background was originally structural 
 400  biophysics, whaddya expect?  :) 
 401   
 402  >>> print eutils.search("#1 AND structure", db = "protein", usehistory = 1, 
 403  ...                     webenv = webenv).read() 
 404  <?xml version="1.0"?> 
 405  <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd"> 
 406  <eSearchResult> 
 407          <Count>67</Count> 
 408          <RetMax>20</RetMax> 
 409          <RetStart>0</RetStart> 
 410          <QueryKey>2</QueryKey> 
 411          <WebEnv>UdvMf%3F%60G%3DIE%60bG%3DGec%3E%3D%3Cbc_%5DgBAf%3EAi_e%5EAJcHgDi%3CIqGdE%7BmC%3C</WebEnv> 
 412          <IdList> 
 413                  <Id>461608</Id> 
 414                  <Id>114808</Id> 
 415                  <Id>1364150</Id> 
 416                  <Id>1363466</Id> 
 417                  <Id>1083906</Id> 
 418                  <Id>99232</Id> 
 419                  <Id>99212</Id> 
 420                  <Id>81076</Id> 
 421                  <Id>114811</Id> 
 422                  <Id>24158915</Id> 
 423                  <Id>24158914</Id> 
 424                  <Id>24158913</Id> 
 425                  <Id>1168615</Id> 
 426                  <Id>114812</Id> 
 427                  <Id>114809</Id> 
 428                  <Id>17942995</Id> 
 429                  <Id>17942994</Id> 
 430                  <Id>17942993</Id> 
 431                  <Id>20151159</Id> 
 432                  <Id>20150922</Id> 
 433          </IdList> 
 434          <TranslationSet> 
 435          </TranslationSet> 
 436          <TranslationStack> 
 437                  <TermSet> 
 438                          <Term>#1</Term> 
 439                          <Field>All Fields</Field> 
 440                          <Count>222</Count> 
 441                          <Explode>Y</Explode> 
 442                  </TermSet> 
 443                  <TermSet> 
 444                          <Term>structure[All Fields]</Term> 
 445                          <Field>All Fields</Field> 
 446                          <Count>142002</Count> 
 447                          <Explode>Y</Explode> 
 448                  </TermSet> 
 449                  <OP>AND</OP> 
 450          </TranslationStack> 
 451  </eSearchResult> 
 452   
 453  >>>  
 454   
 455  One last thing about history.  It doesn't last very long -- perhaps an 
 456  hour or so.  (Untested.)  You may be able to toss it some keep-alive 
 457  signal every once in a while.  Or you may want to keep  
 458   
 459  The known 'db' fields and primary IDs (if known) are 
 460    genome -- GI number 
 461    nucleotide -- GI number 
 462    omim  -- MIM number 
 463    popset -- GI number 
 464    protein -- GI number 
 465    pubmed  -- PMID 
 466    sequences (not available; this will combine all sequence databases) 
 467    structure -- MMDB ID 
 468    taxonomy -- TAXID 
 469   
 470  The 'field' parameter is different for different databases.  The 
 471  fields for PubMed are listed at 
 472   
 473  http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#SearchFieldDescriptionsandTags 
 474   
 475    Affiliation -- AD 
 476    All Fields -- All 
 477    Author -- AU 
 478    EC/RN Number -- RN 
 479    Entrez Date -- EDAT  (also valid for 'datetype') 
 480    Filter -- FILTER 
 481    Issue -- IP 
 482    Journal Title -- TA 
 483    Language -- LA 
 484    MeSH Date -- MHDA  (also valid for 'datetype') 
 485    MeSH Major Topic -- MAJR 
 486    MeSH Subheadings -- SH 
 487    MeSH Terms -- MH 
 488    Pagination -- PG 
 489    Personal Name as Subject -- PS 
 490    Publication Date -- DP  (also valid for 'datetype') 
 491    Publication Type -- PT 
 492    Secondary Source ID -- SI 
 493    Subset -- SB 
 494    Substance Name -- NM 
 495    Text Words -- TW 
 496    Title -- TI 
 497    Title/Abstract -- TIAB 
 498    Unique Identifiers -- UID 
 499    Volume -- VI 
 500   
 501  The fields marked as 'datetype' can also be used for date searches. 
 502  Date searches can be done in the query (for example, as 
 503   
 504     1990/01/01:1999/12/31[edat] 
 505   
 506  or by passing a WithinNDays or DateRange field to the 'date' parameter 
 507  of the search. 
 508   
 509   
 510  Please pay attention to the usage limits!  The are listed at 
 511    http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html 
 512   
 513  At the time of this writing they are: 
 514      * Run retrieval scripts on weekends or between 9 PM and 5 AM ET 
 515              weekdays for any series of more than 100 requests. 
 516      * Make no more than one request every 3 seconds. 
 517      * Only 5000 PubMed records may be retrieved in a single day. 
 518   
 519      * NCBI's Disclaimer and Copyright notice must be evident to users 
 520        of your service.  NLM does not hold the copyright on the PubMed 
 521        abstracts the journal publishers do.  NLM provides no legal 
 522        advice concerning distribution of copyrighted materials, consult 
 523        your legal counsel. 
 524   
 525  (Their disclaimer is at 
 526         http://www.ncbi.nlm.nih.gov/About/disclaimer.html ) 
 527   
 528   
 529  """ # "  # Emacs cruft 
 530   
 531  import urllib, urllib2, cStringIO 
 532  import time 
 533   
 534  DUMP_URL = 0 
 535  DUMP_RESULT = 0 
 536   
 537  # These tell NCBI who is using the tool.  They are meant to provide 
 538  # hints to NCBI about how their service is being used and provide a 
 539  # means of getting ahold of the author. 
 540  # 
 541  # To use your own values, pass them in to the EUtils constructor. 
 542  # 
 543  TOOL = "EUtils_Python_client" 
 544  EMAIL = "biopython-dev@biopython.org" 
 545   
 546  assert " " not in TOOL 
 547  assert " " not in EMAIL 
 548   
549 -def _dbids_to_id_string(dbids):
550 """Internal function: convert a list of ids to a comma-seperated string""" 551 # NOTE: the server strips out non-numeric characters 552 # Eg, "-1" is treated as "1". So do some sanity checking. 553 # XXX Should I check for non-digits? 554 # Are any of the IDs non-integers? 555 if not dbids: 556 raise TypeError("dbids list must have at least one term") 557 for x in dbids.ids: 558 if "," in x: 559 raise TypeError("identifiers cannot contain a comma: %r " % 560 (x,)) 561 id_string = ",".join(dbids.ids) 562 assert id_string.count(",") == len(dbids.ids)-1, "double checking" 563 return id_string
564 565 #Record the time at module level, in case the user has multiple copies 566 #of the ThinClient class in operation at once. 567 _open_previous = time.time() 568
569 -class ThinClient:
570 """Client-side interface to the EUtils services 571 572 See the module docstring for much more complete information. 573 """
574 - def __init__(self, 575 opener = None, 576 tool = TOOL, 577 email = EMAIL, 578 baseurl = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"):
579 """opener = None, tool = TOOL, email = EMAIL, baseurl = ".../eutils/" 580 581 'opener' -- an object which implements the 'open' method like a 582 urllib2.OpenDirector. Defaults to urllib2.build_opener() 583 584 'tool' -- the term to use for the 'tool' field, used by NCBI to 585 track which programs use their services. If you write your 586 own tool based on this package, use your own tool name. 587 588 'email' -- a way for NCBI to contact you (the developer, not 589 the user!) if there are problems and to tell you about 590 updates or changes to their system. 591 592 'baseurl' -- location of NCBI's EUtils directory. Shouldn't need 593 to change this at all. 594 """ 595 596 if tool is not None and " " in tool: 597 raise TypeError("No spaces allowed in 'tool'") 598 if email is not None and " " in email: 599 raise TypeError("No spaces allowed in 'email'") 600 601 if opener is None: 602 opener = urllib2.build_opener() 603 604 self.opener = opener 605 self.tool = tool 606 self.email = email 607 self.baseurl = baseurl
608
609 - def _fixup_query(self, query):
610 """Internal function to add and remove fields from a query""" 611 q = query.copy() 612 613 # Set the 'tool' and 'email' fields 614 q["tool"] = self.tool 615 q["email"] = self.email 616 617 # Kinda cheesy -- shouldn't really do this here. 618 # If 'usehistory' is true, use the value of 'Y' instead. 619 # Otherwise, don't use history 620 if "usehistory" in q: 621 if q["usehistory"]: 622 q["usehistory"] = "y" 623 else: 624 q["usehistory"] = None 625 626 # This will also remove the history, email, etc. fields 627 # if they are set to None. 628 for k, v in q.items(): 629 if v is None: 630 del q[k] 631 632 # Convert the query into the form needed for a GET. 633 return urllib.urlencode(q)
634
635 - def _wait(self, delay = 3.0) :
636 """Enforce the NCBI requirement of one request every three seconds. 637 638 Ideally the calling code would have respected the 3 second rule, 639 but as this often hasn't happenend we check this here. 640 641 wait - number of seconds between queries.""" 642 global _open_previous 643 wait = _open_previous + delay - time.time() 644 if wait > 0: 645 time.sleep(wait) 646 _open_previous = time.time()
647
648 - def _get(self, program, query):
649 """Internal function: send the query string to the program as GET""" 650 # NOTE: epost uses a different interface 651 652 self._wait() 653 654 q = self._fixup_query(query) 655 url = self.baseurl + program + "?" + q 656 if DUMP_URL: 657 print "Opening with GET:", url 658 if DUMP_RESULT: 659 print " ================== Results ============= " 660 s = self.opener.open(url).read() 661 print s 662 print " ================== Finished ============ " 663 return cStringIO.StringIO(s) 664 return self.opener.open(url)
665
666 - def esearch(self, 667 term, # In Entrez query language 668 db = "pubmed", # Required field, default to PubMed 669 field = None, # Field to use for unqualified words 670 daterange = None, # Date restriction 671 672 retstart = 0, 673 retmax = 20, # Default from NCBI is 20, so I'll use that 674 675 usehistory = 0, # Enable history tracking 676 webenv = None, # If given, add to an existing history 677 ):
678 679 """term, db="pubmed", field=None, daterange=None, retstart=0, retmax=20, usehistory=0, webenv=none 680 681 Search the given database for records matching the query given 682 in the 'term'. See the module docstring for examples. 683 684 'term' -- the query string in the Entrez query language; see 685 http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html 686 'db' -- the database to search 687 688 'field' -- the field to use for unqualified words 689 Eg, "dalke[au] AND gene" with field==None becomes 690 dalke[au] AND (genes[MeSH Terms] OR gene[Text Word] 691 and "dalke[au] AND gene" with field=="au" becomes 692 dalke[au] AND genes[Author] 693 (Yes, I think the first "au" should be "Author" too) 694 695 'daterange' -- a date restriction; either WithinNDays or DateRange 696 'retstart' -- include identifiers in the output, starting with 697 position 'retstart' (normally starts with 0) 698 'retmax' -- return at most 'retmax' identifiers in the output 699 (if not specified, NCBI returns 20 identifiers) 700 701 'usehistory' -- flag to enable history tracking 702 'webenv' -- if this string is given, add the search results 703 to an existing history. (WARNING: the history disappers 704 after about an hour of non-use.) 705 706 You will need to parse the output XML to get the new QueryKey 707 and WebEnv fields. 708 709 Returns an input stream from an HTTP request. The stream 710 contents are in XML. 711 """ 712 query = {"term": term, 713 "db": db, 714 "field": field, 715 "retstart": retstart, 716 "retmax": retmax, 717 "usehistory": usehistory, 718 "WebEnv": webenv, 719 } 720 if daterange is not None: 721 query.update(daterange.get_query_params()) 722 723 return self._get(program = "esearch.fcgi", query = query)
724
725 - def epost(self, 726 dbids, 727 728 webenv = None, # If given, add to an existing history 729 ):
730 """dbids, webenv = None 731 732 Create a new collection in the history containing the given 733 list of identifiers for a database. 734 735 'dbids' -- a DBIds, which contains the database name and 736 a list of identifiers in that database 737 'webenv' -- if this string is given, add the collection 738 to an existing history. (WARNING: the history disappers 739 after about an hour of non-use.) 740 741 You will need to parse the output XML to get the new QueryKey 742 and WebEnv fields. NOTE: The order of the IDs on the server 743 is NOT NECESSARILY the same as the upload order. 744 745 Returns an input stream from an HTTP request. The stream 746 contents are in XML. 747 """ 748 id_string = _dbids_to_id_string(dbids) 749 750 # Looks like it will accept *any* ids. Wonder what that means. 751 program = "epost.fcgi" 752 query = {"id": id_string, 753 "db": dbids.db, 754 "WebEnv": webenv, 755 } 756 q = self._fixup_query(query) 757 758 self._wait() 759 760 # Need to use a POST since the data set can be *very* long; 761 # even too long for GET. 762 if DUMP_URL: 763 print "Opening with POST:", self.baseurl + program + "?" + q 764 if DUMP_RESULT: 765 print " ================== Results ============= " 766 s = self.opener.open(self.baseurl + program, q).read() 767 print s 768 print " ================== Finished ============ " 769 return cStringIO.StringIO(s) 770 return self.opener.open(self.baseurl + program, q)
771
772 - def esummary_using_history(self, 773 db, # This is required. Don't use a 774 # default here because it must match 775 # that of the webenv 776 webenv, 777 query_key, 778 retstart = 0, 779 retmax = 20, 780 retmode = "xml", # any other modes? 781 ):
782 """db, webenv, query_key, retstart = 0, retmax = 20, retmode = "xml" 783 784 Get the summary for a collection of records in the history 785 786 'db' -- the database containing the history/collection 787 'webenv' -- the WebEnv cookie for the history 788 'query_key' -- the collection in the history 789 'retstart' -- get the summaries starting with this position 790 'retmax' -- get at most this many summaries 791 'retmode' -- can only be 'xml'. (Are there others?) 792 793 Returns an input stream from an HTTP request. The stream 794 contents are in 'retmode' format. 795 """ 796 return self._get(program = "esummary.fcgi", 797 query = {"db": db, 798 "WebEnv": webenv, 799 "query_key": query_key, 800 "retstart": retstart, 801 "retmax": retmax, 802 "retmode": retmode, 803 })
804
805 - def esummary_using_dbids(self, 806 dbids, 807 retmode = "xml", # any other modes? 808 ):
809 """dbids, retmode = "xml" 810 811 Get the summary for records specified by identifier 812 813 'dbids' -- a DBIds containing the database name and list 814 of record identifiers 815 'retmode' -- can only be 'xml' 816 817 Returns an input stream from an HTTP request. The stream 818 contents are in 'retmode' format. 819 """ 820 821 id_string = _dbids_to_id_string(dbids) 822 return self._get(program = "esummary.fcgi", 823 query = {"id": id_string, 824 "db": dbids.db, 825 # "retmax": len(dbids.ids), # needed? 826 "retmode": retmode, 827 })
828
829 - def efetch_using_history(self, 830 db, 831 webenv, 832 query_key, 833 834 retstart = 0, 835 retmax = 20, 836 837 retmode = None, 838 rettype = None, 839 840 # sequence only 841 seq_start = None, 842 seq_stop = None, 843 strand = None, 844 complexity = None, 845 ):
846 """db, webenv, query_key, retstart=0, retmax=20, retmode=None, rettype=None, seq_start=None, seq_stop=None, strand=None, complexity=None 847 848 Fetch information for a collection of records in the history, 849 in a variety of formats. 850 851 'db' -- the database containing the history/collection 852 'webenv' -- the WebEnv cookie for the history 853 'query_key' -- the collection in the history 854 'retstart' -- get the formatted data starting with this position 855 'retmax' -- get data for at most this many records 856 857 These options work for sequence databases 858 859 'seq_start' -- return the sequence starting at this position. 860 The first position is numbered 1 861 'seq_stop' -- return the sequence ending at this position 862 Includes the stop position, so seq_start = 1 and 863 seq_stop = 5 returns the first 5 bases/residues. 864 'strand' -- strand. Use EUtils.PLUS_STRAND (== 1) for plus 865 strand and EUtils.MINUS_STRAND (== 2) for negative 866 'complexity' -- regulates the level of display. Options are 867 0 - get the whole blob 868 1 - get the bioseq for gi of interest (default in Entrez) 869 2 - get the minimal bioseq-set containing the gi of interest 870 3 - get the minimal nuc-prot containing the gi of interest 871 4 - get the minimal pub-set containing the gi of interest 872 873 http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html 874 875 The valid retmode and rettype values are 876 877 For publication databases (omim, pubmed, journals) the 878 retmodes are 'xml', 'asn.1', 'text', and 'html'. 879 880 If retmode == xml ---> XML (default) 881 if retmode == asn.1 ---> ASN.1 882 883 The following rettype values work for retmode == 'text'. 884 885 docsum ----> author / title / cite / PMID 886 brief ----> a one-liner up to about 66 chars 887 abstract ----> cite / title / author / dept / 888 full abstract / PMID 889 citation ----> cite / title / author / dept / 890 full abstract / MeSH terms / 891 substances / PMID 892 medline ----> full record in medline format 893 asn.1 ----> full record in one ASN.1 format 894 mlasn1 ----> full record in another ASN.1 format 895 uilist ----> list of uids, one per line 896 sgml ----> same as retmode="xml" 897 898 Sequence databases (genome, protein, nucleotide, popset) 899 also have retmode values of 'xml', 'asn.1', 'text', and 900 'html'. 901 902 If retmode == 'xml' ---> XML (default; only supports 903 rettype == 'native') 904 If retmode == 'asn.1' ---> ASN.1 text (only works for rettype 905 of 'native' and 'sequin') 906 907 The following work with a retmode of 'text' or 'html' 908 909 native ----> Default format for viewing sequences 910 fasta ----> FASTA view of a sequence 911 gb ----> GenBank view for sequences, constructed sequences 912 will be shown as contigs (by pointing to its parts). 913 Valid for nucleotides. 914 gbwithparts --> GenBank view for sequences, the sequence will 915 always be shown. Valid for nucleotides. 916 est ----> EST Report. Valid for sequences from 917 dbEST database. 918 gss ----> GSS Report. Valid for sequences from dbGSS 919 database. 920 gp ----> GenPept view. Valid for proteins. 921 seqid ----> To convert list of gis into list of seqids 922 acc ----> To convert list of gis into list of accessions 923 924 # XXX TRY THESE 925 fasta_xml 926 gb_xml 927 gi (same as uilist?) 928 929 930 931 A retmode of 'file' is the same as 'text' except the data is 932 sent with a Content-Type of application/octet-stream, which tells 933 the browser to save the data to a file. 934 935 A retmode of 'html' is the same as 'text' except a HTML header 936 and footer are added and special character are properly escaped. 937 938 Returns an input stream from an HTTP request. The stream 939 contents are in the requested format. 940 """ 941 942 # NOTE: found the list of possible values by sending illegal 943 # parameters, to see which comes up as an error message. Used 944 # that to supplement the information from the documentation. 945 # Looks like efetch is based on pmfetch code and uses the same 946 # types. 947 948 # if retmax is specified and larger than 500, NCBI only returns 949 # 500 sequences. Removing it from the URL relieves this constraint. 950 # To get around this, if retstart is 0 and retmax is greater than 500, 951 # we set retmax to be None. 952 if retstart == 0 and retmax > 500: 953 retmax = None 954 return self._get(program = "efetch.fcgi", 955 query = {"db": db, 956 "WebEnv": webenv, 957 "query_key": query_key, 958 "retstart": retstart, 959 "retmax": retmax, 960 "retmode": retmode, 961 "rettype": rettype, 962 "seq_start": seq_start, 963 "seq_stop": seq_stop, 964 "strand": strand, 965 "complexity": complexity, 966 })
967
968 - def efetch_using_dbids(self, 969 dbids, 970 retmode = None, 971 rettype = None, 972 973 # sequence only 974 seq_start = None, 975 seq_stop = None, 976 strand = None, 977 complexity = None, 978 ):
979 """dbids, retmode = None, rettype = None, seq_start = None, seq_stop = None, strand = None, complexity = None 980 981 Fetch information for records specified by identifier 982 983 'dbids' -- a DBIds containing the database name and list 984 of record identifiers 985 'retmode' -- See the docstring for 'efetch_using_history' 986 'rettype' -- See the docstring for 'efetch_using_history' 987 988 These options work for sequence databases 989 990 'seq_start' -- return the sequence starting at this position. 991 The first position is numbered 1 992 'seq_stop' -- return the sequence ending at this position 993 Includes the stop position, so seq_start = 1 and 994 seq_stop = 5 returns the first 5 bases/residues. 995 'strand' -- strand. Use EUtils.PLUS_STRAND (== 1) for plus 996 strand and EUtils.MINUS_STRAND (== 2) for negative 997 'complexity' -- regulates the level of display. Options are 998 0 - get the whole blob 999 1 - get the bioseq for gi of interest (default in Entrez) 1000 2 - get the minimal bioseq-set containing the gi of interest 1001 3 - get the minimal nuc-prot containing the gi of interest 1002 4 - get the minimal pub-set containing the gi of interest 1003 1004 Returns an input stream from an HTTP request. The stream 1005 contents are in the requested format. 1006 """ 1007 id_string = _dbids_to_id_string(dbids) 1008 return self._get(program = "efetch.fcgi", 1009 query = {"id": id_string, 1010 "db": dbids.db, 1011 # "retmax": len(dbids.ids), # needed? 1012 "retmode": retmode, 1013 "rettype": rettype, 1014 "seq_start": seq_start, 1015 "seq_stop": seq_stop, 1016 "strand": strand, 1017 "complexity": complexity, 1018 })
1019 1129
1181