Selects HTML elements using CSS 2 selectors.
The Selector class uses CSS selector expressions to match and select HTML elements.
For example:
selector = HTML::Selector.new "form.login[action=/login]"
creates a new selector that matches any form element with the class login and an attribute action with the value /login.
Matching Elements
Use the match method to determine if an element matches the selector.
For simple selectors, the method returns an array with that element, or nil if the element does not match. For complex selectors (see below) the method returns an array with all matched elements, of nil if no match found.
For example:
if selector.match(element) puts "Element is a login form" end
Selecting Elements
Use the select method to select all matching elements starting with one element and going through all children in depth-first order.
This method returns an array of all matching elements, an empty array if no match is found
For example:
selector = HTML::Selector.new "input[type=text]" matches = selector.select(element) matches.each do |match| puts "Found text field with name #{match.attributes['name']}" end
Expressions
Selectors can match elements using any of the following criteria:
- name — Match an element based on its name (tag name). For example, p to match a paragraph. You can use * to match any element.
- #id — Match an element based on its identifier (the id attribute). For example, #page.
- .class — Match an element based on its class name, all class names if more than one specified.
- [attr] — Match an element that has the specified attribute.
- [attr=value] — Match an element that has the specified attribute and value. (More operators are supported see below)
- :pseudo-class — Match an element based on a pseudo class, such as :nth-child and :empty.
- :not(expr) — Match an element that does not match the negation expression.
When using a combination of the above, the element name comes first followed by identifier, class names, attributes, pseudo classes and negation in any order. Do not seprate these parts with spaces! Space separation is used for descendant selectors.
For example:
selector = HTML::Selector.new "form.login[action=/login]"
The matched element must be of type form and have the class login. It may have other classes, but the class login is required to match. It must also have an attribute called action with the value /login.
This selector will match the following element:
<form class="login form" method="post" action="/login">
but will not match the element:
<form method="post" action="/logout">
Attribute Values
Several operators are supported for matching attributes:
- name — The element must have an attribute with that name.
- name=value — The element must have an attribute with that name and value.
- name^=value — The attribute value must start with the specified value.
- name$=value — The attribute value must end with the specified value.
- name*=value — The attribute value must contain the specified value.
- name~=word — The attribute value must contain the specified word (space separated).
- name|=word — The attribute value must start with specified word.
For example, the following two selectors match the same element:
#my_id [id=my_id]
and so do the following two selectors:
.my_class [class~=my_class]
Alternatives, siblings, children
Complex selectors use a combination of expressions to match elements:
- expr1 expr2 — Match any element against the second expression if it has some parent element that matches the first expression.
- expr1 > expr2 — Match any element against the second expression if it is the child of an element that matches the first expression.
- expr1 + expr2 — Match any element against the second expression if it immediately follows an element that matches the first expression.
- expr1 ~ expr2 — Match any element against the second expression that comes after an element that matches the first expression.
- expr1, expr2 — Match any element against the first expression, or against the second expression.
Since children and sibling selectors may match more than one element given the first element, the match method may return more than one match.
Pseudo classes
Pseudo classes were introduced in CSS 3. They are most often used to select elements in a given position:
- :root — Match the element only if it is the root element (no parent element).
- :empty — Match the element only if it has no child elements, and no text content.
- :only-child — Match the element if it is the only child (element) of its parent element.
- :only-of-type — Match the element if it is the only child (element) of its parent element and its type.
- :first-child — Match the element if it is the first child (element) of its parent element.
- :first-of-type — Match the element if it is the first child (element) of its parent element of its type.
- :last-child — Match the element if it is the last child (element) of its parent element.
- :last-of-type — Match the element if it is the last child (element) of its parent element of its type.
- :nth-child(b) — Match the element if it is the b-th child (element) of its parent element. The value b specifies its index, starting with 1.
- :nth-child(an+b) — Match the element if it is the b-th child (element) in each group of a child elements of its parent element.
- :nth-child(-an+b) — Match the element if it is the first child (element) in each group of a child elements, up to the first b child elements of its parent element.
- :nth-child(odd) — Match element in the odd position (i.e. first, third). Same as :nth-child(2n+1).
- :nth-child(even) — Match element in the even position (i.e. second, fourth). Same as :nth-child(2n+2).
- :nth-of-type(..) — As above, but only counts elements of its type.
- :nth-last-child(..) — As above, but counts from the last child.
- :nth-last-of-type(..) — As above, but counts from the last child and only elements of its type.
- :not(selector) — Match the element only if the element does not match the simple selector.
As you can see, <tt>:nth-child<tt> pseudo class and its varient can get quite tricky and the CSS specification doesn‘t do a much better job explaining it. But after reading the examples and trying a few combinations, it‘s easy to figure out.
For example:
table tr:nth-child(odd)
Selects every second row in the table starting with the first one.
div p:nth-child(4)
Selects the fourth paragraph in the div, but not if the div contains other elements, since those are also counted.
div p:nth-of-type(4)
Selects the fourth paragraph in the div, counting only paragraphs, and ignoring all other elements.
div p:nth-of-type(-n+4)
Selects the first four paragraphs, ignoring all others.
And you can always select an element that matches one set of rules but not another using :not. For example:
p:not(.post)
Matches all paragraphs that do not have the class .post.
Substitution Values
You can use substitution with identifiers, class names and element values. A substitution takes the form of a question mark (?) and uses the next value in the argument list following the CSS expression.
The substitution value may be a string or a regular expression. All other values are converted to strings.
For example:
selector = HTML::Selector.new "#?", /^\d+$/
matches any element whose identifier consists of one or more digits.
- attribute_match
- for_class
- for_id
- match
- new
- next_element
- next_selector
- nth_child
- only_child
- select
- select_first
- simple_selector
Creates a new selector for the given class name.
[ show source ]
# File vendor/rails/actionpack/lib/action_controller/vendor/html-scanner/html/selector.rb, line 214 214: def for_class(cls) 215: self.new([".?", cls]) 216: end
Creates a new selector for the given id.
[ show source ]
# File vendor/rails/actionpack/lib/action_controller/vendor/html-scanner/html/selector.rb, line 223 223: def for_id(id) 224: self.new(["#?", id]) 225: end
Creates a new selector from a CSS 2 selector expression.
The first argument is the selector expression. All other arguments are used for value substitution.
Throws InvalidSelectorError is the selector expression is invalid.
[ show source ]
# File vendor/rails/actionpack/lib/action_controller/vendor/html-scanner/html/selector.rb, line 239 239: def initialize(selector, *values) 240: raise ArgumentError, "CSS expression cannot be empty" if selector.empty? 241: @source = "" 242: values = values[0] if values.size == 1 && values[0].is_a?(Array) 243: # We need a copy to determine if we failed to parse, and also 244: # preserve the original pass by-ref statement. 245: statement = selector.strip.dup 246: # Create a simple selector, along with negation. 247: simple_selector(statement, values).each { |name, value| instance_variable_set("@#{name}", value) } 248: 249: # Alternative selector. 250: if statement.sub!(/^\s*,\s*/, "") 251: second = Selector.new(statement, values) 252: (@alternates ||= []) << second 253: # If there are alternate selectors, we group them in the top selector. 254: if alternates = second.instance_variable_get(:@alternates) 255: second.instance_variable_set(:@alternates, nil) 256: @alternates.concat alternates 257: end 258: @source << " , " << second.to_s 259: # Sibling selector: create a dependency into second selector that will 260: # match element immediately following this one. 261: elsif statement.sub!(/^\s*\+\s*/, "") 262: second = next_selector(statement, values) 263: @depends = lambda do |element, first| 264: if element = next_element(element) 265: second.match(element, first) 266: end 267: end 268: @source << " + " << second.to_s 269: # Adjacent selector: create a dependency into second selector that will 270: # match all elements following this one. 271: elsif statement.sub!(/^\s*~\s*/, "") 272: second = next_selector(statement, values) 273: @depends = lambda do |element, first| 274: matches = [] 275: while element = next_element(element) 276: if subset = second.match(element, first) 277: if first && !subset.empty? 278: matches << subset.first 279: break 280: else 281: matches.concat subset 282: end 283: end 284: end 285: matches.empty? ? nil : matches 286: end 287: @source << " ~ " << second.to_s 288: # Child selector: create a dependency into second selector that will 289: # match a child element of this one. 290: elsif statement.sub!(/^\s*>\s*/, "") 291: second = next_selector(statement, values) 292: @depends = lambda do |element, first| 293: matches = [] 294: element.children.each do |child| 295: if child.tag? && subset = second.match(child, first) 296: if first && !subset.empty? 297: matches << subset.first 298: break 299: else 300: matches.concat subset 301: end 302: end 303: end 304: matches.empty? ? nil : matches 305: end 306: @source << " > " << second.to_s 307: # Descendant selector: create a dependency into second selector that 308: # will match all descendant elements of this one. Note, 309: elsif statement =~ /^\s+\S+/ && statement != selector 310: second = next_selector(statement, values) 311: @depends = lambda do |element, first| 312: matches = [] 313: stack = element.children.reverse 314: while node = stack.pop 315: next unless node.tag? 316: if subset = second.match(node, first) 317: if first && !subset.empty? 318: matches << subset.first 319: break 320: else 321: matches.concat subset 322: end 323: elsif children = node.children 324: stack.concat children.reverse 325: end 326: end 327: matches.empty? ? nil : matches 328: end 329: @source << " " << second.to_s 330: else 331: # The last selector is where we check that we parsed 332: # all the parts. 333: unless statement.empty? || statement.strip.empty? 334: raise ArgumentError, "Invalid selector: #{statement}" 335: end 336: end 337: end
Matches an element against the selector.
For a simple selector this method returns an array with the element if the element matches, nil otherwise.
For a complex selector (sibling and descendant) this method returns an array with all matching elements, nil if no match is found.
Use +first_only=true+ if you are only interested in the first element.
For example:
if selector.match(element) puts "Element is a login form" end
[ show source ]
# File vendor/rails/actionpack/lib/action_controller/vendor/html-scanner/html/selector.rb, line 358 358: def match(element, first_only = false) 359: # Match element if no element name or element name same as element name 360: if matched = (!@tag_name || @tag_name == element.name) 361: # No match if one of the attribute matches failed 362: for attr in @attributes 363: if element.attributes[attr[0]] !~ attr[1] 364: matched = false 365: break 366: end 367: end 368: end 369: 370: # Pseudo class matches (nth-child, empty, etc). 371: if matched 372: for pseudo in @pseudo 373: unless pseudo.call(element) 374: matched = false 375: break 376: end 377: end 378: end 379: 380: # Negation. Same rules as above, but we fail if a match is made. 381: if matched && @negation 382: for negation in @negation 383: if negation[:tag_name] == element.name 384: matched = false 385: else 386: for attr in negation[:attributes] 387: if element.attributes[attr[0]] =~ attr[1] 388: matched = false 389: break 390: end 391: end 392: end 393: if matched 394: for pseudo in negation[:pseudo] 395: if pseudo.call(element) 396: matched = false 397: break 398: end 399: end 400: end 401: break unless matched 402: end 403: end 404: 405: # If element matched but depends on another element (child, 406: # sibling, etc), apply the dependent matches instead. 407: if matched && @depends 408: matches = @depends.call(element, first_only) 409: else 410: matches = matched ? [element] : nil 411: end 412: 413: # If this selector is part of the group, try all the alternative 414: # selectors (unless first_only). 415: if @alternates && (!first_only || !matches) 416: @alternates.each do |alternate| 417: break if matches && first_only 418: if subset = alternate.match(element, first_only) 419: if matches 420: matches.concat subset 421: else 422: matches = subset 423: end 424: end 425: end 426: end 427: 428: matches 429: end
Return the next element after this one. Skips sibling text nodes.
With the name argument, returns the next element with that name, skipping other sibling elements.
[ show source ]
# File vendor/rails/actionpack/lib/action_controller/vendor/html-scanner/html/selector.rb, line 488 488: def next_element(element, name = nil) 489: if siblings = element.parent.children 490: found = false 491: siblings.each do |node| 492: if node.equal?(element) 493: found = true 494: elsif found && node.tag? 495: return node if (name.nil? || node.name == name) 496: end 497: end 498: end 499: nil 500: end
Selects and returns an array with all matching elements, beginning with one node and traversing through all children depth-first. Returns an empty array if no match is found.
The root node may be any element in the document, or the document itself.
For example:
selector = HTML::Selector.new "input[type=text]" matches = selector.select(element) matches.each do |match| puts "Found text field with name #{match.attributes['name']}" end
[ show source ]
# File vendor/rails/actionpack/lib/action_controller/vendor/html-scanner/html/selector.rb, line 448 448: def select(root) 449: matches = [] 450: stack = [root] 451: while node = stack.pop 452: if node.tag? && subset = match(node, false) 453: subset.each do |match| 454: matches << match unless matches.any? { |item| item.equal?(match) } 455: end 456: elsif children = node.children 457: stack.concat children.reverse 458: end 459: end 460: matches 461: end
Similar to select but returns the first matching element. Returns nil if no element matches the selector.
[ show source ]
# File vendor/rails/actionpack/lib/action_controller/vendor/html-scanner/html/selector.rb, line 466 466: def select_first(root) 467: stack = [root] 468: while node = stack.pop 469: if node.tag? && subset = match(node, true) 470: return subset.first if !subset.empty? 471: elsif children = node.children 472: stack.concat children.reverse 473: end 474: end 475: nil 476: end
Create a regular expression to match an attribute value based on the equality operator (=, ^=, |=, etc).
[ show source ]
# File vendor/rails/actionpack/lib/action_controller/vendor/html-scanner/html/selector.rb, line 682 682: def attribute_match(equality, value) 683: regexp = value.is_a?(Regexp) ? value : Regexp.escape(value.to_s) 684: case equality 685: when "=" then 686: # Match the attribute value in full 687: Regexp.new("^#{regexp}$") 688: when "~=" then 689: # Match a space-separated word within the attribute value 690: Regexp.new("(^|\s)#{regexp}($|\s)") 691: when "^=" 692: # Match the beginning of the attribute value 693: Regexp.new("^#{regexp}") 694: when "$=" 695: # Match the end of the attribute value 696: Regexp.new("#{regexp}$") 697: when "*=" 698: # Match substring of the attribute value 699: regexp.is_a?(Regexp) ? regexp : Regexp.new(regexp) 700: when "|=" then 701: # Match the first space-separated item of the attribute value 702: Regexp.new("^#{regexp}($|\s)") 703: else 704: raise InvalidSelectorError, "Invalid operation/value" unless value.empty? 705: # Match all attributes values (existence check) 706: // 707: end 708: end
Called to create a dependent selector (sibling, descendant, etc). Passes the remainder of the statement that will be reduced to zero eventually, and array of substitution values.
This method is called from four places, so it helps to put it here for resue. The only logic deals with the need to detect comma separators (alternate) and apply them to the selector group of the top selector.
[ show source ]
# File vendor/rails/actionpack/lib/action_controller/vendor/html-scanner/html/selector.rb, line 795 795: def next_selector(statement, values) 796: second = Selector.new(statement, values) 797: # If there are alternate selectors, we group them in the top selector. 798: if alternates = second.instance_variable_get(:@alternates) 799: second.instance_variable_set(:@alternates, nil) 800: (@alternates ||= []).concat alternates 801: end 802: second 803: end
Returns a lambda that can match an element against the nth-child pseudo class, given the following arguments:
- a — Value of a part.
- b — Value of b part.
- of_type — True to test only elements of this type (of-type).
- reverse — True to count in reverse order (last-).
[ show source ]
# File vendor/rails/actionpack/lib/action_controller/vendor/html-scanner/html/selector.rb, line 717 717: def nth_child(a, b, of_type, reverse) 718: # a = 0 means select at index b, if b = 0 nothing selected 719: return lambda { |element| false } if a == 0 && b == 0 720: # a < 0 and b < 0 will never match against an index 721: return lambda { |element| false } if a < 0 && b < 0 722: b = a + b + 1 if b < 0 # b < 0 just picks last element from each group 723: b -= 1 unless b == 0 # b == 0 is same as b == 1, otherwise zero based 724: lambda do |element| 725: # Element must be inside parent element. 726: return false unless element.parent && element.parent.tag? 727: index = 0 728: # Get siblings, reverse if counting from last. 729: siblings = element.parent.children 730: siblings = siblings.reverse if reverse 731: # Match element name if of-type, otherwise ignore name. 732: name = of_type ? element.name : nil 733: found = false 734: for child in siblings 735: # Skip text nodes/comments. 736: if child.tag? && (name == nil || child.name == name) 737: if a == 0 738: # Shortcut when a == 0 no need to go past count 739: if index == b 740: found = child.equal?(element) 741: break 742: end 743: elsif a < 0 744: # Only look for first b elements 745: break if index > b 746: if child.equal?(element) 747: found = (index % a) == 0 748: break 749: end 750: else 751: # Otherwise, break if child found and count == an+b 752: if child.equal?(element) 753: found = (index % a) == b 754: break 755: end 756: end 757: index += 1 758: end 759: end 760: found 761: end 762: end
Creates a only child lambda. Pass +of-type+ to only look at elements of its type.
[ show source ]
# File vendor/rails/actionpack/lib/action_controller/vendor/html-scanner/html/selector.rb, line 767 767: def only_child(of_type) 768: lambda do |element| 769: # Element must be inside parent element. 770: return false unless element.parent && element.parent.tag? 771: name = of_type ? element.name : nil 772: other = false 773: for child in element.parent.children 774: # Skip text nodes/comments. 775: if child.tag? && (name == nil || child.name == name) 776: unless child.equal?(element) 777: other = true 778: break 779: end 780: end 781: end 782: !other 783: end 784: end
Creates a simple selector given the statement and array of substitution values.
Returns a hash with the values tag_name, attributes, pseudo (classes) and negation.
Called the first time with can_negate true to allow negation. Called a second time with false since negation cannot be negated.
[ show source ]
# File vendor/rails/actionpack/lib/action_controller/vendor/html-scanner/html/selector.rb, line 515 515: def simple_selector(statement, values, can_negate = true) 516: tag_name = nil 517: attributes = [] 518: pseudo = [] 519: negation = [] 520: 521: # Element name. (Note that in negation, this can come at 522: # any order, but for simplicity we allow if only first). 523: statement.sub!(/^(\*|[[:alpha:]][\w\-]*)/) do |match| 524: match.strip! 525: tag_name = match.downcase unless match == "*" 526: @source << match 527: "" # Remove 528: end 529: 530: # Get identifier, class, attribute name, pseudo or negation. 531: while true 532: # Element identifier. 533: next if statement.sub!(/^#(\?|[\w\-]+)/) do |match| 534: id = $1 535: if id == "?" 536: id = values.shift 537: end 538: @source << "##{id}" 539: id = Regexp.new("^#{Regexp.escape(id.to_s)}$") unless id.is_a?(Regexp) 540: attributes << ["id", id] 541: "" # Remove 542: end 543: 544: # Class name. 545: next if statement.sub!(/^\.([\w\-]+)/) do |match| 546: class_name = $1 547: @source << ".#{class_name}" 548: class_name = Regexp.new("(^|\s)#{Regexp.escape(class_name)}($|\s)") unless class_name.is_a?(Regexp) 549: attributes << ["class", class_name] 550: "" # Remove 551: end 552: 553: # Attribute value. 554: next if statement.sub!(/^\[\s*([[:alpha:]][\w\-]*)\s*((?:[~|^$*])?=)?\s*('[^']*'|"[^*]"|[^\]]*)\s*\]/) do |match| 555: name, equality, value = $1, $2, $3 556: if value == "?" 557: value = values.shift 558: else 559: # Handle single and double quotes. 560: value.strip! 561: if (value[0] == ?" || value[0] == ?') && value[0] == value[-1] 562: value = value[1..-2] 563: end 564: end 565: @source << "[#{name}#{equality}'#{value}']" 566: attributes << [name.downcase.strip, attribute_match(equality, value)] 567: "" # Remove 568: end 569: 570: # Root element only. 571: next if statement.sub!(/^:root/) do |match| 572: pseudo << lambda do |element| 573: element.parent.nil? || !element.parent.tag? 574: end 575: @source << ":root" 576: "" # Remove 577: end 578: 579: # Nth-child including last and of-type. 580: next if statement.sub!(/^:nth-(last-)?(child|of-type)\((odd|even|(\d+|\?)|(-?\d*|\?)?n([+\-]\d+|\?)?)\)/) do |match| 581: reverse = $1 == "last-" 582: of_type = $2 == "of-type" 583: @source << ":nth-#{$1}#{$2}(" 584: case $3 585: when "odd" 586: pseudo << nth_child(2, 1, of_type, reverse) 587: @source << "odd)" 588: when "even" 589: pseudo << nth_child(2, 2, of_type, reverse) 590: @source << "even)" 591: when /^(\d+|\?)$/ # b only 592: b = ($1 == "?" ? values.shift : $1).to_i 593: pseudo << nth_child(0, b, of_type, reverse) 594: @source << "#{b})" 595: when /^(-?\d*|\?)?n([+\-]\d+|\?)?$/ 596: a = ($1 == "?" ? values.shift : 597: $1 == "" ? 1 : $1 == "-" ? -1 : $1).to_i 598: b = ($2 == "?" ? values.shift : $2).to_i 599: pseudo << nth_child(a, b, of_type, reverse) 600: @source << (b >= 0 ? "#{a}n+#{b})" : "#{a}n#{b})") 601: else 602: raise ArgumentError, "Invalid nth-child #{match}" 603: end 604: "" # Remove 605: end 606: # First/last child (of type). 607: next if statement.sub!(/^:(first|last)-(child|of-type)/) do |match| 608: reverse = $1 == "last" 609: of_type = $2 == "of-type" 610: pseudo << nth_child(0, 1, of_type, reverse) 611: @source << ":#{$1}-#{$2}" 612: "" # Remove 613: end 614: # Only child (of type). 615: next if statement.sub!(/^:only-(child|of-type)/) do |match| 616: of_type = $1 == "of-type" 617: pseudo << only_child(of_type) 618: @source << ":only-#{$1}" 619: "" # Remove 620: end 621: 622: # Empty: no child elements or meaningful content (whitespaces 623: # are ignored). 624: next if statement.sub!(/^:empty/) do |match| 625: pseudo << lambda do |element| 626: empty = true 627: for child in element.children 628: if child.tag? || !child.content.strip.empty? 629: empty = false 630: break 631: end 632: end 633: empty 634: end 635: @source << ":empty" 636: "" # Remove 637: end 638: # Content: match the text content of the element, stripping 639: # leading and trailing spaces. 640: next if statement.sub!(/^:content\(\s*(\?|'[^']*'|"[^"]*"|[^)]*)\s*\)/) do |match| 641: content = $1 642: if content == "?" 643: content = values.shift 644: elsif (content[0] == ?" || content[0] == ?') && content[0] == content[-1] 645: content = content[1..-2] 646: end 647: @source << ":content('#{content}')" 648: content = Regexp.new("^#{Regexp.escape(content.to_s)}$") unless content.is_a?(Regexp) 649: pseudo << lambda do |element| 650: text = "" 651: for child in element.children 652: unless child.tag? 653: text << child.content 654: end 655: end 656: text.strip =~ content 657: end 658: "" # Remove 659: end 660: 661: # Negation. Create another simple selector to handle it. 662: if statement.sub!(/^:not\(\s*/, "") 663: raise ArgumentError, "Double negatives are not missing feature" unless can_negate 664: @source << ":not(" 665: negation << simple_selector(statement, values, false) 666: raise ArgumentError, "Negation not closed" unless statement.sub!(/^\s*\)/, "") 667: @source << ")" 668: next 669: end 670: 671: # No match: moving on. 672: break 673: end 674: 675: # Return hash. The keys are mapped to instance variables. 676: {:tag_name=>tag_name, :attributes=>attributes, :pseudo=>pseudo, :negation=>negation} 677: end