| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
|
| Hi. This has probably been asked 1E+06 times, and I apologize, but I couldn't find any info on it. I have gotten the same basic results on a variety of ASE versions from 11.9.2.6 to 15.0.2 ESD#2 and ESD#5. I have a table with a million rows. There is a nonclustered index on a column. There are 50 distinct values of the column spread throughout the table. I realize that with so many rows per distinct index value, there will generally be lots of I/Os even if the index is being used. But still I am confused when I do any of these queries: select distinct my_indexed_column from mytable select my_indexed_column from mytable group by my_indexed_column select count(distinct my_indexed_column) from mytable I have verified that in all cases my index is being used, and it's a covering index (i.e. "Base table will not be read".) But each of those queries costs 11,000 I/Os. It just seems like logically ASE should be able to extract the 50 distinct values out of the index with a bare minimum of I/Os (like maybe, I don't know, 10?). Why so many I/Os? Thanks. - John. |
|
#2
|
| Because it requires a full index scan (leaf level) of the index to obtain those 50 values, as it's the leaf level of the index which is the only level which contains ALL of the values. Look at the structure of a b-tree index. There is nothing in the btree that says which page each (first occurence?) of the distinct value of the column resides. I'll do my best in ascii to illustrate the issue. Presume that I have 4 index leaf level pages with key values as below: --pg1-- 100 100 100 200 --pg2-- 200 300 400 400 400 --pg3-- 400 400 400 400 --pg4-- 400 500 500 The btree level above this leaf level doesn't tell me that I can skip reading p3 to find unique values, nor does it necessarily (it might, but it's a dense structure, so can't be guaranteed) tell me exactly which page contains the first occurence of "200". The challenge comes in finding a method to sort all of these values somehow, and I think that you'll find the methods used in ASE 15 (particularly "hash distinct") better performing at this sorting step especially with large sets of data to plow through. Remember, it's only the leaf level which contains all of those unique values that you are after. I find the phonebook analogy useful in these situations as well. Pretend that I ask you for a list of all of the unique last names from the white pages of the phone book. And of course, each page of the phone book has an index on it which lists the beginning and ending last names on those pages. You can't satisfy my request by simply reading the indexed names on the top of the page. You have to look at each name on every page of the phone book and sort out the distinct names by keeping track of which ones you've already seen. Not only that, you have to at least "look" at every page. Even if multiple pages contain "Smith", you have to open and look at the first entry of the next page to see if there is a different name starting on that page. If it's another "Smith", you then MUST at least look at the first entry of the next page, etc. Those are your page IO's, and you have to open/read each page. "John Flynn" news:48a5b35c-at-forums-1-dub... > Hi. > > This has probably been asked 1E+06 times, and I apologize, but I couldn't > find any info on it. I have gotten the same basic results on a variety of > ASE versions from 11.9.2.6 to 15.0.2 ESD#2 and ESD#5. > > I have a table with a million rows. There is a nonclustered index on a > column. There are 50 distinct values of the column spread throughout the > table. I realize that with so many rows per distinct index value, there > will generally be lots of I/Os even if the index is being used. But still > I am confused when I do any of these queries: > > select distinct my_indexed_column from mytable > select my_indexed_column from mytable group by my_indexed_column > select count(distinct my_indexed_column) from mytable > > I have verified that in all cases my index is being used, and it's a > covering index (i.e. "Base table will not be read".) But each of those > queries costs 11,000 I/Os. It just seems like logically ASE should be able > to extract the 50 distinct values out of the index with a bare minimum of > I/Os (like maybe, I don't know, 10?). Why so many I/Os? > > Thanks. > - John. > > |
|
#3
|
| Hi Kevin, > Because it requires a full index scan (leaf level) of the index to > obtain those 50 values, as it's the leaf level of the index which is > the only level which contains ALL of the values. Thanks for the explanation, I really appreciate it. It makes sense to me now. - John. |
|
#4
|
| If this index was created just as a covering index to support aggregate queries - creating it with ignore dupe key might help..... John Flynn wrote: > Hi Kevin, > >> Because it requires a full index scan (leaf level) of the index to >> obtain those 50 values, as it's the leaf level of the index which is >> the only level which contains ALL of the values. > > Thanks for the explanation, I really appreciate it. It makes sense to me > now. > > - John. > > |
|
#5
|
| ???? Ignore_dup_key is only for creating unique indexes, and only affects inserts/updates that attempt to create duplicate keys. An index on such a column as the OP has is obviously not unique. I'm not sure I understand the recommendation. "Jeff Tallman" news:48ab3962$1-at-forums-1-dub... > If this index was created just as a covering index to support aggregate > queries - creating it with ignore dupe key might help..... > > John Flynn wrote: >> Hi Kevin, >> >>> Because it requires a full index scan (leaf level) of the index to >>> obtain those 50 values, as it's the leaf level of the index which is >>> the only level which contains ALL of the values. >> >> Thanks for the explanation, I really appreciate it. It makes sense to me >> now. >> >> - John. >> |
|
#6
|
| It was a brain freeze while on a concall and trying to look at these - proof that multitasking in humans is impossible - I was thinking it might reject just the index keys - but instead it rejects the entire row (uuugghhhh). Been too long since I used these as usually they are a serious indication of a logic problem.... Oh for the want of sparse indexing!!! Sherlock, Kevin wrote: > ???? > > Ignore_dup_key is only for creating unique indexes, and only affects > inserts/updates that attempt to create duplicate keys. An index on such a > column as the OP has is obviously not unique. I'm not sure I understand the > recommendation. > > "Jeff Tallman" > news:48ab3962$1-at-forums-1-dub... >> If this index was created just as a covering index to support aggregate >> queries - creating it with ignore dupe key might help..... >> >> John Flynn wrote: >>> Hi Kevin, >>> >>>> Because it requires a full index scan (leaf level) of the index to >>>> obtain those 50 values, as it's the leaf level of the index which is >>>> the only level which contains ALL of the values. >>> Thanks for the explanation, I really appreciate it. It makes sense to me >>> now. >>> >>> - John. >>> > |
|
#7
|
| Kevin has provided a great explanation to "the question". Let me add my two cents, as I come across this all the time at cust sites. Of course I am treating your examples as real. > On 2008-08-16 02:48:28 +1000, "John Flynn" > > select distinct my_indexed_column from mytable > select my_indexed_column from mytable group by my_indexed_column If you have 50 distinct values over a 1m row table and you are doing this reasonably often, you really have a normalisation problem here. Low cardinality indices are prime candidates for performance improvement. You may be better off with a Parent table containing the 50 values; which will most probably be a description not the PK; create a proper PK; implement an FK to the child table; and my_indexed_column NOT indexed on the child table. That will reduce the above selects to 50 rows. The 11,000 I/Os to scan the index indicate that you probably have the description in the FK, carried in the child, which is not good. And of course, an Indexed column should never be var length or nullable. > select count(distinct my_indexed_column) from mytable If the performance of that is not acceptable, you can put the count in a column in the parent table, and update it whenever you insert/delete the child. After all, it is a data base of facts, not a heap that needs to be counted again and again. -- Cheers Derek Senior Sybase DBA / Information Architect Copyright © 2008 Software Gems Pty Ltd Quality Standards = Zero Maintenance + Zero Surprises Performance Standards = Predictability + Scaleability |
|
#8
|
| Derek Asirvadem wrote: > If you have 50 distinct values over a 1m row table and you are doing > this reasonably often, you really have a normalisation problem here. > Low cardinality indices are prime candidates for performance > improvement. You may be better off with a Parent table containing the > 50 values; which will most probably be a description not the PK; > create a proper PK; implement an FK to the child table; and > my_indexed_column NOT indexed on the child table. That will reduce > the above selects to 50 rows. If I'm understanding correctly what you're saying, then what you've described is very close to how I already have it. The "50 distinct values" are foreign keys to a master table. The master table is a table of branch offices. There are, say, 70 branch offices. Each branch office has a PK. Meanwhile, the million-row table is a table of sales. Each one is identified with which branch office it "belongs" to, by way of a FK that relates to the PK in the master table. Twenty of the branch offices don't have any sales in the big table. So the sales table only has 50 distinct values of the FK. I want to find out which those 50 are. I don't see how reading my master table of 70 branch offices helps me to get that answer. > The 11,000 I/Os to scan the index indicate that you probably have the > description in the FK, carried in the child, which is not good. No. The FK is a true FK. >> select count(distinct my_indexed_column) from mytable > > If the performance of that is not acceptable, you can put the count in > a column in the parent table, and update it whenever you insert/delete > the child. After all, it is a data base of facts, not a heap that > needs to be counted again and again. Agreed. Thanks. - John. |
|
#9
|
| Going back to your original request, ie, how many branches have (at least 1) record in sales: select b.PK from branches b where exists(select s.FK from sales s where s.FK = b.PK) - the optimizer will table scan branches as the outer table - for each of the 70 PKs in branches an exists() sub-query is run against sales - each exists() sub-query will use the index on s.FK and require a few IOs to return a true/false with just a few IOs John Flynn wrote: > The master table is a table of branch offices. There are, say, 70 branch > offices. Each branch office has a PK. Meanwhile, the million-row table is a > table of sales. Each one is identified with which branch office it "belongs" > to, by way of a FK that relates to the PK in the master table. Twenty of the > branch offices don't have any sales in the big table. So the sales table > only has 50 distinct values of the FK. I want to find out which those 50 > are. I don't see how reading my master table of 70 branch offices helps me > to get that answer. |
|
#10
|
| Thanks Derek and Mark, that is very helpful. |
![]() |
| Thread Tools | |
| Display Modes | |