Monday, March 26, 2012

index question?

I do have a dump table "table1"

Table1:

UniqueId varchar(4)

scandate datetime

courseno varchar(5).

This table is not related to any other tables in the database. This is used to track the student attendance for a particular course. There can 60000 distinct uniqueid's in the table and 300 disticnt courses. I query this table most of the time on uniqueid and coursecrn combination.

This table can grow up to 5 million records.

can anybody tell on what fields do I have to create indexes and also of what type(clustered or non-clustered). This table doesnt have a primary key.

Thanks

Sandeep:

It appears to me that at the very least you need an index based on (1) uniqueID, (2) courseNo and (3) scanDate. This index looks like it might be a candidate for a primary key; however, this also looks like a staging table. If that is the case it might be best to just index it, allow duplicates so that the data can be staged and deal with the duplicates later -- but without more information it's hard to say. I choose uniqueID first simply because it has the higher cardinality. Adding this index reduced logical IO from 11,000 to 3. Time from 313 ms to 0 ms. Test query for testing this index:

select * from table1 where courseNo = '2' and uniqueId = '1'

My guess is that you will also need an index based on (1) courseNo, (2) scanDate and (3) uniqueId. When I ran a query that did not include uniqueId as part of the filtering criteria this index brought a big performance boost because it avoided the table scan. Additing this index reduced logical IO from 11,000 to 9. Time went from 343 ms to 0 ms.

Test query for testing this index:

select * from table1 where courseNo = '2' and scanDate between '9/1/2006' and '9/30/2006'

You might also need an index based on scanDate. This depends on whether or not you have any queries that have scanDate as the filter criteria but do filter based on either courseNo or uniqueId. If I were creating this index I would also include the other two columns in the index to avoid potential bookmark lookups. I didn't create this hypothetical index and also did no associated tests.


Dave

|||

Dave:

I will tell you my database structure. This is simple. I have only three tables.

Student: Uniqueid int (PK), lname varchar(35), fname varchar(35)

courses: Courseno(PK), coursetime, courseday.

Table1(this is dump table): uniqueid, scandate, courseno.

This dump table is not related to student or courses table bc without a record in students table then student can attend a class.

If student attends a class for 45 days in a semester for a particular course then dump table will have 45 records. say if he swipes his id thrice (thinking something wrong ) then there will be three records in the dump table. This dump table is also the final table. If i want to see the attendance of students in a particular course i query this table. do you think I need a table out of this dump table or just adding the indexes if fine for this table.?

as far i understand from ur reply, you suggest to create non-clustered index on uniqueid, scandate, coursecrn?

|||

Sandeep:

This is not one of the indexes that I suggested. I suggested two indexes. I suggest the first index be based on (1) uniqueID, (2) courseNo and (3) scandate; I suggest the second index be based on (1) courseNo, (2) scandate and (3) uniqueID. My knee-jerk reaction is that the indexes should be sufficient for checking attendence of students in classes. It is not possible to give more than a knee-jerk response without knowing the full set of requirements and without knowing the entire usage spectrum of your application and these tables.


Dave

|||

do the order of the fields mentioned in index makes a difference?

|||

Yes, it makes a difference; moreover, there may be a good reason to want to change the order of the columns to meet particular needs of your application. Is there a particular reason that you are potentially interested in creating an index based on the order you stated -- uniqueid, scandate, coursecrn?

I chose a different order because of your first statement that " ... I query this table most of the time on uniqueid and coursecrn combination. ..." The order I chose provides the narrowest target for a query that does not include scandate as part of the filter criteria.


Dave

|||i just mentioned the order as an example...I will create two non-clustered indexes as you suggested. one more question: say, If the dump table has 10 million records and when we insert 300 more records to that ( as the table has indexes and it will re-organize the table when a new record is inserted) am i going to have a peformance issue? or not much|||

I don't think you will see much negative performance impact as long as you perform normal periodic maintenance. I was just working on something else and I realized something. It seems to me that if scanDate is a true datetime variable -- that is, the time portion of the data is generated and included -- that the logical key to your table is (1) uniqueID and (2) scanDate. The reason I believe this is that it is not physically possible for a student to be doing more than 1 badge scan at any one time. If that is true, then there is reason to set up the primary key based on these two columns.


Dave

|||

ya scandate is datetime datatype.

is this the final solution you suggest:

1. create primary key on uniqueid and scandate( is it clustered or non-clustered)

2. non-clustered on uniqueid, scandate, coursrno

3. non-clustered on courseno, scandate, uniqueid

can you please correct me if i am wrong?

|||

Sandeep:

What we need to do is run a query to verify whether or not the primary key can be applied. If it can, I dont think we will need index #2. Try running this query:

select uniqueID,
scanDate,
count(*)
from table1
group by uniqueID, scanDate
having count(*) > 1

if no rows are returned, then the primary key will work; if it will not, then we need to forget about the primary key.


Dave

|||

dave,

it doesnt return any rows so i think we can create primary key on uniqueid and scandate.

one more question: why dont we need this non-clustered index on courseno, scandate, uniqueid if we can create the primary key? sorry to ask you many questions, I am new to this indexing...

Thanks

|||

Sandeep:

I think that you DO still need index you referred to as number 3 -- the one based on (1) courseno, (2), scandate and (3) uniqueID. To some extent I am "guessing" at this because you sand that "most" of your queries were based on courseno and uniqueID. Again, when I tested a select that filtered records based on courseNo and a date range but did NOT filter based on uniqueID the select performed a table scan. What that means is that SQL Server examines each and every one of the 3 million records (my mock-up was 3 million instead of 10 million) to see if it selected by our filtering criteria. When we create the index, the number of records examined is reduced from 3 million to something like 400 -- and that is a huge reduction.

The real issue is whether or not we need the index you referred to as number 2. I have a good deal of doubt that it is worth the price of the trade-off. Also, if it is needed we can always defer that decision and add it in if it turns out to be especially beneficial. The issue with index #2 is that it is almost the same as the primary key. While there might be some queries that will benefit a little from index #2 it is likely that most of these queries will still run almost as fast off the primary key. Hence, the benefit of index number 2 might not be great.

The other factor for considering an index is the cost of the index. This table will have 10 million rows. There is an overhead cost of maintaing any index. In this case I doubt that the potential gain -- probably a minimal gain -- that will come with index #2 is not worth the cost of the index. That would NOT be the case if there was no primary key!

I will run a couple of quick tests tomorrow morning to put some numbers together to do an actual comparison between the primary key and index #2 so that you can make an appraisal.


Dave

|||

Sandeep:

I used my mockup table with the primary key composed of (1) uniqueID and (2) scandate and the index composed of (1) courseNo, (2) uniqueID and (3) scanDate and loaded the mockup table with 32767 different uniqueID and 300 different courseNo with about 90 different scan dates to create a mockup table with about 3 million rows. I used this mockup for the basis of my tests.


First I tested with this query:

select * from table1 where uniqueId = '1' and scanDate between '9/1/2006' and '9/30/2006'

This query took 0 ms with 3 logical reads; the query plan was based on a clustered index seek, therefore, filtering was efficient. 22 rows were returned.


I next tested this query:

select * from table1 where courseNo = '1' and scanDate between '9/1/2006' and '9/30/2006'

This query took 0 ms with 10 logical reads; the query plan was based on an index seek of the nonclustered index, therefore, filtering was efficient. 2290 rows were returned.

Finally, I tested this query:

select * from table1 where courseNo = '2' and uniqueId = '1'

This query took 0 ms with 3 logical reads; the query plan was based on a clustered index seek, therefore, filtering was efficient. 92 riws were returned.


After this, I added the index based on (1) uniqueId, (2) scanDate and (3) courseNo and reran all of the queries. The index plans of two of the queries was changed to use this index; however, there was no measurable difference. Execution time was still 0 ms and the required amount of logical reads did not change.


Dave

|||

Dave...Thanks a lot for your effort. what is the meaning of logical read?.

"the index composed of (1) courseNo, (2) uniqueID and (3) scanDate " is the order correct? because yday we discussed tht the index should be on courseno, scandate, uniqueid....

|||

Sandeep:

A "logical read" is one of the measurements of the work performed to satisfy a query. In this case the "3 logical reads" is three seeks of 8K data pages or 24K -- a very small amount of work. When we were performing either the table scan or the clustered index scan -- that is when we read without any indexes or a primary key -- we were reading 11,000 logical reads -- but my mockup was for 3 million rows of data instead of 10 million rows of data. 10 million rows of data would translate into about 38,000 logical reads -- about 308 Meg of data as opposed to 24K. This is a difference of a very large magnitude. This difference really becomes our primary motivation for indexing.

Related to the index needs:

The reason that the index needs to be based on (1) courseNo, (2) uniqueID and (3) scandate is to support queries that filter based on courseNo and uniqueID. If the order is changed to (1) courseNo, (2) scandate and (3) uniqueID there will be an increase in the number of logical reads required to fetch the data for this kind of search.


Dave

No comments:

Post a Comment