TOP(1) BY GROUP of very huge (100,000,000+) table
Setup
I have a huge table of ~115,382,254 rows.
The table is relatively simple and logs application process operations.
CREATE TABLE [data].[OperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[Size] [bigint] NULL,
[Begin] [datetime2](7) NULL,
[End] [datetime2](7) NOT NULL,
[Date] AS (isnull(CONVERT([date],[End]),CONVERT([date],'19000101',(112)))) PERSISTED NOT NULL,
[DataSetCount] [bigint] NULL,
[Result] [int] NULL,
[Error] [nvarchar](max) NULL,
[Status] [int] NULL,
CONSTRAINT [PK_OperationData] PRIMARY KEY CLUSTERED
(
[SourceDeviceID] ASC,
[FileSource] ASC,
[End] ASC
))
CREATE TABLE [model].[SourceDevice](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[Name] [nvarchar](50) NULL,
CONSTRAINT [PK_DataLogger] PRIMARY KEY CLUSTERED
(
[ID] ASC
))
ALTER TABLE [data].[OperationData] WITH CHECK ADD CONSTRAINT [FK_OperationData_SourceDevice] FOREIGN KEY([SourceDeviceID])
REFERENCES [model].[SourceDevice] ([ID])
The table is clustered at around 500 clusters and on a daily base.
Also, the table is well indexed by PK, statistics are up-to-date and the INDEXer get defraged every night.
Index based SELECTs are lightning fast and we had no problem with it.
Problem
I need to know the last (TOP) row by [End]
and partitioned by [SourceDeciveID]
. To get the very last [OperationData]
of every source device.
Question
I need to find a way to solve this in a good way and without bringing the DB to the limits.
Effort 1
The first try was obvious GROUP BY
or SELECT OVER PARTITION BY
query. The problem here is also obvious, every query has to scan over very partition order/find the top row. So the query is very slow and has a very high IO impact.
Example query 1
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [SourceDeciveID] ORDER BY [End] DESC) AS rn
FROM [data].[OperationData]
)
SELECT *
FROM cte
WHERE rn = 1
Example query 2
SELECT *
FROM [data].[OperationData] AS d
CROSS APPLY
(
SELECT TOP 1 *
FROM [data].[OperationData]
WHERE [SourceDeciveID] = d.[SourceDeciveID]
ORDER BY [End] DESC
) AS ds
FAILED!
Effort 2
I created a help table to always hold a reference to TOP row.
CREATE TABLE [data].[LastOperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[End] [datetime2](7) NOT NULL,
CONSTRAINT [PK_LastOperationData] PRIMARY KEY CLUSTERED
(
[SourceDeciveID] ASC
)
ALTER TABLE [data].[LastOperationData] WITH CHECK ADD CONSTRAINT [FK_LastOperationData_OperationData] FOREIGN KEY([SourceDeciveID], [FileSource], [End])
REFERENCES [data].[OperationData] ([SourceDeciveID], [FileSource], [End])
To fill up the table a created a trigger to always add/update the source row if higher [End]
column is inserted.
CREATE TRIGGER [data].[OperationData_Last]
ON [data].[OperationData]
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;
MERGE [data].[LastOperationData] AS [target]
USING (SELECT [SourceDeciveID], [FileSource], [End] FROM inserted) AS [source] ([SourceDeciveID], [FileSource], [End])
ON ([target].[SourceDeciveID] = [FileSource].[SourceDeciveID])
WHEN MATCHED AND [target].[End] < [source].[End] THEN
UPDATE SET [target].[FileSource] = source.[FileSource], [target].[End] = source.[End]
WHEN NOT MATCHED THEN
INSERT ([SourceDeciveID], [FileSource], [End])
VALUES (source.[SourceDeciveID], source.[FileSource], source.[End]);
END
The problem here is, that it also has a very huge IO impact and I don't know why.
As you can see here in the query plan it also executes a scan over the whole [OperationData]
table.
It has a huge overall impact on my DB.
FAILED!
sql-server index partitioning azure-sql-database
|
show 2 more comments
Setup
I have a huge table of ~115,382,254 rows.
The table is relatively simple and logs application process operations.
CREATE TABLE [data].[OperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[Size] [bigint] NULL,
[Begin] [datetime2](7) NULL,
[End] [datetime2](7) NOT NULL,
[Date] AS (isnull(CONVERT([date],[End]),CONVERT([date],'19000101',(112)))) PERSISTED NOT NULL,
[DataSetCount] [bigint] NULL,
[Result] [int] NULL,
[Error] [nvarchar](max) NULL,
[Status] [int] NULL,
CONSTRAINT [PK_OperationData] PRIMARY KEY CLUSTERED
(
[SourceDeviceID] ASC,
[FileSource] ASC,
[End] ASC
))
CREATE TABLE [model].[SourceDevice](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[Name] [nvarchar](50) NULL,
CONSTRAINT [PK_DataLogger] PRIMARY KEY CLUSTERED
(
[ID] ASC
))
ALTER TABLE [data].[OperationData] WITH CHECK ADD CONSTRAINT [FK_OperationData_SourceDevice] FOREIGN KEY([SourceDeviceID])
REFERENCES [model].[SourceDevice] ([ID])
The table is clustered at around 500 clusters and on a daily base.
Also, the table is well indexed by PK, statistics are up-to-date and the INDEXer get defraged every night.
Index based SELECTs are lightning fast and we had no problem with it.
Problem
I need to know the last (TOP) row by [End]
and partitioned by [SourceDeciveID]
. To get the very last [OperationData]
of every source device.
Question
I need to find a way to solve this in a good way and without bringing the DB to the limits.
Effort 1
The first try was obvious GROUP BY
or SELECT OVER PARTITION BY
query. The problem here is also obvious, every query has to scan over very partition order/find the top row. So the query is very slow and has a very high IO impact.
Example query 1
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [SourceDeciveID] ORDER BY [End] DESC) AS rn
FROM [data].[OperationData]
)
SELECT *
FROM cte
WHERE rn = 1
Example query 2
SELECT *
FROM [data].[OperationData] AS d
CROSS APPLY
(
SELECT TOP 1 *
FROM [data].[OperationData]
WHERE [SourceDeciveID] = d.[SourceDeciveID]
ORDER BY [End] DESC
) AS ds
FAILED!
Effort 2
I created a help table to always hold a reference to TOP row.
CREATE TABLE [data].[LastOperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[End] [datetime2](7) NOT NULL,
CONSTRAINT [PK_LastOperationData] PRIMARY KEY CLUSTERED
(
[SourceDeciveID] ASC
)
ALTER TABLE [data].[LastOperationData] WITH CHECK ADD CONSTRAINT [FK_LastOperationData_OperationData] FOREIGN KEY([SourceDeciveID], [FileSource], [End])
REFERENCES [data].[OperationData] ([SourceDeciveID], [FileSource], [End])
To fill up the table a created a trigger to always add/update the source row if higher [End]
column is inserted.
CREATE TRIGGER [data].[OperationData_Last]
ON [data].[OperationData]
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;
MERGE [data].[LastOperationData] AS [target]
USING (SELECT [SourceDeciveID], [FileSource], [End] FROM inserted) AS [source] ([SourceDeciveID], [FileSource], [End])
ON ([target].[SourceDeciveID] = [FileSource].[SourceDeciveID])
WHEN MATCHED AND [target].[End] < [source].[End] THEN
UPDATE SET [target].[FileSource] = source.[FileSource], [target].[End] = source.[End]
WHEN NOT MATCHED THEN
INSERT ([SourceDeciveID], [FileSource], [End])
VALUES (source.[SourceDeciveID], source.[FileSource], source.[End]);
END
The problem here is, that it also has a very huge IO impact and I don't know why.
As you can see here in the query plan it also executes a scan over the whole [OperationData]
table.
It has a huge overall impact on my DB.
FAILED!
sql-server index partitioning azure-sql-database
2
In your first code block I can't see where the first column of the clustered index is coming from - is it right?
– George.Palacios
Nov 28 '18 at 10:19
Yes sorry SSMS does not include it into theCREATE TABLE
script but inside the query plan you will see the partitions. I will edit the question.
– Steffen Mangold
Nov 28 '18 at 10:23
Not an extra index because included inside thePRIMARY KEY CLUSTERED
you think it may help?
– Steffen Mangold
Nov 28 '18 at 10:27
Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
– Steffen Mangold
Nov 28 '18 at 10:30
@ypercubeᵀᴹ yes because theSELECT [SourceID], [Source], [End] FROM inserted
some how do a table scan on the[OperationData]
.
– Steffen Mangold
Nov 28 '18 at 10:32
|
show 2 more comments
Setup
I have a huge table of ~115,382,254 rows.
The table is relatively simple and logs application process operations.
CREATE TABLE [data].[OperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[Size] [bigint] NULL,
[Begin] [datetime2](7) NULL,
[End] [datetime2](7) NOT NULL,
[Date] AS (isnull(CONVERT([date],[End]),CONVERT([date],'19000101',(112)))) PERSISTED NOT NULL,
[DataSetCount] [bigint] NULL,
[Result] [int] NULL,
[Error] [nvarchar](max) NULL,
[Status] [int] NULL,
CONSTRAINT [PK_OperationData] PRIMARY KEY CLUSTERED
(
[SourceDeviceID] ASC,
[FileSource] ASC,
[End] ASC
))
CREATE TABLE [model].[SourceDevice](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[Name] [nvarchar](50) NULL,
CONSTRAINT [PK_DataLogger] PRIMARY KEY CLUSTERED
(
[ID] ASC
))
ALTER TABLE [data].[OperationData] WITH CHECK ADD CONSTRAINT [FK_OperationData_SourceDevice] FOREIGN KEY([SourceDeviceID])
REFERENCES [model].[SourceDevice] ([ID])
The table is clustered at around 500 clusters and on a daily base.
Also, the table is well indexed by PK, statistics are up-to-date and the INDEXer get defraged every night.
Index based SELECTs are lightning fast and we had no problem with it.
Problem
I need to know the last (TOP) row by [End]
and partitioned by [SourceDeciveID]
. To get the very last [OperationData]
of every source device.
Question
I need to find a way to solve this in a good way and without bringing the DB to the limits.
Effort 1
The first try was obvious GROUP BY
or SELECT OVER PARTITION BY
query. The problem here is also obvious, every query has to scan over very partition order/find the top row. So the query is very slow and has a very high IO impact.
Example query 1
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [SourceDeciveID] ORDER BY [End] DESC) AS rn
FROM [data].[OperationData]
)
SELECT *
FROM cte
WHERE rn = 1
Example query 2
SELECT *
FROM [data].[OperationData] AS d
CROSS APPLY
(
SELECT TOP 1 *
FROM [data].[OperationData]
WHERE [SourceDeciveID] = d.[SourceDeciveID]
ORDER BY [End] DESC
) AS ds
FAILED!
Effort 2
I created a help table to always hold a reference to TOP row.
CREATE TABLE [data].[LastOperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[End] [datetime2](7) NOT NULL,
CONSTRAINT [PK_LastOperationData] PRIMARY KEY CLUSTERED
(
[SourceDeciveID] ASC
)
ALTER TABLE [data].[LastOperationData] WITH CHECK ADD CONSTRAINT [FK_LastOperationData_OperationData] FOREIGN KEY([SourceDeciveID], [FileSource], [End])
REFERENCES [data].[OperationData] ([SourceDeciveID], [FileSource], [End])
To fill up the table a created a trigger to always add/update the source row if higher [End]
column is inserted.
CREATE TRIGGER [data].[OperationData_Last]
ON [data].[OperationData]
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;
MERGE [data].[LastOperationData] AS [target]
USING (SELECT [SourceDeciveID], [FileSource], [End] FROM inserted) AS [source] ([SourceDeciveID], [FileSource], [End])
ON ([target].[SourceDeciveID] = [FileSource].[SourceDeciveID])
WHEN MATCHED AND [target].[End] < [source].[End] THEN
UPDATE SET [target].[FileSource] = source.[FileSource], [target].[End] = source.[End]
WHEN NOT MATCHED THEN
INSERT ([SourceDeciveID], [FileSource], [End])
VALUES (source.[SourceDeciveID], source.[FileSource], source.[End]);
END
The problem here is, that it also has a very huge IO impact and I don't know why.
As you can see here in the query plan it also executes a scan over the whole [OperationData]
table.
It has a huge overall impact on my DB.
FAILED!
sql-server index partitioning azure-sql-database
Setup
I have a huge table of ~115,382,254 rows.
The table is relatively simple and logs application process operations.
CREATE TABLE [data].[OperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[Size] [bigint] NULL,
[Begin] [datetime2](7) NULL,
[End] [datetime2](7) NOT NULL,
[Date] AS (isnull(CONVERT([date],[End]),CONVERT([date],'19000101',(112)))) PERSISTED NOT NULL,
[DataSetCount] [bigint] NULL,
[Result] [int] NULL,
[Error] [nvarchar](max) NULL,
[Status] [int] NULL,
CONSTRAINT [PK_OperationData] PRIMARY KEY CLUSTERED
(
[SourceDeviceID] ASC,
[FileSource] ASC,
[End] ASC
))
CREATE TABLE [model].[SourceDevice](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[Name] [nvarchar](50) NULL,
CONSTRAINT [PK_DataLogger] PRIMARY KEY CLUSTERED
(
[ID] ASC
))
ALTER TABLE [data].[OperationData] WITH CHECK ADD CONSTRAINT [FK_OperationData_SourceDevice] FOREIGN KEY([SourceDeviceID])
REFERENCES [model].[SourceDevice] ([ID])
The table is clustered at around 500 clusters and on a daily base.
Also, the table is well indexed by PK, statistics are up-to-date and the INDEXer get defraged every night.
Index based SELECTs are lightning fast and we had no problem with it.
Problem
I need to know the last (TOP) row by [End]
and partitioned by [SourceDeciveID]
. To get the very last [OperationData]
of every source device.
Question
I need to find a way to solve this in a good way and without bringing the DB to the limits.
Effort 1
The first try was obvious GROUP BY
or SELECT OVER PARTITION BY
query. The problem here is also obvious, every query has to scan over very partition order/find the top row. So the query is very slow and has a very high IO impact.
Example query 1
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [SourceDeciveID] ORDER BY [End] DESC) AS rn
FROM [data].[OperationData]
)
SELECT *
FROM cte
WHERE rn = 1
Example query 2
SELECT *
FROM [data].[OperationData] AS d
CROSS APPLY
(
SELECT TOP 1 *
FROM [data].[OperationData]
WHERE [SourceDeciveID] = d.[SourceDeciveID]
ORDER BY [End] DESC
) AS ds
FAILED!
Effort 2
I created a help table to always hold a reference to TOP row.
CREATE TABLE [data].[LastOperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[End] [datetime2](7) NOT NULL,
CONSTRAINT [PK_LastOperationData] PRIMARY KEY CLUSTERED
(
[SourceDeciveID] ASC
)
ALTER TABLE [data].[LastOperationData] WITH CHECK ADD CONSTRAINT [FK_LastOperationData_OperationData] FOREIGN KEY([SourceDeciveID], [FileSource], [End])
REFERENCES [data].[OperationData] ([SourceDeciveID], [FileSource], [End])
To fill up the table a created a trigger to always add/update the source row if higher [End]
column is inserted.
CREATE TRIGGER [data].[OperationData_Last]
ON [data].[OperationData]
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;
MERGE [data].[LastOperationData] AS [target]
USING (SELECT [SourceDeciveID], [FileSource], [End] FROM inserted) AS [source] ([SourceDeciveID], [FileSource], [End])
ON ([target].[SourceDeciveID] = [FileSource].[SourceDeciveID])
WHEN MATCHED AND [target].[End] < [source].[End] THEN
UPDATE SET [target].[FileSource] = source.[FileSource], [target].[End] = source.[End]
WHEN NOT MATCHED THEN
INSERT ([SourceDeciveID], [FileSource], [End])
VALUES (source.[SourceDeciveID], source.[FileSource], source.[End]);
END
The problem here is, that it also has a very huge IO impact and I don't know why.
As you can see here in the query plan it also executes a scan over the whole [OperationData]
table.
It has a huge overall impact on my DB.
FAILED!
sql-server index partitioning azure-sql-database
sql-server index partitioning azure-sql-database
edited Nov 28 '18 at 21:35
asked Nov 28 '18 at 10:13
Steffen Mangold
397113
397113
2
In your first code block I can't see where the first column of the clustered index is coming from - is it right?
– George.Palacios
Nov 28 '18 at 10:19
Yes sorry SSMS does not include it into theCREATE TABLE
script but inside the query plan you will see the partitions. I will edit the question.
– Steffen Mangold
Nov 28 '18 at 10:23
Not an extra index because included inside thePRIMARY KEY CLUSTERED
you think it may help?
– Steffen Mangold
Nov 28 '18 at 10:27
Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
– Steffen Mangold
Nov 28 '18 at 10:30
@ypercubeᵀᴹ yes because theSELECT [SourceID], [Source], [End] FROM inserted
some how do a table scan on the[OperationData]
.
– Steffen Mangold
Nov 28 '18 at 10:32
|
show 2 more comments
2
In your first code block I can't see where the first column of the clustered index is coming from - is it right?
– George.Palacios
Nov 28 '18 at 10:19
Yes sorry SSMS does not include it into theCREATE TABLE
script but inside the query plan you will see the partitions. I will edit the question.
– Steffen Mangold
Nov 28 '18 at 10:23
Not an extra index because included inside thePRIMARY KEY CLUSTERED
you think it may help?
– Steffen Mangold
Nov 28 '18 at 10:27
Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
– Steffen Mangold
Nov 28 '18 at 10:30
@ypercubeᵀᴹ yes because theSELECT [SourceID], [Source], [End] FROM inserted
some how do a table scan on the[OperationData]
.
– Steffen Mangold
Nov 28 '18 at 10:32
2
2
In your first code block I can't see where the first column of the clustered index is coming from - is it right?
– George.Palacios
Nov 28 '18 at 10:19
In your first code block I can't see where the first column of the clustered index is coming from - is it right?
– George.Palacios
Nov 28 '18 at 10:19
Yes sorry SSMS does not include it into the
CREATE TABLE
script but inside the query plan you will see the partitions. I will edit the question.– Steffen Mangold
Nov 28 '18 at 10:23
Yes sorry SSMS does not include it into the
CREATE TABLE
script but inside the query plan you will see the partitions. I will edit the question.– Steffen Mangold
Nov 28 '18 at 10:23
Not an extra index because included inside the
PRIMARY KEY CLUSTERED
you think it may help?– Steffen Mangold
Nov 28 '18 at 10:27
Not an extra index because included inside the
PRIMARY KEY CLUSTERED
you think it may help?– Steffen Mangold
Nov 28 '18 at 10:27
Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
– Steffen Mangold
Nov 28 '18 at 10:30
Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
– Steffen Mangold
Nov 28 '18 at 10:30
@ypercubeᵀᴹ yes because the
SELECT [SourceID], [Source], [End] FROM inserted
some how do a table scan on the [OperationData]
.– Steffen Mangold
Nov 28 '18 at 10:32
@ypercubeᵀᴹ yes because the
SELECT [SourceID], [Source], [End] FROM inserted
some how do a table scan on the [OperationData]
.– Steffen Mangold
Nov 28 '18 at 10:32
|
show 2 more comments
1 Answer
1
active
oldest
votes
If you have a table of SourceID
values, and an index on your main table on (SourceID, End) include (othercolumns)
, just use OUTER APPLY
.
SELECT d.*
FROM dbo.Sources s
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d
WHERE d.SourceID = s.SourceID
ORDER BY d.[End] DESC) d;
If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())
Edit: Because your clustered index is on SourceID, Source, End)
, put Source into your Sources table too and join on that as well. Then you don’t need the new index.
SELECT d.*
FROM dbo.Sources s -- Small table
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d -- Big table quick seeks
WHERE d.SourceID = s.SourceID
AND d.Source = s.Source
AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination
ORDER BY d.[End] DESC) d;
The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
– Steffen Mangold
Nov 28 '18 at 15:37
1
@SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
– jmoreno
Nov 28 '18 at 15:38
I'm really sorry but there is aSource
Table referencing thesourceID
column. The column source is only a file name. It's a little confusing naming. For eachSource
device (sourceID) there could be only one single entry for one filesource
(column) at one timestamp. Also I cant do partition elimination because the newestEnd
is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
– Steffen Mangold
Nov 28 '18 at 21:25
@Rob Farley I edited the question to be more clear
– Steffen Mangold
Nov 28 '18 at 21:30
With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
– Rob Farley
Nov 29 '18 at 3:46
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f223631%2ftop1-by-group-of-very-huge-100-000-000-table%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you have a table of SourceID
values, and an index on your main table on (SourceID, End) include (othercolumns)
, just use OUTER APPLY
.
SELECT d.*
FROM dbo.Sources s
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d
WHERE d.SourceID = s.SourceID
ORDER BY d.[End] DESC) d;
If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())
Edit: Because your clustered index is on SourceID, Source, End)
, put Source into your Sources table too and join on that as well. Then you don’t need the new index.
SELECT d.*
FROM dbo.Sources s -- Small table
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d -- Big table quick seeks
WHERE d.SourceID = s.SourceID
AND d.Source = s.Source
AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination
ORDER BY d.[End] DESC) d;
The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
– Steffen Mangold
Nov 28 '18 at 15:37
1
@SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
– jmoreno
Nov 28 '18 at 15:38
I'm really sorry but there is aSource
Table referencing thesourceID
column. The column source is only a file name. It's a little confusing naming. For eachSource
device (sourceID) there could be only one single entry for one filesource
(column) at one timestamp. Also I cant do partition elimination because the newestEnd
is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
– Steffen Mangold
Nov 28 '18 at 21:25
@Rob Farley I edited the question to be more clear
– Steffen Mangold
Nov 28 '18 at 21:30
With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
– Rob Farley
Nov 29 '18 at 3:46
add a comment |
If you have a table of SourceID
values, and an index on your main table on (SourceID, End) include (othercolumns)
, just use OUTER APPLY
.
SELECT d.*
FROM dbo.Sources s
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d
WHERE d.SourceID = s.SourceID
ORDER BY d.[End] DESC) d;
If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())
Edit: Because your clustered index is on SourceID, Source, End)
, put Source into your Sources table too and join on that as well. Then you don’t need the new index.
SELECT d.*
FROM dbo.Sources s -- Small table
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d -- Big table quick seeks
WHERE d.SourceID = s.SourceID
AND d.Source = s.Source
AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination
ORDER BY d.[End] DESC) d;
The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
– Steffen Mangold
Nov 28 '18 at 15:37
1
@SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
– jmoreno
Nov 28 '18 at 15:38
I'm really sorry but there is aSource
Table referencing thesourceID
column. The column source is only a file name. It's a little confusing naming. For eachSource
device (sourceID) there could be only one single entry for one filesource
(column) at one timestamp. Also I cant do partition elimination because the newestEnd
is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
– Steffen Mangold
Nov 28 '18 at 21:25
@Rob Farley I edited the question to be more clear
– Steffen Mangold
Nov 28 '18 at 21:30
With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
– Rob Farley
Nov 29 '18 at 3:46
add a comment |
If you have a table of SourceID
values, and an index on your main table on (SourceID, End) include (othercolumns)
, just use OUTER APPLY
.
SELECT d.*
FROM dbo.Sources s
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d
WHERE d.SourceID = s.SourceID
ORDER BY d.[End] DESC) d;
If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())
Edit: Because your clustered index is on SourceID, Source, End)
, put Source into your Sources table too and join on that as well. Then you don’t need the new index.
SELECT d.*
FROM dbo.Sources s -- Small table
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d -- Big table quick seeks
WHERE d.SourceID = s.SourceID
AND d.Source = s.Source
AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination
ORDER BY d.[End] DESC) d;
If you have a table of SourceID
values, and an index on your main table on (SourceID, End) include (othercolumns)
, just use OUTER APPLY
.
SELECT d.*
FROM dbo.Sources s
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d
WHERE d.SourceID = s.SourceID
ORDER BY d.[End] DESC) d;
If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())
Edit: Because your clustered index is on SourceID, Source, End)
, put Source into your Sources table too and join on that as well. Then you don’t need the new index.
SELECT d.*
FROM dbo.Sources s -- Small table
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d -- Big table quick seeks
WHERE d.SourceID = s.SourceID
AND d.Source = s.Source
AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination
ORDER BY d.[End] DESC) d;
edited Nov 28 '18 at 21:04
answered Nov 28 '18 at 10:38
Rob Farley
13.7k12448
13.7k12448
The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
– Steffen Mangold
Nov 28 '18 at 15:37
1
@SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
– jmoreno
Nov 28 '18 at 15:38
I'm really sorry but there is aSource
Table referencing thesourceID
column. The column source is only a file name. It's a little confusing naming. For eachSource
device (sourceID) there could be only one single entry for one filesource
(column) at one timestamp. Also I cant do partition elimination because the newestEnd
is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
– Steffen Mangold
Nov 28 '18 at 21:25
@Rob Farley I edited the question to be more clear
– Steffen Mangold
Nov 28 '18 at 21:30
With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
– Rob Farley
Nov 29 '18 at 3:46
add a comment |
The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
– Steffen Mangold
Nov 28 '18 at 15:37
1
@SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
– jmoreno
Nov 28 '18 at 15:38
I'm really sorry but there is aSource
Table referencing thesourceID
column. The column source is only a file name. It's a little confusing naming. For eachSource
device (sourceID) there could be only one single entry for one filesource
(column) at one timestamp. Also I cant do partition elimination because the newestEnd
is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
– Steffen Mangold
Nov 28 '18 at 21:25
@Rob Farley I edited the question to be more clear
– Steffen Mangold
Nov 28 '18 at 21:30
With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
– Rob Farley
Nov 29 '18 at 3:46
The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
– Steffen Mangold
Nov 28 '18 at 15:37
The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
– Steffen Mangold
Nov 28 '18 at 15:37
1
1
@SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
– jmoreno
Nov 28 '18 at 15:38
@SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
– jmoreno
Nov 28 '18 at 15:38
I'm really sorry but there is a
Source
Table referencing the sourceID
column. The column source is only a file name. It's a little confusing naming. For each Source
device (sourceID) there could be only one single entry for one file source
(column) at one timestamp. Also I cant do partition elimination because the newest End
is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.– Steffen Mangold
Nov 28 '18 at 21:25
I'm really sorry but there is a
Source
Table referencing the sourceID
column. The column source is only a file name. It's a little confusing naming. For each Source
device (sourceID) there could be only one single entry for one file source
(column) at one timestamp. Also I cant do partition elimination because the newest End
is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.– Steffen Mangold
Nov 28 '18 at 21:25
@Rob Farley I edited the question to be more clear
– Steffen Mangold
Nov 28 '18 at 21:30
@Rob Farley I edited the question to be more clear
– Steffen Mangold
Nov 28 '18 at 21:30
With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
– Rob Farley
Nov 29 '18 at 3:46
With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
– Rob Farley
Nov 29 '18 at 3:46
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f223631%2ftop1-by-group-of-very-huge-100-000-000-table%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
In your first code block I can't see where the first column of the clustered index is coming from - is it right?
– George.Palacios
Nov 28 '18 at 10:19
Yes sorry SSMS does not include it into the
CREATE TABLE
script but inside the query plan you will see the partitions. I will edit the question.– Steffen Mangold
Nov 28 '18 at 10:23
Not an extra index because included inside the
PRIMARY KEY CLUSTERED
you think it may help?– Steffen Mangold
Nov 28 '18 at 10:27
Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
– Steffen Mangold
Nov 28 '18 at 10:30
@ypercubeᵀᴹ yes because the
SELECT [SourceID], [Source], [End] FROM inserted
some how do a table scan on the[OperationData]
.– Steffen Mangold
Nov 28 '18 at 10:32