TOP(1) BY GROUP of very huge (100,000,000+) table












8














Setup



I have a huge table of ~115,382,254 rows.
The table is relatively simple and logs application process operations.



CREATE TABLE [data].[OperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[Size] [bigint] NULL,
[Begin] [datetime2](7) NULL,
[End] [datetime2](7) NOT NULL,
[Date] AS (isnull(CONVERT([date],[End]),CONVERT([date],'19000101',(112)))) PERSISTED NOT NULL,
[DataSetCount] [bigint] NULL,
[Result] [int] NULL,
[Error] [nvarchar](max) NULL,
[Status] [int] NULL,
CONSTRAINT [PK_OperationData] PRIMARY KEY CLUSTERED
(
[SourceDeviceID] ASC,
[FileSource] ASC,
[End] ASC
))

CREATE TABLE [model].[SourceDevice](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[Name] [nvarchar](50) NULL,
CONSTRAINT [PK_DataLogger] PRIMARY KEY CLUSTERED
(
[ID] ASC
))

ALTER TABLE [data].[OperationData] WITH CHECK ADD CONSTRAINT [FK_OperationData_SourceDevice] FOREIGN KEY([SourceDeviceID])
REFERENCES [model].[SourceDevice] ([ID])


The table is clustered at around 500 clusters and on a daily base.



partitions



enter image description here



Also, the table is well indexed by PK, statistics are up-to-date and the INDEXer get defraged every night.



Index based SELECTs are lightning fast and we had no problem with it.



Problem



I need to know the last (TOP) row by [End] and partitioned by [SourceDeciveID]. To get the very last [OperationData] of every source device.



Question



I need to find a way to solve this in a good way and without bringing the DB to the limits.





Effort 1



The first try was obvious GROUP BY or SELECT OVER PARTITION BY query. The problem here is also obvious, every query has to scan over very partition order/find the top row. So the query is very slow and has a very high IO impact.



Example query 1



;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [SourceDeciveID] ORDER BY [End] DESC) AS rn
FROM [data].[OperationData]
)
SELECT *
FROM cte
WHERE rn = 1


Example query 2



SELECT *
FROM [data].[OperationData] AS d
CROSS APPLY
(
SELECT TOP 1 *
FROM [data].[OperationData]
WHERE [SourceDeciveID] = d.[SourceDeciveID]
ORDER BY [End] DESC
) AS ds


FAILED!



Effort 2



I created a help table to always hold a reference to TOP row.



CREATE TABLE [data].[LastOperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[End] [datetime2](7) NOT NULL,
CONSTRAINT [PK_LastOperationData] PRIMARY KEY CLUSTERED
(
[SourceDeciveID] ASC
)

ALTER TABLE [data].[LastOperationData] WITH CHECK ADD CONSTRAINT [FK_LastOperationData_OperationData] FOREIGN KEY([SourceDeciveID], [FileSource], [End])
REFERENCES [data].[OperationData] ([SourceDeciveID], [FileSource], [End])


To fill up the table a created a trigger to always add/update the source row if higher [End] column is inserted.



CREATE TRIGGER [data].[OperationData_Last]
ON [data].[OperationData]
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;

MERGE [data].[LastOperationData] AS [target]
USING (SELECT [SourceDeciveID], [FileSource], [End] FROM inserted) AS [source] ([SourceDeciveID], [FileSource], [End])
ON ([target].[SourceDeciveID] = [FileSource].[SourceDeciveID])

WHEN MATCHED AND [target].[End] < [source].[End] THEN
UPDATE SET [target].[FileSource] = source.[FileSource], [target].[End] = source.[End]

WHEN NOT MATCHED THEN
INSERT ([SourceDeciveID], [FileSource], [End])
VALUES (source.[SourceDeciveID], source.[FileSource], source.[End]);

END


The problem here is, that it also has a very huge IO impact and I don't know why.



As you can see here in the query plan it also executes a scan over the whole [OperationData] table.



It has a huge overall impact on my DB.
stats



FAILED!










share|improve this question




















  • 2




    In your first code block I can't see where the first column of the clustered index is coming from - is it right?
    – George.Palacios
    Nov 28 '18 at 10:19










  • Yes sorry SSMS does not include it into the CREATE TABLE script but inside the query plan you will see the partitions. I will edit the question.
    – Steffen Mangold
    Nov 28 '18 at 10:23










  • Not an extra index because included inside the PRIMARY KEY CLUSTERED you think it may help?
    – Steffen Mangold
    Nov 28 '18 at 10:27










  • Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
    – Steffen Mangold
    Nov 28 '18 at 10:30










  • @ypercubeᵀᴹ yes because the SELECT [SourceID], [Source], [End] FROM inserted some how do a table scan on the [OperationData].
    – Steffen Mangold
    Nov 28 '18 at 10:32
















8














Setup



I have a huge table of ~115,382,254 rows.
The table is relatively simple and logs application process operations.



CREATE TABLE [data].[OperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[Size] [bigint] NULL,
[Begin] [datetime2](7) NULL,
[End] [datetime2](7) NOT NULL,
[Date] AS (isnull(CONVERT([date],[End]),CONVERT([date],'19000101',(112)))) PERSISTED NOT NULL,
[DataSetCount] [bigint] NULL,
[Result] [int] NULL,
[Error] [nvarchar](max) NULL,
[Status] [int] NULL,
CONSTRAINT [PK_OperationData] PRIMARY KEY CLUSTERED
(
[SourceDeviceID] ASC,
[FileSource] ASC,
[End] ASC
))

CREATE TABLE [model].[SourceDevice](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[Name] [nvarchar](50) NULL,
CONSTRAINT [PK_DataLogger] PRIMARY KEY CLUSTERED
(
[ID] ASC
))

ALTER TABLE [data].[OperationData] WITH CHECK ADD CONSTRAINT [FK_OperationData_SourceDevice] FOREIGN KEY([SourceDeviceID])
REFERENCES [model].[SourceDevice] ([ID])


The table is clustered at around 500 clusters and on a daily base.



partitions



enter image description here



Also, the table is well indexed by PK, statistics are up-to-date and the INDEXer get defraged every night.



Index based SELECTs are lightning fast and we had no problem with it.



Problem



I need to know the last (TOP) row by [End] and partitioned by [SourceDeciveID]. To get the very last [OperationData] of every source device.



Question



I need to find a way to solve this in a good way and without bringing the DB to the limits.





Effort 1



The first try was obvious GROUP BY or SELECT OVER PARTITION BY query. The problem here is also obvious, every query has to scan over very partition order/find the top row. So the query is very slow and has a very high IO impact.



Example query 1



;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [SourceDeciveID] ORDER BY [End] DESC) AS rn
FROM [data].[OperationData]
)
SELECT *
FROM cte
WHERE rn = 1


Example query 2



SELECT *
FROM [data].[OperationData] AS d
CROSS APPLY
(
SELECT TOP 1 *
FROM [data].[OperationData]
WHERE [SourceDeciveID] = d.[SourceDeciveID]
ORDER BY [End] DESC
) AS ds


FAILED!



Effort 2



I created a help table to always hold a reference to TOP row.



CREATE TABLE [data].[LastOperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[End] [datetime2](7) NOT NULL,
CONSTRAINT [PK_LastOperationData] PRIMARY KEY CLUSTERED
(
[SourceDeciveID] ASC
)

ALTER TABLE [data].[LastOperationData] WITH CHECK ADD CONSTRAINT [FK_LastOperationData_OperationData] FOREIGN KEY([SourceDeciveID], [FileSource], [End])
REFERENCES [data].[OperationData] ([SourceDeciveID], [FileSource], [End])


To fill up the table a created a trigger to always add/update the source row if higher [End] column is inserted.



CREATE TRIGGER [data].[OperationData_Last]
ON [data].[OperationData]
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;

MERGE [data].[LastOperationData] AS [target]
USING (SELECT [SourceDeciveID], [FileSource], [End] FROM inserted) AS [source] ([SourceDeciveID], [FileSource], [End])
ON ([target].[SourceDeciveID] = [FileSource].[SourceDeciveID])

WHEN MATCHED AND [target].[End] < [source].[End] THEN
UPDATE SET [target].[FileSource] = source.[FileSource], [target].[End] = source.[End]

WHEN NOT MATCHED THEN
INSERT ([SourceDeciveID], [FileSource], [End])
VALUES (source.[SourceDeciveID], source.[FileSource], source.[End]);

END


The problem here is, that it also has a very huge IO impact and I don't know why.



As you can see here in the query plan it also executes a scan over the whole [OperationData] table.



It has a huge overall impact on my DB.
stats



FAILED!










share|improve this question




















  • 2




    In your first code block I can't see where the first column of the clustered index is coming from - is it right?
    – George.Palacios
    Nov 28 '18 at 10:19










  • Yes sorry SSMS does not include it into the CREATE TABLE script but inside the query plan you will see the partitions. I will edit the question.
    – Steffen Mangold
    Nov 28 '18 at 10:23










  • Not an extra index because included inside the PRIMARY KEY CLUSTERED you think it may help?
    – Steffen Mangold
    Nov 28 '18 at 10:27










  • Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
    – Steffen Mangold
    Nov 28 '18 at 10:30










  • @ypercubeᵀᴹ yes because the SELECT [SourceID], [Source], [End] FROM inserted some how do a table scan on the [OperationData].
    – Steffen Mangold
    Nov 28 '18 at 10:32














8












8








8


3





Setup



I have a huge table of ~115,382,254 rows.
The table is relatively simple and logs application process operations.



CREATE TABLE [data].[OperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[Size] [bigint] NULL,
[Begin] [datetime2](7) NULL,
[End] [datetime2](7) NOT NULL,
[Date] AS (isnull(CONVERT([date],[End]),CONVERT([date],'19000101',(112)))) PERSISTED NOT NULL,
[DataSetCount] [bigint] NULL,
[Result] [int] NULL,
[Error] [nvarchar](max) NULL,
[Status] [int] NULL,
CONSTRAINT [PK_OperationData] PRIMARY KEY CLUSTERED
(
[SourceDeviceID] ASC,
[FileSource] ASC,
[End] ASC
))

CREATE TABLE [model].[SourceDevice](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[Name] [nvarchar](50) NULL,
CONSTRAINT [PK_DataLogger] PRIMARY KEY CLUSTERED
(
[ID] ASC
))

ALTER TABLE [data].[OperationData] WITH CHECK ADD CONSTRAINT [FK_OperationData_SourceDevice] FOREIGN KEY([SourceDeviceID])
REFERENCES [model].[SourceDevice] ([ID])


The table is clustered at around 500 clusters and on a daily base.



partitions



enter image description here



Also, the table is well indexed by PK, statistics are up-to-date and the INDEXer get defraged every night.



Index based SELECTs are lightning fast and we had no problem with it.



Problem



I need to know the last (TOP) row by [End] and partitioned by [SourceDeciveID]. To get the very last [OperationData] of every source device.



Question



I need to find a way to solve this in a good way and without bringing the DB to the limits.





Effort 1



The first try was obvious GROUP BY or SELECT OVER PARTITION BY query. The problem here is also obvious, every query has to scan over very partition order/find the top row. So the query is very slow and has a very high IO impact.



Example query 1



;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [SourceDeciveID] ORDER BY [End] DESC) AS rn
FROM [data].[OperationData]
)
SELECT *
FROM cte
WHERE rn = 1


Example query 2



SELECT *
FROM [data].[OperationData] AS d
CROSS APPLY
(
SELECT TOP 1 *
FROM [data].[OperationData]
WHERE [SourceDeciveID] = d.[SourceDeciveID]
ORDER BY [End] DESC
) AS ds


FAILED!



Effort 2



I created a help table to always hold a reference to TOP row.



CREATE TABLE [data].[LastOperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[End] [datetime2](7) NOT NULL,
CONSTRAINT [PK_LastOperationData] PRIMARY KEY CLUSTERED
(
[SourceDeciveID] ASC
)

ALTER TABLE [data].[LastOperationData] WITH CHECK ADD CONSTRAINT [FK_LastOperationData_OperationData] FOREIGN KEY([SourceDeciveID], [FileSource], [End])
REFERENCES [data].[OperationData] ([SourceDeciveID], [FileSource], [End])


To fill up the table a created a trigger to always add/update the source row if higher [End] column is inserted.



CREATE TRIGGER [data].[OperationData_Last]
ON [data].[OperationData]
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;

MERGE [data].[LastOperationData] AS [target]
USING (SELECT [SourceDeciveID], [FileSource], [End] FROM inserted) AS [source] ([SourceDeciveID], [FileSource], [End])
ON ([target].[SourceDeciveID] = [FileSource].[SourceDeciveID])

WHEN MATCHED AND [target].[End] < [source].[End] THEN
UPDATE SET [target].[FileSource] = source.[FileSource], [target].[End] = source.[End]

WHEN NOT MATCHED THEN
INSERT ([SourceDeciveID], [FileSource], [End])
VALUES (source.[SourceDeciveID], source.[FileSource], source.[End]);

END


The problem here is, that it also has a very huge IO impact and I don't know why.



As you can see here in the query plan it also executes a scan over the whole [OperationData] table.



It has a huge overall impact on my DB.
stats



FAILED!










share|improve this question















Setup



I have a huge table of ~115,382,254 rows.
The table is relatively simple and logs application process operations.



CREATE TABLE [data].[OperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[Size] [bigint] NULL,
[Begin] [datetime2](7) NULL,
[End] [datetime2](7) NOT NULL,
[Date] AS (isnull(CONVERT([date],[End]),CONVERT([date],'19000101',(112)))) PERSISTED NOT NULL,
[DataSetCount] [bigint] NULL,
[Result] [int] NULL,
[Error] [nvarchar](max) NULL,
[Status] [int] NULL,
CONSTRAINT [PK_OperationData] PRIMARY KEY CLUSTERED
(
[SourceDeviceID] ASC,
[FileSource] ASC,
[End] ASC
))

CREATE TABLE [model].[SourceDevice](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[Name] [nvarchar](50) NULL,
CONSTRAINT [PK_DataLogger] PRIMARY KEY CLUSTERED
(
[ID] ASC
))

ALTER TABLE [data].[OperationData] WITH CHECK ADD CONSTRAINT [FK_OperationData_SourceDevice] FOREIGN KEY([SourceDeviceID])
REFERENCES [model].[SourceDevice] ([ID])


The table is clustered at around 500 clusters and on a daily base.



partitions



enter image description here



Also, the table is well indexed by PK, statistics are up-to-date and the INDEXer get defraged every night.



Index based SELECTs are lightning fast and we had no problem with it.



Problem



I need to know the last (TOP) row by [End] and partitioned by [SourceDeciveID]. To get the very last [OperationData] of every source device.



Question



I need to find a way to solve this in a good way and without bringing the DB to the limits.





Effort 1



The first try was obvious GROUP BY or SELECT OVER PARTITION BY query. The problem here is also obvious, every query has to scan over very partition order/find the top row. So the query is very slow and has a very high IO impact.



Example query 1



;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [SourceDeciveID] ORDER BY [End] DESC) AS rn
FROM [data].[OperationData]
)
SELECT *
FROM cte
WHERE rn = 1


Example query 2



SELECT *
FROM [data].[OperationData] AS d
CROSS APPLY
(
SELECT TOP 1 *
FROM [data].[OperationData]
WHERE [SourceDeciveID] = d.[SourceDeciveID]
ORDER BY [End] DESC
) AS ds


FAILED!



Effort 2



I created a help table to always hold a reference to TOP row.



CREATE TABLE [data].[LastOperationData](
[SourceDeciveID] [bigint] NOT NULL,
[FileSource] [nvarchar](256) NOT NULL,
[End] [datetime2](7) NOT NULL,
CONSTRAINT [PK_LastOperationData] PRIMARY KEY CLUSTERED
(
[SourceDeciveID] ASC
)

ALTER TABLE [data].[LastOperationData] WITH CHECK ADD CONSTRAINT [FK_LastOperationData_OperationData] FOREIGN KEY([SourceDeciveID], [FileSource], [End])
REFERENCES [data].[OperationData] ([SourceDeciveID], [FileSource], [End])


To fill up the table a created a trigger to always add/update the source row if higher [End] column is inserted.



CREATE TRIGGER [data].[OperationData_Last]
ON [data].[OperationData]
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;

MERGE [data].[LastOperationData] AS [target]
USING (SELECT [SourceDeciveID], [FileSource], [End] FROM inserted) AS [source] ([SourceDeciveID], [FileSource], [End])
ON ([target].[SourceDeciveID] = [FileSource].[SourceDeciveID])

WHEN MATCHED AND [target].[End] < [source].[End] THEN
UPDATE SET [target].[FileSource] = source.[FileSource], [target].[End] = source.[End]

WHEN NOT MATCHED THEN
INSERT ([SourceDeciveID], [FileSource], [End])
VALUES (source.[SourceDeciveID], source.[FileSource], source.[End]);

END


The problem here is, that it also has a very huge IO impact and I don't know why.



As you can see here in the query plan it also executes a scan over the whole [OperationData] table.



It has a huge overall impact on my DB.
stats



FAILED!







sql-server index partitioning azure-sql-database






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 28 '18 at 21:35

























asked Nov 28 '18 at 10:13









Steffen Mangold

397113




397113








  • 2




    In your first code block I can't see where the first column of the clustered index is coming from - is it right?
    – George.Palacios
    Nov 28 '18 at 10:19










  • Yes sorry SSMS does not include it into the CREATE TABLE script but inside the query plan you will see the partitions. I will edit the question.
    – Steffen Mangold
    Nov 28 '18 at 10:23










  • Not an extra index because included inside the PRIMARY KEY CLUSTERED you think it may help?
    – Steffen Mangold
    Nov 28 '18 at 10:27










  • Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
    – Steffen Mangold
    Nov 28 '18 at 10:30










  • @ypercubeᵀᴹ yes because the SELECT [SourceID], [Source], [End] FROM inserted some how do a table scan on the [OperationData].
    – Steffen Mangold
    Nov 28 '18 at 10:32














  • 2




    In your first code block I can't see where the first column of the clustered index is coming from - is it right?
    – George.Palacios
    Nov 28 '18 at 10:19










  • Yes sorry SSMS does not include it into the CREATE TABLE script but inside the query plan you will see the partitions. I will edit the question.
    – Steffen Mangold
    Nov 28 '18 at 10:23










  • Not an extra index because included inside the PRIMARY KEY CLUSTERED you think it may help?
    – Steffen Mangold
    Nov 28 '18 at 10:27










  • Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
    – Steffen Mangold
    Nov 28 '18 at 10:30










  • @ypercubeᵀᴹ yes because the SELECT [SourceID], [Source], [End] FROM inserted some how do a table scan on the [OperationData].
    – Steffen Mangold
    Nov 28 '18 at 10:32








2




2




In your first code block I can't see where the first column of the clustered index is coming from - is it right?
– George.Palacios
Nov 28 '18 at 10:19




In your first code block I can't see where the first column of the clustered index is coming from - is it right?
– George.Palacios
Nov 28 '18 at 10:19












Yes sorry SSMS does not include it into the CREATE TABLE script but inside the query plan you will see the partitions. I will edit the question.
– Steffen Mangold
Nov 28 '18 at 10:23




Yes sorry SSMS does not include it into the CREATE TABLE script but inside the query plan you will see the partitions. I will edit the question.
– Steffen Mangold
Nov 28 '18 at 10:23












Not an extra index because included inside the PRIMARY KEY CLUSTERED you think it may help?
– Steffen Mangold
Nov 28 '18 at 10:27




Not an extra index because included inside the PRIMARY KEY CLUSTERED you think it may help?
– Steffen Mangold
Nov 28 '18 at 10:27












Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
– Steffen Mangold
Nov 28 '18 at 10:30




Soryy that was an error, I modified the names for the question to by more clear, I corrected it.
– Steffen Mangold
Nov 28 '18 at 10:30












@ypercubeᵀᴹ yes because the SELECT [SourceID], [Source], [End] FROM inserted some how do a table scan on the [OperationData].
– Steffen Mangold
Nov 28 '18 at 10:32




@ypercubeᵀᴹ yes because the SELECT [SourceID], [Source], [End] FROM inserted some how do a table scan on the [OperationData].
– Steffen Mangold
Nov 28 '18 at 10:32










1 Answer
1






active

oldest

votes


















9














If you have a table of SourceID values, and an index on your main table on (SourceID, End) include (othercolumns), just use OUTER APPLY.



SELECT d.*
FROM dbo.Sources s
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d
WHERE d.SourceID = s.SourceID
ORDER BY d.[End] DESC) d;


If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())



Edit: Because your clustered index is on SourceID, Source, End), put Source into your Sources table too and join on that as well. Then you don’t need the new index.



SELECT d.*
FROM dbo.Sources s -- Small table
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d -- Big table quick seeks
WHERE d.SourceID = s.SourceID
AND d.Source = s.Source
AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination
ORDER BY d.[End] DESC) d;





share|improve this answer























  • The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
    – Steffen Mangold
    Nov 28 '18 at 15:37






  • 1




    @SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
    – jmoreno
    Nov 28 '18 at 15:38










  • I'm really sorry but there is a Source Table referencing the sourceID column. The column source is only a file name. It's a little confusing naming. For each Source device (sourceID) there could be only one single entry for one file source (column) at one timestamp. Also I cant do partition elimination because the newest End is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
    – Steffen Mangold
    Nov 28 '18 at 21:25












  • @Rob Farley I edited the question to be more clear
    – Steffen Mangold
    Nov 28 '18 at 21:30










  • With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
    – Rob Farley
    Nov 29 '18 at 3:46













Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f223631%2ftop1-by-group-of-very-huge-100-000-000-table%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









9














If you have a table of SourceID values, and an index on your main table on (SourceID, End) include (othercolumns), just use OUTER APPLY.



SELECT d.*
FROM dbo.Sources s
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d
WHERE d.SourceID = s.SourceID
ORDER BY d.[End] DESC) d;


If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())



Edit: Because your clustered index is on SourceID, Source, End), put Source into your Sources table too and join on that as well. Then you don’t need the new index.



SELECT d.*
FROM dbo.Sources s -- Small table
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d -- Big table quick seeks
WHERE d.SourceID = s.SourceID
AND d.Source = s.Source
AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination
ORDER BY d.[End] DESC) d;





share|improve this answer























  • The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
    – Steffen Mangold
    Nov 28 '18 at 15:37






  • 1




    @SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
    – jmoreno
    Nov 28 '18 at 15:38










  • I'm really sorry but there is a Source Table referencing the sourceID column. The column source is only a file name. It's a little confusing naming. For each Source device (sourceID) there could be only one single entry for one file source (column) at one timestamp. Also I cant do partition elimination because the newest End is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
    – Steffen Mangold
    Nov 28 '18 at 21:25












  • @Rob Farley I edited the question to be more clear
    – Steffen Mangold
    Nov 28 '18 at 21:30










  • With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
    – Rob Farley
    Nov 29 '18 at 3:46


















9














If you have a table of SourceID values, and an index on your main table on (SourceID, End) include (othercolumns), just use OUTER APPLY.



SELECT d.*
FROM dbo.Sources s
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d
WHERE d.SourceID = s.SourceID
ORDER BY d.[End] DESC) d;


If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())



Edit: Because your clustered index is on SourceID, Source, End), put Source into your Sources table too and join on that as well. Then you don’t need the new index.



SELECT d.*
FROM dbo.Sources s -- Small table
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d -- Big table quick seeks
WHERE d.SourceID = s.SourceID
AND d.Source = s.Source
AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination
ORDER BY d.[End] DESC) d;





share|improve this answer























  • The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
    – Steffen Mangold
    Nov 28 '18 at 15:37






  • 1




    @SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
    – jmoreno
    Nov 28 '18 at 15:38










  • I'm really sorry but there is a Source Table referencing the sourceID column. The column source is only a file name. It's a little confusing naming. For each Source device (sourceID) there could be only one single entry for one file source (column) at one timestamp. Also I cant do partition elimination because the newest End is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
    – Steffen Mangold
    Nov 28 '18 at 21:25












  • @Rob Farley I edited the question to be more clear
    – Steffen Mangold
    Nov 28 '18 at 21:30










  • With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
    – Rob Farley
    Nov 29 '18 at 3:46
















9












9








9






If you have a table of SourceID values, and an index on your main table on (SourceID, End) include (othercolumns), just use OUTER APPLY.



SELECT d.*
FROM dbo.Sources s
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d
WHERE d.SourceID = s.SourceID
ORDER BY d.[End] DESC) d;


If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())



Edit: Because your clustered index is on SourceID, Source, End), put Source into your Sources table too and join on that as well. Then you don’t need the new index.



SELECT d.*
FROM dbo.Sources s -- Small table
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d -- Big table quick seeks
WHERE d.SourceID = s.SourceID
AND d.Source = s.Source
AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination
ORDER BY d.[End] DESC) d;





share|improve this answer














If you have a table of SourceID values, and an index on your main table on (SourceID, End) include (othercolumns), just use OUTER APPLY.



SELECT d.*
FROM dbo.Sources s
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d
WHERE d.SourceID = s.SourceID
ORDER BY d.[End] DESC) d;


If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())



Edit: Because your clustered index is on SourceID, Source, End), put Source into your Sources table too and join on that as well. Then you don’t need the new index.



SELECT d.*
FROM dbo.Sources s -- Small table
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d -- Big table quick seeks
WHERE d.SourceID = s.SourceID
AND d.Source = s.Source
AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination
ORDER BY d.[End] DESC) d;






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 28 '18 at 21:04

























answered Nov 28 '18 at 10:38









Rob Farley

13.7k12448




13.7k12448












  • The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
    – Steffen Mangold
    Nov 28 '18 at 15:37






  • 1




    @SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
    – jmoreno
    Nov 28 '18 at 15:38










  • I'm really sorry but there is a Source Table referencing the sourceID column. The column source is only a file name. It's a little confusing naming. For each Source device (sourceID) there could be only one single entry for one file source (column) at one timestamp. Also I cant do partition elimination because the newest End is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
    – Steffen Mangold
    Nov 28 '18 at 21:25












  • @Rob Farley I edited the question to be more clear
    – Steffen Mangold
    Nov 28 '18 at 21:30










  • With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
    – Rob Farley
    Nov 29 '18 at 3:46




















  • The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
    – Steffen Mangold
    Nov 28 '18 at 15:37






  • 1




    @SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
    – jmoreno
    Nov 28 '18 at 15:38










  • I'm really sorry but there is a Source Table referencing the sourceID column. The column source is only a file name. It's a little confusing naming. For each Source device (sourceID) there could be only one single entry for one file source (column) at one timestamp. Also I cant do partition elimination because the newest End is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
    – Steffen Mangold
    Nov 28 '18 at 21:25












  • @Rob Farley I edited the question to be more clear
    – Steffen Mangold
    Nov 28 '18 at 21:30










  • With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
    – Rob Farley
    Nov 29 '18 at 3:46


















The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
– Steffen Mangold
Nov 28 '18 at 15:37




The index truly speeded up the query. A second problem that comes up with it is that an unpartitioned index on such a huge table is nearly unmaintainable. On all of our "big-data" table we working with partitioned indexer. They can be maintained online partition by partition. As soon as the indexer is partitioned the problem is the old one because he has to run through every partition.
– Steffen Mangold
Nov 28 '18 at 15:37




1




1




@SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
– jmoreno
Nov 28 '18 at 15:38




@SteffenMangold: The less data in an index the better (as long as it has everything you need) and excluding materialized views, the clustered index has the maximum amount of data possible. Clustered indexes are present because getting all of the data by the key is the norm. In this case you are getting all of the data, but you aren’t really getting it by the key, you are getting it by part of the key. You need an index that can be queried with part of the key.
– jmoreno
Nov 28 '18 at 15:38












I'm really sorry but there is a Source Table referencing the sourceID column. The column source is only a file name. It's a little confusing naming. For each Source device (sourceID) there could be only one single entry for one file source (column) at one timestamp. Also I cant do partition elimination because the newest End is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
– Steffen Mangold
Nov 28 '18 at 21:25






I'm really sorry but there is a Source Table referencing the sourceID column. The column source is only a file name. It's a little confusing naming. For each Source device (sourceID) there could be only one single entry for one file source (column) at one timestamp. Also I cant do partition elimination because the newest End is widly fragmentated. Thats why I came up with the trigger solution.I think a live query will not work here.
– Steffen Mangold
Nov 28 '18 at 21:25














@Rob Farley I edited the question to be more clear
– Steffen Mangold
Nov 28 '18 at 21:30




@Rob Farley I edited the question to be more clear
– Steffen Mangold
Nov 28 '18 at 21:30












With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
– Rob Farley
Nov 29 '18 at 3:46






With partitioning, you’ll find it does all those seeks into each partition. With the extra predicate, you can make it so that it doesn’t bother with all of them, and only does some. Make it a month if you need to.
– Rob Farley
Nov 29 '18 at 3:46




















draft saved

draft discarded




















































Thanks for contributing an answer to Database Administrators Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f223631%2ftop1-by-group-of-very-huge-100-000-000-table%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Aardman Animations

Are they similar matrix

“minimization” problem in Euclidean space related to orthonormal basis