De-duping a table based on an expression using one field
Mariadb: mysql Ver 15.1 Distrib 10.3.8-MariaDB
I have a table built as a union of queries from several other tables. This table looks something like-
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| source_id | int(11) | YES | | NULL | |
| code | varchar(64) | YES | | NULL | |
| description | varchar(255) | YES | | NULL | |
+-------------+--------------+------+-----+---------+----------------+
The code field can contain "extra" information in curly braces {think of them as comments}. I want to find all the rows where the code field is equal if the comments are ignored, and then keep just one row from each duplicate group.
It's the "keep one row of each group" bit that's causing me problems.
I have a query that will list the IDs of all the rows in each dup group-
select id
from mytable
group by (regexp_replace(code, '(?-i){.*?}', ''))
having count(*) > 1
but if I make this a subquery in a DELETE WHERE operation it will delete all the duplicates. Not what I want.
After searching around for an "all but one" or "pop one from the output list" function I came upon the 2-argument version of LIMIT with an OFFSET-
select id
from mytable
group by (regexp_replace(code, '(?-i){.*?}', ''))
having count(*) > 1 limit 1000000 offset 1
but if you put this into a DELETE command it fails-
delete from mytable where id in(
select id
from mytable
group by (regexp_replace(code, '(?-i){.*?}', ''))
having count(*) > 1 limit 1000 offset 1
);
ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
Any idea how to achieve this? Preferably in-place but worst case I could SELECT into a temporary table then replace the original if that's what it takes.
mysql deduplication mariadb
add a comment |
Mariadb: mysql Ver 15.1 Distrib 10.3.8-MariaDB
I have a table built as a union of queries from several other tables. This table looks something like-
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| source_id | int(11) | YES | | NULL | |
| code | varchar(64) | YES | | NULL | |
| description | varchar(255) | YES | | NULL | |
+-------------+--------------+------+-----+---------+----------------+
The code field can contain "extra" information in curly braces {think of them as comments}. I want to find all the rows where the code field is equal if the comments are ignored, and then keep just one row from each duplicate group.
It's the "keep one row of each group" bit that's causing me problems.
I have a query that will list the IDs of all the rows in each dup group-
select id
from mytable
group by (regexp_replace(code, '(?-i){.*?}', ''))
having count(*) > 1
but if I make this a subquery in a DELETE WHERE operation it will delete all the duplicates. Not what I want.
After searching around for an "all but one" or "pop one from the output list" function I came upon the 2-argument version of LIMIT with an OFFSET-
select id
from mytable
group by (regexp_replace(code, '(?-i){.*?}', ''))
having count(*) > 1 limit 1000000 offset 1
but if you put this into a DELETE command it fails-
delete from mytable where id in(
select id
from mytable
group by (regexp_replace(code, '(?-i){.*?}', ''))
having count(*) > 1 limit 1000 offset 1
);
ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
Any idea how to achieve this? Preferably in-place but worst case I could SELECT into a temporary table then replace the original if that's what it takes.
mysql deduplication mariadb
add a comment |
Mariadb: mysql Ver 15.1 Distrib 10.3.8-MariaDB
I have a table built as a union of queries from several other tables. This table looks something like-
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| source_id | int(11) | YES | | NULL | |
| code | varchar(64) | YES | | NULL | |
| description | varchar(255) | YES | | NULL | |
+-------------+--------------+------+-----+---------+----------------+
The code field can contain "extra" information in curly braces {think of them as comments}. I want to find all the rows where the code field is equal if the comments are ignored, and then keep just one row from each duplicate group.
It's the "keep one row of each group" bit that's causing me problems.
I have a query that will list the IDs of all the rows in each dup group-
select id
from mytable
group by (regexp_replace(code, '(?-i){.*?}', ''))
having count(*) > 1
but if I make this a subquery in a DELETE WHERE operation it will delete all the duplicates. Not what I want.
After searching around for an "all but one" or "pop one from the output list" function I came upon the 2-argument version of LIMIT with an OFFSET-
select id
from mytable
group by (regexp_replace(code, '(?-i){.*?}', ''))
having count(*) > 1 limit 1000000 offset 1
but if you put this into a DELETE command it fails-
delete from mytable where id in(
select id
from mytable
group by (regexp_replace(code, '(?-i){.*?}', ''))
having count(*) > 1 limit 1000 offset 1
);
ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
Any idea how to achieve this? Preferably in-place but worst case I could SELECT into a temporary table then replace the original if that's what it takes.
mysql deduplication mariadb
Mariadb: mysql Ver 15.1 Distrib 10.3.8-MariaDB
I have a table built as a union of queries from several other tables. This table looks something like-
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| source_id | int(11) | YES | | NULL | |
| code | varchar(64) | YES | | NULL | |
| description | varchar(255) | YES | | NULL | |
+-------------+--------------+------+-----+---------+----------------+
The code field can contain "extra" information in curly braces {think of them as comments}. I want to find all the rows where the code field is equal if the comments are ignored, and then keep just one row from each duplicate group.
It's the "keep one row of each group" bit that's causing me problems.
I have a query that will list the IDs of all the rows in each dup group-
select id
from mytable
group by (regexp_replace(code, '(?-i){.*?}', ''))
having count(*) > 1
but if I make this a subquery in a DELETE WHERE operation it will delete all the duplicates. Not what I want.
After searching around for an "all but one" or "pop one from the output list" function I came upon the 2-argument version of LIMIT with an OFFSET-
select id
from mytable
group by (regexp_replace(code, '(?-i){.*?}', ''))
having count(*) > 1 limit 1000000 offset 1
but if you put this into a DELETE command it fails-
delete from mytable where id in(
select id
from mytable
group by (regexp_replace(code, '(?-i){.*?}', ''))
having count(*) > 1 limit 1000 offset 1
);
ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
Any idea how to achieve this? Preferably in-place but worst case I could SELECT into a temporary table then replace the original if that's what it takes.
mysql deduplication mariadb
mysql deduplication mariadb
asked Feb 13 at 22:04
IncansIncans
163
163
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
You already figured out how to identify duplicates, so I will concentrate on how to delete extra rows, i.e. all rows but the first of each value.
We can do this by putting Ids of rows we want to keep in a temporary table, and then deleting all rows which Ids are not in this list.
I developed a little example script to keep it simple :
DROP TABLE IF EXISTS hasdupes;
CREATE TABLE hasdupes (
Id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
Value VARCHAR(100)
) ENGINE=InnoDB;
INSERT INTO hasdupes (Value) VALUES ('red'), ('green'), ('blue'), ('blue'),
('green'), ('blue'), ('red'), ('orange'), ('red'), ('blue'), ('green'), ('white');
SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
/*
blue 4
green 3
orange 1
red 3
white 1
*/
/* SELECTS an Id for each value in a temporary table */
CREATE TEMPORARY TABLE keep SELECT Id FROM hasdupes GROUP BY Value;
DELETE FROM hasdupes WHERE hasdupes.Id NOT IN (SELECT Id FROM keep);
DROP TEMPORARY TABLE keep;
SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
/*
blue 1
green 1
orange 1
red 1
white 1
*/
Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.
– Incans
Feb 18 at 13:30
add a comment |
I initially got this working by selecting the de-duped results into a new table-
truncate dedup_table;
insert into dedup_table
(source_id, source, code, description)
select source_id, source, code, description
from hasdupes_table
where id in
(select min(id) from hasdupes_table
group by regexp_replace(code, '(?-i){.*?}', ''));
This is basically the same approach recommended by @berty, but using a subquery instead of a temp table.
I then realised this could be made to work as I originally intended by inverting the subquery using NOT IN to work out which rows to delete from the "hasdupes" table.
I haven't actually tested this but I believe it would work-
delete from hasdupes_table where id not in
(select min(id) from hasdupes_table
group by regexp_replace(code, '(?-i){.*?}', ''));
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1405465%2fde-duping-a-table-based-on-an-expression-using-one-field%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
You already figured out how to identify duplicates, so I will concentrate on how to delete extra rows, i.e. all rows but the first of each value.
We can do this by putting Ids of rows we want to keep in a temporary table, and then deleting all rows which Ids are not in this list.
I developed a little example script to keep it simple :
DROP TABLE IF EXISTS hasdupes;
CREATE TABLE hasdupes (
Id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
Value VARCHAR(100)
) ENGINE=InnoDB;
INSERT INTO hasdupes (Value) VALUES ('red'), ('green'), ('blue'), ('blue'),
('green'), ('blue'), ('red'), ('orange'), ('red'), ('blue'), ('green'), ('white');
SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
/*
blue 4
green 3
orange 1
red 3
white 1
*/
/* SELECTS an Id for each value in a temporary table */
CREATE TEMPORARY TABLE keep SELECT Id FROM hasdupes GROUP BY Value;
DELETE FROM hasdupes WHERE hasdupes.Id NOT IN (SELECT Id FROM keep);
DROP TEMPORARY TABLE keep;
SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
/*
blue 1
green 1
orange 1
red 1
white 1
*/
Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.
– Incans
Feb 18 at 13:30
add a comment |
You already figured out how to identify duplicates, so I will concentrate on how to delete extra rows, i.e. all rows but the first of each value.
We can do this by putting Ids of rows we want to keep in a temporary table, and then deleting all rows which Ids are not in this list.
I developed a little example script to keep it simple :
DROP TABLE IF EXISTS hasdupes;
CREATE TABLE hasdupes (
Id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
Value VARCHAR(100)
) ENGINE=InnoDB;
INSERT INTO hasdupes (Value) VALUES ('red'), ('green'), ('blue'), ('blue'),
('green'), ('blue'), ('red'), ('orange'), ('red'), ('blue'), ('green'), ('white');
SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
/*
blue 4
green 3
orange 1
red 3
white 1
*/
/* SELECTS an Id for each value in a temporary table */
CREATE TEMPORARY TABLE keep SELECT Id FROM hasdupes GROUP BY Value;
DELETE FROM hasdupes WHERE hasdupes.Id NOT IN (SELECT Id FROM keep);
DROP TEMPORARY TABLE keep;
SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
/*
blue 1
green 1
orange 1
red 1
white 1
*/
Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.
– Incans
Feb 18 at 13:30
add a comment |
You already figured out how to identify duplicates, so I will concentrate on how to delete extra rows, i.e. all rows but the first of each value.
We can do this by putting Ids of rows we want to keep in a temporary table, and then deleting all rows which Ids are not in this list.
I developed a little example script to keep it simple :
DROP TABLE IF EXISTS hasdupes;
CREATE TABLE hasdupes (
Id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
Value VARCHAR(100)
) ENGINE=InnoDB;
INSERT INTO hasdupes (Value) VALUES ('red'), ('green'), ('blue'), ('blue'),
('green'), ('blue'), ('red'), ('orange'), ('red'), ('blue'), ('green'), ('white');
SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
/*
blue 4
green 3
orange 1
red 3
white 1
*/
/* SELECTS an Id for each value in a temporary table */
CREATE TEMPORARY TABLE keep SELECT Id FROM hasdupes GROUP BY Value;
DELETE FROM hasdupes WHERE hasdupes.Id NOT IN (SELECT Id FROM keep);
DROP TEMPORARY TABLE keep;
SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
/*
blue 1
green 1
orange 1
red 1
white 1
*/
You already figured out how to identify duplicates, so I will concentrate on how to delete extra rows, i.e. all rows but the first of each value.
We can do this by putting Ids of rows we want to keep in a temporary table, and then deleting all rows which Ids are not in this list.
I developed a little example script to keep it simple :
DROP TABLE IF EXISTS hasdupes;
CREATE TABLE hasdupes (
Id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
Value VARCHAR(100)
) ENGINE=InnoDB;
INSERT INTO hasdupes (Value) VALUES ('red'), ('green'), ('blue'), ('blue'),
('green'), ('blue'), ('red'), ('orange'), ('red'), ('blue'), ('green'), ('white');
SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
/*
blue 4
green 3
orange 1
red 3
white 1
*/
/* SELECTS an Id for each value in a temporary table */
CREATE TEMPORARY TABLE keep SELECT Id FROM hasdupes GROUP BY Value;
DELETE FROM hasdupes WHERE hasdupes.Id NOT IN (SELECT Id FROM keep);
DROP TEMPORARY TABLE keep;
SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
/*
blue 1
green 1
orange 1
red 1
white 1
*/
answered Feb 16 at 18:03
bertyberty
113
113
Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.
– Incans
Feb 18 at 13:30
add a comment |
Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.
– Incans
Feb 18 at 13:30
Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.
– Incans
Feb 18 at 13:30
Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.
– Incans
Feb 18 at 13:30
add a comment |
I initially got this working by selecting the de-duped results into a new table-
truncate dedup_table;
insert into dedup_table
(source_id, source, code, description)
select source_id, source, code, description
from hasdupes_table
where id in
(select min(id) from hasdupes_table
group by regexp_replace(code, '(?-i){.*?}', ''));
This is basically the same approach recommended by @berty, but using a subquery instead of a temp table.
I then realised this could be made to work as I originally intended by inverting the subquery using NOT IN to work out which rows to delete from the "hasdupes" table.
I haven't actually tested this but I believe it would work-
delete from hasdupes_table where id not in
(select min(id) from hasdupes_table
group by regexp_replace(code, '(?-i){.*?}', ''));
add a comment |
I initially got this working by selecting the de-duped results into a new table-
truncate dedup_table;
insert into dedup_table
(source_id, source, code, description)
select source_id, source, code, description
from hasdupes_table
where id in
(select min(id) from hasdupes_table
group by regexp_replace(code, '(?-i){.*?}', ''));
This is basically the same approach recommended by @berty, but using a subquery instead of a temp table.
I then realised this could be made to work as I originally intended by inverting the subquery using NOT IN to work out which rows to delete from the "hasdupes" table.
I haven't actually tested this but I believe it would work-
delete from hasdupes_table where id not in
(select min(id) from hasdupes_table
group by regexp_replace(code, '(?-i){.*?}', ''));
add a comment |
I initially got this working by selecting the de-duped results into a new table-
truncate dedup_table;
insert into dedup_table
(source_id, source, code, description)
select source_id, source, code, description
from hasdupes_table
where id in
(select min(id) from hasdupes_table
group by regexp_replace(code, '(?-i){.*?}', ''));
This is basically the same approach recommended by @berty, but using a subquery instead of a temp table.
I then realised this could be made to work as I originally intended by inverting the subquery using NOT IN to work out which rows to delete from the "hasdupes" table.
I haven't actually tested this but I believe it would work-
delete from hasdupes_table where id not in
(select min(id) from hasdupes_table
group by regexp_replace(code, '(?-i){.*?}', ''));
I initially got this working by selecting the de-duped results into a new table-
truncate dedup_table;
insert into dedup_table
(source_id, source, code, description)
select source_id, source, code, description
from hasdupes_table
where id in
(select min(id) from hasdupes_table
group by regexp_replace(code, '(?-i){.*?}', ''));
This is basically the same approach recommended by @berty, but using a subquery instead of a temp table.
I then realised this could be made to work as I originally intended by inverting the subquery using NOT IN to work out which rows to delete from the "hasdupes" table.
I haven't actually tested this but I believe it would work-
delete from hasdupes_table where id not in
(select min(id) from hasdupes_table
group by regexp_replace(code, '(?-i){.*?}', ''));
edited Feb 19 at 12:23
answered Feb 18 at 13:37
IncansIncans
163
163
add a comment |
add a comment |
Thanks for contributing an answer to Super User!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1405465%2fde-duping-a-table-based-on-an-expression-using-one-field%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown