De-duping a table based on an expression using one field

Mariadb: mysql  Ver 15.1 Distrib 10.3.8-MariaDB

I have a table built as a union of queries from several other tables. This table looks something like-

+-------------+--------------+------+-----+---------+----------------+

| Field       | Type         | Null | Key | Default | Extra          |

+-------------+--------------+------+-----+---------+----------------+

| id          | int(11)      | NO   | PRI | NULL    | auto_increment |

| source_id   | int(11)      | YES  |     | NULL    |                |

| code        | varchar(64)  | YES  |     | NULL    |                |

| description | varchar(255) | YES  |     | NULL    |                |

+-------------+--------------+------+-----+---------+----------------+

The code field can contain "extra" information in curly braces {think of them as comments}. I want to find all the rows where the code field is equal if the comments are ignored, and then keep just one row from each duplicate group.

It's the "keep one row of each group" bit that's causing me problems.

I have a query that will list the IDs of all the rows in each dup group-

select id

from mytable

group by (regexp_replace(code, '(?-i){.*?}', ''))

having count(*) > 1

but if I make this a subquery in a DELETE WHERE operation it will delete all the duplicates. Not what I want.

After searching around for an "all but one" or "pop one from the output list" function I came upon the 2-argument version of LIMIT with an OFFSET-

select id

from mytable

group by (regexp_replace(code, '(?-i){.*?}', ''))

having count(*) > 1 limit 1000000 offset 1

but if you put this into a DELETE command it fails-

delete from mytable where id in(

    select id

    from mytable

    group by (regexp_replace(code, '(?-i){.*?}', ''))

    having count(*) > 1 limit 1000 offset 1

);

ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

Any idea how to achieve this? Preferably in-place but worst case I could SELECT into a temporary table then replace the original if that's what it takes.

asked Feb 13 at 22:04

Incans

163

add a comment |

Mariadb: mysql  Ver 15.1 Distrib 10.3.8-MariaDB

I have a table built as a union of queries from several other tables. This table looks something like-

+-------------+--------------+------+-----+---------+----------------+

| Field       | Type         | Null | Key | Default | Extra          |

+-------------+--------------+------+-----+---------+----------------+

| id          | int(11)      | NO   | PRI | NULL    | auto_increment |

| source_id   | int(11)      | YES  |     | NULL    |                |

| code        | varchar(64)  | YES  |     | NULL    |                |

| description | varchar(255) | YES  |     | NULL    |                |

+-------------+--------------+------+-----+---------+----------------+

It's the "keep one row of each group" bit that's causing me problems.

I have a query that will list the IDs of all the rows in each dup group-

select id

from mytable

group by (regexp_replace(code, '(?-i){.*?}', ''))

having count(*) > 1

but if I make this a subquery in a DELETE WHERE operation it will delete all the duplicates. Not what I want.

After searching around for an "all but one" or "pop one from the output list" function I came upon the 2-argument version of LIMIT with an OFFSET-

select id

from mytable

group by (regexp_replace(code, '(?-i){.*?}', ''))

having count(*) > 1 limit 1000000 offset 1

but if you put this into a DELETE command it fails-

delete from mytable where id in(

    select id

    from mytable

    group by (regexp_replace(code, '(?-i){.*?}', ''))

    having count(*) > 1 limit 1000 offset 1

);

ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

Any idea how to achieve this? Preferably in-place but worst case I could SELECT into a temporary table then replace the original if that's what it takes.

asked Feb 13 at 22:04

Incans

163

add a comment |

Mariadb: mysql  Ver 15.1 Distrib 10.3.8-MariaDB

I have a table built as a union of queries from several other tables. This table looks something like-

+-------------+--------------+------+-----+---------+----------------+

| Field       | Type         | Null | Key | Default | Extra          |

+-------------+--------------+------+-----+---------+----------------+

| id          | int(11)      | NO   | PRI | NULL    | auto_increment |

| source_id   | int(11)      | YES  |     | NULL    |                |

| code        | varchar(64)  | YES  |     | NULL    |                |

| description | varchar(255) | YES  |     | NULL    |                |

+-------------+--------------+------+-----+---------+----------------+

It's the "keep one row of each group" bit that's causing me problems.

I have a query that will list the IDs of all the rows in each dup group-

select id

from mytable

group by (regexp_replace(code, '(?-i){.*?}', ''))

having count(*) > 1

but if I make this a subquery in a DELETE WHERE operation it will delete all the duplicates. Not what I want.

After searching around for an "all but one" or "pop one from the output list" function I came upon the 2-argument version of LIMIT with an OFFSET-

select id

from mytable

group by (regexp_replace(code, '(?-i){.*?}', ''))

having count(*) > 1 limit 1000000 offset 1

but if you put this into a DELETE command it fails-

delete from mytable where id in(

    select id

    from mytable

    group by (regexp_replace(code, '(?-i){.*?}', ''))

    having count(*) > 1 limit 1000 offset 1

);

ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

Any idea how to achieve this? Preferably in-place but worst case I could SELECT into a temporary table then replace the original if that's what it takes.

asked Feb 13 at 22:04

Incans

163

Mariadb: mysql  Ver 15.1 Distrib 10.3.8-MariaDB

I have a table built as a union of queries from several other tables. This table looks something like-

+-------------+--------------+------+-----+---------+----------------+

| Field       | Type         | Null | Key | Default | Extra          |

+-------------+--------------+------+-----+---------+----------------+

| id          | int(11)      | NO   | PRI | NULL    | auto_increment |

| source_id   | int(11)      | YES  |     | NULL    |                |

| code        | varchar(64)  | YES  |     | NULL    |                |

| description | varchar(255) | YES  |     | NULL    |                |

+-------------+--------------+------+-----+---------+----------------+

It's the "keep one row of each group" bit that's causing me problems.

I have a query that will list the IDs of all the rows in each dup group-

select id

from mytable

group by (regexp_replace(code, '(?-i){.*?}', ''))

having count(*) > 1

but if I make this a subquery in a DELETE WHERE operation it will delete all the duplicates. Not what I want.

After searching around for an "all but one" or "pop one from the output list" function I came upon the 2-argument version of LIMIT with an OFFSET-

select id

from mytable

group by (regexp_replace(code, '(?-i){.*?}', ''))

having count(*) > 1 limit 1000000 offset 1

but if you put this into a DELETE command it fails-

delete from mytable where id in(

    select id

    from mytable

    group by (regexp_replace(code, '(?-i){.*?}', ''))

    having count(*) > 1 limit 1000 offset 1

);

ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

Any idea how to achieve this? Preferably in-place but worst case I could SELECT into a temporary table then replace the original if that's what it takes.

mysql deduplication mariadb

asked Feb 13 at 22:04

Incans

163

asked Feb 13 at 22:04

Incans

163

asked Feb 13 at 22:04

Incans

163

asked Feb 13 at 22:04

Incans

163

asked Feb 13 at 22:04

Incans

163

add a comment |

2 Answers
2

active

oldest

votes

You already figured out how to identify duplicates, so I will concentrate on how to delete extra rows, i.e. all rows but the first of each value.

We can do this by putting Ids of rows we want to keep in a temporary table, and then deleting all rows which Ids are not in this list.

I developed a little example script to keep it simple :

DROP TABLE IF EXISTS hasdupes;

CREATE TABLE hasdupes (

 Id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,

 Value VARCHAR(100)

) ENGINE=InnoDB;



INSERT INTO hasdupes (Value) VALUES ('red'), ('green'), ('blue'), ('blue'),

('green'), ('blue'), ('red'), ('orange'), ('red'), ('blue'), ('green'), ('white');



SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;

/*

blue    4

green   3

orange  1

red     3

white   1

*/



/* SELECTS an Id for each value in a temporary table */

CREATE TEMPORARY TABLE keep SELECT Id FROM hasdupes GROUP BY Value;

DELETE FROM hasdupes WHERE hasdupes.Id NOT IN (SELECT Id FROM keep);

DROP TEMPORARY TABLE keep;



SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;

/*

blue    1

green   1

orange  1

red     1

white   1

*/

answered Feb 16 at 18:03

berty

113

Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.

– Incans
Feb 18 at 13:30

add a comment |

I initially got this working by selecting the de-duped results into a new table-

truncate dedup_table;

insert into dedup_table

    (source_id, source, code, description)

    select source_id, source, code, description

        from hasdupes_table

        where id in

        (select min(id) from hasdupes_table

         group by regexp_replace(code, '(?-i){.*?}', ''));

This is basically the same approach recommended by @berty, but using a subquery instead of a temp table.

I then realised this could be made to work as I originally intended by inverting the subquery using NOT IN to work out which rows to delete from the "hasdupes" table.

I haven't actually tested this but I believe it would work-

delete from hasdupes_table where id not in

            (select min(id) from hasdupes_table

             group by regexp_replace(code, '(?-i){.*?}', ''));

edited Feb 19 at 12:23

answered Feb 18 at 13:37

Incans

163

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1405465%2fde-duping-a-table-based-on-an-expression-using-one-field%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You already figured out how to identify duplicates, so I will concentrate on how to delete extra rows, i.e. all rows but the first of each value.

We can do this by putting Ids of rows we want to keep in a temporary table, and then deleting all rows which Ids are not in this list.

I developed a little example script to keep it simple :

DROP TABLE IF EXISTS hasdupes;

CREATE TABLE hasdupes (

 Id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,

 Value VARCHAR(100)

) ENGINE=InnoDB;



INSERT INTO hasdupes (Value) VALUES ('red'), ('green'), ('blue'), ('blue'),

('green'), ('blue'), ('red'), ('orange'), ('red'), ('blue'), ('green'), ('white');



SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;

/*

blue    4

green   3

orange  1

red     3

white   1

*/



/* SELECTS an Id for each value in a temporary table */

CREATE TEMPORARY TABLE keep SELECT Id FROM hasdupes GROUP BY Value;

DELETE FROM hasdupes WHERE hasdupes.Id NOT IN (SELECT Id FROM keep);

DROP TEMPORARY TABLE keep;



SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;

/*

blue    1

green   1

orange  1

red     1

white   1

*/

answered Feb 16 at 18:03

berty

113

Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.

– Incans
Feb 18 at 13:30

add a comment |

You already figured out how to identify duplicates, so I will concentrate on how to delete extra rows, i.e. all rows but the first of each value.

We can do this by putting Ids of rows we want to keep in a temporary table, and then deleting all rows which Ids are not in this list.

I developed a little example script to keep it simple :

DROP TABLE IF EXISTS hasdupes;

CREATE TABLE hasdupes (

 Id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,

 Value VARCHAR(100)

) ENGINE=InnoDB;



INSERT INTO hasdupes (Value) VALUES ('red'), ('green'), ('blue'), ('blue'),

('green'), ('blue'), ('red'), ('orange'), ('red'), ('blue'), ('green'), ('white');



SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;

/*

blue    4

green   3

orange  1

red     3

white   1

*/



/* SELECTS an Id for each value in a temporary table */

CREATE TEMPORARY TABLE keep SELECT Id FROM hasdupes GROUP BY Value;

DELETE FROM hasdupes WHERE hasdupes.Id NOT IN (SELECT Id FROM keep);

DROP TEMPORARY TABLE keep;



SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;

/*

blue    1

green   1

orange  1

red     1

white   1

*/

answered Feb 16 at 18:03

berty

113

Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.

– Incans
Feb 18 at 13:30

add a comment |

You already figured out how to identify duplicates, so I will concentrate on how to delete extra rows, i.e. all rows but the first of each value.

We can do this by putting Ids of rows we want to keep in a temporary table, and then deleting all rows which Ids are not in this list.

I developed a little example script to keep it simple :

DROP TABLE IF EXISTS hasdupes;

CREATE TABLE hasdupes (

 Id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,

 Value VARCHAR(100)

) ENGINE=InnoDB;



INSERT INTO hasdupes (Value) VALUES ('red'), ('green'), ('blue'), ('blue'),

('green'), ('blue'), ('red'), ('orange'), ('red'), ('blue'), ('green'), ('white');



SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;

/*

blue    4

green   3

orange  1

red     3

white   1

*/



/* SELECTS an Id for each value in a temporary table */

CREATE TEMPORARY TABLE keep SELECT Id FROM hasdupes GROUP BY Value;

DELETE FROM hasdupes WHERE hasdupes.Id NOT IN (SELECT Id FROM keep);

DROP TEMPORARY TABLE keep;



SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;

/*

blue    1

green   1

orange  1

red     1

white   1

*/

answered Feb 16 at 18:03

berty

113

You already figured out how to identify duplicates, so I will concentrate on how to delete extra rows, i.e. all rows but the first of each value.

We can do this by putting Ids of rows we want to keep in a temporary table, and then deleting all rows which Ids are not in this list.

I developed a little example script to keep it simple :

DROP TABLE IF EXISTS hasdupes;

CREATE TABLE hasdupes (

 Id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,

 Value VARCHAR(100)

) ENGINE=InnoDB;



INSERT INTO hasdupes (Value) VALUES ('red'), ('green'), ('blue'), ('blue'),

('green'), ('blue'), ('red'), ('orange'), ('red'), ('blue'), ('green'), ('white');



SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;

/*

blue    4

green   3

orange  1

red     3

white   1

*/



/* SELECTS an Id for each value in a temporary table */

CREATE TEMPORARY TABLE keep SELECT Id FROM hasdupes GROUP BY Value;

DELETE FROM hasdupes WHERE hasdupes.Id NOT IN (SELECT Id FROM keep);

DROP TEMPORARY TABLE keep;



SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;

/*

blue    1

green   1

orange  1

red     1

white   1

*/

answered Feb 16 at 18:03

berty

113

answered Feb 16 at 18:03

berty

113

answered Feb 16 at 18:03

berty

113

answered Feb 16 at 18:03

berty

113

Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.

– Incans
Feb 18 at 13:30

add a comment |

Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.

– Incans
Feb 18 at 13:30

Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.

– Incans
Feb 18 at 13:30

add a comment |

I initially got this working by selecting the de-duped results into a new table-

truncate dedup_table;

insert into dedup_table

    (source_id, source, code, description)

    select source_id, source, code, description

        from hasdupes_table

        where id in

        (select min(id) from hasdupes_table

         group by regexp_replace(code, '(?-i){.*?}', ''));

This is basically the same approach recommended by @berty, but using a subquery instead of a temp table.

I then realised this could be made to work as I originally intended by inverting the subquery using NOT IN to work out which rows to delete from the "hasdupes" table.

I haven't actually tested this but I believe it would work-

delete from hasdupes_table where id not in

            (select min(id) from hasdupes_table

             group by regexp_replace(code, '(?-i){.*?}', ''));

edited Feb 19 at 12:23

answered Feb 18 at 13:37

Incans

163

add a comment |

I initially got this working by selecting the de-duped results into a new table-

truncate dedup_table;

insert into dedup_table

    (source_id, source, code, description)

    select source_id, source, code, description

        from hasdupes_table

        where id in

        (select min(id) from hasdupes_table

         group by regexp_replace(code, '(?-i){.*?}', ''));

This is basically the same approach recommended by @berty, but using a subquery instead of a temp table.

I then realised this could be made to work as I originally intended by inverting the subquery using NOT IN to work out which rows to delete from the "hasdupes" table.

I haven't actually tested this but I believe it would work-

delete from hasdupes_table where id not in

            (select min(id) from hasdupes_table

             group by regexp_replace(code, '(?-i){.*?}', ''));

edited Feb 19 at 12:23

answered Feb 18 at 13:37

Incans

163

add a comment |

I initially got this working by selecting the de-duped results into a new table-

truncate dedup_table;

insert into dedup_table

    (source_id, source, code, description)

    select source_id, source, code, description

        from hasdupes_table

        where id in

        (select min(id) from hasdupes_table

         group by regexp_replace(code, '(?-i){.*?}', ''));

This is basically the same approach recommended by @berty, but using a subquery instead of a temp table.

I then realised this could be made to work as I originally intended by inverting the subquery using NOT IN to work out which rows to delete from the "hasdupes" table.

I haven't actually tested this but I believe it would work-

delete from hasdupes_table where id not in

            (select min(id) from hasdupes_table

             group by regexp_replace(code, '(?-i){.*?}', ''));

edited Feb 19 at 12:23

answered Feb 18 at 13:37

Incans

163

I initially got this working by selecting the de-duped results into a new table-

truncate dedup_table;

insert into dedup_table

    (source_id, source, code, description)

    select source_id, source, code, description

        from hasdupes_table

        where id in

        (select min(id) from hasdupes_table

         group by regexp_replace(code, '(?-i){.*?}', ''));

This is basically the same approach recommended by @berty, but using a subquery instead of a temp table.

I then realised this could be made to work as I originally intended by inverting the subquery using NOT IN to work out which rows to delete from the "hasdupes" table.

I haven't actually tested this but I believe it would work-

delete from hasdupes_table where id not in

            (select min(id) from hasdupes_table

             group by regexp_replace(code, '(?-i){.*?}', ''));

edited Feb 19 at 12:23

answered Feb 18 at 13:37

Incans

163

edited Feb 19 at 12:23

answered Feb 18 at 13:37

Incans

163

answered Feb 18 at 13:37

Incans

163

answered Feb 18 at 13:37

Incans

163

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Super User!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Jtdylktuy