De-duping a table based on an expression using one field












0















Mariadb: mysql  Ver 15.1 Distrib 10.3.8-MariaDB


I have a table built as a union of queries from several other tables. This table looks something like-



+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| source_id | int(11) | YES | | NULL | |
| code | varchar(64) | YES | | NULL | |
| description | varchar(255) | YES | | NULL | |
+-------------+--------------+------+-----+---------+----------------+


The code field can contain "extra" information in curly braces {think of them as comments}. I want to find all the rows where the code field is equal if the comments are ignored, and then keep just one row from each duplicate group.



It's the "keep one row of each group" bit that's causing me problems.



I have a query that will list the IDs of all the rows in each dup group-



select id
from mytable
group by (regexp_replace(code, '(?-i){.*?}', ''))
having count(*) > 1


but if I make this a subquery in a DELETE WHERE operation it will delete all the duplicates. Not what I want.



After searching around for an "all but one" or "pop one from the output list" function I came upon the 2-argument version of LIMIT with an OFFSET-



select id
from mytable
group by (regexp_replace(code, '(?-i){.*?}', ''))
having count(*) > 1 limit 1000000 offset 1


but if you put this into a DELETE command it fails-



delete from mytable where id in(
select id
from mytable
group by (regexp_replace(code, '(?-i){.*?}', ''))
having count(*) > 1 limit 1000 offset 1
);
ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'


Any idea how to achieve this? Preferably in-place but worst case I could SELECT into a temporary table then replace the original if that's what it takes.










share|improve this question



























    0















    Mariadb: mysql  Ver 15.1 Distrib 10.3.8-MariaDB


    I have a table built as a union of queries from several other tables. This table looks something like-



    +-------------+--------------+------+-----+---------+----------------+
    | Field | Type | Null | Key | Default | Extra |
    +-------------+--------------+------+-----+---------+----------------+
    | id | int(11) | NO | PRI | NULL | auto_increment |
    | source_id | int(11) | YES | | NULL | |
    | code | varchar(64) | YES | | NULL | |
    | description | varchar(255) | YES | | NULL | |
    +-------------+--------------+------+-----+---------+----------------+


    The code field can contain "extra" information in curly braces {think of them as comments}. I want to find all the rows where the code field is equal if the comments are ignored, and then keep just one row from each duplicate group.



    It's the "keep one row of each group" bit that's causing me problems.



    I have a query that will list the IDs of all the rows in each dup group-



    select id
    from mytable
    group by (regexp_replace(code, '(?-i){.*?}', ''))
    having count(*) > 1


    but if I make this a subquery in a DELETE WHERE operation it will delete all the duplicates. Not what I want.



    After searching around for an "all but one" or "pop one from the output list" function I came upon the 2-argument version of LIMIT with an OFFSET-



    select id
    from mytable
    group by (regexp_replace(code, '(?-i){.*?}', ''))
    having count(*) > 1 limit 1000000 offset 1


    but if you put this into a DELETE command it fails-



    delete from mytable where id in(
    select id
    from mytable
    group by (regexp_replace(code, '(?-i){.*?}', ''))
    having count(*) > 1 limit 1000 offset 1
    );
    ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'


    Any idea how to achieve this? Preferably in-place but worst case I could SELECT into a temporary table then replace the original if that's what it takes.










    share|improve this question

























      0












      0








      0








      Mariadb: mysql  Ver 15.1 Distrib 10.3.8-MariaDB


      I have a table built as a union of queries from several other tables. This table looks something like-



      +-------------+--------------+------+-----+---------+----------------+
      | Field | Type | Null | Key | Default | Extra |
      +-------------+--------------+------+-----+---------+----------------+
      | id | int(11) | NO | PRI | NULL | auto_increment |
      | source_id | int(11) | YES | | NULL | |
      | code | varchar(64) | YES | | NULL | |
      | description | varchar(255) | YES | | NULL | |
      +-------------+--------------+------+-----+---------+----------------+


      The code field can contain "extra" information in curly braces {think of them as comments}. I want to find all the rows where the code field is equal if the comments are ignored, and then keep just one row from each duplicate group.



      It's the "keep one row of each group" bit that's causing me problems.



      I have a query that will list the IDs of all the rows in each dup group-



      select id
      from mytable
      group by (regexp_replace(code, '(?-i){.*?}', ''))
      having count(*) > 1


      but if I make this a subquery in a DELETE WHERE operation it will delete all the duplicates. Not what I want.



      After searching around for an "all but one" or "pop one from the output list" function I came upon the 2-argument version of LIMIT with an OFFSET-



      select id
      from mytable
      group by (regexp_replace(code, '(?-i){.*?}', ''))
      having count(*) > 1 limit 1000000 offset 1


      but if you put this into a DELETE command it fails-



      delete from mytable where id in(
      select id
      from mytable
      group by (regexp_replace(code, '(?-i){.*?}', ''))
      having count(*) > 1 limit 1000 offset 1
      );
      ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'


      Any idea how to achieve this? Preferably in-place but worst case I could SELECT into a temporary table then replace the original if that's what it takes.










      share|improve this question














      Mariadb: mysql  Ver 15.1 Distrib 10.3.8-MariaDB


      I have a table built as a union of queries from several other tables. This table looks something like-



      +-------------+--------------+------+-----+---------+----------------+
      | Field | Type | Null | Key | Default | Extra |
      +-------------+--------------+------+-----+---------+----------------+
      | id | int(11) | NO | PRI | NULL | auto_increment |
      | source_id | int(11) | YES | | NULL | |
      | code | varchar(64) | YES | | NULL | |
      | description | varchar(255) | YES | | NULL | |
      +-------------+--------------+------+-----+---------+----------------+


      The code field can contain "extra" information in curly braces {think of them as comments}. I want to find all the rows where the code field is equal if the comments are ignored, and then keep just one row from each duplicate group.



      It's the "keep one row of each group" bit that's causing me problems.



      I have a query that will list the IDs of all the rows in each dup group-



      select id
      from mytable
      group by (regexp_replace(code, '(?-i){.*?}', ''))
      having count(*) > 1


      but if I make this a subquery in a DELETE WHERE operation it will delete all the duplicates. Not what I want.



      After searching around for an "all but one" or "pop one from the output list" function I came upon the 2-argument version of LIMIT with an OFFSET-



      select id
      from mytable
      group by (regexp_replace(code, '(?-i){.*?}', ''))
      having count(*) > 1 limit 1000000 offset 1


      but if you put this into a DELETE command it fails-



      delete from mytable where id in(
      select id
      from mytable
      group by (regexp_replace(code, '(?-i){.*?}', ''))
      having count(*) > 1 limit 1000 offset 1
      );
      ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'


      Any idea how to achieve this? Preferably in-place but worst case I could SELECT into a temporary table then replace the original if that's what it takes.







      mysql deduplication mariadb






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Feb 13 at 22:04









      IncansIncans

      163




      163






















          2 Answers
          2






          active

          oldest

          votes


















          0














          You already figured out how to identify duplicates, so I will concentrate on how to delete extra rows, i.e. all rows but the first of each value.



          We can do this by putting Ids of rows we want to keep in a temporary table, and then deleting all rows which Ids are not in this list.



          I developed a little example script to keep it simple :



          DROP TABLE IF EXISTS hasdupes;
          CREATE TABLE hasdupes (
          Id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
          Value VARCHAR(100)
          ) ENGINE=InnoDB;

          INSERT INTO hasdupes (Value) VALUES ('red'), ('green'), ('blue'), ('blue'),
          ('green'), ('blue'), ('red'), ('orange'), ('red'), ('blue'), ('green'), ('white');

          SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
          /*
          blue 4
          green 3
          orange 1
          red 3
          white 1
          */

          /* SELECTS an Id for each value in a temporary table */
          CREATE TEMPORARY TABLE keep SELECT Id FROM hasdupes GROUP BY Value;
          DELETE FROM hasdupes WHERE hasdupes.Id NOT IN (SELECT Id FROM keep);
          DROP TEMPORARY TABLE keep;

          SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
          /*
          blue 1
          green 1
          orange 1
          red 1
          white 1
          */





          share|improve this answer
























          • Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.

            – Incans
            Feb 18 at 13:30



















          0














          I initially got this working by selecting the de-duped results into a new table-



          truncate dedup_table;
          insert into dedup_table
          (source_id, source, code, description)
          select source_id, source, code, description
          from hasdupes_table
          where id in
          (select min(id) from hasdupes_table
          group by regexp_replace(code, '(?-i){.*?}', ''));


          This is basically the same approach recommended by @berty, but using a subquery instead of a temp table.



          I then realised this could be made to work as I originally intended by inverting the subquery using NOT IN to work out which rows to delete from the "hasdupes" table.



          I haven't actually tested this but I believe it would work-



          delete from hasdupes_table where id not in
          (select min(id) from hasdupes_table
          group by regexp_replace(code, '(?-i){.*?}', ''));





          share|improve this answer

























            Your Answer








            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "3"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1405465%2fde-duping-a-table-based-on-an-expression-using-one-field%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            You already figured out how to identify duplicates, so I will concentrate on how to delete extra rows, i.e. all rows but the first of each value.



            We can do this by putting Ids of rows we want to keep in a temporary table, and then deleting all rows which Ids are not in this list.



            I developed a little example script to keep it simple :



            DROP TABLE IF EXISTS hasdupes;
            CREATE TABLE hasdupes (
            Id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
            Value VARCHAR(100)
            ) ENGINE=InnoDB;

            INSERT INTO hasdupes (Value) VALUES ('red'), ('green'), ('blue'), ('blue'),
            ('green'), ('blue'), ('red'), ('orange'), ('red'), ('blue'), ('green'), ('white');

            SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
            /*
            blue 4
            green 3
            orange 1
            red 3
            white 1
            */

            /* SELECTS an Id for each value in a temporary table */
            CREATE TEMPORARY TABLE keep SELECT Id FROM hasdupes GROUP BY Value;
            DELETE FROM hasdupes WHERE hasdupes.Id NOT IN (SELECT Id FROM keep);
            DROP TEMPORARY TABLE keep;

            SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
            /*
            blue 1
            green 1
            orange 1
            red 1
            white 1
            */





            share|improve this answer
























            • Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.

              – Incans
              Feb 18 at 13:30
















            0














            You already figured out how to identify duplicates, so I will concentrate on how to delete extra rows, i.e. all rows but the first of each value.



            We can do this by putting Ids of rows we want to keep in a temporary table, and then deleting all rows which Ids are not in this list.



            I developed a little example script to keep it simple :



            DROP TABLE IF EXISTS hasdupes;
            CREATE TABLE hasdupes (
            Id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
            Value VARCHAR(100)
            ) ENGINE=InnoDB;

            INSERT INTO hasdupes (Value) VALUES ('red'), ('green'), ('blue'), ('blue'),
            ('green'), ('blue'), ('red'), ('orange'), ('red'), ('blue'), ('green'), ('white');

            SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
            /*
            blue 4
            green 3
            orange 1
            red 3
            white 1
            */

            /* SELECTS an Id for each value in a temporary table */
            CREATE TEMPORARY TABLE keep SELECT Id FROM hasdupes GROUP BY Value;
            DELETE FROM hasdupes WHERE hasdupes.Id NOT IN (SELECT Id FROM keep);
            DROP TEMPORARY TABLE keep;

            SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
            /*
            blue 1
            green 1
            orange 1
            red 1
            white 1
            */





            share|improve this answer
























            • Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.

              – Incans
              Feb 18 at 13:30














            0












            0








            0







            You already figured out how to identify duplicates, so I will concentrate on how to delete extra rows, i.e. all rows but the first of each value.



            We can do this by putting Ids of rows we want to keep in a temporary table, and then deleting all rows which Ids are not in this list.



            I developed a little example script to keep it simple :



            DROP TABLE IF EXISTS hasdupes;
            CREATE TABLE hasdupes (
            Id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
            Value VARCHAR(100)
            ) ENGINE=InnoDB;

            INSERT INTO hasdupes (Value) VALUES ('red'), ('green'), ('blue'), ('blue'),
            ('green'), ('blue'), ('red'), ('orange'), ('red'), ('blue'), ('green'), ('white');

            SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
            /*
            blue 4
            green 3
            orange 1
            red 3
            white 1
            */

            /* SELECTS an Id for each value in a temporary table */
            CREATE TEMPORARY TABLE keep SELECT Id FROM hasdupes GROUP BY Value;
            DELETE FROM hasdupes WHERE hasdupes.Id NOT IN (SELECT Id FROM keep);
            DROP TEMPORARY TABLE keep;

            SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
            /*
            blue 1
            green 1
            orange 1
            red 1
            white 1
            */





            share|improve this answer













            You already figured out how to identify duplicates, so I will concentrate on how to delete extra rows, i.e. all rows but the first of each value.



            We can do this by putting Ids of rows we want to keep in a temporary table, and then deleting all rows which Ids are not in this list.



            I developed a little example script to keep it simple :



            DROP TABLE IF EXISTS hasdupes;
            CREATE TABLE hasdupes (
            Id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
            Value VARCHAR(100)
            ) ENGINE=InnoDB;

            INSERT INTO hasdupes (Value) VALUES ('red'), ('green'), ('blue'), ('blue'),
            ('green'), ('blue'), ('red'), ('orange'), ('red'), ('blue'), ('green'), ('white');

            SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
            /*
            blue 4
            green 3
            orange 1
            red 3
            white 1
            */

            /* SELECTS an Id for each value in a temporary table */
            CREATE TEMPORARY TABLE keep SELECT Id FROM hasdupes GROUP BY Value;
            DELETE FROM hasdupes WHERE hasdupes.Id NOT IN (SELECT Id FROM keep);
            DROP TEMPORARY TABLE keep;

            SELECT Value, COUNT(*) FROM hasdupes GROUP BY Value;
            /*
            blue 1
            green 1
            orange 1
            red 1
            white 1
            */






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Feb 16 at 18:03









            bertyberty

            113




            113













            • Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.

              – Incans
              Feb 18 at 13:30



















            • Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.

              – Incans
              Feb 18 at 13:30

















            Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.

            – Incans
            Feb 18 at 13:30





            Thanks berty. Having worked out a solution that copies the "deduped" data into a different table, I then realised this was the basis of the full answer to my original question.

            – Incans
            Feb 18 at 13:30













            0














            I initially got this working by selecting the de-duped results into a new table-



            truncate dedup_table;
            insert into dedup_table
            (source_id, source, code, description)
            select source_id, source, code, description
            from hasdupes_table
            where id in
            (select min(id) from hasdupes_table
            group by regexp_replace(code, '(?-i){.*?}', ''));


            This is basically the same approach recommended by @berty, but using a subquery instead of a temp table.



            I then realised this could be made to work as I originally intended by inverting the subquery using NOT IN to work out which rows to delete from the "hasdupes" table.



            I haven't actually tested this but I believe it would work-



            delete from hasdupes_table where id not in
            (select min(id) from hasdupes_table
            group by regexp_replace(code, '(?-i){.*?}', ''));





            share|improve this answer






























              0














              I initially got this working by selecting the de-duped results into a new table-



              truncate dedup_table;
              insert into dedup_table
              (source_id, source, code, description)
              select source_id, source, code, description
              from hasdupes_table
              where id in
              (select min(id) from hasdupes_table
              group by regexp_replace(code, '(?-i){.*?}', ''));


              This is basically the same approach recommended by @berty, but using a subquery instead of a temp table.



              I then realised this could be made to work as I originally intended by inverting the subquery using NOT IN to work out which rows to delete from the "hasdupes" table.



              I haven't actually tested this but I believe it would work-



              delete from hasdupes_table where id not in
              (select min(id) from hasdupes_table
              group by regexp_replace(code, '(?-i){.*?}', ''));





              share|improve this answer




























                0












                0








                0







                I initially got this working by selecting the de-duped results into a new table-



                truncate dedup_table;
                insert into dedup_table
                (source_id, source, code, description)
                select source_id, source, code, description
                from hasdupes_table
                where id in
                (select min(id) from hasdupes_table
                group by regexp_replace(code, '(?-i){.*?}', ''));


                This is basically the same approach recommended by @berty, but using a subquery instead of a temp table.



                I then realised this could be made to work as I originally intended by inverting the subquery using NOT IN to work out which rows to delete from the "hasdupes" table.



                I haven't actually tested this but I believe it would work-



                delete from hasdupes_table where id not in
                (select min(id) from hasdupes_table
                group by regexp_replace(code, '(?-i){.*?}', ''));





                share|improve this answer















                I initially got this working by selecting the de-duped results into a new table-



                truncate dedup_table;
                insert into dedup_table
                (source_id, source, code, description)
                select source_id, source, code, description
                from hasdupes_table
                where id in
                (select min(id) from hasdupes_table
                group by regexp_replace(code, '(?-i){.*?}', ''));


                This is basically the same approach recommended by @berty, but using a subquery instead of a temp table.



                I then realised this could be made to work as I originally intended by inverting the subquery using NOT IN to work out which rows to delete from the "hasdupes" table.



                I haven't actually tested this but I believe it would work-



                delete from hasdupes_table where id not in
                (select min(id) from hasdupes_table
                group by regexp_replace(code, '(?-i){.*?}', ''));






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Feb 19 at 12:23

























                answered Feb 18 at 13:37









                IncansIncans

                163




                163






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Super User!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1405465%2fde-duping-a-table-based-on-an-expression-using-one-field%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Index of /

                    Tribalistas

                    Listed building