linux command line for large data set?











up vote
0
down vote

favorite












the problem: I a large excel file of data, there are over 1000 columns and over 40,000 rows. I have to identify where a given row has a value of >199 in any given cell. If any row does not have >199 in any given cell then I want to delete those rows. So that I am left with only rows where at least one cell has a value of >199.



I also have the same data file as a text file, so I was thinking that the best approach might be to use linux command line to do this problem rather than using the excel file (which is bulky to work with given the number of rows and columns). But I am a novice at linux and awk so I was looking for general advice of how to approach this issue? thanks so much



Thank you for your help.



Example image of data set below. Here I would want only the rows that have highlighted cells (because those are >200) but I can't just use the sort function or complicated if than statements because there are so many columns in my data set, so that is too time consuming...



enter image description here










share|improve this question
























  • I don't understand.  Is this a question about Excel?  Have you exported your data from Excel to a text file?  Details, please.   Please do not respond in comments; edit your question to make it clearer and more complete.
    – Scott
    Nov 27 at 1:02










  • Anna did describe her/his case clear. She/He has "the same data file as a text file" and wants to ask for an advice on how to approach parsing this txt file in linux command line and which combination of tools to use for that. Should piping with several commands be used, or some other approach. I am interested in this also. Thank you for the question.
    – titus
    Nov 27 at 1:28








  • 1




    @titus: I would expect somebody who has been on the site for eight years to know about edits and the revision history. If you look, you'll see that the "same data file as a text file" statement was added after I made my comment.
    – Scott
    Nov 27 at 3:09

















up vote
0
down vote

favorite












the problem: I a large excel file of data, there are over 1000 columns and over 40,000 rows. I have to identify where a given row has a value of >199 in any given cell. If any row does not have >199 in any given cell then I want to delete those rows. So that I am left with only rows where at least one cell has a value of >199.



I also have the same data file as a text file, so I was thinking that the best approach might be to use linux command line to do this problem rather than using the excel file (which is bulky to work with given the number of rows and columns). But I am a novice at linux and awk so I was looking for general advice of how to approach this issue? thanks so much



Thank you for your help.



Example image of data set below. Here I would want only the rows that have highlighted cells (because those are >200) but I can't just use the sort function or complicated if than statements because there are so many columns in my data set, so that is too time consuming...



enter image description here










share|improve this question
























  • I don't understand.  Is this a question about Excel?  Have you exported your data from Excel to a text file?  Details, please.   Please do not respond in comments; edit your question to make it clearer and more complete.
    – Scott
    Nov 27 at 1:02










  • Anna did describe her/his case clear. She/He has "the same data file as a text file" and wants to ask for an advice on how to approach parsing this txt file in linux command line and which combination of tools to use for that. Should piping with several commands be used, or some other approach. I am interested in this also. Thank you for the question.
    – titus
    Nov 27 at 1:28








  • 1




    @titus: I would expect somebody who has been on the site for eight years to know about edits and the revision history. If you look, you'll see that the "same data file as a text file" statement was added after I made my comment.
    – Scott
    Nov 27 at 3:09















up vote
0
down vote

favorite









up vote
0
down vote

favorite











the problem: I a large excel file of data, there are over 1000 columns and over 40,000 rows. I have to identify where a given row has a value of >199 in any given cell. If any row does not have >199 in any given cell then I want to delete those rows. So that I am left with only rows where at least one cell has a value of >199.



I also have the same data file as a text file, so I was thinking that the best approach might be to use linux command line to do this problem rather than using the excel file (which is bulky to work with given the number of rows and columns). But I am a novice at linux and awk so I was looking for general advice of how to approach this issue? thanks so much



Thank you for your help.



Example image of data set below. Here I would want only the rows that have highlighted cells (because those are >200) but I can't just use the sort function or complicated if than statements because there are so many columns in my data set, so that is too time consuming...



enter image description here










share|improve this question















the problem: I a large excel file of data, there are over 1000 columns and over 40,000 rows. I have to identify where a given row has a value of >199 in any given cell. If any row does not have >199 in any given cell then I want to delete those rows. So that I am left with only rows where at least one cell has a value of >199.



I also have the same data file as a text file, so I was thinking that the best approach might be to use linux command line to do this problem rather than using the excel file (which is bulky to work with given the number of rows and columns). But I am a novice at linux and awk so I was looking for general advice of how to approach this issue? thanks so much



Thank you for your help.



Example image of data set below. Here I would want only the rows that have highlighted cells (because those are >200) but I can't just use the sort function or complicated if than statements because there are so many columns in my data set, so that is too time consuming...



enter image description here







microsoft-excel-2010






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 27 at 1:30

























asked Nov 27 at 0:45









Anna

11




11












  • I don't understand.  Is this a question about Excel?  Have you exported your data from Excel to a text file?  Details, please.   Please do not respond in comments; edit your question to make it clearer and more complete.
    – Scott
    Nov 27 at 1:02










  • Anna did describe her/his case clear. She/He has "the same data file as a text file" and wants to ask for an advice on how to approach parsing this txt file in linux command line and which combination of tools to use for that. Should piping with several commands be used, or some other approach. I am interested in this also. Thank you for the question.
    – titus
    Nov 27 at 1:28








  • 1




    @titus: I would expect somebody who has been on the site for eight years to know about edits and the revision history. If you look, you'll see that the "same data file as a text file" statement was added after I made my comment.
    – Scott
    Nov 27 at 3:09




















  • I don't understand.  Is this a question about Excel?  Have you exported your data from Excel to a text file?  Details, please.   Please do not respond in comments; edit your question to make it clearer and more complete.
    – Scott
    Nov 27 at 1:02










  • Anna did describe her/his case clear. She/He has "the same data file as a text file" and wants to ask for an advice on how to approach parsing this txt file in linux command line and which combination of tools to use for that. Should piping with several commands be used, or some other approach. I am interested in this also. Thank you for the question.
    – titus
    Nov 27 at 1:28








  • 1




    @titus: I would expect somebody who has been on the site for eight years to know about edits and the revision history. If you look, you'll see that the "same data file as a text file" statement was added after I made my comment.
    – Scott
    Nov 27 at 3:09


















I don't understand.  Is this a question about Excel?  Have you exported your data from Excel to a text file?  Details, please.   Please do not respond in comments; edit your question to make it clearer and more complete.
– Scott
Nov 27 at 1:02




I don't understand.  Is this a question about Excel?  Have you exported your data from Excel to a text file?  Details, please.   Please do not respond in comments; edit your question to make it clearer and more complete.
– Scott
Nov 27 at 1:02












Anna did describe her/his case clear. She/He has "the same data file as a text file" and wants to ask for an advice on how to approach parsing this txt file in linux command line and which combination of tools to use for that. Should piping with several commands be used, or some other approach. I am interested in this also. Thank you for the question.
– titus
Nov 27 at 1:28






Anna did describe her/his case clear. She/He has "the same data file as a text file" and wants to ask for an advice on how to approach parsing this txt file in linux command line and which combination of tools to use for that. Should piping with several commands be used, or some other approach. I am interested in this also. Thank you for the question.
– titus
Nov 27 at 1:28






1




1




@titus: I would expect somebody who has been on the site for eight years to know about edits and the revision history. If you look, you'll see that the "same data file as a text file" statement was added after I made my comment.
– Scott
Nov 27 at 3:09






@titus: I would expect somebody who has been on the site for eight years to know about edits and the revision history. If you look, you'll see that the "same data file as a text file" statement was added after I made my comment.
– Scott
Nov 27 at 3:09












2 Answers
2






active

oldest

votes

















up vote
0
down vote













Since you said "looking for general advice of how to approach this issue?" here is one approach:



If you know how to use Python, you could save the file as a comma separated file and run through the file simply by writing a small script and then use csv to do something with the data. You can use any operating system that supports python.






share|improve this answer





















  • I am just learning python, so I haven't written a script before, what script would do this?
    – Anna
    Nov 27 at 1:34










  • You could, for example, use the "xlrd" module, in particular its sheet.nrows and sheet.row_values(n) methods. A small tutorial for using xlrd is described on geeksforgeeks.org/reading-excel-file-using-python
    – Christoph Sommer
    Nov 27 at 1:46


















up vote
0
down vote













I don't have much general advice. 
Specifically, I advise you to use this awk command:



awk '{
over=0
for (i=1; i<=NF; i++) if ($i > 199) over=1
if (over) print
}'


I created a small data file,
based on numbers from your file, and a few I made up on my own:



$ cat input
81 23 40
31 0 416 12
2 2 1
157 41 80 201
417 42 17

$ ./myscript input
31 0 416 12
157 41 80 201
417 42 17


To delete rows from your file, do



$ ./myscript input > input.new
$ mv input.new input


Notes:




  • For your own sake,
    you should decide whether your requirement is > 199, > 200, ≥ 200, or what.

  • If you need to keep Row 1 (i.e., line 1, the header row), say so.

  • I haven't tested this on a large file. 
    awk shouldn't have any problem with a huge number of rows (lines). 
    A thousand columns (fields) might be an issue, but I doubt it.






share|improve this answer





















    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "3"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1378599%2flinux-command-line-for-large-data-set%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    Since you said "looking for general advice of how to approach this issue?" here is one approach:



    If you know how to use Python, you could save the file as a comma separated file and run through the file simply by writing a small script and then use csv to do something with the data. You can use any operating system that supports python.






    share|improve this answer





















    • I am just learning python, so I haven't written a script before, what script would do this?
      – Anna
      Nov 27 at 1:34










    • You could, for example, use the "xlrd" module, in particular its sheet.nrows and sheet.row_values(n) methods. A small tutorial for using xlrd is described on geeksforgeeks.org/reading-excel-file-using-python
      – Christoph Sommer
      Nov 27 at 1:46















    up vote
    0
    down vote













    Since you said "looking for general advice of how to approach this issue?" here is one approach:



    If you know how to use Python, you could save the file as a comma separated file and run through the file simply by writing a small script and then use csv to do something with the data. You can use any operating system that supports python.






    share|improve this answer





















    • I am just learning python, so I haven't written a script before, what script would do this?
      – Anna
      Nov 27 at 1:34










    • You could, for example, use the "xlrd" module, in particular its sheet.nrows and sheet.row_values(n) methods. A small tutorial for using xlrd is described on geeksforgeeks.org/reading-excel-file-using-python
      – Christoph Sommer
      Nov 27 at 1:46













    up vote
    0
    down vote










    up vote
    0
    down vote









    Since you said "looking for general advice of how to approach this issue?" here is one approach:



    If you know how to use Python, you could save the file as a comma separated file and run through the file simply by writing a small script and then use csv to do something with the data. You can use any operating system that supports python.






    share|improve this answer












    Since you said "looking for general advice of how to approach this issue?" here is one approach:



    If you know how to use Python, you could save the file as a comma separated file and run through the file simply by writing a small script and then use csv to do something with the data. You can use any operating system that supports python.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 27 at 1:29









    Elmo

    22914




    22914












    • I am just learning python, so I haven't written a script before, what script would do this?
      – Anna
      Nov 27 at 1:34










    • You could, for example, use the "xlrd" module, in particular its sheet.nrows and sheet.row_values(n) methods. A small tutorial for using xlrd is described on geeksforgeeks.org/reading-excel-file-using-python
      – Christoph Sommer
      Nov 27 at 1:46


















    • I am just learning python, so I haven't written a script before, what script would do this?
      – Anna
      Nov 27 at 1:34










    • You could, for example, use the "xlrd" module, in particular its sheet.nrows and sheet.row_values(n) methods. A small tutorial for using xlrd is described on geeksforgeeks.org/reading-excel-file-using-python
      – Christoph Sommer
      Nov 27 at 1:46
















    I am just learning python, so I haven't written a script before, what script would do this?
    – Anna
    Nov 27 at 1:34




    I am just learning python, so I haven't written a script before, what script would do this?
    – Anna
    Nov 27 at 1:34












    You could, for example, use the "xlrd" module, in particular its sheet.nrows and sheet.row_values(n) methods. A small tutorial for using xlrd is described on geeksforgeeks.org/reading-excel-file-using-python
    – Christoph Sommer
    Nov 27 at 1:46




    You could, for example, use the "xlrd" module, in particular its sheet.nrows and sheet.row_values(n) methods. A small tutorial for using xlrd is described on geeksforgeeks.org/reading-excel-file-using-python
    – Christoph Sommer
    Nov 27 at 1:46












    up vote
    0
    down vote













    I don't have much general advice. 
    Specifically, I advise you to use this awk command:



    awk '{
    over=0
    for (i=1; i<=NF; i++) if ($i > 199) over=1
    if (over) print
    }'


    I created a small data file,
    based on numbers from your file, and a few I made up on my own:



    $ cat input
    81 23 40
    31 0 416 12
    2 2 1
    157 41 80 201
    417 42 17

    $ ./myscript input
    31 0 416 12
    157 41 80 201
    417 42 17


    To delete rows from your file, do



    $ ./myscript input > input.new
    $ mv input.new input


    Notes:




    • For your own sake,
      you should decide whether your requirement is > 199, > 200, ≥ 200, or what.

    • If you need to keep Row 1 (i.e., line 1, the header row), say so.

    • I haven't tested this on a large file. 
      awk shouldn't have any problem with a huge number of rows (lines). 
      A thousand columns (fields) might be an issue, but I doubt it.






    share|improve this answer

























      up vote
      0
      down vote













      I don't have much general advice. 
      Specifically, I advise you to use this awk command:



      awk '{
      over=0
      for (i=1; i<=NF; i++) if ($i > 199) over=1
      if (over) print
      }'


      I created a small data file,
      based on numbers from your file, and a few I made up on my own:



      $ cat input
      81 23 40
      31 0 416 12
      2 2 1
      157 41 80 201
      417 42 17

      $ ./myscript input
      31 0 416 12
      157 41 80 201
      417 42 17


      To delete rows from your file, do



      $ ./myscript input > input.new
      $ mv input.new input


      Notes:




      • For your own sake,
        you should decide whether your requirement is > 199, > 200, ≥ 200, or what.

      • If you need to keep Row 1 (i.e., line 1, the header row), say so.

      • I haven't tested this on a large file. 
        awk shouldn't have any problem with a huge number of rows (lines). 
        A thousand columns (fields) might be an issue, but I doubt it.






      share|improve this answer























        up vote
        0
        down vote










        up vote
        0
        down vote









        I don't have much general advice. 
        Specifically, I advise you to use this awk command:



        awk '{
        over=0
        for (i=1; i<=NF; i++) if ($i > 199) over=1
        if (over) print
        }'


        I created a small data file,
        based on numbers from your file, and a few I made up on my own:



        $ cat input
        81 23 40
        31 0 416 12
        2 2 1
        157 41 80 201
        417 42 17

        $ ./myscript input
        31 0 416 12
        157 41 80 201
        417 42 17


        To delete rows from your file, do



        $ ./myscript input > input.new
        $ mv input.new input


        Notes:




        • For your own sake,
          you should decide whether your requirement is > 199, > 200, ≥ 200, or what.

        • If you need to keep Row 1 (i.e., line 1, the header row), say so.

        • I haven't tested this on a large file. 
          awk shouldn't have any problem with a huge number of rows (lines). 
          A thousand columns (fields) might be an issue, but I doubt it.






        share|improve this answer












        I don't have much general advice. 
        Specifically, I advise you to use this awk command:



        awk '{
        over=0
        for (i=1; i<=NF; i++) if ($i > 199) over=1
        if (over) print
        }'


        I created a small data file,
        based on numbers from your file, and a few I made up on my own:



        $ cat input
        81 23 40
        31 0 416 12
        2 2 1
        157 41 80 201
        417 42 17

        $ ./myscript input
        31 0 416 12
        157 41 80 201
        417 42 17


        To delete rows from your file, do



        $ ./myscript input > input.new
        $ mv input.new input


        Notes:




        • For your own sake,
          you should decide whether your requirement is > 199, > 200, ≥ 200, or what.

        • If you need to keep Row 1 (i.e., line 1, the header row), say so.

        • I haven't tested this on a large file. 
          awk shouldn't have any problem with a huge number of rows (lines). 
          A thousand columns (fields) might be an issue, but I doubt it.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 27 at 3:05









        Scott

        15.5k113789




        15.5k113789






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Super User!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1378599%2flinux-command-line-for-large-data-set%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Probability when a professor distributes a quiz and homework assignment to a class of n students.

            Aardman Animations

            Are they similar matrix