unix - split a huge .gz file by line












14















I'm sure someone has had the below need, what is a quick way of splitting a huge .gz file by line? The underlying text file has 120million rows. I don't have enough disk space to gunzip the entire file at once so I was wondering if someone knows of a bash/perl script or tool that could split the file (either the .gz or inner .txt) into 3x 40mn line files. ie calling it like:



    bash splitter.sh hugefile.txt.gz 4000000 1
would get lines 1 to 40 mn
bash splitter.sh hugefile.txt.gz 4000000 2
would get lines 40mn to 80 mn
bash splitter.sh hugefile.txt.gz 4000000 3
would get lines 80mn to 120 mn


Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000



Note: I can't get extra disk.



Thanks!










share|improve this question













migrated from stackoverflow.com Jan 23 '12 at 11:28


This question came from our site for professional and enthusiast programmers.














  • 1





    Do you want the resulting files to be gziped again?

    – user95605
    Jan 23 '12 at 11:25











  • You can use gunzip in a ipe. The rest can be done with head and tail

    – Ingo
    Jan 23 '12 at 11:25











  • @Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz

    – toop
    Jan 23 '12 at 11:42






  • 1





    @toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.

    – sleske
    Jan 23 '12 at 12:06











  • The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).

    – b0fh
    Aug 13 '14 at 15:27
















14















I'm sure someone has had the below need, what is a quick way of splitting a huge .gz file by line? The underlying text file has 120million rows. I don't have enough disk space to gunzip the entire file at once so I was wondering if someone knows of a bash/perl script or tool that could split the file (either the .gz or inner .txt) into 3x 40mn line files. ie calling it like:



    bash splitter.sh hugefile.txt.gz 4000000 1
would get lines 1 to 40 mn
bash splitter.sh hugefile.txt.gz 4000000 2
would get lines 40mn to 80 mn
bash splitter.sh hugefile.txt.gz 4000000 3
would get lines 80mn to 120 mn


Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000



Note: I can't get extra disk.



Thanks!










share|improve this question













migrated from stackoverflow.com Jan 23 '12 at 11:28


This question came from our site for professional and enthusiast programmers.














  • 1





    Do you want the resulting files to be gziped again?

    – user95605
    Jan 23 '12 at 11:25











  • You can use gunzip in a ipe. The rest can be done with head and tail

    – Ingo
    Jan 23 '12 at 11:25











  • @Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz

    – toop
    Jan 23 '12 at 11:42






  • 1





    @toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.

    – sleske
    Jan 23 '12 at 12:06











  • The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).

    – b0fh
    Aug 13 '14 at 15:27














14












14








14


9






I'm sure someone has had the below need, what is a quick way of splitting a huge .gz file by line? The underlying text file has 120million rows. I don't have enough disk space to gunzip the entire file at once so I was wondering if someone knows of a bash/perl script or tool that could split the file (either the .gz or inner .txt) into 3x 40mn line files. ie calling it like:



    bash splitter.sh hugefile.txt.gz 4000000 1
would get lines 1 to 40 mn
bash splitter.sh hugefile.txt.gz 4000000 2
would get lines 40mn to 80 mn
bash splitter.sh hugefile.txt.gz 4000000 3
would get lines 80mn to 120 mn


Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000



Note: I can't get extra disk.



Thanks!










share|improve this question














I'm sure someone has had the below need, what is a quick way of splitting a huge .gz file by line? The underlying text file has 120million rows. I don't have enough disk space to gunzip the entire file at once so I was wondering if someone knows of a bash/perl script or tool that could split the file (either the .gz or inner .txt) into 3x 40mn line files. ie calling it like:



    bash splitter.sh hugefile.txt.gz 4000000 1
would get lines 1 to 40 mn
bash splitter.sh hugefile.txt.gz 4000000 2
would get lines 40mn to 80 mn
bash splitter.sh hugefile.txt.gz 4000000 3
would get lines 80mn to 120 mn


Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000



Note: I can't get extra disk.



Thanks!







linux perl bash shell unix






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 23 '12 at 11:21









tooptoop

1951416




1951416




migrated from stackoverflow.com Jan 23 '12 at 11:28


This question came from our site for professional and enthusiast programmers.









migrated from stackoverflow.com Jan 23 '12 at 11:28


This question came from our site for professional and enthusiast programmers.










  • 1





    Do you want the resulting files to be gziped again?

    – user95605
    Jan 23 '12 at 11:25











  • You can use gunzip in a ipe. The rest can be done with head and tail

    – Ingo
    Jan 23 '12 at 11:25











  • @Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz

    – toop
    Jan 23 '12 at 11:42






  • 1





    @toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.

    – sleske
    Jan 23 '12 at 12:06











  • The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).

    – b0fh
    Aug 13 '14 at 15:27














  • 1





    Do you want the resulting files to be gziped again?

    – user95605
    Jan 23 '12 at 11:25











  • You can use gunzip in a ipe. The rest can be done with head and tail

    – Ingo
    Jan 23 '12 at 11:25











  • @Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz

    – toop
    Jan 23 '12 at 11:42






  • 1





    @toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.

    – sleske
    Jan 23 '12 at 12:06











  • The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).

    – b0fh
    Aug 13 '14 at 15:27








1




1





Do you want the resulting files to be gziped again?

– user95605
Jan 23 '12 at 11:25





Do you want the resulting files to be gziped again?

– user95605
Jan 23 '12 at 11:25













You can use gunzip in a ipe. The rest can be done with head and tail

– Ingo
Jan 23 '12 at 11:25





You can use gunzip in a ipe. The rest can be done with head and tail

– Ingo
Jan 23 '12 at 11:25













@Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz

– toop
Jan 23 '12 at 11:42





@Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz

– toop
Jan 23 '12 at 11:42




1




1





@toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.

– sleske
Jan 23 '12 at 12:06





@toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.

– sleske
Jan 23 '12 at 12:06













The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).

– b0fh
Aug 13 '14 at 15:27





The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).

– b0fh
Aug 13 '14 at 15:27










7 Answers
7






active

oldest

votes


















11














How to do this best depends on what you want:




  • Do you want to extract a single part of the large file?

  • Or do you want to create all the parts in one go?




If you want a single part of the file, your idea to use gunzip and head is right. You can use:



gunzip -c hugefile.txt.gz | head -n 4000000


That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.



To get the other parts, you'd use a combination of head and tail, like:



gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000


to get the second block.




Is perhaps doing a series of these a solution or would the gunzip -c
require enough space for the entire file to be unzipped




No, the gunzip -c does not require any disk space - it does everything in memory, then streams it out to stdout.





If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split; see jim mcnamara's answer for details.






share|improve this answer


























  • From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?

    – Alois Mahdal
    Mar 22 '12 at 12:57






  • 3





    @AloisMahdal: Actually, that would be a good separate question :-). Short version: gzip does not know about the limit (which comes from a different process). If head is used, head will exit when it has received enough, and this will propagate to gzip (via SIGPIPE, see Wikipedia). For tail this is not possible, so yes, gzip will decompress everything.

    – sleske
    Mar 22 '12 at 15:26











  • But if you are interested, you should really ask this as a separate question.

    – sleske
    Mar 22 '12 at 15:35



















19














pipe to split use either gunzip -c or zcat to open the file



gunzip -c bigfile.gz | split -l 400000


Add output specifications to the split command.






share|improve this answer



















  • 2





    This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.

    – b0fh
    Aug 13 '14 at 15:29











  • @b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).

    – sleske
    Sep 7 '16 at 7:20











  • Best answer for sure.

    – Stephen Blum
    Mar 7 '18 at 21:27











  • what are the output specs so that the outputs are .gz files themselves?

    – Quetzalcoatl
    Oct 26 '18 at 5:53



















6














As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.



zcat hugefile.txt.gz | head -n 40000000
zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000





share|improve this answer































    4














    I'd consider using split.




    split a file into pieces







    share|improve this answer

































      2














      Here's a python script to open a globbed set of files from a directory, gunzip them if necessary, and read through them line by line. It only uses the space necessary in memory for holding the filenames, and the current line, plus a little overhead.



      #!/usr/bin/env python
      import gzip, bz2
      import os
      import fnmatch

      def gen_find(filepat,top):
      for path, dirlist, filelist in os.walk(top):
      for name in fnmatch.filter(filelist,filepat):
      yield os.path.join(path,name)

      def gen_open(filenames):
      for name in filenames:
      if name.endswith(".gz"):
      yield gzip.open(name)
      elif name.endswith(".bz2"):
      yield bz2.BZ2File(name)
      else:
      yield open(name)

      def gen_cat(sources):
      for s in sources:
      for item in s:
      yield item

      def main(regex, searchDir):
      fileNames = gen_find(regex,searchDir)
      fileHandles = gen_open(fileNames)
      fileLines = gen_cat(fileHandles)
      for line in fileLines:
      print line

      if __name__ == '__main__':
      parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')
      parser.add_argument('regex', type=str, default='*', help='Regular expression')
      parser.add_argument('searchDir', , type=str, default='.', help='list of input files')
      args = parser.parse_args()
      main(args.regex, args.searchDir)


      The print line command will send every line to std out, so you can redirect to a file. Alternatively, if you let us know what you want done with the lines, I can add it to the python script and you won't need to leave chunks of the file laying around.






      share|improve this answer































        2














        Here's a perl program that can be used to read stdin, and split the lines, piping each clump to a separate command that can use a shell variable $SPLIT to route it to a different destination. For your case, it would be invoked with



        zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'



        Sorry the command-line processing is a little kludgy but you get the idea.



        #!/usr/bin/perl -w
        #####
        # xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.
        # Usage: perl xsplit.pl LINES 'COMMAND'
        # where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.
        # 'cat > tmp$SPLIT.txt'
        # or:
        # 'gzip > tmp$SPLIT.gz'
        #####
        use strict;

        sub pipeHandler {
        my $sig = shift @_;
        print " Caught SIGPIPE: $sign";
        exit(1);
        }
        $SIG{PIPE} = &pipeHandler;

        my $LINES = shift;
        die "LINES must be a positive numbern" if ($LINES <= 0);
        my $COMMAND = shift || die "second argument should be COMMANDn";

        my $line_number = 0;

        while (<STDIN>) {
        if ($line_number%$LINES == 0) {
        close OUTFILE;
        my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);
        print "$splitn";
        my $command = $COMMAND;
        open (OUTFILE, "| $command") or die "failed to write to command '$command'n";
        }
        print OUTFILE $_;
        $line_number++;
        }

        exit 0;





        share|improve this answer































          1














          Directly split .gz file to .gz files:



          zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'


          I think this is what OP wanted, because he don't have much space.






          share|improve this answer
























            Your Answer








            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "3"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f381394%2funix-split-a-huge-gz-file-by-line%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            7 Answers
            7






            active

            oldest

            votes








            7 Answers
            7






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            11














            How to do this best depends on what you want:




            • Do you want to extract a single part of the large file?

            • Or do you want to create all the parts in one go?




            If you want a single part of the file, your idea to use gunzip and head is right. You can use:



            gunzip -c hugefile.txt.gz | head -n 4000000


            That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.



            To get the other parts, you'd use a combination of head and tail, like:



            gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000


            to get the second block.




            Is perhaps doing a series of these a solution or would the gunzip -c
            require enough space for the entire file to be unzipped




            No, the gunzip -c does not require any disk space - it does everything in memory, then streams it out to stdout.





            If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split; see jim mcnamara's answer for details.






            share|improve this answer


























            • From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?

              – Alois Mahdal
              Mar 22 '12 at 12:57






            • 3





              @AloisMahdal: Actually, that would be a good separate question :-). Short version: gzip does not know about the limit (which comes from a different process). If head is used, head will exit when it has received enough, and this will propagate to gzip (via SIGPIPE, see Wikipedia). For tail this is not possible, so yes, gzip will decompress everything.

              – sleske
              Mar 22 '12 at 15:26











            • But if you are interested, you should really ask this as a separate question.

              – sleske
              Mar 22 '12 at 15:35
















            11














            How to do this best depends on what you want:




            • Do you want to extract a single part of the large file?

            • Or do you want to create all the parts in one go?




            If you want a single part of the file, your idea to use gunzip and head is right. You can use:



            gunzip -c hugefile.txt.gz | head -n 4000000


            That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.



            To get the other parts, you'd use a combination of head and tail, like:



            gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000


            to get the second block.




            Is perhaps doing a series of these a solution or would the gunzip -c
            require enough space for the entire file to be unzipped




            No, the gunzip -c does not require any disk space - it does everything in memory, then streams it out to stdout.





            If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split; see jim mcnamara's answer for details.






            share|improve this answer


























            • From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?

              – Alois Mahdal
              Mar 22 '12 at 12:57






            • 3





              @AloisMahdal: Actually, that would be a good separate question :-). Short version: gzip does not know about the limit (which comes from a different process). If head is used, head will exit when it has received enough, and this will propagate to gzip (via SIGPIPE, see Wikipedia). For tail this is not possible, so yes, gzip will decompress everything.

              – sleske
              Mar 22 '12 at 15:26











            • But if you are interested, you should really ask this as a separate question.

              – sleske
              Mar 22 '12 at 15:35














            11












            11








            11







            How to do this best depends on what you want:




            • Do you want to extract a single part of the large file?

            • Or do you want to create all the parts in one go?




            If you want a single part of the file, your idea to use gunzip and head is right. You can use:



            gunzip -c hugefile.txt.gz | head -n 4000000


            That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.



            To get the other parts, you'd use a combination of head and tail, like:



            gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000


            to get the second block.




            Is perhaps doing a series of these a solution or would the gunzip -c
            require enough space for the entire file to be unzipped




            No, the gunzip -c does not require any disk space - it does everything in memory, then streams it out to stdout.





            If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split; see jim mcnamara's answer for details.






            share|improve this answer















            How to do this best depends on what you want:




            • Do you want to extract a single part of the large file?

            • Or do you want to create all the parts in one go?




            If you want a single part of the file, your idea to use gunzip and head is right. You can use:



            gunzip -c hugefile.txt.gz | head -n 4000000


            That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.



            To get the other parts, you'd use a combination of head and tail, like:



            gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000


            to get the second block.




            Is perhaps doing a series of these a solution or would the gunzip -c
            require enough space for the entire file to be unzipped




            No, the gunzip -c does not require any disk space - it does everything in memory, then streams it out to stdout.





            If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split; see jim mcnamara's answer for details.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Sep 7 '16 at 7:19

























            answered Jan 23 '12 at 11:29









            sleskesleske

            18k85383




            18k85383













            • From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?

              – Alois Mahdal
              Mar 22 '12 at 12:57






            • 3





              @AloisMahdal: Actually, that would be a good separate question :-). Short version: gzip does not know about the limit (which comes from a different process). If head is used, head will exit when it has received enough, and this will propagate to gzip (via SIGPIPE, see Wikipedia). For tail this is not possible, so yes, gzip will decompress everything.

              – sleske
              Mar 22 '12 at 15:26











            • But if you are interested, you should really ask this as a separate question.

              – sleske
              Mar 22 '12 at 15:35



















            • From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?

              – Alois Mahdal
              Mar 22 '12 at 12:57






            • 3





              @AloisMahdal: Actually, that would be a good separate question :-). Short version: gzip does not know about the limit (which comes from a different process). If head is used, head will exit when it has received enough, and this will propagate to gzip (via SIGPIPE, see Wikipedia). For tail this is not possible, so yes, gzip will decompress everything.

              – sleske
              Mar 22 '12 at 15:26











            • But if you are interested, you should really ask this as a separate question.

              – sleske
              Mar 22 '12 at 15:35

















            From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?

            – Alois Mahdal
            Mar 22 '12 at 12:57





            From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?

            – Alois Mahdal
            Mar 22 '12 at 12:57




            3




            3





            @AloisMahdal: Actually, that would be a good separate question :-). Short version: gzip does not know about the limit (which comes from a different process). If head is used, head will exit when it has received enough, and this will propagate to gzip (via SIGPIPE, see Wikipedia). For tail this is not possible, so yes, gzip will decompress everything.

            – sleske
            Mar 22 '12 at 15:26





            @AloisMahdal: Actually, that would be a good separate question :-). Short version: gzip does not know about the limit (which comes from a different process). If head is used, head will exit when it has received enough, and this will propagate to gzip (via SIGPIPE, see Wikipedia). For tail this is not possible, so yes, gzip will decompress everything.

            – sleske
            Mar 22 '12 at 15:26













            But if you are interested, you should really ask this as a separate question.

            – sleske
            Mar 22 '12 at 15:35





            But if you are interested, you should really ask this as a separate question.

            – sleske
            Mar 22 '12 at 15:35













            19














            pipe to split use either gunzip -c or zcat to open the file



            gunzip -c bigfile.gz | split -l 400000


            Add output specifications to the split command.






            share|improve this answer



















            • 2





              This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.

              – b0fh
              Aug 13 '14 at 15:29











            • @b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).

              – sleske
              Sep 7 '16 at 7:20











            • Best answer for sure.

              – Stephen Blum
              Mar 7 '18 at 21:27











            • what are the output specs so that the outputs are .gz files themselves?

              – Quetzalcoatl
              Oct 26 '18 at 5:53
















            19














            pipe to split use either gunzip -c or zcat to open the file



            gunzip -c bigfile.gz | split -l 400000


            Add output specifications to the split command.






            share|improve this answer



















            • 2





              This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.

              – b0fh
              Aug 13 '14 at 15:29











            • @b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).

              – sleske
              Sep 7 '16 at 7:20











            • Best answer for sure.

              – Stephen Blum
              Mar 7 '18 at 21:27











            • what are the output specs so that the outputs are .gz files themselves?

              – Quetzalcoatl
              Oct 26 '18 at 5:53














            19












            19








            19







            pipe to split use either gunzip -c or zcat to open the file



            gunzip -c bigfile.gz | split -l 400000


            Add output specifications to the split command.






            share|improve this answer













            pipe to split use either gunzip -c or zcat to open the file



            gunzip -c bigfile.gz | split -l 400000


            Add output specifications to the split command.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jan 23 '12 at 16:41









            jim mcnamarajim mcnamara

            75947




            75947








            • 2





              This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.

              – b0fh
              Aug 13 '14 at 15:29











            • @b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).

              – sleske
              Sep 7 '16 at 7:20











            • Best answer for sure.

              – Stephen Blum
              Mar 7 '18 at 21:27











            • what are the output specs so that the outputs are .gz files themselves?

              – Quetzalcoatl
              Oct 26 '18 at 5:53














            • 2





              This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.

              – b0fh
              Aug 13 '14 at 15:29











            • @b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).

              – sleske
              Sep 7 '16 at 7:20











            • Best answer for sure.

              – Stephen Blum
              Mar 7 '18 at 21:27











            • what are the output specs so that the outputs are .gz files themselves?

              – Quetzalcoatl
              Oct 26 '18 at 5:53








            2




            2





            This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.

            – b0fh
            Aug 13 '14 at 15:29





            This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.

            – b0fh
            Aug 13 '14 at 15:29













            @b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).

            – sleske
            Sep 7 '16 at 7:20





            @b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).

            – sleske
            Sep 7 '16 at 7:20













            Best answer for sure.

            – Stephen Blum
            Mar 7 '18 at 21:27





            Best answer for sure.

            – Stephen Blum
            Mar 7 '18 at 21:27













            what are the output specs so that the outputs are .gz files themselves?

            – Quetzalcoatl
            Oct 26 '18 at 5:53





            what are the output specs so that the outputs are .gz files themselves?

            – Quetzalcoatl
            Oct 26 '18 at 5:53











            6














            As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.



            zcat hugefile.txt.gz | head -n 40000000
            zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
            zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000





            share|improve this answer




























              6














              As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.



              zcat hugefile.txt.gz | head -n 40000000
              zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
              zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000





              share|improve this answer


























                6












                6








                6







                As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.



                zcat hugefile.txt.gz | head -n 40000000
                zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
                zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000





                share|improve this answer













                As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.



                zcat hugefile.txt.gz | head -n 40000000
                zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
                zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Jan 23 '12 at 11:33









                zgpmaxzgpmax

                1692




                1692























                    4














                    I'd consider using split.




                    split a file into pieces







                    share|improve this answer






























                      4














                      I'd consider using split.




                      split a file into pieces







                      share|improve this answer




























                        4












                        4








                        4







                        I'd consider using split.




                        split a file into pieces







                        share|improve this answer















                        I'd consider using split.




                        split a file into pieces








                        share|improve this answer














                        share|improve this answer



                        share|improve this answer








                        edited Feb 20 '12 at 0:22









                        Tom Wijsman

                        50.4k24164247




                        50.4k24164247










                        answered Jan 23 '12 at 11:24









                        Michael Krelin - hackerMichael Krelin - hacker

                        58636




                        58636























                            2














                            Here's a python script to open a globbed set of files from a directory, gunzip them if necessary, and read through them line by line. It only uses the space necessary in memory for holding the filenames, and the current line, plus a little overhead.



                            #!/usr/bin/env python
                            import gzip, bz2
                            import os
                            import fnmatch

                            def gen_find(filepat,top):
                            for path, dirlist, filelist in os.walk(top):
                            for name in fnmatch.filter(filelist,filepat):
                            yield os.path.join(path,name)

                            def gen_open(filenames):
                            for name in filenames:
                            if name.endswith(".gz"):
                            yield gzip.open(name)
                            elif name.endswith(".bz2"):
                            yield bz2.BZ2File(name)
                            else:
                            yield open(name)

                            def gen_cat(sources):
                            for s in sources:
                            for item in s:
                            yield item

                            def main(regex, searchDir):
                            fileNames = gen_find(regex,searchDir)
                            fileHandles = gen_open(fileNames)
                            fileLines = gen_cat(fileHandles)
                            for line in fileLines:
                            print line

                            if __name__ == '__main__':
                            parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')
                            parser.add_argument('regex', type=str, default='*', help='Regular expression')
                            parser.add_argument('searchDir', , type=str, default='.', help='list of input files')
                            args = parser.parse_args()
                            main(args.regex, args.searchDir)


                            The print line command will send every line to std out, so you can redirect to a file. Alternatively, if you let us know what you want done with the lines, I can add it to the python script and you won't need to leave chunks of the file laying around.






                            share|improve this answer




























                              2














                              Here's a python script to open a globbed set of files from a directory, gunzip them if necessary, and read through them line by line. It only uses the space necessary in memory for holding the filenames, and the current line, plus a little overhead.



                              #!/usr/bin/env python
                              import gzip, bz2
                              import os
                              import fnmatch

                              def gen_find(filepat,top):
                              for path, dirlist, filelist in os.walk(top):
                              for name in fnmatch.filter(filelist,filepat):
                              yield os.path.join(path,name)

                              def gen_open(filenames):
                              for name in filenames:
                              if name.endswith(".gz"):
                              yield gzip.open(name)
                              elif name.endswith(".bz2"):
                              yield bz2.BZ2File(name)
                              else:
                              yield open(name)

                              def gen_cat(sources):
                              for s in sources:
                              for item in s:
                              yield item

                              def main(regex, searchDir):
                              fileNames = gen_find(regex,searchDir)
                              fileHandles = gen_open(fileNames)
                              fileLines = gen_cat(fileHandles)
                              for line in fileLines:
                              print line

                              if __name__ == '__main__':
                              parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')
                              parser.add_argument('regex', type=str, default='*', help='Regular expression')
                              parser.add_argument('searchDir', , type=str, default='.', help='list of input files')
                              args = parser.parse_args()
                              main(args.regex, args.searchDir)


                              The print line command will send every line to std out, so you can redirect to a file. Alternatively, if you let us know what you want done with the lines, I can add it to the python script and you won't need to leave chunks of the file laying around.






                              share|improve this answer


























                                2












                                2








                                2







                                Here's a python script to open a globbed set of files from a directory, gunzip them if necessary, and read through them line by line. It only uses the space necessary in memory for holding the filenames, and the current line, plus a little overhead.



                                #!/usr/bin/env python
                                import gzip, bz2
                                import os
                                import fnmatch

                                def gen_find(filepat,top):
                                for path, dirlist, filelist in os.walk(top):
                                for name in fnmatch.filter(filelist,filepat):
                                yield os.path.join(path,name)

                                def gen_open(filenames):
                                for name in filenames:
                                if name.endswith(".gz"):
                                yield gzip.open(name)
                                elif name.endswith(".bz2"):
                                yield bz2.BZ2File(name)
                                else:
                                yield open(name)

                                def gen_cat(sources):
                                for s in sources:
                                for item in s:
                                yield item

                                def main(regex, searchDir):
                                fileNames = gen_find(regex,searchDir)
                                fileHandles = gen_open(fileNames)
                                fileLines = gen_cat(fileHandles)
                                for line in fileLines:
                                print line

                                if __name__ == '__main__':
                                parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')
                                parser.add_argument('regex', type=str, default='*', help='Regular expression')
                                parser.add_argument('searchDir', , type=str, default='.', help='list of input files')
                                args = parser.parse_args()
                                main(args.regex, args.searchDir)


                                The print line command will send every line to std out, so you can redirect to a file. Alternatively, if you let us know what you want done with the lines, I can add it to the python script and you won't need to leave chunks of the file laying around.






                                share|improve this answer













                                Here's a python script to open a globbed set of files from a directory, gunzip them if necessary, and read through them line by line. It only uses the space necessary in memory for holding the filenames, and the current line, plus a little overhead.



                                #!/usr/bin/env python
                                import gzip, bz2
                                import os
                                import fnmatch

                                def gen_find(filepat,top):
                                for path, dirlist, filelist in os.walk(top):
                                for name in fnmatch.filter(filelist,filepat):
                                yield os.path.join(path,name)

                                def gen_open(filenames):
                                for name in filenames:
                                if name.endswith(".gz"):
                                yield gzip.open(name)
                                elif name.endswith(".bz2"):
                                yield bz2.BZ2File(name)
                                else:
                                yield open(name)

                                def gen_cat(sources):
                                for s in sources:
                                for item in s:
                                yield item

                                def main(regex, searchDir):
                                fileNames = gen_find(regex,searchDir)
                                fileHandles = gen_open(fileNames)
                                fileLines = gen_cat(fileHandles)
                                for line in fileLines:
                                print line

                                if __name__ == '__main__':
                                parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')
                                parser.add_argument('regex', type=str, default='*', help='Regular expression')
                                parser.add_argument('searchDir', , type=str, default='.', help='list of input files')
                                args = parser.parse_args()
                                main(args.regex, args.searchDir)


                                The print line command will send every line to std out, so you can redirect to a file. Alternatively, if you let us know what you want done with the lines, I can add it to the python script and you won't need to leave chunks of the file laying around.







                                share|improve this answer












                                share|improve this answer



                                share|improve this answer










                                answered Jan 23 '12 at 13:39









                                Spencer RathbunSpencer Rathbun

                                27125




                                27125























                                    2














                                    Here's a perl program that can be used to read stdin, and split the lines, piping each clump to a separate command that can use a shell variable $SPLIT to route it to a different destination. For your case, it would be invoked with



                                    zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'



                                    Sorry the command-line processing is a little kludgy but you get the idea.



                                    #!/usr/bin/perl -w
                                    #####
                                    # xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.
                                    # Usage: perl xsplit.pl LINES 'COMMAND'
                                    # where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.
                                    # 'cat > tmp$SPLIT.txt'
                                    # or:
                                    # 'gzip > tmp$SPLIT.gz'
                                    #####
                                    use strict;

                                    sub pipeHandler {
                                    my $sig = shift @_;
                                    print " Caught SIGPIPE: $sign";
                                    exit(1);
                                    }
                                    $SIG{PIPE} = &pipeHandler;

                                    my $LINES = shift;
                                    die "LINES must be a positive numbern" if ($LINES <= 0);
                                    my $COMMAND = shift || die "second argument should be COMMANDn";

                                    my $line_number = 0;

                                    while (<STDIN>) {
                                    if ($line_number%$LINES == 0) {
                                    close OUTFILE;
                                    my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);
                                    print "$splitn";
                                    my $command = $COMMAND;
                                    open (OUTFILE, "| $command") or die "failed to write to command '$command'n";
                                    }
                                    print OUTFILE $_;
                                    $line_number++;
                                    }

                                    exit 0;





                                    share|improve this answer




























                                      2














                                      Here's a perl program that can be used to read stdin, and split the lines, piping each clump to a separate command that can use a shell variable $SPLIT to route it to a different destination. For your case, it would be invoked with



                                      zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'



                                      Sorry the command-line processing is a little kludgy but you get the idea.



                                      #!/usr/bin/perl -w
                                      #####
                                      # xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.
                                      # Usage: perl xsplit.pl LINES 'COMMAND'
                                      # where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.
                                      # 'cat > tmp$SPLIT.txt'
                                      # or:
                                      # 'gzip > tmp$SPLIT.gz'
                                      #####
                                      use strict;

                                      sub pipeHandler {
                                      my $sig = shift @_;
                                      print " Caught SIGPIPE: $sign";
                                      exit(1);
                                      }
                                      $SIG{PIPE} = &pipeHandler;

                                      my $LINES = shift;
                                      die "LINES must be a positive numbern" if ($LINES <= 0);
                                      my $COMMAND = shift || die "second argument should be COMMANDn";

                                      my $line_number = 0;

                                      while (<STDIN>) {
                                      if ($line_number%$LINES == 0) {
                                      close OUTFILE;
                                      my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);
                                      print "$splitn";
                                      my $command = $COMMAND;
                                      open (OUTFILE, "| $command") or die "failed to write to command '$command'n";
                                      }
                                      print OUTFILE $_;
                                      $line_number++;
                                      }

                                      exit 0;





                                      share|improve this answer


























                                        2












                                        2








                                        2







                                        Here's a perl program that can be used to read stdin, and split the lines, piping each clump to a separate command that can use a shell variable $SPLIT to route it to a different destination. For your case, it would be invoked with



                                        zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'



                                        Sorry the command-line processing is a little kludgy but you get the idea.



                                        #!/usr/bin/perl -w
                                        #####
                                        # xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.
                                        # Usage: perl xsplit.pl LINES 'COMMAND'
                                        # where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.
                                        # 'cat > tmp$SPLIT.txt'
                                        # or:
                                        # 'gzip > tmp$SPLIT.gz'
                                        #####
                                        use strict;

                                        sub pipeHandler {
                                        my $sig = shift @_;
                                        print " Caught SIGPIPE: $sign";
                                        exit(1);
                                        }
                                        $SIG{PIPE} = &pipeHandler;

                                        my $LINES = shift;
                                        die "LINES must be a positive numbern" if ($LINES <= 0);
                                        my $COMMAND = shift || die "second argument should be COMMANDn";

                                        my $line_number = 0;

                                        while (<STDIN>) {
                                        if ($line_number%$LINES == 0) {
                                        close OUTFILE;
                                        my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);
                                        print "$splitn";
                                        my $command = $COMMAND;
                                        open (OUTFILE, "| $command") or die "failed to write to command '$command'n";
                                        }
                                        print OUTFILE $_;
                                        $line_number++;
                                        }

                                        exit 0;





                                        share|improve this answer













                                        Here's a perl program that can be used to read stdin, and split the lines, piping each clump to a separate command that can use a shell variable $SPLIT to route it to a different destination. For your case, it would be invoked with



                                        zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'



                                        Sorry the command-line processing is a little kludgy but you get the idea.



                                        #!/usr/bin/perl -w
                                        #####
                                        # xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.
                                        # Usage: perl xsplit.pl LINES 'COMMAND'
                                        # where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.
                                        # 'cat > tmp$SPLIT.txt'
                                        # or:
                                        # 'gzip > tmp$SPLIT.gz'
                                        #####
                                        use strict;

                                        sub pipeHandler {
                                        my $sig = shift @_;
                                        print " Caught SIGPIPE: $sign";
                                        exit(1);
                                        }
                                        $SIG{PIPE} = &pipeHandler;

                                        my $LINES = shift;
                                        die "LINES must be a positive numbern" if ($LINES <= 0);
                                        my $COMMAND = shift || die "second argument should be COMMANDn";

                                        my $line_number = 0;

                                        while (<STDIN>) {
                                        if ($line_number%$LINES == 0) {
                                        close OUTFILE;
                                        my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);
                                        print "$splitn";
                                        my $command = $COMMAND;
                                        open (OUTFILE, "| $command") or die "failed to write to command '$command'n";
                                        }
                                        print OUTFILE $_;
                                        $line_number++;
                                        }

                                        exit 0;






                                        share|improve this answer












                                        share|improve this answer



                                        share|improve this answer










                                        answered Jan 23 '12 at 20:54









                                        Liudvikas BukysLiudvikas Bukys

                                        27149




                                        27149























                                            1














                                            Directly split .gz file to .gz files:



                                            zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'


                                            I think this is what OP wanted, because he don't have much space.






                                            share|improve this answer




























                                              1














                                              Directly split .gz file to .gz files:



                                              zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'


                                              I think this is what OP wanted, because he don't have much space.






                                              share|improve this answer


























                                                1












                                                1








                                                1







                                                Directly split .gz file to .gz files:



                                                zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'


                                                I think this is what OP wanted, because he don't have much space.






                                                share|improve this answer













                                                Directly split .gz file to .gz files:



                                                zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'


                                                I think this is what OP wanted, because he don't have much space.







                                                share|improve this answer












                                                share|improve this answer



                                                share|improve this answer










                                                answered Feb 24 at 10:53









                                                siulkilulkisiulkilulki

                                                1111




                                                1111






























                                                    draft saved

                                                    draft discarded




















































                                                    Thanks for contributing an answer to Super User!


                                                    • Please be sure to answer the question. Provide details and share your research!

                                                    But avoid



                                                    • Asking for help, clarification, or responding to other answers.

                                                    • Making statements based on opinion; back them up with references or personal experience.


                                                    To learn more, see our tips on writing great answers.




                                                    draft saved


                                                    draft discarded














                                                    StackExchange.ready(
                                                    function () {
                                                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f381394%2funix-split-a-huge-gz-file-by-line%23new-answer', 'question_page');
                                                    }
                                                    );

                                                    Post as a guest















                                                    Required, but never shown





















































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown

































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown







                                                    Popular posts from this blog

                                                    How do I know what Microsoft account the skydrive app is syncing to?

                                                    Grease: Live!

                                                    When does type information flow backwards in C++?